并行架构与编程(十一)异构并行

1. Amdahl’s law

  • speedup(f, n) = $\frac{1}{(1-f) + \frac{f}{n}}$
    • f: fraction of program that is parallelizable
    • n: parallel processors

1.1. resource limits symmetric version

Suppose the total resource is fixed, and each processor uses the same amount of resources.

  • speedup(f, n, r) = $\frac{1}{\frac{1-f}{perf(r)} + \frac{f}{perf(r) * \frac{n}{r}}}$
    • f: fraction of program that is parallelizable
    • n: total processing resources
    • r: resources dedicated to each processing core
  • BCEs: Generic unit of cost
    • area, power, etc

  • x-axis: resources per core
    • right: fewer fatter cores
    • left: many small cores
  • perf(r) modeled as $\sqrt{r}$
  • graph:
    • f = 0.5: for workload with small parallelism ability, less but fatter core wins
    • f = 0.999: for workload with huge parallelism ability, more but thinner core wins
    • optimal point: peak point
  • analysis 1: fraction f should be as high as possible
  • analysis 2: having r > 1 BCEs per core can be beneficial (for n = 256, f = 0.975, maximum speedup at 7.1 BCEs per core)

1.2. asymmetric set of processing cores

  • 上图给出了n资源异构处理器相对于单位资源单处理器的加速比
    • 1-f串行部分通过perf(r)执行,f可并行部分通过(n-r)perf(1) + pref(r)执行
    • perf(r)相当于拥有r资源的单个核相对于单位资源单个核的性能

  • 大小核架构不仅最大加速比提升
  • 而且peak加速比的位置偏移
  • 并且不同workload的加速比峰值位置相对更加集中

1.3. heterogeneous processing

real world

并行 串行
components that can be widely parallelized components that are difficult to parallelize
components that are amenable to wide SIMD execution components are not (divergent control flow)
components with predictable data access components with unpredictable access, but might cache well

Idea: the most efficient processor is a heterogeneous mixture of resources (“use the most efficient tool for the job”)

real world heterogeneous

choosing the right tool for the job

  • challenges of heterogeneity
    • use preferred processor for each task
      • what is the right mixture of resources to meet performance, cost, and energy goals?
    • implication: increased pressure to understand workloads accurately at chip design time
  • pitfalls of heterogeneous designs
    • how to map programs onto a heterogeneous collection of resources?
    • pick the right tool for the job: software portability nightmare
    • reducing energy consumption idea 1: use specialized processing
    • reducing energy consumption idea 2: move less data

note: traditional rule of thumb in “good system design” is to design simple, general-purpose components. not the case with emerging processing systems

note: today want collection of components that meet perf requirement AND minimize energy use

note challenge of using these resources effectively is pushed up to the programmer