并行架构与编程（三）性能优化

Optimize the part of the program taking the most time.

高性能计算的目标

parallel programming rules of thumb

at least as much work as parallel execution capability, generally more independent work than capability
“parallel slack” = ratio of independent work to machine’s parallel execution capability(~8)

1. 性能优化1: Work distribution and scheduling

Always implement the simplest solution first, then measure performance to determine if you need to do better

one shared queue, multiple work threads
tradeoff: task granularity
- small granularity -> better workload balance, more synchronization cost
- large granularity -> minimize overhead of managing the assignment
- ideal granularity depends both on the workload and the mahcine
scheduler: long task first

set of work queues, same amount of worker threads
- the initial assignment of tasks is just like the static assignment
- on demand assignment later
steal: 若本地任务队列为空，能够从其它任务队列中获取
性能特点
- steal阶段存在同步开销
- 增加locality
其它设计问题
- who to steal from?
  - Generally randomly pick one to steal from.
- how much to steal?
- how to detect program termination?
- how to ensure local queue access is fast?

idea: consider divide-and-conquer algorithms, we treat each node as a single task.

cilk_spawn: create new logical thread of control
cilk_sync: join
program termination semantics
- implicit cilk_sync at the end of every function: returns when all calls spawned by current function have completed

real world: thread pool + per thread work queue + steal continuation + random victim

thread pool initialization: all threads created when encountering the first call to cilk_spawn
nubmer of threads: equal to the number of execution contexts in the machine
work: 整个上下文，例如下图中的child(foo())，以及之后的bar();都

continuation first (child stealing)

child first (continuation stealing)

put continuation context into work queue
dfs: anticipate divide and conquer parallelism, code at right generates work in parallel
storage space: for T threads, no more than T times that of stack storage for single threaded execution
exectuion order: same as the program with spawn removed

other implmentation details

work queue: lock free dequeue
work stealing: random choice of victim
- local thred pushes/pops from the “tail”
- remote threads steal from “heap”

remote threads steal from top of dequeue

reduces contention with local threads
amortize steal cost: steals work at beginning of call tree so that steal longer task, amortize steal cost with future computation
maximize locality: local thread work on local part of call tree

如上图，steal continuation倾向于为同一root的多个子任务生成并行执行的task，而steal child则继续执行root节点，直到将同一root的子任务扔到同一个queue中。

non-pipelined communication
- transfer time: $T(n) = T_0 + \frac{n}{B}$
- T0: start-up latency
- n: bytes transferred in operation
- B: transfer rate, bandwidth of the link
- effective bandwidth: $\frac{n}{T(n)}$
a more general model of communication
- total communication time = overhead + occupancy + network delay
- overhead: time spent on the communication by a processor
- occupancy: time for data to pass through slowest component of system, $T_0 + n/B_{smallest}$
- network delay: everything else
pipelined communication
- total communication cost = communication time - overlap
- overlap: portion of communication performed concurrently with other work

存在两种类型的通信开销

inherent communication: communication that must occur in a parallel algorithm, information that fundamentally must be moved between processors to carry out the algorithm
- reduce: work assignment decisions
artificial communication: all other communication
- 示例
  - 系统有最小传输粒度，导致只需要4字节浮点数的程序，需要装载整个64字节cache line
  - 分布式内存中数据的分布位置: 由于数据的位置远离最常访问它的处理器，因此导致不必要的通信开销
  - 有限的replication capacity: 相同的数据需要通信多次

communication-to-computation ratio
- amount of communication / amount of computation
arithmetic intensity
- 1 / communication-to-computation ratio
- improve arithmetic intensity
  - schedule threads working on the same data structure at the same time on the same processor
  - reduce inherent communication
  - ex. CUDA thread block

目标: 计算各个cell中包含哪些particles

方法一: 针对每个cell，串行判断各个particles
- work inefficient: 16x额外的工作
- parallelism inefficient: 仅16个并行任务
方法二: 针对每个particles，判断所在cell，然后原子更新cell列表
- massive contention
  - 各个cell列表公用同一个锁
  - 各个cell列表使用细粒度的锁
方法三: 计算部分结果，最后合并
方法四: 数据并行方法
- 计算每个particle所在的cell，以particle为索引填充数组(particle index, grid index)
- 对该数组排序根据grid index
- 对该数组每个元素，并行获取每个cell在数组中的起始索引和结束索引

reduce overhead of communication to sender/receiver
- send fewer message, make messages larger, amortize overhead
- batch small messages into large ones
reduce delay
- programmer: utilize locality in code
- hw implementor: improve communication architecture
reduce contention
- replicate contended resources (local copies, fine-grained locks, per thread queue)
- stagger access to contended resources
increase communication/computation overlap
- programmer: use async communication
- hw implementor: pipelining, multi-threading, pre-fetching, out-of-order

decomposition
assignment
- load balancing
- maximize locality
- minimize extra work, eg. contention, communication
orchestration
mapping