并行架构与编程（十二）tolerating latency through prefetching

The increase of processor speed is way faster than memory speed. Caches can help, but definitely not a panacea. What makes it worse is the presence of NUMA arch.

challenge - NUMA multiprocessors:

remote memory accesses
cache coherence cost more
- normally directory based cache coherence

idea: by overlapping memory access with computation and other accesses, tolerate latency through prefetching

benefits
- prefetch early enough: completely hides memory latency
- issue prefetches in blocks
  - pipeline the misses
  - only the first reference stalls
- prefetch with ownership
  - reduces write latency, coherence messages

1. types of prefetching

large cache blocks
- limitations: spatial locality, false sharing
hardware-controlled prefetching: hardware can detect simple strided access patterns
- limitations: simple patterns, page boundaries, cache pollution
software-controlled prefetching
- insert explicit instructions in modern instruction sets
- limitations: instruction overhead?

2. compiler-based prefetching, a.k.a fully-automate prefetch insertion

access patterns
- arrays, pointers
goal
- maximize benefit
- minimize overhead

2.1. prefetching concepts

pre-conditions: address is predictable
coverage factor: fraction of misses that are prefetched
- unnecessary prefetch: data is already in the cache
- effective prefetch: data prefetched is in the cache when later referenced
analysis: what to prefetch
- maximize coverage factor
- minimize unnecessary prefetches
scheduling: when/how to schedule prefetches
- when: a cache miss expect to happen
- minimize overhead per prefetch

注意: 有很多程序已经能够达到很高的cache命中率，因此减少不必要的预取是非常重要的。

2.2. compiler algorithm

step 1: what to prefetch
- locality analysis
step 2: instruction scheduling when/how to issue prefetches
- loop splitting
- software pipelining

2.2.1. locality analysis steps

find data reuse
- prefetch only when memory access suffer cache misses
- 被重用的数据有潜在的达成cache hit的可能，对于能够cache hit的内存访问就不需要再进行预取。
determine “localized iteration space”
- set of inner loops where the data accessed by an iteration is expected to fit within the cache
find data locality
- reuse and localized iteration space

2.2.1.1. data reuse

suppose one cache line can hold 2 elements

the graph is called iteration space, x and y axis represents time

spatial locality: access element in the same cache line
- 对A[i][j]
temporal locality: access the same element
- 对B[j+1][0]: 上例中对于不同的i（外循环）重用了B数组
group locality:
- 对B[j][0]: 只在第一次访问时miss，之后的访问因为有B[j+1][0]的存在能够直接hit

2.2.1.2. localized iteration space

given compiler a parameter about the cache size and the compiler will estimate how much data a loop has and then it will decide if it will be a capacity miss or not

2.2.1.3. find data locality

locality type	miss instance	predicate
none	every iteration	true
temporal	first iteration	i = 0
spatial	every l iterations (l = cache line size)	(i mod l) = 0

对于那些cache miss的内存访问有潜在的预取机会。

2.2.2. schedule

2.2.2.1. loop splitting

strawman method: insert IF statement, if will miss, then prefetch
- cost too high
decompose loops to isolate cache miss instances

loop transmission method

locality type	loop transformation
None	direct prefetch
Temporal	Peel out loop i, then insert prefetch
Spatial	Unroll loop i by l, then insert prefetch once

2.2.2.2. software pipelining

prefetch iterations ahead
given parameter
- l = memory latency
- s = shortest path through loop body

$Iterations\;Ahead = \lceil\frac{l}{s}\rceil$

2.2.3. example

2.3. evaluation

performance

effectiveness

effectiveness of software pipelining

2.4. prefetching indirections

A[index[i]]

analysis: what to prefetch
- hard to predict whether indirections hit or miss
- in general, just prefetch
scheduling: when/how to issue prefetches

2.4.1. software pipelining for indirections

2 level deep pipeline

2.4.2. evaluation

not fetch vs. fetch dense reference vs. fetch dense + indirect reference

3. prefetching for parallel shared-address-space machines

non-binding prefetches vs binding prefetches: whether the value is determined, in load time or in prefetch time
- binding prefetch: determined in prefetch time, cause stale data
  - ex. load into register, may not be correct
- non-binding prefetch: determined in load time, prefetch to cache, may be invalidated
  - no restrictions on when prefetches can be issued, the protocol guarantees the correctness of the program, even if we take aggressive strategy
deal with coherence miss
- take into account explicit synchronization
optimization
- exclusive mode
  - ex. X = X + 1, use readX to load data to decrease synchronization cost
  - 已经有writer buffer
  - 但是隐藏invalidate traffic message
analysis
- 减小锁的时间，减小在同步区的时间，因此减小同步时间