Measure then build.
1 |
|
- declarative construct: perform this set of operations atomically
1. motivation of transactional memory
- memory transaction
- an atomic and isolated
- atomicity (all or nothing)
- upon transaction commit, all memory writes in a transaction take effect at once
- on transaction abort, none of the writes appear to take effect
- isolation
- no other processor can observe writes before transaction commits
- serializability
- transactions appear to commit in a single serial order
- durability
- changes persist
- but memory is not durable…
- fine-grained lock: may prevent concurrency in tree-type data structures
- transactional memory: only abort when read-write or write-write conflicts
1.1. LL/SC
- LL/SC: lite version of transactional memory (single address), lock-free atomic read-modify-write operation
- load_linked(x): load value from address and return current value
- store_conditional(x, value): store value to x only if x hasn’t been written to since corresponding LL
- enable multiple temporary local operations between these two instructions
ll(account value); tmp+=xxx; sc(account value);
- no ABA problem
- implementation
- CPU track the load-linked address at a cache-line granularity, such that any modification to any portion of the cache line (another core’s store-conditional or ordinary store) will cause the store-conditional to fail
1.2. locks vs. transactions
1 |
|
- lock
- programmer provides “undo” code on a case-by-case basis
- locks may be held by a failing thread
- transaction
- system responsible for processing exceptions
- no partial commit: transaction is aborted and memory updates are undone
1 |
|
- fine grained lock: good for performance, but possibility of dead lock
- requires system-wide policies to get correct
- system-wide policies can break software modularity
- transactions
- system manages concurrency as well as possible serialization
1.3. pros/cons of TM
- pros
- easy to use
- often perform as well as fine-grained locks
- performance portability: locking scheme for 4 cpus may not be the best scheme for 64 cpus
- failure atomicity and recovery
- no lost locks when a thread fails
- failure recovery = transaction abort + restart
- composability
- safe and scalable composition of software modules
- cons
- not mutual exclusion
- can not just replace
synchronized
withatomic
- can not just replace
- atomic code sequence
- not mutual exclusion
1 |
|
2. implementation of transactional memory
- TM systems must provide atomicity and isolation
- implementation requirements
- data versioning
- conflict detection and resolution
- implementation policy
- hardware transactional memory (HTM)
- software transactional memory (STM)
- hybrid transactional memory
- hardware-accelerated STMs
2.1. data versioning
- data versioning: manage uncommitted and previously committed versions of data for concurrent transactions
- eager versioning (undo-log based)
- lazy versioning (write-buffer based)
2.1.1. eager versioning
update memory immediately, maintain “undo log” in case of abort
- pro: faster commit
- con: slower aborts, fault tolerance issues (roll back)
2.1.2. lazy versioning
log memory updates in transaction write buffer, flush buffer on commit
- pro: faster abort (only need to clear log)
- con: slower commits
2.2. conflict detection
TM system must detect and handle conflicts between transactions.
- read-write conflict: transaction A reads address X, which was written to by pending transaction B
- write-write conflict: transactions A and B both pending, and both write to address X
Solution: system must track a transaction’s read set and write set
- read-set: addresses read within the transaction
- write-set: addresses written within the transaction
check: whether a transaction’s write set is overlapping with another transaction’s read/write set
2.2.1. pessimistic detection
- check for conflicts during each loads or stores
- pessimistic: I suspect conflicts might happen, so let’s always check to see if one has occurred after each memory operation
- “contention manager”: decides to stall or abort transaction
2.2.2. optimistic detection
- detect conflicts when a transaction attempts to commit
- HW: only when we want to commit then we send cache coherence requests
- optimistic: let’s hope for the best and sort out all the conflicts only when the transaction tries to commit
- on a conflict, give priority to committing transaction: can ensure we make progress
Note: can use optimistic and pessimistic schemes together (optimistic for reads and pessimistic for writes)
注: 其他事务的提交可能导致某个正在进行的事务的restart。
2.2.3. tradeoffs
- pessimistic
- pro: detect conflicts early
- cons:
- no forward progress guarantees
- fine-grained communication
- detection on critical path
- optimistic
- pros:
- forward progress guarantees
- bulk communication and conflict detection
- cons:
- detects conflicts late
- fairness issue
- pros:
conflicts detection granularity
- object
- machine word
- cache line
2.3. real world
- Hardware TM systems
- lazy + optimistic
- lazy + pessimistic
- eager + pessimistic
- eager + optimistic: not practical, isolation is violated by this combination. eager情况下内存已经更新了,乐观仅在commit时check,因此其它事务/非事务可能读到未提交的事务的更新。
- Software TM systems
- lazy + optimistic(rd/wr)
- lazy + optimistic(rd)/pessimistic(wr)
- eager + optimistic(rd)/pessimistic(wr)
- eager + pessimistic(rd/wr)
- Optimal design remains an open question
2.4. hardware transactional memory
- data versioning is implemented in caches
- cache the write buffer or the undo log
- add new cache line metadata to track transaction read set and write set
- conflict detection through cache coherence protocol
- coherence lookups detect conflicts between transactions
- Note
- register checkpoint be taken at transaction begin, so that it is possible to restore execution context state on abort
2.4.1. implementation details
- data versioning
- R bit: read set
- W bit: write set
- R/W bits cleared on transaction commit or abort
- eager versioning: 2nd cache write for undo log
- conflicts detection
- observing shared request on W-word is a read-write conflict
- observing exclusive request to R-word is a write-read conflict
- observing exclusive request to W-word is a write-write conflict
2.4.2. lazy-optimistic HTM implementation
- CPU changes
- ability to checkpoint register state
- TM state registers (status, pointers to abort handlers)
- Cache changes
- R/W bit
- transaction begin
- initialize CPU and cache state
- take register checkpoint
- load operation
- serve cache miss if needed
- mark data as part of read set
- store operation
- serve cache miss if needed
- mark data as part of write set (not a load into exclusive state because lazy)
- transaction commit: fast two-phase commit
- validate: request RdX access to write set lines (UpgradeX)
- commit: gang-reset R and W bits, turns write set data to valid (dirty) data
- fast conflict detection and abort
- check: receive coherence requests from another core’s commit, lookup read set and write set
- abort: invalidate write set, gang-reset R/W bits, restore to register checkpoint
2.4.3. real world: hardware transactional memory support in Intel Haswell
- instructions
xbegin
: takes pointer to “fallback address” in case of abortxend
xabort
- implementation: tracks read and write set in L1 cache
- processor makes sure all memory operations commit atomically
- processor may automatically abort transaction: eviction of line in read or write set, even in the same transaction
1 |
|
_stop_the_world
: make sure fallback can proceed- in other words programmer should make sure
- the fallback path must not overlap with other transactions
- the lock path must prevent transactions from committing
- TM implementation itself doesn’t guarantee progress
- in other words programmer should make sure
2.4.4. performance
- TSX can only track a limited number of locations
- minimize memory touched in case cache line eviction within a single transaction
- transaction cost
- approximately equal to the cost of six atomic primitives to the same cache line