并行架构与编程(十)synchronization - Transactional Memory

Measure then build.

1
2
3
4
5
6
7
void deposit(Acct account, int amount) {
    atomic {
        int tmp = bank.get(account);
        tmp += amount;
        bank.put(account, tmp);
    }
}
  • declarative construct: perform this set of operations atomically

1. motivation of transactional memory

  • memory transaction
    • an atomic and isolated
  • atomicity (all or nothing)
    • upon transaction commit, all memory writes in a transaction take effect at once
    • on transaction abort, none of the writes appear to take effect
  • isolation
    • no other processor can observe writes before transaction commits
  • serializability
    • transactions appear to commit in a single serial order
  • durability
    • changes persist
    • but memory is not durable…

  • fine-grained lock: may prevent concurrency in tree-type data structures
  • transactional memory: only abort when read-write or write-write conflicts

1.1. LL/SC

  • LL/SC: lite version of transactional memory (single address), lock-free atomic read-modify-write operation
    • load_linked(x): load value from address and return current value
    • store_conditional(x, value): store value to x only if x hasn’t been written to since corresponding LL
    • enable multiple temporary local operations between these two instructions
      • ll(account value); tmp+=xxx; sc(account value);
    • no ABA problem
  • implementation
    • CPU track the load-linked address at a cache-line granularity, such that any modification to any portion of the cache line (another core’s store-conditional or ordinary store) will cause the store-conditional to fail

1.2. locks vs. transactions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void transfer(A, B, amount) {
  synchronized(bank) {
    try {
      withdraw(A, amount);
      deposit(B, amount);
    }
    catch(exception1) {}
    catch(exception2) {}
  }
}

void transfer(A, B, amount) {
  atomic {
    withdraw(A, amount);
    deposit(B, amount);
  }
}
  • lock
    • programmer provides “undo” code on a case-by-case basis
    • locks may be held by a failing thread
  • transaction
    • system responsible for processing exceptions
    • no partial commit: transaction is aborted and memory updates are undone
1
2
3
4
5
6
7
8
9
10
11
12
void transfer(A, B, amount) {
  synchronized(A) {
    synchronized(B) {
      withdraw(A, amount);
      deposit(B, amount);
    }
  }
}

// dead lock
transfer(A, B, 10);
transfer(B, A, 10);
  • fine grained lock: good for performance, but possibility of dead lock
    • requires system-wide policies to get correct
    • system-wide policies can break software modularity
  • transactions
    • system manages concurrency as well as possible serialization

1.3. pros/cons of TM

  • pros
    • easy to use
    • often perform as well as fine-grained locks
      • performance portability: locking scheme for 4 cpus may not be the best scheme for 64 cpus
    • failure atomicity and recovery
      • no lost locks when a thread fails
      • failure recovery = transaction abort + restart
    • composability
      • safe and scalable composition of software modules
  • cons
    • not mutual exclusion
      • can not just replace synchronized with atomic
    • atomic code sequence
1
2
3
4
5
6
7
8
9
synchronized(lock1) {
  flagA = true;  // if atomic, cause code below to abort
  while (flagB == 0);
}

synchronized(lock2) {
  flagB = true;  // if atomic, cause code above to abort
  while (flagA == 0);
}

2. implementation of transactional memory

  • TM systems must provide atomicity and isolation
  • implementation requirements
    • data versioning
    • conflict detection and resolution
  • implementation policy
    • hardware transactional memory (HTM)
    • software transactional memory (STM)
    • hybrid transactional memory
      • hardware-accelerated STMs

2.1. data versioning

  • data versioning: manage uncommitted and previously committed versions of data for concurrent transactions
    • eager versioning (undo-log based)
    • lazy versioning (write-buffer based)

2.1.1. eager versioning

update memory immediately, maintain “undo log” in case of abort

  • pro: faster commit
  • con: slower aborts, fault tolerance issues (roll back)

2.1.2. lazy versioning

log memory updates in transaction write buffer, flush buffer on commit

  • pro: faster abort (only need to clear log)
  • con: slower commits

2.2. conflict detection

TM system must detect and handle conflicts between transactions.

  • read-write conflict: transaction A reads address X, which was written to by pending transaction B
  • write-write conflict: transactions A and B both pending, and both write to address X

Solution: system must track a transaction’s read set and write set

  • read-set: addresses read within the transaction
  • write-set: addresses written within the transaction

check: whether a transaction’s write set is overlapping with another transaction’s read/write set

2.2.1. pessimistic detection

  • check for conflicts during each loads or stores
    • pessimistic: I suspect conflicts might happen, so let’s always check to see if one has occurred after each memory operation
  • “contention manager”: decides to stall or abort transaction

2.2.2. optimistic detection

  • detect conflicts when a transaction attempts to commit
    • HW: only when we want to commit then we send cache coherence requests
    • optimistic: let’s hope for the best and sort out all the conflicts only when the transaction tries to commit
  • on a conflict, give priority to committing transaction: can ensure we make progress

Note: can use optimistic and pessimistic schemes together (optimistic for reads and pessimistic for writes)

注: 其他事务的提交可能导致某个正在进行的事务的restart。

2.2.3. tradeoffs

  • pessimistic
    • pro: detect conflicts early
    • cons:
      • no forward progress guarantees
      • fine-grained communication
      • detection on critical path
  • optimistic
    • pros:
      • forward progress guarantees
      • bulk communication and conflict detection
    • cons:
      • detects conflicts late
      • fairness issue

conflicts detection granularity

  • object
  • machine word
  • cache line

2.3. real world

  • Hardware TM systems
    • lazy + optimistic
    • lazy + pessimistic
    • eager + pessimistic
    • eager + optimistic: not practical, isolation is violated by this combination. eager情况下内存已经更新了,乐观仅在commit时check,因此其它事务/非事务可能读到未提交的事务的更新。
  • Software TM systems
    • lazy + optimistic(rd/wr)
    • lazy + optimistic(rd)/pessimistic(wr)
    • eager + optimistic(rd)/pessimistic(wr)
    • eager + pessimistic(rd/wr)
  • Optimal design remains an open question

2.4. hardware transactional memory

  • data versioning is implemented in caches
    • cache the write buffer or the undo log
    • add new cache line metadata to track transaction read set and write set
  • conflict detection through cache coherence protocol
    • coherence lookups detect conflicts between transactions
  • Note
    • register checkpoint be taken at transaction begin, so that it is possible to restore execution context state on abort

2.4.1. implementation details

  • data versioning
    • R bit: read set
    • W bit: write set
    • R/W bits cleared on transaction commit or abort
    • eager versioning: 2nd cache write for undo log
  • conflicts detection
    • observing shared request on W-word is a read-write conflict
    • observing exclusive request to R-word is a write-read conflict
    • observing exclusive request to W-word is a write-write conflict

2.4.2. lazy-optimistic HTM implementation

  • CPU changes
    • ability to checkpoint register state
    • TM state registers (status, pointers to abort handlers)
  • Cache changes
    • R/W bit
  • transaction begin
    • initialize CPU and cache state
    • take register checkpoint
  • load operation
    • serve cache miss if needed
    • mark data as part of read set
  • store operation
    • serve cache miss if needed
    • mark data as part of write set (not a load into exclusive state because lazy)
  • transaction commit: fast two-phase commit
    • validate: request RdX access to write set lines (UpgradeX)
    • commit: gang-reset R and W bits, turns write set data to valid (dirty) data

  • fast conflict detection and abort
    • check: receive coherence requests from another core’s commit, lookup read set and write set
    • abort: invalidate write set, gang-reset R/W bits, restore to register checkpoint

2.4.3. real world: hardware transactional memory support in Intel Haswell

  • instructions
    • xbegin: takes pointer to “fallback address” in case of abort
    • xend
    • xabort
  • implementation: tracks read and write set in L1 cache
  • processor makes sure all memory operations commit atomically
    • processor may automatically abort transaction: eviction of line in read or write set, even in the same transaction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// fallback address example
Result status = _xbegin();
if (status == SUCCESS) {
  if (_stop_the_world)
    _xabort();
  ...
  _xend();
} else {
  /* Fall back path */
  lock();
  _stop_the_world = true;
  ...
  _stop_the_world = false;
  unlock();
}
  • _stop_the_world: make sure fallback can proceed
    • in other words programmer should make sure
      • the fallback path must not overlap with other transactions
      • the lock path must prevent transactions from committing
    • TM implementation itself doesn’t guarantee progress

2.4.4. performance

  • TSX can only track a limited number of locations
    • minimize memory touched in case cache line eviction within a single transaction
  • transaction cost
    • approximately equal to the cost of six atomic primitives to the same cache line