并行架构与编程（八）Synchronization - the implementation of lock and barrier

Perfect is the enemy of the good.

1. basic framework

acquire method: how threads gain access to the protected resource
waiting algorithm: how threads wait for access to be granted
release method: how threads enable other threads to gain resource

2. lock implementation

lock:
    ld R0, mem[addr]
    cmp R0, #0
    bnz lock
    st mem[addr], #1

unlock:
    st mem[addr], #0

load-test-store can not guarantee atomicity.

2.1. test-and-set lock

idea: atomically read and write

ts R0, mem[addr] // load mem[addr] into R0
                 // if mem[addr] is 0, set mem[addr] to 1
lock:
    ts R0, mem[addr]
    bnz R0, lock

unlock:
    st mem[addr], #0 // store 0 to address

Every lock operation needs to issue BusRdX even if they can not acquire the lock.

With the increasing of processors, bus contention increases. Lock holders must wait to acquire bus to release.

2.2. lock performance characteristics

low latency under no contention
low interconnect traffic
scalability
- latency and traffic should scale reasonably
low storage cost
fairness
- avoid starvation or substantial unfairness

2.3. test-and-test-and-set lock

idea: change RdX to Rd

void Lock(int *lock) {
    while (1) {
        while (*lock != 0);

        if (test_and_set(*lock) == 0)
            return;
    }
}

void Unlock(volatile int *lock) {
    *lock = 0;
}

slightly higher latency
much less interconnect traffic
more scalable (due to less traffic)
low storage cost
no provisions for fairness

2.4. test-and-set lock with back off

void Lock(volatile int *lock) {
    int amount = 1;
    while (1) {
        if (test_and_set(*lock) == 0)
            return;
        delay(amount);
        amount *= 2;
    }
}

same uncontended latency, but potentially higher latency under contention
less traffic
improves scalability
low storage cost
exponential back-off can cause severe unfairness
- newer requesters back off for shorter intervals

2.5. ticket lock

thundering herd (惊群效应): upon release, all processors waiting attempt to acquire lock using test-and-set

struct lock {
    volatile int next_ticket;
    volatile int now_serving;
}

void Lock(lock *l) {
    int my_ticket = atomic_increment(&l->next_ticker);
    while (my_ticket != l->now_serving);
}

void unlock(lock *l) {
    l->now_serving++;
}

no atomic operation needed to acquire the lock
O(P) invalidation per lock release

2.6. array-based lock

idea: each processor spins on a different memory address

struct lock {
    volatile padded_int status[P];
    volatile int head;
};

int my_element;

void Lock(lock *l) {
    my_element = atomic_circ_increment(&l->head); // circular increment
    while (l->status[my_element] == 1);
}

void unlock(lock *l) {
    l->status[my_element] = 1;
    l->status[cir_next(my_element)] = 0;
}

O(p) storage space
O(1) interconnect traffic per release
atomic circular increment is a more complex operation (slightly higher overhead)

2.7. queue-based lock (MCS lock)

x86 cmpxchg:

lock cmpxchg dst, src

if (dst == EAX)
    ZF = 1
    dst = src
else
    ZF = 0
    EAX = dst

// use cmpxchg to implement test-and-set
xor EAX, EAX
lock cmpxchg mem[addr], 1

idea: create a queue of waiters and each thread allocates a local space on which to wait
queued spinlocks
- queue head: current lock holder
- queue tail: glock
- operations
  - acquire: atomically set tail/glock
  - release: if head == tail, holder atomically set the head to NULL using CMPXCHG to release the lock so that other can acquire the lock

// glock - global lock
// mlock - my lock (state, next pointer)

AcquireQLock(*glock, *mlock) {
    mlock->next = NULL;
    mlock->state = UNLOCKED;
    prev = mlock;
    ATOMIC_SWAP(glock, prev);
    if (prev == NULL) return; // currently no lock holder
    mlock->state = LOCKED;
    prev->next = mlock;
    while (mlock->state == LOCKED) ;
}

ReleaseQLock(*glock, *mlock) {
    // I'm the head of the queue
   do {
       if (mlock->next == NULL) {
           x = CMPXCHG(glock, mlock, NULL); // if glock == mlock, set glock to NULL and return glock
                                            // otherwise return value in glock
           if (x == mlock) return;
       } else {
           mlock->next->state = UNLOCKED;
           return;
       }
   } while (1);
}

3. Barriers Implementation

3.1. centralized barrier

buggy version: note that when the first thread clears the flag, some other threads may still waiting in the flag checking loop of the previous barrier

struct Barrier_t {
    LOCK lock;
    int counter;
    int flag;
};

// barrier for p processors
void Barrier(Barrier_t *b, int p) {
    lock(b->lock);
    if (b->counter == 0) {
        // first thread arrived clears the flag
        b->flag = 0;
    }
    int num_arrived = ++(b->counter);
    unlock(b->lock);

    if (num_arrived == p) {
        // last arrived thread sets the flag
        b->counter = 0;
        b->flag = 1;
    } else {
        // many other threads may wait in the loop
        while (b->flag == 0);
    }
}

solution: use another counter to count the number of threads that have left barrier and clear the flag until all threads leave the barrier.

struct Barrier_t {
    LOCK lock;
    int arrive_counter;
    int leave_counter;
    int flag;
}

// barrier for p processors
void Barrier(Barrier_t *b, int p) {
    lock(b->lock);
    if (b->arrive_counter == 0) {
        if (b->leave_counter == p) {
            b->flag = 0;
        } else {
            unlock(lock);
            while (b->leave_counter != p); // wait for all threads to leave cleaning
            lock(lock);
            b->flag = 0;
        }
    }
    int num_arrived = ++(b->arrive_counter);
    unlock(b->lock);

    if (num_arrived == p) {
        b->arrive_counter = 0;
        b->leave_counter = 1;
        b->flag = 1;
    } else {
        while (b->flag == 0);
        lock(b->lock);
        b->leave_counter++;
        unlock(b->lock);
    }
}

3.2. centralized barrier with sense reversal

Main idea: processors wait for flag to be equal to local sense. Only set flag at exit point when counter == p, no need to clean flag.

struct Barrier_t {
    LOCK lock;
    int counter;
    int flag;
}

int local_sense = 0; // private per processor.

// barrier for p processors
void Barrier(Barrier_t *b, int p) {
    local_sense = (local_sense == 0) ? 1 : 0;
    lock(b->lock);
    int num_arrived = ++(b->counter);
    if (b->counter == p) {
        unlock(b->lock);
        b->counter = 0;
        b->flag = local_sense;
    } else {
        unlock(b->lock);
        while (b.flag != local_sense); // wait for flag
    }
}