Tuesday, May 20, 2014

Relaxed Atomics optimizations for the Ticket Lock

The Ticket Lock is a very simple mechanism for implementing a mutual exclusion lock:
http://web.mit.edu/6.173/www/currentsemester/readings/R06-scalable-synchronization-1991.pdf
It was discovered by John Mellor-Crummey and Michael Scott in 1991 (see section 2.2 of the pdf above).

Thanks to the new memory model for C++11 and for C11, we can use atomic operations on a variable, with or without acquire/release barriers.
When no barrier is used and only the atomicity property is guaranteed, it is called a "relaxed atomic":
http://en.cppreference.com/w/cpp/atomic/memory_order
Notice that in the C11/C++11 memory model, an atomic variable is not intrinsically "sequentially-consistent" or "relaxed", but instead, we can choose its behavior on a per-access basis.
The only thing that the atomic qualifier gives, is the guarantee of atomicity, i.e. that the variable will be read or written in "one shot". This is different from other languages like Java or Scala, where (almost) all native types provide that guarantee and therefore act as a "relaxed atomic", and all variables declared as sequentially consistent will be so, for the entirety of the program (unless, in the case o Java, you use sun.misc.unsafe). Although this subtle difference may seem unimportant, it has some relevance when the goal is to squeeze the maximum performance out of a synchronization or concurrent algorithm.

Suppose you want to declare an atomic integer in different languages, this is how you would go about:
C11 (can be relaxed or sequentially consistent)
_Atomic int x;

C++11 or C++14 (can be relaxed or sequentially consistent)
atomic<int> x;

Java (sequentially consistent)
volatile int x;


With that info at hand, we can now write a simple implementation of the Ticket Lock in C11. Here's how it would look:


typedef struct
{
    _Atomic long ingress;
    char padding[64];      // To avoid false sharing between ingress and egress
    _Atomic long egress;
} ticket_mutex_t;

void ticket_mutex_lock(ticket_mutex_t * self)
{
    long lingress = atomic_fetch_add(&self->ingress, 1);
    while (lingress != atomic_load(&self->egress)) {
        sched_yield();
    }
    // This thread has acquired the lock on the mutex
}

void ticket_mutex_unlock(ticket_mutex_t * self)
{
    atomic_fetch_add(&self->egress, 1);
}


There are three two non-obvious optimizations we can do to this code by using relaxed atomics. Let's look at them one at a time.

1) In the lock() function, the first time egress is read, it can be done with a relaxed load.
The code for lock() would look like this:


void ticket_mutex_lock(ticket_mutex_t * self)
{
    long lingress = atomic_fetch_add(&self->ingress, 1);

    // If the ingress and egress match, then the lock as been acquired and
    // we don't even need to do an acquire-barrier.
    if (lingress == atomic_load_explicit(&self->egress, memory_order_relaxed)) return ;

    while (lingress != atomic_load(&self->egress)) {
        sched_yield();  // Replace this with thrd_yield() if you use <threads.h>
    }
    // This thread has acquired the lock on the mutex
}
We can do this because there is an acquire barrier in atomic_fetch_add(ingress), which means that, not only the atomic_load_explicit(egress, memory_order_relaxed) can never be executed before the atomic_fetch_add(), but it will get the up-to-date value, of at least the same time as when the atomic_fetch_add(ingress) executed. This implies that the value read of  egress will never be higher than the value returned by atomic_fetch_add(ingress).

On x86, this optimization will not give any performance gains because the acquire barrier "comes for free", which means that saving one acquire barrier does not give any gains, but on other architectures, like ARMv8, this may provide some small gain. I'll let you know when I get my hands on an ARMv8 and a gcc with C11 or C++11 support for it.
PS: It turns out this optimization is not possible in a very special case if the CPU does branch prediction and re-orders instructions. Anyways, this first optimization doesn't play well with the C11/C++1x memory model, so better not use it. I've updated the source code accordingly.


2) In the unlock() function, we can replace the atomic_fetch_add() with an atomic_load() and an atomic_store().
There is no need to do the read/addition in one atomic operation in the unlock() function because only one thread at a time is able to call this function. This means that we can implement the increment on egress with an atomic_load() to read the current value, and atomic_store() to write the current value plus one.

3) The previously described atomic_load() can be made relaxed.
Step 2 can be optimized even further by doing a relaxed atomic_load(). There was an acquire barrier done on atomic_fetch_add() on ingress in the lock() function, or worse case, in the atomic_load() of egress inside the while() loop, which guarantees that the current thread has the most up-to-date value of egress. We also have the guarantee that between then and the call to unlock() there was no other thread modifying the egress variable, so this is safe to do.
The final code is the following:


void ticket_mutex_unlock( ticket_mutex_t * self)
{
    long legress = atomic_load_explicit(&self->egress, memory_order_relaxed);
    atomic_store(&self->egress, legress+1);
}


These are just a few examples of how to use relaxed atomics to optimize your code.
Now, we should be honest and mention that the improvements of these particular optimizations are very small, and that some of these optimizations can be tricky to understand or even to prove their correctness. The upside is that this code is still cross-platform, i.e., you can run it on  x86, or ppc, or mips, or arm, and with the memory model you have the confidence that the code is still safe (correct).
This is why I like the C11/C++1x memory model so much   :)

Source code for this C11 Ticket Lock is available on github:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.h
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/C11/locks/ticket_mutex.c

4 comments:

  1. I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
    while fetch_add is an acquire operation, it only synchronizes on the location
    it is operating on. So the fetch_add from the lock cannot synchronize with the
    fetch_add from the unlock (ingress is not the same location as egress). So your
    code could break provided a sufficiently aggressive compiler and processor (I
    suspect Itanium would do the trick, but I don't know its memory model well
    enough to say from memory). For example one thread could take the lock,
    initialize a variable, release the lock; then another thread could take the lock
    and still see the variable uninitialized.

    I agree with your optimizations (2) and (3).

    I believe your optimization (1) is unsound in the C11/C++11 model. Indeed,
    while fetch_add is an acquire operation, it only synchronizes on the location
    it is operating on. So the fetch_add from the lock cannot synchronize with the
    fetch_add from the unlock (ingress is not the same location as egress). So your
    code could break provided a sufficiently aggressive compiler and processor (I
    suspect Itanium would do the trick, but I don't know its memory model well
    enough to say from memory). For example one thread could take the lock,
    initialize a variable, release the lock; then another thread could take the lock
    and still see the variable uninitialized.

    I agree with your optimizations (2) and (3).

    Two other possibilites exist:
    - use the release/acquire attributes instead of the default of seq_cst (you are
    not using its stronger guarantees and it is slower).
    - using an atomic_thread_fence(acquire) after the loop in lock, turning all the
    accesses in the loop in relaxed accesses.

    This would result in the following code:
    void ticket_mutex_lock(ticket_mutex_t * self)
    {
    long lingress = atomic_fetch_add(&self->ingress, 1);
    while (lingress != atomic_load(&self->egress, memory_order_relaxed)) {
    sched_yield();
    }
    atomic_thread_fence(memory_order_acquire);
    }

    void ticket_mutex_unlock(ticket_mutex_t * self)
    {
    long legress = atomic_load_explicit(&self-> egress, memory_order_relaxed);
    atomic_store(&self-> egress, legress+1, memory_order_release);
    }

    I suspect the fetch_add in lock() could be made relaxed too, but it is not
    obvious, and maybe wrong.

    Best regards,
    Robin

    ReplyDelete
  2. Hi Robin,
    Thanks for the comment.
    I think you may be confusing memory_order_consume with memory_order_seq_cst:
    http://en.cppreference.com/w/cpp/atomic/memory_order
    If we were doing atomic_fetch_add_explicit(1, std::memory_order_consume) then the reasoning you describe would be correct because memory_order_consume only acts on the affected memory location, but we are using the default memory model which is memory_order_seq_cst which guarantees an "implicit" acquire and release when doing the atomic_fetch_add(), and as such prevents re-ordering with other code (at either CPU level or compiler level).
    Therefore, I believe (1) to be a valid optimization.

    ReplyDelete
  3. Thanks, but I am not confusing with memory_order_consume. The 'implicit' acquire means that *if* the fetch_add were to synchronize with another access, then every access afterwards will see everything that happened before that other access (which is indeed not even true for consume).
    But fetch_add can only synchronize with accesses to the same variable. In the jargon of the standard: "An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic
    operation B that performs an acquire operation on M and takes its value from any side effect in the release
    sequence headed by A." ( it may be more readable as the 'synchronizes_with' predicate in the mathematical formalization of that part of the standard: http://www.cl.cam.ac.uk/~pes20/cpp/)
    So if thread 1 take the lock, do stuff, unlock; then thread 2 takes the lock, it will see everything that was done before thread 1 took the lock (thanks to the synchronization between the fetch add in each lock()), but not necessarily the do_stuff, as there is no release access to ingress in unlock() that it could synchronize with.

    Best regards,
    Robin

    ReplyDelete
    Replies
    1. Hi Robin,
      I think I understand what you're saying. You are absolutely correct when you say that "(...) there is no release access to ingress in unlock() that it could synchronize with. (...)", but
      you are forgetting that there is an Happens-Before relationship between "ingress" and "egress".

      Assuming the scenario you describe with:
      Thread 1:
      ticket_mutex_lock();
      do_stuff();
      ticket_mutex_unlock();
      Thread 2:
      ticket_mutex_lock();
      do_stuff();
      ticket_mutex_unlock();

      Let me do some assertions and then you can say if you disagree with any of these statements (I might be missing something which you find obvious):
      a) In ticket_mutex_lock() the value of egress will always be read _after_ the value of ingress, due to the implicit acquire barrier in fetch_add(ingress,1)
      b) ticket_mutex_lock() will only return if ingress equals egress, implying that do_stuff() will be called only in such a situation
      c) To some thread synchronizing on egress, ticket_mutex_unlock() will always occur _after_ do_stuff()

      Summarizing, in ticket_mutex_lock(), the fetch_add(ingress,1) happens-before the atomic_load_explicit(egress, relaxed), and only when
      ingress equals egress will the lock be acquired and do_stuff() called.


      However, while thinking about your comment, I thought about a different scenario which occurs only at the CPU re-ordering level due
      to branch prediction, and after talking with Andreia for a while she showed one (rare but) possible issue:
      Suppose the CPU predicts that the following expression will be true and starts executing part of do_stuff() before it has all the values from memory:
      if (lingress == atomic_load_explicit(&self->egress, memory_order_relaxed)) return ;
      In this case, even if it turns out to be true (ingress == egress), during a small amount of time the CPU was using values for variables from do_stuff() that
      could have been simultaneously written by another thread which in the meantime called ticket_mutex_unlock() thus releasing the lock.
      Notice that this scenario only happens at the CPU level if it decides to: predict true for this branch AND the egress is still not equal to ingress
      AND the egress is equal to ingress by the time the value of egress arrives.
      If the expression in the if() is false, then the code goes through a load with acquire, and would be correct.

      This is a very tricky and rare scenario, but it means that optimization (1) is not a valid optimization.
      I think this is not the scenario that you were talking about, but thanks for making us think about this Robin!
      I will update the post and the code accordingly in the next few days.

      Once again, thank you for your comment,
      Pedro

      Delete