Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 232 additions & 4 deletions concurrency-primer.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1179,6 +1179,214 @@ \section{Do we always need sequentially consistent operations?}

\section{Memory orderings}

\subsection{Memory consistency models}

Modern concurrent programs execute atop at least two layers that freely reorder memory operations:
optimizing compilers transform the source program, and modern microprocessors retire loads and stores through pipelines, caches, speculation, and store buffers.
The program therefore does not have to run in the exact order in which it was written.
Instead, the system is allowed to change the sequence of operations so long as the observable result remains within an agreed set of valid outcomes.
This agreement between the programmer, the compiler, and the hardware is called a memory consistency model.

Memory consistency models operate at several levels.
For example, a compiler may rearrange instructions while lowering \cplusplus{} source into assembly, and the processor may further reorder the resulting machine instructions while they execute.
The exact implementation details differ from one layer to another, but each layer must still preserve the outcomes promised by its memory model.

\subsubsection{Sequential consistency (SC)}

In the 1970s, Leslie Lamport proposed the most widely cited memory consistency model, sequential consistency (SC), defined as follows:

\begin{quote}
A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
\end{quote}

Sequential consistency is easy to reason about because every execution appears to be some interleaving of each thread's program order.
That simplicity comes at a cost: modern hardware and optimizing compilers must give up profitable reorderings in order to preserve the illusion.

A memory consistency model is a semantic contract, not a prescribed implementation.
The machine may execute instructions in a very different order from the source text, provided the final result still matches one of the outcomes allowed by the model.
Sequential consistency therefore does not imply a single execution order or a single result.
It only requires that the execution appear equivalent to some legal interleaving of the participating threads.

To make that concrete, consider the following message-passing litmus test.
Two threads write to and read from two shared variables \monobox{x} and \monobox{y}, both initially set to \monobox{0}.

\begin{ccode}
// Litmus Test: Message Passing
int x = 0;
int y = 0;

// Thread 1 // Thread 2
x = 1; r1 = y;
y = 1; r2 = x;
\end{ccode}

If this program is sequentially consistent, then in the global interleaving Thread~1's operations must appear with \monobox{x = 1} before \monobox{y = 1}, and Thread~2's with \monobox{r1 = y} before \monobox{r2 = x}.
Across the whole program, the following six interleavings are possible:

\begin{center}
\noindent
\begin{tabular}{|c|c|c|} \hline
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
x = 1
y = 1
r1 = y(1)
r2 = x(1)
\end{lstlisting}
\end{minipage} &
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
x = 1
r1 = y(0)
y = 1
r2 = x(1)
\end{lstlisting}
\end{minipage} &
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
x = 1
r1 = y(0)
r2 = x(1)
y = 1
\end{lstlisting}
\end{minipage} \\ \hline
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
r1 = y(0)
x = 1
y = 1
r2 = x(1)
\end{lstlisting}
\end{minipage} &
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
r1 = y(0)
x = 1
r2 = x(1)
y = 1
\end{lstlisting}
\end{minipage} &
\begin{minipage}{0.28\linewidth}
\begin{lstlisting}
r1 = y(0)
r2 = x(0)
x = 1
y = 1
\end{lstlisting}
\end{minipage} \\ \hline
\end{tabular}
\captionof{table}{Six possible executions of the message-passing litmus test under sequential consistency.}
\end{center}

None of these executions produce \monobox{r1 = 1} and \monobox{r2 = 0}.
Sequential consistency therefore allows only \monobox{(r1, r2)} to be \monobox{(1, 1)}, \monobox{(0, 1)}, or \monobox{(0, 0)}.
Software may rely on \monobox{(1, 0)} never occurring, while hardware remains free to optimize as long as it preserves that guarantee.

\begin{center}
\includegraphics[keepaspectratio,width=0.7\linewidth]{images/hw-seq-cst}
\captionof{figure}{A simple model of sequentially consistent hardware.}
\label{hw-seq-cst}
\end{center}

Figure \ref{hw-seq-cst} sketches one intuitive implementation: each thread accesses a single shared memory, and that memory processes one read or write at a time.
Real machines are more complicated than that.
They may include private caches, queues, and multiple banks, but they still qualify as sequentially consistent if the externally visible behavior matches the same abstract model.

\subsubsection{Total store order (TSO)}

Although sequential consistency is often treated as the gold standard for reasoning about multi-threaded programs, it leaves limited room for performance optimization.
Modern processors therefore tend to implement weaker models.
For example, x86 processors are usually described using the total store order (TSO) model, which can be approximated by the following picture:

\begin{center}
\includegraphics[keepaspectratio,width=0.7\linewidth]{images/hw-tso}
\captionof{figure}{A simplified model of x86-TSO hardware.}
\label{hw-tso}
\end{center}

Under TSO, all processors can read from a single shared memory, but each processor first places its own writes into a per-core write queue, often called a store buffer.

Consider the following write-queue litmus test:

\begin{ccode}
// Litmus Test: Write Queue (Store Buffer)
int x = 0;
int y = 0;

// Thread 1 // Thread 2
x = 1; y = 1;
r1 = y; r2 = x;
\end{ccode}

Sequential consistency does not allow \monobox{r1 = r2 = 0}, but TSO does.
Under SC, at least one of \monobox{x = 1} or \monobox{y = 1} must become visible before the reads occur.
Under TSO, however, both writes may still be sitting in their respective store buffers when the reads execute, so each thread can still observe \monobox{0}.

Non-sequentially consistent hardware typically provides memory barriers, or fences, to restore stronger ordering when needed.
On a TSO machine, placing a barrier between the write and the read forces older writes to drain before later reads execute:

\begin{ccode}
// Thread 1 // Thread 2
x = 1; y = 1;
barrier; barrier;
r1 = y; r2 = x;
\end{ccode}

The name total store order comes from the fact that once a write leaves the store buffer and reaches shared memory, every processor agrees on where that write sits relative to other writes.
Different processors may see a write late, but they do not disagree about the final order in which committed writes become visible.

Consider the following Independent Reads of Independent Writes (IRIW) litmus test:

\begin{ccode}
// Litmus Test: Independent Reads of Independent Writes (IRIW)
int x = 0;
int y = 0;

// Thread 1 // Thread 2 // Thread 3 // Thread 4
x = 1; y = 1; r1 = x; r3 = y;
r2 = y; r4 = x;
\end{ccode}

If Thread~3 observes \monobox{r1 = 1} and \monobox{r2 = 0}, then in the single global store order \monobox{x = 1} must have committed before \monobox{y = 1}.
If Thread~4 also observes \monobox{r3 = 1}, that same global order forces \monobox{r4} to see \monobox{x = 1} as well, so \monobox{r4} can only be \monobox{1}.
Under TSO, all threads agree on the order in which committed writes become visible, which rules out the disagreement that IRIW would require.

\subsubsection{Relaxed memory models}

\begin{center}
\includegraphics[keepaspectratio,width=0.4\linewidth]{images/hw-relaxed}
\captionof{figure}{A simplified relaxed model resembling \textsc{Arm} hardware.}
\label{hw-relaxed}
\end{center}

Figure \ref{hw-relaxed} sketches a more relaxed model similar to that used by modern \textsc{Arm} processors.
Each core can read and write through its own local structures, and writes may propagate to other cores in different orders.
Reads may also be delayed or speculated until their values are needed.
This gives the hardware substantially more freedom than either SC or TSO.

One property still remains essential: accesses to the same memory location must obey coherence.
All threads must eventually agree on the order in which writes to a single address become visible.
Without coherence, even simple shared-state programs would be almost impossible to reason about.

The following coherence litmus test is therefore disallowed not only on \textsc{Arm}, but also on x86-TSO and under sequential consistency:

\begin{ccode}
// Litmus Test: Coherence
int x = 0;

// Thread 1 // Thread 2 // Thread 3 // Thread 4
x = 1; x = 2; r1 = x; r3 = x;
r2 = x; r4 = x;
\end{ccode}

No execution may produce \monobox{r1 = 1}, \monobox{r2 = 2}, \monobox{r3 = 2}, and \monobox{r4 = 1}, because that would require different threads to disagree about the order of writes to \monobox{x}.

Litmus tests like these are commonly checked with tools rather than by hand.
The \href{https://diy.inria.fr/}{\texttt{diy}} and \href{https://github.com/herd/herdtools7}{\texttt{herd7}} tools let you describe a small concurrent program, choose an architecture or language memory model, and enumerate the outcomes that are allowed.
That workflow is especially useful when validating whether a surprising execution is genuinely permitted by \textsc{Arm}, x86-TSO, or the \cplusplus{} memory model.

\subsection{C11/C++11 atomics}

By default, all atomic operations, including loads, stores, and various forms of \textsc{RMW},
are considered sequentially consistent.
However, this is just one among many possible orderings.
Expand Down Expand Up @@ -1225,7 +1433,7 @@ \section{Memory orderings}
let's look at what these orderings are and how we can use them.
As it turns out, almost all of the examples we have seen so far do not actually need sequentially consistent operations.

\subsection{Acquire and release}
\subsubsection{Acquire and release}

We have just examined the acquire and release operations in the context of the lock example from \secref{lock-example}.
You can think of them as ``one-way'' barriers: an acquire operation permits other reads and writes to move past it,
Expand Down Expand Up @@ -1290,7 +1498,7 @@ \subsection{Acquire and release}
}
\end{cppcode}

\subsection{Relaxed}
\subsubsection{Relaxed}
Relaxed atomic operations are useful for variables shared between threads where \emph{no specific order} of operations is needed.
Although it may seem like a niche requirement, such scenarios are quite common.

Expand Down Expand Up @@ -1332,7 +1540,7 @@ \subsection{Relaxed}
a \textsc{CAS} loop is performed to claim a job.
All of the loads can be relaxed as we do not need to enforce any order until we have successfully modified our value.

\subsection{Acquire-Release}
\subsubsection{Acquire-Release}

\cc|memory_order_acq_rel| is used with atomic \textsc{RMW} operations that need to both load-acquire \emph{and} store-release a value.
A typical example involves thread-safe reference counting,
Expand Down Expand Up @@ -1382,7 +1590,7 @@ \subsection{Acquire-Release}
experts-only construct we have in the language.
\end{quote}

\subsection{Consume}
\subsubsection{Consume}

Last but not least, we introduce \cc|memory_order_consume|.
Imagine a situation where data changes rarely but is frequently read by many threads.
Expand Down Expand Up @@ -1598,6 +1806,21 @@ \section{Additional Resources}
by Fedor Pikus,
a hour-long talk on this topic.

\href{https://lamport.azurewebsites.net/pubs/multi.pdf}{%
\textit{How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs}},
Leslie Lamport's classic paper introducing sequential consistency.

\href{https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf}{%
\textit{x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors}},
by Peter Sewell et al.,
for a precise description of the x86 memory model.

\href{https://research.swtch.com/hwmm}{%
\textit{Hardware Memory Models}},
by Russ Cox,
the essay whose SC/TSO/relaxed progression and litmus tests this chapter's
``Memory consistency models'' section is adapted from.

\href{https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/}{%
\textit{\cpp|atomic<> Weapons|: The \cplusplus{11} Memory Model and Modern Hardware}}
by Herb Sutter,
Expand All @@ -1619,6 +1842,11 @@ \section{Additional Resources}
an older but much shorter piece by McKenney explaining how memory barriers are implemented
in the Linux kernel on various architectures.

\href{https://diy.inria.fr/}{\texttt{diy}},
plus the related
\href{https://github.com/herd/herdtools7}{\texttt{herdtools7}} suite,
for generating and checking litmus tests against hardware and language memory models.

\href{https://preshing.com/archives/}{\textit{Preshing On Programming}},
a blog with many excellent articles on lockless concurrency.

Expand Down
Binary file added images/hw-relaxed.pdf
Binary file not shown.
Binary file added images/hw-seq-cst.pdf
Binary file not shown.
Binary file added images/hw-tso.pdf
Binary file not shown.
Loading