diff --git a/concurrency-primer.tex b/concurrency-primer.tex index 81b5f90..71ecc7c 100644 --- a/concurrency-primer.tex +++ b/concurrency-primer.tex @@ -1179,6 +1179,214 @@ \section{Do we always need sequentially consistent operations?} \section{Memory orderings} +\subsection{Memory consistency models} + +Modern concurrent programs execute atop at least two layers that freely reorder memory operations: +optimizing compilers transform the source program, and modern microprocessors retire loads and stores through pipelines, caches, speculation, and store buffers. +The program therefore does not have to run in the exact order in which it was written. +Instead, the system is allowed to change the sequence of operations so long as the observable result remains within an agreed set of valid outcomes. +This agreement between the programmer, the compiler, and the hardware is called a memory consistency model. + +Memory consistency models operate at several levels. +For example, a compiler may rearrange instructions while lowering \cplusplus{} source into assembly, and the processor may further reorder the resulting machine instructions while they execute. +The exact implementation details differ from one layer to another, but each layer must still preserve the outcomes promised by its memory model. + +\subsubsection{Sequential consistency (SC)} + +In the 1970s, Leslie Lamport proposed the most widely cited memory consistency model, sequential consistency (SC), defined as follows: + +\begin{quote} +A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. +\end{quote} + +Sequential consistency is easy to reason about because every execution appears to be some interleaving of each thread's program order. +That simplicity comes at a cost: modern hardware and optimizing compilers must give up profitable reorderings in order to preserve the illusion. + +A memory consistency model is a semantic contract, not a prescribed implementation. +The machine may execute instructions in a very different order from the source text, provided the final result still matches one of the outcomes allowed by the model. +Sequential consistency therefore does not imply a single execution order or a single result. +It only requires that the execution appear equivalent to some legal interleaving of the participating threads. + +To make that concrete, consider the following message-passing litmus test. +Two threads write to and read from two shared variables \monobox{x} and \monobox{y}, both initially set to \monobox{0}. + +\begin{ccode} +// Litmus Test: Message Passing +int x = 0; +int y = 0; + +// Thread 1 // Thread 2 +x = 1; r1 = y; +y = 1; r2 = x; +\end{ccode} + +If this program is sequentially consistent, then in the global interleaving Thread~1's operations must appear with \monobox{x = 1} before \monobox{y = 1}, and Thread~2's with \monobox{r1 = y} before \monobox{r2 = x}. +Across the whole program, the following six interleavings are possible: + +\begin{center} +\noindent +\begin{tabular}{|c|c|c|} \hline +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} +x = 1 +y = 1 + r1 = y(1) + r2 = x(1) +\end{lstlisting} +\end{minipage} & +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} +x = 1 + r1 = y(0) +y = 1 + r2 = x(1) +\end{lstlisting} +\end{minipage} & +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} +x = 1 + r1 = y(0) + r2 = x(1) +y = 1 +\end{lstlisting} +\end{minipage} \\ \hline +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} + r1 = y(0) +x = 1 +y = 1 + r2 = x(1) +\end{lstlisting} +\end{minipage} & +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} + r1 = y(0) +x = 1 + r2 = x(1) +y = 1 +\end{lstlisting} +\end{minipage} & +\begin{minipage}{0.28\linewidth} +\begin{lstlisting} + r1 = y(0) + r2 = x(0) +x = 1 +y = 1 +\end{lstlisting} +\end{minipage} \\ \hline +\end{tabular} +\captionof{table}{Six possible executions of the message-passing litmus test under sequential consistency.} +\end{center} + +None of these executions produce \monobox{r1 = 1} and \monobox{r2 = 0}. +Sequential consistency therefore allows only \monobox{(r1, r2)} to be \monobox{(1, 1)}, \monobox{(0, 1)}, or \monobox{(0, 0)}. +Software may rely on \monobox{(1, 0)} never occurring, while hardware remains free to optimize as long as it preserves that guarantee. + +\begin{center} +\includegraphics[keepaspectratio,width=0.7\linewidth]{images/hw-seq-cst} +\captionof{figure}{A simple model of sequentially consistent hardware.} +\label{hw-seq-cst} +\end{center} + +Figure \ref{hw-seq-cst} sketches one intuitive implementation: each thread accesses a single shared memory, and that memory processes one read or write at a time. +Real machines are more complicated than that. +They may include private caches, queues, and multiple banks, but they still qualify as sequentially consistent if the externally visible behavior matches the same abstract model. + +\subsubsection{Total store order (TSO)} + +Although sequential consistency is often treated as the gold standard for reasoning about multi-threaded programs, it leaves limited room for performance optimization. +Modern processors therefore tend to implement weaker models. +For example, x86 processors are usually described using the total store order (TSO) model, which can be approximated by the following picture: + +\begin{center} +\includegraphics[keepaspectratio,width=0.7\linewidth]{images/hw-tso} +\captionof{figure}{A simplified model of x86-TSO hardware.} +\label{hw-tso} +\end{center} + +Under TSO, all processors can read from a single shared memory, but each processor first places its own writes into a per-core write queue, often called a store buffer. + +Consider the following write-queue litmus test: + +\begin{ccode} +// Litmus Test: Write Queue (Store Buffer) +int x = 0; +int y = 0; + +// Thread 1 // Thread 2 +x = 1; y = 1; +r1 = y; r2 = x; +\end{ccode} + +Sequential consistency does not allow \monobox{r1 = r2 = 0}, but TSO does. +Under SC, at least one of \monobox{x = 1} or \monobox{y = 1} must become visible before the reads occur. +Under TSO, however, both writes may still be sitting in their respective store buffers when the reads execute, so each thread can still observe \monobox{0}. + +Non-sequentially consistent hardware typically provides memory barriers, or fences, to restore stronger ordering when needed. +On a TSO machine, placing a barrier between the write and the read forces older writes to drain before later reads execute: + +\begin{ccode} +// Thread 1 // Thread 2 +x = 1; y = 1; +barrier; barrier; +r1 = y; r2 = x; +\end{ccode} + +The name total store order comes from the fact that once a write leaves the store buffer and reaches shared memory, every processor agrees on where that write sits relative to other writes. +Different processors may see a write late, but they do not disagree about the final order in which committed writes become visible. + +Consider the following Independent Reads of Independent Writes (IRIW) litmus test: + +\begin{ccode} +// Litmus Test: Independent Reads of Independent Writes (IRIW) +int x = 0; +int y = 0; + +// Thread 1 // Thread 2 // Thread 3 // Thread 4 +x = 1; y = 1; r1 = x; r3 = y; + r2 = y; r4 = x; +\end{ccode} + +If Thread~3 observes \monobox{r1 = 1} and \monobox{r2 = 0}, then in the single global store order \monobox{x = 1} must have committed before \monobox{y = 1}. +If Thread~4 also observes \monobox{r3 = 1}, that same global order forces \monobox{r4} to see \monobox{x = 1} as well, so \monobox{r4} can only be \monobox{1}. +Under TSO, all threads agree on the order in which committed writes become visible, which rules out the disagreement that IRIW would require. + +\subsubsection{Relaxed memory models} + +\begin{center} +\includegraphics[keepaspectratio,width=0.4\linewidth]{images/hw-relaxed} +\captionof{figure}{A simplified relaxed model resembling \textsc{Arm} hardware.} +\label{hw-relaxed} +\end{center} + +Figure \ref{hw-relaxed} sketches a more relaxed model similar to that used by modern \textsc{Arm} processors. +Each core can read and write through its own local structures, and writes may propagate to other cores in different orders. +Reads may also be delayed or speculated until their values are needed. +This gives the hardware substantially more freedom than either SC or TSO. + +One property still remains essential: accesses to the same memory location must obey coherence. +All threads must eventually agree on the order in which writes to a single address become visible. +Without coherence, even simple shared-state programs would be almost impossible to reason about. + +The following coherence litmus test is therefore disallowed not only on \textsc{Arm}, but also on x86-TSO and under sequential consistency: + +\begin{ccode} +// Litmus Test: Coherence +int x = 0; + +// Thread 1 // Thread 2 // Thread 3 // Thread 4 +x = 1; x = 2; r1 = x; r3 = x; + r2 = x; r4 = x; +\end{ccode} + +No execution may produce \monobox{r1 = 1}, \monobox{r2 = 2}, \monobox{r3 = 2}, and \monobox{r4 = 1}, because that would require different threads to disagree about the order of writes to \monobox{x}. + +Litmus tests like these are commonly checked with tools rather than by hand. +The \href{https://diy.inria.fr/}{\texttt{diy}} and \href{https://github.com/herd/herdtools7}{\texttt{herd7}} tools let you describe a small concurrent program, choose an architecture or language memory model, and enumerate the outcomes that are allowed. +That workflow is especially useful when validating whether a surprising execution is genuinely permitted by \textsc{Arm}, x86-TSO, or the \cplusplus{} memory model. + +\subsection{C11/C++11 atomics} + By default, all atomic operations, including loads, stores, and various forms of \textsc{RMW}, are considered sequentially consistent. However, this is just one among many possible orderings. @@ -1225,7 +1433,7 @@ \section{Memory orderings} let's look at what these orderings are and how we can use them. As it turns out, almost all of the examples we have seen so far do not actually need sequentially consistent operations. -\subsection{Acquire and release} +\subsubsection{Acquire and release} We have just examined the acquire and release operations in the context of the lock example from \secref{lock-example}. You can think of them as ``one-way'' barriers: an acquire operation permits other reads and writes to move past it, @@ -1290,7 +1498,7 @@ \subsection{Acquire and release} } \end{cppcode} -\subsection{Relaxed} +\subsubsection{Relaxed} Relaxed atomic operations are useful for variables shared between threads where \emph{no specific order} of operations is needed. Although it may seem like a niche requirement, such scenarios are quite common. @@ -1332,7 +1540,7 @@ \subsection{Relaxed} a \textsc{CAS} loop is performed to claim a job. All of the loads can be relaxed as we do not need to enforce any order until we have successfully modified our value. -\subsection{Acquire-Release} +\subsubsection{Acquire-Release} \cc|memory_order_acq_rel| is used with atomic \textsc{RMW} operations that need to both load-acquire \emph{and} store-release a value. A typical example involves thread-safe reference counting, @@ -1382,7 +1590,7 @@ \subsection{Acquire-Release} experts-only construct we have in the language. \end{quote} -\subsection{Consume} +\subsubsection{Consume} Last but not least, we introduce \cc|memory_order_consume|. Imagine a situation where data changes rarely but is frequently read by many threads. @@ -1598,6 +1806,21 @@ \section{Additional Resources} by Fedor Pikus, a hour-long talk on this topic. +\href{https://lamport.azurewebsites.net/pubs/multi.pdf}{% +\textit{How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs}}, +Leslie Lamport's classic paper introducing sequential consistency. + +\href{https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf}{% +\textit{x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors}}, +by Peter Sewell et al., +for a precise description of the x86 memory model. + +\href{https://research.swtch.com/hwmm}{% +\textit{Hardware Memory Models}}, +by Russ Cox, +the essay whose SC/TSO/relaxed progression and litmus tests this chapter's +``Memory consistency models'' section is adapted from. + \href{https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/}{% \textit{\cpp|atomic<> Weapons|: The \cplusplus{11} Memory Model and Modern Hardware}} by Herb Sutter, @@ -1619,6 +1842,11 @@ \section{Additional Resources} an older but much shorter piece by McKenney explaining how memory barriers are implemented in the Linux kernel on various architectures. +\href{https://diy.inria.fr/}{\texttt{diy}}, +plus the related +\href{https://github.com/herd/herdtools7}{\texttt{herdtools7}} suite, +for generating and checking litmus tests against hardware and language memory models. + \href{https://preshing.com/archives/}{\textit{Preshing On Programming}}, a blog with many excellent articles on lockless concurrency. diff --git a/images/hw-relaxed.pdf b/images/hw-relaxed.pdf new file mode 100644 index 0000000..a9a0c2c Binary files /dev/null and b/images/hw-relaxed.pdf differ diff --git a/images/hw-seq-cst.pdf b/images/hw-seq-cst.pdf new file mode 100644 index 0000000..301b788 Binary files /dev/null and b/images/hw-seq-cst.pdf differ diff --git a/images/hw-tso.pdf b/images/hw-tso.pdf new file mode 100644 index 0000000..12f4870 Binary files /dev/null and b/images/hw-tso.pdf differ