The Leaky Semicolon
Compositional Semantic Dependencies for Relaxed-Memory Concurrency

ALAN JEFFREY, Roblox, USA
JAMES RIELY, DePaul University, USA
MARK BATTY, University of Kent, UK
SIMON COOKSEY, University of Kent, UK
ILYA KAYSIN, JetBrains Research, Russia and University of Cambridge, UK
ANTON PODKOPAEV, HSE University, Russia

Program logics and semantics tell a pleasant story about sequential composition: when executing \( S_1; S_2 \), we first execute \( S_1 \) then \( S_2 \). To improve performance, however, processors execute instructions out of order, and compilers reorder programs even more dramatically. By design, single-threaded systems cannot observe these reorderings; however, multiple-threaded systems can, making the story considerably less pleasant. A formal attempt to understand the resulting mess is known as a “relaxed memory model.” Prior models either fail to address sequential composition directly, or overly restrict processors and compilers, or permit nonsense thin-air behaviors which are unobservable in practice.

To support sequential composition while targeting modern hardware, we enrich the standard event-based approach with \textit{preconditions} and \textit{families of predicate transformers}. When calculating the meaning of \( S_1; S_2 \), the predicate transformer applied to the precondition of an event \( e \) from \( S_2 \) is chosen based on the set of events in \( S_1 \) upon which \( e \) depends. We apply this approach to two existing memory models.

CCS Concepts: • Theory of computation → Parallel computing models; Preconditions.

Additional Key Words and Phrases: Concurrency, Relaxed Memory Models, Pomsets, Preconditions, Predicate Transformers, Multi-Copy Atomicity, Arm8, C11, Thin-Air Reads, Compiler Optimizations

ACM Reference Format:

1 INTRODUCTION

\textit{Sequentiality} is a \textit{leaky abstraction} [Spolsky 2002]. For example, sequentiality tells us that when executing \( r_1 := x ; y := r_2 \), the assignment \( r_1 := x \) is executed before \( y := r_2 \). Thus, one might reasonably expect that the final value of \( r_1 \) is independent of the initial value of \( r_2 \). In most modern languages, however, this fails to hold when the program is run concurrently with \( s := y ; x := s \), which copies \( y \) to \( x \).

In certain cases it is possible to ban concurrent access using separation [O’Hearn 2007], or to accept inefficient implementation in order to obtain sequential consistency (SC) [Marino et al. 2015].

Authors’ addresses: Alan Jeffrey, Roblox, Chicago, USA, ajeffrey@roblox.com; James Riely, DePaul University, Chicago, USA, jriely@cs.depaul.edu; Mark Batty, University of Kent, Canterbury, UK, m.j.batty@kent.ac.uk; Simon Cooksey, University of Kent, Canterbury, UK, simon@graymalk.in; Ilya Kaysin, JetBrains Research, Russia and University of Cambridge, Cambridge, UK, ik404@cam.ac.uk; Anton Podkopaev, HSE University, Saint Petersburg, Russia, apodkopaev@hse.ru.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2022 Copyright held by the owner/author(s).
2473-1421/2022/1-ART54
https://doi.org/10.1145/3498716
When these approaches are not available, however, the humble semicolon becomes shrouded in mystery, covered in the cloak of something known as a memory model. Every language has such a model: For each read operation, it determines the set of available values. Compilers and runtime systems are allowed to choose any value in the set. To allow efficient implementation, the set must not be too small. To allow invariant reasoning, the set must not be too large.

For optimized concurrent languages, it is surprisingly difficult to define a model that allows common compiler optimizations and hardware reorderings yet disallows nonsense behaviors that don’t arise in practice. The latter are commonly known as “thin-air” behaviors [Batty et al. 2015]. There are only a handful of solutions, and all have deficiencies. These can be classified by their approach to dependency tracking (from strongest to weakest):

- Syntactic dependencies [Boehm and Demsky 2014; Kavanagh and Brookes 2018; Lahav et al. 2017; Vafeiadis and Narayan 2013]. These models require inefficient implementation of relaxed access. This is a non-starter for safe languages like Java and JavaScript, and may be an unacceptable cost for low-level languages like C11.
- Semantic dependencies [Chakraborty and Vafeiadis 2019; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005]. These models compute dependencies operationally using alternate worlds, making it impossible to understand a single execution in isolation; they also allow executions that violate temporal reasoning (see §9).
- No dependencies, as in C11 [Batty et al. 2015] and JavaScript [Watt et al. 2019]. This allows thin-air executions.

These models are all non-compositional in the sense that in order to calculate the meaning of any thread, all threads must be known. Using the axiomatic approach of C11, for example, execution graphs are first constructed for each thread, using an operational semantics that allows a read to see any value. The combined graphs are then filtered using a set of acyclicity axioms that determine which reads are valid. These axioms use existentially defined global relations, such as memory order (mo), which must be a per-location total order on write actions.

Part of this non-compositionality is essential: In a concurrent system, the complete set of writes is known only at top-level. However, much of it is incidental. Two recent models have attempted to limit non-compositionality. Jagadeesan et al. [2020] defined Pomsets with Preconditions (PWP), which use preconditions and logic to calculate dependencies for a Java-like language. Paviotti et al. [2020] defined Modular Relaxed Dependencies (MRD), which use event structures to calculate a semantic dependency relation (sdep). PWP is defined using (acyclic) labeled partial orders, or pomsets [Gischer 1988]. MRD adds a causality axiom to C11, stating that (sdep Urf) must be acyclic. In both approaches, acyclicity enables inductive reasoning.

While PWP and MRD both treat concurrency compositionally, neither gives a compositional account of sequentiality. PWP uses prefixing, adding one event at a time on the left. MRD encodes sequential composition using continuation-passing. In both, adding an event requires perfect knowledge of the future. For example, suppose that you are writing system call code and you wish to know if you can reorder a couple of statements. Using PWP or MRD, you cannot tell whether this is possible without having the calling code! More formally, Jagadeesan et al. state the equivalence allowing reordering independent writes as follows:

\[ [x := M; y := N; S] = [y := N; x := M; S] \text{ if } x \neq y \]

This requires a quantification over all continuations S. This is problematic, both from a theoretical point of view—the syntax of programs is now mentioned in the definition of the semantics—and in practice—tools cannot quantify over infinite sets. This problem is related to contextual equivalence, full abstraction [Milner 1977; Plotkin 1977] and the CIU theorem of Mason and Talcott [1992].
In this paper, we show that PwP can be extended with families of predicate transformers (PwT) to calculate sequential dependencies in a way that is compositional and direct: compositional in that the denotation of \((S_1; S_2)\) can be computed from the denotation of \(S_1\) and the denotation of \(S_2\), and direct in that these can be calculated independently. With this formulation, we can show:

\[
[x := M; y := N] = [y := N; x := M] \text{ if } x \neq y
\]

Then the equivalence holds in any context—this form of the equivalence enables reasoning about peephole optimizations. Said differently, unlike prior work, PwT allows the presence or absence of a dependency to be understood in isolation—this enables incremental and modular validation of assumptions about program dependencies in larger blocks of code.

Our main insight is that for language models, sequentiality is the hard part. Concurrency is easy! Or at least, it is no more difficult than it is for hardware. Compilers make the difference, since they typically do little optimization between threads. We motivate our approach to sequential dependencies in §2 and provide formal definitions in §3. In §8, we extend the model to include additional features, such as address calculation and rmws. We discuss related and future work in §9–10.

We extend PwT to a full memory model in §4, based on PwP [Jagadeesan et al. 2020]. §5 summarizes the results for this model. In addition to powering such a bespoke model, the dependency relation calculated by PwT can also be used with off-the-shelf models. For example, in §6 we show that it can be used as an \(sdep\) relation for C11, adapting the approach of mRd [Paviotti et al. 2020]. §7 describes a tool for automatic evaluation of litmus tests in this model. C11 allows thin-air in order to avoid overhead in the implementation of relaxed reads. Safe languages like OCaml [Dolan et al. 2018] have typically made the opposite choice, accepting a performance penalty in order to avoid thin-air. Just as PwT can be used to strengthen C11, it could also be used to weaken these models, allowing optimal lowering for relaxed reads while banning thin-air.

PwT has been formalized in Coq. We have formally verified that the sequential composition satisfies the expected monoid laws (Lemma 3.5). In addition we have formally verified that \([\text{if}(\phi)\{S_1; S_3\}\text{ else }\{S_2; S_3\}] \supseteq [\text{if}(\phi)\{S_1\}\text{ else }\{S_2; S_3\}]\) (Lemma 3.6e).

Supplementary material for this paper is available at https://weakmemory.github.io/pwt.

2 OVERVIEW

This paper is about the interaction of two of the fundamental building blocks of computing: sequential composition and mutable state. One would like to think that these are well-worn topics, where every issue has been settled, but this is not the case.

2.1 Sequential Composition

Novice programmers are taught sequential abstraction: that the program \(S_1; S_2\) executes \(S_1\) before \(S_2\). Since the late 1960s, we’ve been able to explain this using logic [Hoare 1969]. In Dijkstra’s [1975] formulation, we think of programs as predicate transformers, where predicates describe the state of memory in the system. In the calculus of weakest preconditions, programs map postconditions to preconditions. We call the definition of \(wp_S(\psi)\) for loop-free code below (where \(r \sim s\) range over thread-local registers and \(M \sim N\) range over side-effect-free expressions).

\[
wp_{r:=M}(\psi) = \psi[M/r] \quad wp_{S_1;S_2}(\psi) = wp_{S_1}(wp_{S_2}(\psi)) \quad wp_{\text{skip}}(\psi) = \psi
\]

\[
wp_{\text{if}(M)\{S_1\}\text{ else }\{S_2\}}(\psi) = ((M \neq 0) \Rightarrow wp_{S_1}(\psi)) \land ((M = 0) \Rightarrow wp_{S_2}(\psi))
\]

Without loops, the Hoare triple \(\{\phi\} S \{\psi\}\) holds exactly when \(\phi \Rightarrow wp_S(\psi)\). This is an elegant explanation of sequential computation in a sequential context. Note that the assignment rule is sound because a read from a thread-local register must be fulfilled by a preceding write in the
same thread. In a concurrent context, with shared variables \((x−z)\), the obvious generalization of the assignment rule for reads, \(wp_r := \psi[x/r]\), is unsound! In particular, a read from a shared memory location may be fulfilled by a write in another thread.

In this paper we answer the following question: what does sequential composition mean in a concurrent context? An acceptable answer must satisfy several desiderata:

1. it should not impose too much order, overconstraining the implementation,
2. it should not impose too little order, allowing bogus executions, and
3. it should be compositional and direct, as described in §1.

Memory models differ in how they navigate between desiderata 1 and 2. In one direction there are both more valid compiler optimizations and also more potentially dubious executions, in the other direction, less of both. To understand the tradeoffs, one must first understand the underlying hardware and compilers.

2.2 Memory Models

For single-threaded programs, memory can be thought of as you might expect: programs write to, and read from, memory references. This can be thought of as a total order over memory actions (\(\rightarrow\)), where each read has a matching fulfilling write (\(\rightarrow\)), for example:

\[
\begin{align*}
x &:= 0; \ x := 1; \ y := 2; \ r := y; \ s := x
\end{align*}
\]

This model extends naturally to the case of shared-memory concurrency, leading to a sequentially consistent semantics [Lamport 1979], in which program order inside a thread implies a total causal order between read and write events, for example (where ; has higher precedence than ||):

\[
\begin{align*}
x &:= 0; \ x := 1; \ y := 2 \ || \ r := y; \ s := x
\end{align*}
\]

We can represent such an execution as a labeled partial order, or pomset [Gischer 1988; Pratt 1985]. A program may give rise to many executions, each reflecting a different interleaving of the threads.

Unfortunately, this model does not compile efficiently to commodity hardware, resulting in a 37–73% increase in CPU time on Arm8 [Liu et al. 2019] and, hence, in power consumption. Developers of software and compilers have therefore been faced with a difficult trade-off, between an elegant model of memory, and its impact on resource usage (such as size of data centers, electricity bills and carbon footprint). Unsurprisingly, many have chosen to prioritize efficiency over elegance.

This has led to relaxed memory models, in which the requirement of sequential consistency is weakened to only apply per-location. This allows executions that are inconsistent with program order, such as the following, which contains an antidependency (\(\rightarrow\)):

\[
\begin{align*}
x &:= 0; \ x := 1; \ y := 2 \ || \ r := y; \ s := x
\end{align*}
\]

In such models, the causal order between events is important, and includes control and data dependencies (\(\rightarrow\)) to avoid paradoxical “out of thin air” examples such as the following. (We routinely elide initializing writes when they are uninteresting.)

\[
\begin{align*}
r &:= x; \ if(r \{ y := 1 \} \ || \ s := y; \ x := s
\end{align*}
\]
This candidate execution forms a cycle in causal order, so is disallowed, but this depends crucially on the control dependency from \((R_x1)\) to \((W_y1)\), and the data dependency from \((R_y1)\) to \((W_x1)\). If either is missing, then this execution is acyclic and hence allowed. For example dropping the control dependency results in the following execution, which should be allowed:

\[
\begin{align*}
\textstyle & \textcolor{red}{R_x1} \quad \textcolor{red}{W_y1} \quad \textcolor{red}{R_y1} \quad \textcolor{red}{W_x1} \\
& r := x; y := 1 \parallel s := y; x := s
\end{align*}
\]

While syntactic dependency calculation suffices for hardware models, it is not preserved by common compiler optimizations. For example, consider the following program:

\[
\begin{align*}
& r := x; \text{if}(r)\{y := 1\} \text{else } \{y := 1\} \parallel s := y; x := s
\end{align*}
\]

Because \(y := 1\) occurs on both branches of the conditional, a compiler may lift it out. With the dependency removed, the compiler could reorder the read of \(x\) and write to \(y\), allowing both reads to see \(1\). Attempting to generate this execution with syntactic dependencies, however, results in the following candidate execution, which has a cycle and therefore is disallowed:

\[
\begin{align*}
\textstyle & \textcolor{red}{R_x1} \quad \textcolor{red}{W_y1} \quad \textcolor{red}{R_y1} \quad \textcolor{red}{W_x1} \\
\end{align*}
\]

To address this, Jagadeesan et al. [2020] introduced Pomsets with Preconditions (PwP), where events are labeled with logical formulae. Nontrivial preconditions are introduced by store actions (modeling data dependencies) and conditionals (modeling control dependencies):

\[
\begin{align*}
& \text{if}(s>0)\{z := r*(s-1)\} \\
& (s>0) \land (r*(s-1))=0 \quad W_z0
\end{align*}
\]

In this diagram, \((s>0)\) is a control dependency and \((r*(s-1))=0\) is a data dependency. Preconditions are updated as events are prepended (we assume the usual precedence for logical operators):

\[
\begin{align*}
& r := x; s := y; \text{if}(s>0)\{z := r*(s-1)\} \\
\textcolor{red}{R_x1} \quad \textcolor{red}{R_y1} \quad (1=s) \Rightarrow (s>0) \land (r*(s-1))=0 \quad W_z0
\end{align*}
\]

In this diagram there are two reads. As evidenced by the arrow, the read of \(y\) is ordered before the write, reflecting possible dependency; the read of \(x\) is not, reflecting independency. The dependent read of \(y\) allows the precondition of the write to weaken: now the old precondition need only be satisfied assuming the hypothesis \((1=s)\). The independent read of \(x\) allows no such weakening. Nonetheless, the precondition of the write is now a tautology, and so can be elided in the diagram.

We can complete the execution by adding the required writes:

\[
\begin{align*}
& x := 1; y := 1 \parallel r := x; s := y; \text{if}(s>0)\{z := r*(s-1)\} \\
\textcolor{red}{W_x1} \quad \textcolor{red}{W_y1} \quad \textcolor{red}{R_x1} \quad \textcolor{red}{R_y1} \quad W_z0
\end{align*}
\]

In order for a PwP to be complete, all preconditions must be tautologies and all reads must be fulfilled by matching writes. The first requirement captures the sequential semantics. The second requirement captures the concurrent semantics. These correspond to two views of memory for each thread: thread-local and global. In a multicopy-atomic (mca) architecture, there is only one global view, shared by all processors, which is neatly captured by the order of the pomset (see §4).

An untaken conditional produces no events. PwP models this by including the empty pomset in the semantics of every program fragment. To then ensure that \textit{skip} is not a refinement of \(x := 1\), PwP include a termination action, \(\checkmark\), which we have elided in the examples above.
2.3 Predicate Transformers For Relaxed Memory

PwP shows how the logical approach to sequential dependency calculation can be mixed into a relaxed memory model. Our contribution is to extend PwP with predicate transformers to arrive at a model of sequential composition. Predicate transformers are a good fit for logical models of dependency calculation, since both are concerned with preconditions.

Our first attempt is to associate a predicate transformer with each pomset. We visualize this in diagrams by showing how \( \psi \) is transformed, for example:

\[
\begin{align*}
\forall x \quad r := x & \quad s := y & \quad \text{if}\,(s>0)\{ z := r*(s-1) \} \\
\text{Rx1} & \quad (1=r) \Rightarrow \psi & \quad \text{Ry1} & \quad (1=s) \Rightarrow \psi & \quad (s>0) \land (r*(s-1)) = 0 & \quad \text{Wz0} & \quad \psi[r*/z] \\
\end{align*}
\]

The predicate transformer for a write \( z := M \) matches Dijkstra: taking \( \psi \) to \( \psi[M/z] \). For a read \( r := x \), however, Dijkstra would transform \( \psi \) to \( \psi[x/r] \), which is equivalent to \( (x=r) \Rightarrow \psi \) under the assumption that registers are assigned at most once. Instead, we use \( (1=r) \Rightarrow \psi \), reflecting the fact that 1 may come from a concurrent write. The obligation to find a matching write is moved from the sequential semantics of substitution and implication to the concurrent semantics of fulfillment.

For a sequentially consistent semantics, sequential composition is straightforward: we apply each predicate transformer to subsequent preconditions, composing the predicate transformers.

\[
\begin{align*}
\forall x \quad r := x & \quad s := y & \quad \text{if}\,(s>0)\{ z := r*(s-1) \} \\
\text{Rx1} & \quad (1=r) \Rightarrow (1=s) \Rightarrow (s>0) \land (r*(s-1)) = 0 & \quad \text{Wz0} & \quad (1=r) \Rightarrow (1=s) \Rightarrow \psi[r*(s-1)/z] \\
\end{align*}
\]

This works for the sequentially consistent case, but needs to be weakened for the relaxed case.

The key observation of this paper is that rather than working with one predicate transformer, we should work with a family of predicate transformers, indexed by sets of events. For example, for single-event pomsets, there are two predicate transformers, since there are two subsets of any one-element set. The independent transformer is indexed by the empty set, whereas the dependent transformer is indexed by the singleton. We visualize this by including more than one transformed predicate, with a dotted edge leading to the dependent one (\( \cdots \)). For example:

\[
\begin{align*}
\forall x \quad r := x & \quad s := y \\
\psi \quad \text{Rx1} & \quad (1=r) \Rightarrow \psi & \quad \psi \quad \text{Ry1} & \quad (1=s) \Rightarrow \psi \\
\end{align*}
\]

The model of sequential composition then picks which predicate transformer to apply to an event’s precondition by picking the one indexed by all the events before it in causal order.

For example, we can recover the expected semantics for (*) by choosing the predicate transformer which is independent of (Rx1) but dependent on (Ry1), which is the transformer which maps \( \psi \) to \( (1=s) \Rightarrow \psi \). (In subsequent diagrams, we only show predicate transformers for reads.)

\[
\begin{align*}
\forall x \quad r := x & \quad s := y & \quad \text{if}\,(s>0)\{ z := r*(s-1) \} \\
\text{Rx1} & \quad (1=r) \Rightarrow \psi & \quad \text{Ry1} & \quad (1=s) \Rightarrow (s>0) \land (r*(s-1)) = 0 & \quad \text{Wz0} \\
\end{align*}
\]

In the diagram, the dotted lines indicate set inclusion into the index of the transformer-family. As a quick correctness check, we can see that sequential composition is associative in this case, since it does not matter whether we associate to the left—with the intermediate step as in the diagram above, eliding the write action—or to the right—with the intermediate step:

\[
\begin{align*}
\forall x \quad r := x & \quad s := y & \quad \text{if}\,(s>0)\{ z := r*(s-1) \} \\
\psi \quad (1=s) \Rightarrow \psi \quad \text{Ry1} & \quad (1=s) \Rightarrow (s>0) \land (r*(s-1)) = 0 & \quad \text{Wz0} \\
\end{align*}
\]

This is an instance of the general result that sequential composition forms a monoid (Lemma 3.5).
3  SEQUENTIAL SEMANTICS

After some preliminaries (§3.1–3.2), we define the model and establish some basic properties (§3.3 and Fig. 1). We then explain the model using examples (§3.4–3.9). We encourage readers to skim the definitions and then skip to §3.4, coming back as needed.

In this section, we concentrate on the sequential semantics, ignoring the requirement that concurrent reads be fulfilled by matching writes. We extend the model to a full concurrent semantics in §4 and §6 by defining a reads-from relation (rf) subject to various constraints.

3.1 Preliminaries

We require finiteness for the semantics of address calculation (§8.4), which quantifies over all values. Using types, one could limit the finiteness assumption to the subset of values used for address calculation.

We also write

\[ S := \text{skip} \mid \text{if}(M) \{ S_1 \} \text{else} \{ S_2 \} \mid S_1 \mid S_2 \]

Access modes, \( \mu \), are relaxed (rlx), release (rel), acquire (acq), and sequentially consistent (sc). Reads \( r := [L]^\mu \) support rlx, acq, sc. Writes \( [L]^\mu := r \) support rlx, rel, sc. Fences \( F^\mu \) support rel, acq, sc. Register assignments \( r := M \) only affect thread-local state and therefore have no mode. In examples, the default mode for reads and writes is rlx—we systematically drop the annotation.

Commands, aka statements, \( S \), include fences and memory accesses at a given mode, as well as the usual structural constructs. Following Ferreira et al. [1996], \( \parallel \) denotes parallel composition, preserving thread state on the right after a join. In examples without join, we use the symmetric \( \mid \) operator.

We use common syntactic sugar, such as extended expressions, \( \mathcal{M} \), which include memory locations. For example, if \( \mathcal{M} \) includes a single occurrence of \( x \), then \( y := \mathcal{M}; S \) is shorthand for \( r := x; y := \mathcal{M}[r/x]; S \). Each occurrence of \( x \) in an extended expression corresponds to an separate read. We also write \( \text{if}(M) \{ S \} \) as shorthand for \( \text{if}(M) \{ S \} \text{else} \{ \text{skip} \} \).

Throughout §1–7 we require that each register is assigned at most once in a program. In §8, we drop this restriction, requiring instead that there are registers that do not appear in programs.

The semantics is built from the following:

- a set of values \( \mathcal{V} \), ranged over by \( v, w, \ell, k \),
- a set of registers \( \mathcal{R} \), ranged over by \( r, s \),
- a set of expressions \( \mathcal{M} \), ranged over by \( M, N, L \).

Memory references, aka locations, are tagged values, written \([ \ell ]\). Let \( X \) be the set of memory references, ranged over by \( x, y, z \). We require that

- values and registers are disjoint,
- values are finite\(^1\) and include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do not include memory references: \( M[N/x] = M \) (for all \( x \)).

We model the following language,

\[ \mu, v ::= \text{rlx} \mid \text{rel} \mid \text{acq} \mid \text{sc} \]

\[ S ::= r := M \mid r := [L]^\mu \mid [L]^\mu := M \mid F^\mu \mid \text{skip} \mid S_1 ; S_2 \mid \text{if}(M) \{ S_1 \} \text{else} \{ S_2 \} \mid S_1 \parallel S_2 \]

\( \mu \), \( v \), are relaxed (rlx), release (rel), acquire (acq), and sequentially consistent (sc). Reads \( r := [L]^\mu \) support rlx, acq, sc. Writes \( [L]^\mu := r \) support rlx, rel, sc. Fences \( F^\mu \) support rel, acq, sc. Register assignments \( r := M \) only affect thread-local state and therefore have no mode. In examples, the default mode for reads and writes is rlx—we systematically drop the annotation.

We use common syntactic sugar, such as extended expressions, \( \mathcal{M} \), which include memory locations. For example, if \( \mathcal{M} \) includes a single occurrence of \( x \), then \( y := \mathcal{M}; S \) is shorthand for \( r := x; y := \mathcal{M}[r/x]; S \). Each occurrence of \( x \) in an extended expression corresponds to an separate read. We also write \( \text{if}(M) \{ S \} \) as shorthand for \( \text{if}(M) \{ S \} \text{else} \{ \text{skip} \} \).

Throughout §1–7 we require that each register is assigned at most once in a program. In §8, we drop this restriction, requiring instead that there are registers that do not appear in programs.

The semantics is built from the following:

- a set of events \( \mathcal{E} \), ranged over by \( e, d, c \), and subsets ranged over by \( E, D, C \),
- a set of logical formulae \( \Phi \), ranged over by \( \phi, \psi, \theta \),
- a set of actions \( \mathcal{A} \), ranged over by \( a, b \),
- a family of quiescence symbols \( Q_x \), indexed by location.

We require that

- formulae include \( \text{tt}, \text{ff} \), \( Q_x \), and the equalities \( (M=N) \) and \( (x=M) \),

\(^1\)We require finiteness for the semantics of address calculation (§8.4), which quantifies over all values. Using types, one could limit the finiteness assumption to the subset of values used for address calculation.
formulae are closed under ¬, ∧, ∨, ⇒, and substitutions [M/r], [M/x], [ϕ/Qx].

there is a relation ⊨ between formulae, capturing entailment,

⊨ has the expected semantics for =, ¬, ∧, ∨, ⇒ and substitutions [M/r], [M/x], [ϕ/Qx],

there is a subset of A, distinguishing read actions,

there are four binary relations over A × A: delays and matches ⊆ blocks ⊆ overlaps.

Logical formulae include equations over registers and memory references, such as (r=s+1) and (x=1). We use expressions as formulae, coercing M to M≠0.

We write \( \phi \equiv \psi \) when \( \phi \neq \psi \) and \( \phi \neq \phi \). We say \( \phi \) is a tautology if \( \tau \equiv \phi \). We say \( \phi \) is unsatisfiable if \( \phi \equiv \not \), and satisfiable otherwise.

3.2 Actions in This Paper

In this paper, each action is either a read, a write, or a fence:

\[ a, b \equiv R^\mu x v | W^\mu x v | F^\mu \]

We use shorthand when referring to actions. In definitions, we drop elements of actions that are existentially quantified. In examples, we drop elements of actions, using defaults. Let \( D \) be the smallest order over access and fence modes such that relx ⊆ rel ⊆ sc and relx ⊆ acq ⊆ sc. We write \( (W^{rel}) \) to stand for either \( (W^{rel}) \) or \( (W^{sc}) \), and similarly for the other actions and modes.

Definition 3.1. Actions (R) are read actions.
We say a matches b if \( a = (W x v) \) and \( b = (R x v) \).
We say a blocks b if \( a = (W x) \) and \( b = (R x) \), regardless of value.
We say a overlaps b if they access the same location, regardless of whether they read or write.
Let \( \prec_{co} \) capture write-write, read-write coherence: \( \prec_{co} = \{ (W x, W x), (R x, W x), (W x, R x) \} \). Let \( \prec_{sync} \) capture conflict due to synchronization:

\[ \prec_{sync} = \{ (a, W^{rel}), (a, F^{rel}), (R, F^{acq}), \}

Let \( \prec_{sc} \) capture conflict due to sc access: \( \prec_{sc} = \{ (W^{sc}, W^{sc}), (R^{sc}, W^{sc}), (W^{sc}, R^{sc}), (R^{sc}, R^{sc}) \} \).
We say a delays b if \( a \prec_{co} b \) or \( a \prec_{sync} b \) or \( a \prec_{sc} b \).

3.3 PwT: Pomsets with Predicate Transformers

Predicate transformers are functions on formulæ that preserve logical structure, providing a natural model of sequential composition. The definition follows Dijkstra [1975].

Definition 3.2. A predicate transformer is a function \( \tau : \Phi \to \Phi \) such that

\[ \begin{align*}
\tau(\psi_1 \land \psi_2) & \equiv \tau(\psi_1) \land \tau(\psi_2), \\
\tau(\psi_1 \lor \psi_2) & \equiv \tau(\psi_1) \lor \tau(\psi_2), \\
\tau(\psi_1 \Rightarrow \psi_2) & \equiv \tau(\psi_1) \Rightarrow \tau(\psi_2), \\
\text{if } \tau(\phi) & \equiv \psi, \text{ then } \tau(\phi) \neq \tau(\psi).
\end{align*} \]

We consistently use \( \psi \) as the parameter of predicate transformers. Note that substitutions (\( \psi[M/r] \) and \( \psi[M/x] \)) and implications on the right (\( \phi \Rightarrow \psi \)) are predicate transformers.

As discussed in §1, predicate transformers suffice for sequentially consistent models, but not relaxed models, where dependency calculation is crucial. For dependency calculation, we use a family of predicate transformers, indexed by sets of events. When computing \([S_1 ; S_2]\), we will use \( \tau^C \) as the predicate transformer for event \( e \in [S_2] \), where \( C \) includes all of the events in \([S_1] \) that

\[ \text{2 This formalization includes release sequences } (W^{rel} x, W x). \text{ Symmetry would suggest that we include } (R x, R^{acq} x), \text{ but this is not sound for Arm8.} \]

\[ \text{3 In addition to the three criteria of Def. 3.2, Dijkstra [1975] requires (x4’) } \tau(\phi) \equiv \phi. \text{ The dependent transformer for read actions (x4a) fails x4’, since } \phi \text{ is not equivalent to } \psi \Rightarrow \phi. \text{ We can define an analog of x4’ for our model using the register naming conventions of §8. Define } \theta_{\lambda_1} \text{ to capture the register state of a pomset: } \theta_{\lambda_1} = \lambda_{(e,e) \in (E \times V) \mid E \mid \lambda(e) \Rightarrow (R \sigma)} \text{ where } E = \text{dom}(\lambda). \text{ We say that } \phi \text{ is } \lambda\text{-inconsistent if } \phi \not \theta_{\lambda_1} \text{ is unsatisfiable. We can then require (x4) if } \psi \text{ is } \lambda\text{-inconsistent then } \tau(\psi) \equiv \lambda\text{-inconsistent. x4 is not needed for this proof of the paper, therefore we have elided it from the main development.} \]
precede \( e \) in causal order (\( d <_1 e \) implies \( d \in C \)). Under the following definition, the larger \( C \) is, the better, at least in terms of satisfying preconditions. Adding more order can only increase the size of \( C \). Thus more order means weaker preconditions.

**Definition 3.3.** A family of predicate transformers over \( E \) consists of a predicate transformer \( \tau^D \) for each \( D \subseteq \mathcal{E} \), such that if \( C \cap E \subseteq D \) then \( \tau^C(\psi) \models \tau^D(\psi) \).

In a family of predicate transformers, the transformer of a smaller set must entail the transformer of a larger set. Thus bigger sets are better and \( \tau^E(\psi) \)—the transformer of the biggest set—is the best. (The definition is insensitive to events outside \( E \)—it is for this reason that we have taken \( D \subseteq \mathcal{E} \) rather than \( D \subseteq E \).)

**Definition 3.4.** A pomset with predicate transformers (PwT) is a tuple \((E, \lambda, \kappa, \tau, \checkmark, <)\) where

- \( E \subseteq \mathcal{E} \) is a set of events,
- \( \lambda : E \rightarrow \mathcal{A} \) defines an action for each event,
- \( \kappa : \mathcal{E} \rightarrow \Phi \) defines a precondition for each event, such that
  - \( \kappa(e) = \text{ff} \) if \( e \notin E \),
- \( \tau : 2^E \rightarrow \Phi \) is a family of predicate transformers over \( E \),
- \( \checkmark : \Phi \) is a termination condition, such that
  - \( \checkmark(\tau(\Phi)) \models \tau(\checkmark(\Phi)) \),
- \( < \subseteq E \times E \) is a strict partial order capturing causality.

A PwT is complete if

- \( \kappa(e) \) is a tautology (for every \( e \in E \)),
- \( \checkmark \) is a tautology.

We refer to PwTs simply as pomsets. Let \( P \) range over pomsets, and \( \mathcal{P} \) over sets of pomsets. Throughout the rest of this section, we endeavor to explain Fig. 1, which gives the semantics of programs \([\_]\). We use consistent sub- and super-scripts to refer to the components of a pomset. For example \( \langle 1 \rangle \) is the order of \( P_1 \), \( \langle 1 \rangle ' \) is the order of \( P' \), and \( \langle 3 \rangle \) is the order of \( P \). We also use consistent numbering. For example, item 3 always refers to \( \kappa \) and item 5 always refers to \( \checkmark \). As usual, we write \( d \leq e \) to mean \( d < e \) or \( d = e \).

The core of the model is a labeled partial order, including a set of events (m1), a labeling (m2), and an order (m6). On top of this basic structure, M3–M5 add a layer of logic. For each pomset, M5 provides a termination condition. For each event in a pomset, M3 provides a precondition. For each set of events in a pomset, M4 provides a predicate transformer. The partial order and the logic are tied together formally in the definition of \( \kappa_2 \) in SEQ in Fig. 1, which calculates dependencies.

Before discussing the details, we note that the semantics satisfies the expected monoid laws, as well as some laws concerning the conditional. We have verified Lemma 3.5 and Lemma 3.6e in Coq\(^4\). Similar laws apply to parallel composition—for example \([S] = [(\text{skip} \parallel S)]\). Note, however, that \([\text{skip}] \neq [ \text{skip} \parallel \text{skip}]\)—this asymmetric operator throws away thread state from the left.

**Lemma 3.5.** (a) \([S] = [(S; \text{skip})] = [(\text{skip}; S)]\). (b) \([(S_1; S_2); S_3] = [S_1; (S_2; S_3)]\).

The proof of (a) requires M5a for the termination condition in \((S; \text{skip})\). The proof of (b) requires both conjunction closure (x1, for the termination condition) and disjunction closure (x2, for the predicate transformers themselves). The proof of (b) also requires that s6 enforce projection as well as inclusion (see the definition of respects in Fig. 1).

**Lemma 3.6.** (c) \([\text{if} \phi(S_1) \text{ else } S_2] \supseteq [S_1] \) if \( \phi \) is a tautology.
(d) \([\text{if} \phi(S) \text{ else } S] \supseteq [S]\).
(e) \([\text{if} \phi(S_1; S_2) \text{ else } S_2; S_3] \supseteq [\text{if} \phi(S_1) \text{ else } S_2; S_3]\).

\(^4\)Specifically, we have proven these results for the semantics of Fig. 1 with the refinements of §8.1, and §8.3

Although the semantics of Fig. example, termination conditions ensure that and fences are included in complete pomsets, unless they are inside an untaken conditional. For used inappropriately. At top level, in the semantics of all statements. Termination conditions ensure that the empty pomset is not observationally distinguished by the context: refinement, since the latter includes a two-element pomset, but the former does not. (These are rather than disjoint union, equal except, perhaps, the order, where we require PwT-mca (see §8.3).

In §8.3, we refine the semantics to validate the reverse inclusions for \(d-f\) using if-introduction. Although the semantics of Fig. 1 validates the reverse inclusions for \(g\), these do not hold for PwT-mca (see §10).

The semantics is closed with respect to augmentation: \(P_2\) is an augment of \(P_1\) if all fields are equal except, perhaps, the order, where we require \(<_2 \supseteq <_1\).

**Lemma 3.7.** If \(P_1 \in [S]\) and \(P_2\) augments \(P_1\) then \(P_2 \in [S]\).

Augment closure captures the intuition that it is always sound for a compiler to make more conservative assumptions about dependencies than the semantics.

Unless otherwise noted, all pomsets in examples are complete and augment-minimal.

### 3.4 Pomsets and Complete Pomsets: Termination

Ignoring the logic, the definitions of Fig. 1 are straightforward. Reads, writes and fences map to pomsets with at most one event—we allow the empty pomset so that these may appear in the untaken branch of a conditional. skip and register assignment map to the empty pomset. The structural rules combine pomsets: PAR performs disjoint union, inheriting labeling and order from the two sides. SEQ and IF both perform a union.

We say that \(d \in E_1\) and \(e \in E_2\) coalesce if \(d = e\). As a trivial consequence of using union rather than disjoint union, s1 validates mumbling [Brookes 1996] by coalescing events. For example \([x := 1; x := 1]\) includes the singleton pomset \([Wx3]\). From this it is easy to see that \([x := 1; x := 1] \supseteq [x := 1]\) is a valid refinement. It is equally obvious that \([x := 1] \supseteq [x := 1; x := 1]\) is not a valid refinement, since the latter includes a two-element pomset, but the former does not. (These are observationally distinguished by the context: \([-] \parallel r := x; x := 2; s := x; \text{if}(r=s)\{z := 1\}.\)

In complete pomsets, c3 requires that all preconditions must be tautologies. In order to allow complete pomsets with untaken conditionals, such as \(\text{if}(ff)\{x := 1\}\), we allow the empty pomset in the semantics of all statements. Termination conditions ensure that the empty pomset is not used inappropriately. At top level, c5 requires that \(\checkmark\) is a tautology. w5 and f5 ensure that writes and fences are included in complete pomsets, unless they are inside an untaken conditional. For example, termination conditions ensure that \([x := 1] \supseteq [\text{skip}]\), since \([\text{skip}]\) includes the empty pomset with \(\checkmark \equiv \text{tt}\), but \([x := 1]\) can only include the empty pomset with \(\checkmark \equiv \text{K}(0) = \text{ff}\).

For reads, the definition of \(\checkmark\) depends on the mode: relaxed reads may be elided in complete pomsets (a5a), but acquiring reads must be included (a5b). From this, it is easy to see that \([r := x]\) \(\supseteq [\text{skip}]\) is a valid refinement (where the default mode is \(rbx\)).

Note that \([x := 2]\) can write any value \(v\); the fact that \(v\) must be 2 is captured in the logic. In particular, w5 requires that \(\checkmark \equiv 2=v\) for this program and c5 requires that \(\checkmark\) be a tautology at top-level. In combination, these ensure that complete pomsets do not include bogus writes. Consider the following incomplete pomsets:

\[
\begin{align*}
x &:= 1 \\
Wx1
\end{align*}
\]

\[
\begin{align*}
x &:= 2 \\
2=3 Wx3
\end{align*}
\]

\[
\begin{align*}
\text{if}(M)\{x := 3\} \\
M \neq 0 Wx3
\end{align*}
\]

By merging, the semantics allows the following:

\[
\begin{align*}
x &:= 1; x := 2; \text{if}(M)\{x := 3\} \\
Wx1 & M \neq 0 Wx3
\end{align*}
\]

However, this pomset is incomplete—regardless of \(M\)—since \(\checkmark \equiv 2=3 \equiv \text{ff}\).
If $P \in \text{SKIP}$ then $E = \emptyset$ and $\tau^D(\psi) \equiv \psi$ and $\checkmark \equiv \tt$.

If $P \in \text{ASSIGN}(r, M)$ then $E = \emptyset$ and $\tau^D(\psi) \equiv [M/r]$ and $\checkmark \equiv \tt$.

Suppose $R_i$ is a relation in $E_i \times E_i$. We say $R$ respects $R_i$ if $R \supseteq R_i$ and $R \cap (E_i \times E_i) = R_i$.

If $P \in \text{PAR}(P_1, P_2)$ then $(\exists P_1 \in P_1) (\exists P_2 \in P_2)$

(p1) $E = (E_1 \cup E_2)$,

(p2) $\lambda = (\lambda_1 \cup \lambda_2)$,

(p3) $\kappa(e) \equiv \kappa_1(e) \lor \kappa_2(e)$,

(p4) $\tau^D(\psi) \equiv \tau^D(\psi)$,

(p5) $\checkmark \equiv \checkmark_1 \land \checkmark_2$,

(p6) $\checkmark \checkmark < \checkmark$ and $< 2$.

If $P \in \text{SEQ}(P_1, P_2)$ then $(\exists P_1 \in P_1) (\exists P_2 \in P_2)$

let $\kappa'_2(e) = \tau_2^F(\kappa_2(e))$ where $C = \{c \mid c < e\}$

(s1) $E = (E_1 \cup E_2)$,

(s2) $\lambda = (\lambda_1 \cup \lambda_2)$,

(s3) $\kappa(e) \equiv \kappa_1(e) \lor \kappa'_2(e)$,

(s4) $\tau^D(\psi) \equiv \tau^D(\tau_2^D(\psi))$,

(s5) $\checkmark \equiv \checkmark_1 \land \checkmark_2$,

(s6) $\checkmark < \checkmark_1$ and $< 2$.

If $P \in \text{IF}(\phi, P_1, P_2)$ then $(\exists P_1 \in P_1) (\exists P_2 \in P_2)$

(11) $E = (E_1 \cup E_2)$,

(12) $\lambda = (\lambda_1 \cup \lambda_2)$,

(13) $\kappa(e) \equiv (\phi \land \kappa_1(e)) \lor (\neg \phi \land \kappa_2(e))$,

(14) $\tau^D(\psi) \equiv (\phi \land \tau^D(\psi)) \lor (\neg \phi \land \tau^D(\psi))$,

(15) $\checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2)$,

(16) $\checkmark < \checkmark_1$ and $< 2$.

Let $K(D) = \land_{d \in D}K(d)$. Note that $K(\emptyset) = \tt$.

If $P \in \text{FENCE}(\mu)$ then

(f1) $|E| \leq 1$,

(f2) $\lambda(e) = F^\mu$,

(f3) $\kappa(e) \equiv \tt$,

(f4) $\tau^D(\psi) \equiv \psi$,

(f5) $\checkmark \equiv K(E)$.

If $P \in \text{WRITE}(x, M, \mu)$ then $(\exists v \in V)$

(w1) $|E| \leq 1$,

(w2) $\lambda(e) = W^\mu xv$,

(w3) $\kappa(e) \equiv M = v$,

(w4) $\tau^D(\psi) \equiv \psi[M/x][K(E)/Q_x]$,

(w5) $\checkmark \equiv K(E)$.

If $P \in \text{READ}(r, x, \mu)$ then $(\exists v \in V)$

(r1) $|E| \leq 1$,

(r2) $\lambda(e) = R^\mu xv$,

(r3) $\kappa(e) \equiv Q_x$,

(r4) $\checkmark \equiv K(E)$.

(r4a) If $e \in E \cap D$ then $\tau^D(\psi) \equiv (\kappa(e) \Rightarrow v = r) \Rightarrow \psi$,

(r4b) If $e \in E \cap D$ then $\tau^D(\psi) \equiv (\kappa(e) \Rightarrow (v = r \lor x = r)) \Rightarrow \psi$.

Fig. 1. PwT Semantics

Ignoring predicate transformers, p5 and s5 both take $\checkmark \equiv \checkmark_1 \land \checkmark_2$. This is as expected: the program terminates if both subprograms terminate. In (15), $\checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2)$: the program terminates as long as the taken branch terminates. Thus $[\text{if } (tt) \{x := 1\} \text{ e1se } \{y := 1\}]$ contains a complete pomset with exactly one event: $(Wx1)$. To construct this pomset, we take the singleton from the left and the empty set from the right. This is a general principle: for code that contributes no events at top-level, use the empty set.
3.5 Preconditions, Predicate Transformers, and Data Dependencies

In this section, we ignore the $Q_x$ symbols that appear in the semantics of read and write, taking $Q_x = \text{tt}$, for all $x$. We also introduce the independent transformer for reads $(\text{r4b})$ without explaining why it is defined as it is. We take up both subjects in §3.8.

Preconditions are discharged during sequential composition by applying predicate transformers $τ₁$—from the left to preconditions—$κ₂(e)$—on the right. The specific rule is $s3$, which uses the transformed predicate $κ₂'(e) = τ₁(κ₂(e))$, where $C = \{c \mid c < e\}$ is the set of events that precede $e$ in causal order. We call $C$ the dependent set for $e$. Then $E \setminus C$ is the independent set.

Before looking at the details, it is useful to have a high-level view of how nontrivial preconditions and predicate transformers are introduced.

Preconditions are introduced in: Predicate transformers are introduced in:

(w3) for data dependencies, (r4a) for reads in the dependent set,
(t3) for control dependencies, (r4b) for reads in the independent set,
(w4) for writes.

The rules track dependencies. We discuss data dependencies (w3) here and control dependencies (t3) in §3.6. We enrich the semantics to handle address dependencies in §8.4.

A simple example of a data dependency is a pomset $P \in [r := x; y := w]$. If $P$ is complete, it must have two events. Then $\text{SEQ}$ (Fig. 1) requires $P_1 \in [r := x]$ and $P_2 \in [y := w]$ of the following form. (We only show the independent transformer for writes—ignoring $Q_x$, the dependent and independent transformers for writes are the same.)

$$r := x \quad y := w$$

(†)

First we consider the case that $v = w$. For example, if $v = w = 1$, we have:

$$1 = r \Rightarrow ψ \quad \text{Rx1} d \Rightarrow 1 = r \Rightarrow ψ \quad \text{by}[r/y] \quad 1 = r \Rightarrow ψ[1/y] \quad \phi \quad W y 1 ^ e$$

For the read, the dependent transformer $r_1 ^ d$ is $1 = r \Rightarrow ψ$; the independent transformer $1 = r \Rightarrow ψ$ is complete, it must have two events. Then $\text{SEQ}$ (Fig. 1) requires $P_1 \in [r := x]$ and $P_2 \in [y := w]$ of the following form. (We only show the independent transformer for writes—ignoring $Q_x$, the dependent and independent transformers for writes are the same.)

Looking at the precondition $φ$ of the write, recall that in order for $e$ to participate in a top-level pomset, the precondition $φ$ must be a tautology at top-level. There are two possibilities.

- If $d < e$ then we apply the dependent transformer and $φ \equiv (1 = r \Rightarrow r = 1)$, a tautology.
- If $d \not< e$ then we apply the independent transformer and $φ \equiv (((1 = r V x = r) \Rightarrow r = 1)$, Under the assumption that $r$ is bound (see footnote 3), this is logically equivalent to $(x = 1)$.

Eliding transformers and tautological preconditions, the two outcomes are:

$$r := x ; y := r \quad r := x \quad y := r$$

The independent case on the right can only participate in a top-level pomset if the precondition $(x=1)$ is discharged. To do so, we can prepend a program that writes $1$ to $x$:

$$x := 1 \quad x := 1 \quad r := x ; y := r$$

Here we apply the transformer from the left $(ψ[1/x])$ to $(x = 1)$, resulting in the tautology $(1 = 1)$. 

Now suppose that \( v \neq w \) in (\ref{thm:precondition}). Again there are two possibilities. Taking \( v=0 \) and \( w=1 \):

\[
\begin{align*}
    r := x \land y := r & \quad \text{if } (r=1) \{ y := 1 \} \\
    (0=r \Rightarrow r = 1) & \Rightarrow W y_1^e \\
    (0=r \Rightarrow r = 1) & \Rightarrow W y_1^e
\end{align*}
\]

Assuming that \( r \) is bound, both preconditions on \( e \) are unsatisfiable.

If a write is independent of a read, then clearly no order is imposed between them. For example, the precondition of \( e \) is a tautology in:

\[
\begin{align*}
    r := x \land y := 1 & \quad \text{if } (r=1) \{ y := 1 \} \\
    (0=r \lor x=r) & \Rightarrow \psi[y/r] \\
    (0=r \lor x=r) & \Rightarrow \psi[y/r] \\
    (0=r \lor x=r) & \Rightarrow 1=1 \Rightarrow W y_1^e \\
    (0=r \lor x=r) & \Rightarrow 1=1 \Rightarrow W y_1^e
\end{align*}
\]

Note that both \( R_{4a} \) and \( R_{4b} \) degenerate to the identity transformer when \( \kappa(e) = \varnothing \). This is the same as the transformer for the empty pomset (\( R_{4c} \)).

Also note that \( [S_1 \models S_2] \) is asymmetric, taking the predicate transformer for \( S_2 \) in \( P_4 \).

### 3.6 Control Dependencies

In \( IF(\phi, P_1, P_2) \), the predicate transformer (14) is \( (\phi \land \tau_D^D(\psi)) \lor (\neg \phi \land \tau_D^D(\psi)) \), which is the disjunctive equivalent of Dijkstra's conjunctive formulation: \( (\phi \Rightarrow \tau_D^D(\psi)) \lor (\neg \phi \Rightarrow \tau_D^D(\psi)) \).

Control dependencies are introduced by the conditional. For coalescing events in \( E_1 \cap E_2 \), \ref{r13} requires \( (\phi \land \kappa_1(e)) \lor (\neg \phi \land \kappa_2(e)) \). For other events from \( E_n \), it requires \( \phi \land \kappa_1(e) \), using \ref{m3a}.

Control dependencies are eliminated in the same way as data dependencies. Consider:

\[
\begin{align*}
    r := x & \quad \text{if } (r=1) \{ y := 1 \} \\
    (0=r \lor x=r) & \Rightarrow \psi[y/r] \\
    (0=r \lor x=r) & \Rightarrow \psi[y/r] \\
    (0=r \lor x=r) & \Rightarrow 1=1 \Rightarrow W y_1^e \\
\end{align*}
\]

As for (\ref{r13}), there are two possibilities:

\[
\begin{align*}
    r := x & \quad \text{if } (r=1) \{ y := 1 \} \\
    (0=r \Rightarrow r = 1) & \Rightarrow W y_1^e \\
    (0=r \Rightarrow r = 1) & \Rightarrow W y_1^e
\end{align*}
\]

When events coalesce, \ref{r13} ensures that control dependencies are calculated semantically, rather than syntactically. For example, consider \( P \in \{ \text{if } (r=1) \{ y := r \} \text{ else } \{ y := 1 \} \} \), which is built from \( P_1 \in \{ y := r \} \) and \( P_2 \in \{ y := 1 \} \). For example, consider:

\[
\begin{align*}
    y := r & \quad \text{if } (r=1) \{ y := r \} \text{ else } \{ y := 1 \} \\
    r=1 & \Rightarrow W y_1^e \\
    1=1 & \Rightarrow W y_1^e
\end{align*}
\]

Here, the precondition in the combined pomset (on the right) is a tautology, independent of \( r \).

The semantics allows common code to be lifted out of a conditional, validating the transformation \( [\text{if } (M) \{ S \} \text{ else } \{ S \}] \supseteq [S] \). The semantics also validates dead code elimination: if \( M \neq 0 \) is a tautology then \( [\text{if } (M) \{ S_1 \} \text{ else } \{ S_2 \}] \supseteq [S_1] \). Here, we take the empty pomset as the denotation of \( S_2 \). Since \( M=0 \) is unsatisfiable, \ref{r15} ignores the termination condition of \( S_2 \). It is worth noting that the reverse inclusion, dead-code-introduction, holds for complete pomsets, but not in general.

### 3.7 A Refinement: No Dependencies into Reads

To avoid stalling the CPU pipeline unnecessarily, hardware does not enforce control dependencies between reads. To support if-introduction (§8.3), software models must not distinguish control dependencies from other dependencies. Thus, we are forced to drop all dependencies into reads. To achieve this, we modify the definition of \( \kappa_2^r \) in Fig. 1.

\[
\kappa_2^r(e) = \begin{cases} 
    \tau_E^E(k_2(e)) & \text{if } \lambda(e) \text{ is a read} \\
    \tau_C^C(k_2(e)) & \text{otherwise, where } C = \{ c \mid c < e \}
\end{cases}
\]
Thus reads always use the "best" transformer, $r^E_1$. In order for non-reads to get a good transformer, they need to add order. Throughout the remainder of the paper, we use this definition.

### 3.8 Local State

Several of the JMM Causality Test Cases [Pugh 2004] center on compiler optimizations that result from limiting the range of variables. Because the compiler is allowed to collude with the scheduler when estimating the range, we refer to this as local invariant reasoning. The basic idea is that a write to $y$ is independent of a read of $x$ that precedes it, as long as the local state of $x$ prior to the read justifies the write. For example, consider **tc1:**

$$x := 0; (r := x; \text{if}(r \geq 0)\{y := 1\} \parallel x := y)$$

Using local invariant reasoning, a compiler could determine that $x$ is always either 0 or 1, and therefore that the write to $y$ does not depend on the read of $x$, allowing these to be reordered, resulting in the execution shown above. This is captured by our semantics as follows. Using $r4b$ and $w4$, the precondition $\phi$ is $((1=r \lor x=r) \Rightarrow r \geq 0)/[0/x]$ which is $((1=r \lor 0=r) \Rightarrow r \geq 0)$ which is indeed a tautology, justifying the independency. When used to form complete pomsets, $r4b$ requires that subsequent preconditions be tautological under the assumption that the value of the read is used $(1=r)$ and under the assumption that the local value of $x$ is used instead $(x=r)$.

This requires that we put locations into logical formulae, in addition to registers. While logical formulae involving registers are discharged by predicate transformers from **ASSIGN** or **READ** (Fig. 1), logical formulae involving locations are discharged by predicate transformers from **WRITE**. In other words, registers track the value of reads, whereas locations track the value of the most recent local read. This provides a local view of memory, distinct from the global view manifest in the labels on events. See [Jagadeesan et al. 2020] for further discussion.

A related concern arises when eliding changes to local state from the untaken branch of a conditional, creating indirect dependencies. Consider the following example [Paviotti et al. 2020, §6.3]:

$$x := 1; r := y; \text{if}(r=0)\{x := 0; s := x; \text{if}(s)\{z := 1\}\} \quad \text{else } \{s := x; \text{if}(s)\{z := 1\}\} \quad \text{if}(z)\{y := 1\}$$

In SC executions, the left thread always takes the then-branch of the conditional, reading 0 for $x$ and therefore not writing $z$. As a result the second thread does not write $y$, and the program is data-race-free under SC. To satisfy the DRF-SC theorem, no other executions should be possible. Complete executions of the left thread that take the then-branch must include $(Wx0)$, whereas those that take the else-branch must not include $(Wx0)$. A problem arises if events from the subsequent code of the left thread—common to the two branches—coalesce, thus removing an essential control dependency. Consider the following candidate execution:

$$x := 1; r := y; \text{if}(r=0)\{x := 0; s := x; \text{if}(s)\{z := 1\}\} \quad \text{else } \{s := x; \text{if}(s)\{z := 1\}\} \quad \text{if}(z)\{y := 1\}$$

Note that the write to $z$ depends on the read of $x$, but not the read of $y$. Ignoring $Q_s$, as we have done up to now, the precondition $\phi$ is:

$$\phi \equiv (1=r \lor y=r) \Rightarrow (r=0 \land (1=s \Rightarrow s \neq 0))$$

Since $(1=s)$ implies $(s \neq 0)$, the precondition is a tautology and $(† † †)$ is allowed, violating DRF-SC.

---

5TC6 and TC8–9 are similar. TC2 and TC17–18 require both local invariant reasoning and resolving the nondeterminism of reads using redundant read elimination—see §8.1.
Without \( Q_x \), the semantics enforces \((Wz)_1\)’s direct dependency on \((Rx)_1\), but not its indirect dependency on \((Ry)_1\). By eliding \((Wx)_0\), we have forgotten the local state of \( x \) in the untaken branch of the execution. Nonetheless, we are using the subsequent—stale—read of \( x \), by merging it with the read from the taken branch. This half-stale merged read is then used to justify \((Wz)_1\).

In Fig. 1, \( R4 \) corrects this by introducing quiescence symbols into predicate transformers. Quiescence symbols capture the intuition that—in the untaken branch of a conditional—the value of a read from \( x \) can only be used if the most recent local write to \( x \) is included in the execution. Quiescence symbols are eliminated from formulae by the closest preceding write \((w4)\). With quiescence, the precondition of \((\dagger)\) becomes the following:

\[
\phi' \equiv (Q_y \Rightarrow 1=r \lor y=r) \Rightarrow (r=0 \land ((Q_x[ff/Q_x] \Rightarrow 1=s) \Rightarrow s\neq 0))
\land (r\neq 0 \land ((Q_x[1=1/Q_x] \Rightarrow 1=s) \Rightarrow s\neq 0))
\]

Adding initializing writes, \( Q_y \) becomes \( tt \) at top-level. Regardless, \( \phi' \) is non-tautological: in the top conjunct, we have lost the ability to use \( 1=s \) to prove \( s\neq 0 \). Intuitively, \( Q_x \) is true when the local state of \( x \) is up to date, and false when it is stale. In order to read \( x \), \( Q_x \) requires that the most recent prior write to \( x \) must be in the pomset.

We also include quiescence symbols directly in preconditions of reads \((r3)\). This guarantees initialization in complete pomsets: every \((Rx)_x \) must have a sequentially preceding \((Wx)_x \) in order to eliminate the precondition \( Q_x \).

We end this subsection by noting that value range analysis of mRd [Paviotti et al. 2020] is overly conservative. Consider the following execution:

\[
x := 0; (r := x ; \text{if}(r \leq 1)(x := 2 ; y := 1) || x := y)
\]

\[\text{Wx}_0\] \(\text{Rx}_1\) \[\text{Wx}_2\] \(\text{Wy}_1\) \(\text{Ry}_1\) \(\text{Wx}_1\)

PwT correctly allows this execution; mRd forbids it by requiring \((Rx)_1 \rightarrow (Wy)_1\). The co-product mechanism in mRd seeks an isomorphic justification under the \((Rx)_2 \) branch of the read in the event structure, and—failing to find such a justification—leaves the dependency in place.

### 3.9 The Burdens of Associativity

Many of the design choices in PwT are motivated by Lemma 3.5—in particular, the need for sequential composition to be associative. In this subsection, we give three examples.

First, the predicate transformers we have chosen for \( R4a \) and \( R4b \) are different from the ones used traditionally, which are written using substitution. Attempting to write \( R4a \) and \( R4b \) in this style we would have (as in [Jagadeesan et al. 2020]):

- \( R4a' \) if \( e \in E \cap D \) then \( r^D(\psi) \equiv \psi[v/r] \),
- \( R4b' \) if \( e \in E \setminus D \) then \( r^D(\psi) \equiv \psi[v/r] \land \psi[x/r] \).

\( R4b' \) does not distribute through disjunction \((x2)\), and therefore is not a predicate transformer. This is not merely a theoretical inconvenience: adopting \( R4b' \) would also break associativity. Consider the following example, where “!" represents logical negation:

\[
r := y \quad x := !r \quad x := !!r
\]

\[\psi[1/r] \land \psi[y/r] \quad R_y1 \quad r = 0 \quad \text{Wx}_1 \quad r \neq 0 \quad \text{Wx}_1\]

Associating to the right, we coalesce the writes then prepend the read:

\[
r := y \quad x := !r \quad x := !!r \quad r := y ; (x := !r ; x := !!r)
\]

\[\psi[1/r] \land \psi[y/r] \quad R_y1 \quad r = 0 \lor r \neq 0 \quad \text{Wx}_1 \quad (R_y1) \quad \phi \quad \text{Wx}_1\]

The precondition \( \phi \) is \((1=0 \lor y=0) \land (1\neq 0 \lor y\neq 0) \), which is a tautology.
We define \( \mathsf{PwT-mca} \) which is equivalent to \( \mathsf{Arm8} \). Associating to the right, this program has a complete pomset containing \( \mathsf{PwT-mca} \).

Our solution is to Skolemize, replacing substitution by implication, with uniquely chosen registers. Using Skolemization, Fig. 1 computes \( \phi' \equiv ((1=r \lor y=r) \Rightarrow r=0) \lor ((1=r \lor y=r) \Rightarrow r\neq 0) \), which is equivalent to \( \phi \equiv (1=r \lor y=r) \Rightarrow (r=0 \lor r\neq 0) \). Both are tautologies.

Second, Jagadeesan et al. impose consistency, which requires that for every pomset \( \mathcal{P} \), \( \lambda \mathcal{P} \) is satisfiable. Associativity requires that we allow inconsistent preconditions. To see this, note that

\[
\begin{align*}
(\text{if}(M)\{x:=1\}; \text{if}(M)\{y:=1\}) & \quad (\text{if}(M)\{y:=1\}; \text{if}(M)\{y:=1\})
\end{align*}
\]

has a complete pomset that writes \( x \) and \( y \), regardless of \( M \). In order to match this in

\[
\begin{align*}
\text{if}(M)\{x:=1\} & \quad \text{if}(M)\{y:=1\}; \text{if}(M)\{y:=1\}; \text{if}(M)\{y:=1\}
\end{align*}
\]

the middle pomset must include the inconsistent actions \((M=0; Wx1)\) and \((M\neq 0; Wy1)\).

Finally, we drop Jagadeesan et al.’s causal strengthening for the same reason: Consider

\[
\text{if}(M)\{r:=x\}; y:=r; \text{if}(M)\{s:=x\}
\]

Associating to the right, this program has a complete pomset containing \((Wy1)\). Associating to the left, with causal strengthening, it does not.

## 4 \( \text{PwT-MCA: POMSETS WITH PREDICATE TRANSFORMERS FOR MCA} \)

In this section, we develop a model of concurrent computation by adding \( \text{reads-from} \) to Fig. 1. To model coherence and synchronization, we add delay to the rule for sequential composition. For MCA architectures, it is sufficient to encode delay in the pomset order. The resulting model, \( \text{PwT-MCA1} \), supports optimal lowering for relaxed access on Arm8, but requires extra synchronization for acquiring reads. (Lowering is the translation of language-level operators to machine instructions. A lowering is optimal if it provides the most efficient execution possible.)

A variant, \( \text{PwT-MCA2} \), supports optimal lowering for all access modes on Arm8. To achieve this, \( \text{PwT-MCA2} \) drops the global requirement that \( \text{reads-from} \) implies pomset order \((M7c)\). The models are the same, except for internal reads, where a thread reads its own write. We show an example at the beginning of §4.2. The lowering proofs can be found in the supplementary material. The proofs use recent alternative characterizations of Arm8 [Alglave et al. 2021].

### 4.1 \( \text{PwT-MCA1} \)

We define \( \text{PwT-MCA1} \) by extending Def. 3.4 and Fig. 1. The definition uses several relations over actions—matches, blocks, and delays—as well a distinguished set of read actions; see §3.2.

**Definition 4.1.** The definition of \( \text{PwT-MCA1} \) extends that of \( \text{PwT} \) with a relation \( \text{rf} \) such that

\[
(M7) \quad \text{rf} \subseteq E \times E \text{ is an injective relation capturing reads-from, such that}
\]

\[
(M7a) \text{ if } d \xrightarrow{\text{rf}} e \text{ then } \lambda(e) \text{ matches } \lambda(e),
\]

\[
(M7b) \text{ if } d \xrightarrow{\text{rf}} e \text{ and } \lambda(e) \text{ blocks } \lambda(e) \text{ then either } c \leq d \text{ or } e \leq c,
\]

\[
(M7c) \text{ if } d \xrightarrow{\text{rf}} e \text{ then } d < e.
\]

The definition of completeness extends Def. 3.4 as follows:

\[
(c7) \text{ if } \lambda(e) \text{ is a read then there is some } d \xrightarrow{\text{rf}} e.
\]

The semantic function extends Fig. 1 as follows:

\[
(s6a) \text{ if } \lambda_1(d) \text{ delays } \lambda_2(e) \text{ then } d \leq e,
\]

\[
(p7) \text{ (s7) (17) rf respects rf}_1 \text{ and rf}_2.
\]
In complete pomsets, reads-from (rf) must pair every read with a matching write (c7). The requirements m7a, m7b, and m7c guarantee that reads are fulfilled, as in [Jagadeesan et al. 2020, §2.7]. Parallel composition, sequential composition, and the conditional respect reads-from (p7, s7, t7).

From Def. 3.1, recall that a delays b if a \(\rightarrow_{\text{co}}\) b or a \(\rightarrow_{\text{sync}}\) b or a \(\rightarrow_{\text{sc}}\) b. S6a guarantees that sequential order is enforced between conflicting accesses of the same location (\(\rightarrow_{\text{co}}\)), into a release and out of an acquire (\(\rightarrow_{\text{sync}}\)), and between SC accesses (\(\rightarrow_{\text{sc}}\)). Combined with the fulfillment requirements (m7a, m7b, m7c), these ensure coherence, publication, subscription and other idioms. For example, consider the following:

\[
x := 0; \ x := 1; \ y_{\text{rel}} := 1 \parallel r := y_{\text{acq}}; \ s := x
\]

\[
\text{Wx0} \rightarrow \text{Wx1} \rightarrow (\text{Wrel} y_1) \rightarrow (\text{Racq} y_1) \rightarrow \text{Rx0}
\]

(PUB)

The execution is disallowed due to the cycle. All of the order shown is required at top-level: The intra-thread order comes from s6a: \((\text{Wx0}) \rightarrow (\text{Wx1})\) is required by \(\rightarrow_{\text{co}}\). \((\text{Wx1}) \rightarrow (\text{Wrel} y_1)\) and \((\text{Racq} y_1) \rightarrow (\text{Rx0})\) are required by \(\rightarrow_{\text{sync}}\). The cross-thread order is required by fulfillment: c7 requires that all top-level reads are in the image of \(\rightarrow_{\text{co}}\). M7a ensures that \((\text{Wrel} y_1) \rightarrow (\text{Racq} y_1)\), and m7c subsequently ensures that \((\text{Wrel} y_1) \prec (\text{Racq} y_1)\). The antidependency \((\text{Rx0}) \rightarrow (\text{Wx1})\) is required by m7b. (Alternatively, we could have \((\text{Wx1}) \rightarrow (\text{Wx0})\), again resulting in a cycle.)

The semantics gives the expected results for store buffering and load buffering, as well as litmus tests involving fences and SC access. The model of coherence is weaker than C11, in order to support common subexpression elimination, and stronger than Java, in order to support local reasoning about data races. For further examples, see [Jagadeesan et al. 2020, §3.1].

Lemmas 3.5 and 3.6 hold for PWT-MCA1. We discuss 3.6g further in §10. 4.2 PWT-MCA2

Lowering PWT-MCA1 to Arm8 requires a full fence before every acquiring read.7 To see why, consider the following attempted execution, where the final values of both \(x\) and \(y\) are 2.

\[
x := 2; \ r := x_{\text{acq}}; \ y := r - 1 \parallel y := 2; \ x_{\text{rel}} := 1
\]

\[
\text{Wx2} \rightarrow (\text{Racq} x_2) \rightarrow \text{Wy1} \rightarrow \text{Wy2} \rightarrow (\text{Wrel} x_1)
\]

(INTERNAL-ACQ)

The execution is allowed by Arm8, but disallowed by PWT-MCA1, due to the cycle.

Arm8 allows the execution because the read of \(x\) is internal to the thread. This aspect of Arm8 semantics is difficult to model locally. To capture this, we found it necessary to drop m7c and relax s6a, adding local constraints on rf to PAR, SEQ and IF. (For parallelism, we explicitly specify the domain of \(d\) and \(e\) in s6a'.)

Definition 4.2. The definition of PWT-MCA2 is derived from that of PWT-MCA1 by removing m7c and s6a and adding the following:

(p6a) if \(d \in E_1, e \in E_2\) and \(d \rightarrow_e e\) then \(d < e\),

(p6b) if \(d \in E_1, e \in E_2\) and \(e \rightarrow_d d\) then \(e < d\),

(s6a') if \(d \in E_1, e \in E_2\) and \(\lambda_1(d)\) delays \(\lambda_2(e)\) then either \(d \rightarrow_e e\) or \(d \leq e\),

6We use different colors for arrows representing order:

- \(d \rightarrow e\) arises from \(\rightarrow_{\text{co}}\) (s6a), \(d \rightarrow e\) arises from \(\rightarrow_{\text{sc}}\) (s6a), \(d \rightarrow e\) arises from \(\rightarrow_{\text{sc}}\) (s6a'),

- \(d \rightarrow e\) arises from \(\rightarrow_{\text{sync}}\) or \(\rightarrow_{\text{sc}}\) (s6a), \(d \rightarrow e\) arises from blocking (m7b),

- \(d \rightarrow e\) arises from control/data/address dependency (s3, definition of \(x_1^2(d)\)).

In PWT-MCA2, it is possible for rf to contradict \(<\). In this case, we use a dotted arrow for rf: \(d \rightarrow_d e\) indicates that \(e < d\).

7Jagadeesan et al. [2020] erroneously elide the required synchronization on acquiring reads.
p6a and p6b ensure that $d \xrightarrow{\pi} e$ implies $d < e$ when the actions come from different threads. However, we may have $d \xrightarrow{\pi} e$ and $e < d$ within a thread, as between $(Wx2)$ to $(R2x1)$ in INTERNAL-ACQ, thus allowing this execution. m7b and s6a’ are sufficient to stop stale reads within a thread. For example, it prevents a read of $i := 1; j := 2; r := x$.

With the weakening of s6a, we must be careful not to allow spurious pairs to be added to the rf relation. For example, $\{i \leftarrow (b) \{r := x \mid x := 1\} \text{ else } \{r := x; x := 1\}\} \not\subseteq \{R \xrightarrow{\pi} W \xrightarrow{\pi}\}$; taking rf from the left and $<$ from the right. The use of “respects” in 16 and 17 ensures this.

As a consequence of dropping m7c, sequential rf must be validated during pomset construction, rather than post-hoc. In §6, we show how to construct program order (po) for complete pomsets using phantom events ($\pi$). Using this construction, the following lemma gives a post-hoc verification technique for rf. Let $\pi^{-1}$ be the inverse of $\pi$.

**Lemma 4.3.** If $P \in [S]_{mca2}$ is complete, then for every $d \xrightarrow{\pi} e$ either

- **external fulfillment:** $d < e$ and if $\lambda(c)$ blocks $\lambda(e)$ then either $c \leq d$ or $e \leq c$, or
- **internal fulfillment:** $(\exists d' \in \pi^{-1}(d))((\exists e' \in \pi^{-1}(e))
  \begin{align*}
  d' & \xrightarrow{\text{po}} \cdots \xrightarrow{\text{po}} e' \\
  & \text{and } (\exists c \in \pi^{-1}(e))
  \begin{align*}
  & \xrightarrow{\text{po}} \cdots \xrightarrow{\text{po}} c
  \end{align*}
  \end{align*}
\)

These mimic the external consistency requirements of Arm8 [Alglave et al. 2021].

## 5 PwT-MCA RESULTS

Prop. 6.1 of Jagadeesan et al. [2020] establishes a compositional principle for proving that programs validate formula in past-time temporal logic. The principal is based entirely on the pomset order relation. Its proof, and all of the no-thin-air examples in [Jagadeesan et al. 2020, §6] hold equally for the models described here.

In the supplementary material, we show that PwT-MCA$_1$ supports the optimal lowering of relaxed accesses to Arm8 and that PwT-MCA$_2$ supports the optimal lowering of all accesses to Arm8. The proofs are based on two recent characterizations of Arm8 [Alglave et al. 2021]. For PwT-MCA$_1$, we use External Global Consistency. For PwT-MCA$_2$, we use External Consistency.

In the supplementary material, we also sketch a proof of sequential consistency for local-data-race-free programs. The proof uses program order, which we construct for C11 in §6. The same construction works for PwT-MCA. (This proof sketch assumes there are no RMW operations.)

The semantics validates many peephole optimizations, such as reorderings on relaxed access:

\[
\begin{align*}
[r := x; s := y] & \xrightarrow{\text{po}} [s := y; r := x] & \text{if } r \not= s \\
[x := M; y := N] & \xrightarrow{\text{po}} [y := N; x := M] & \text{if } x \not= y \\
[x := M; s := y] & \xrightarrow{\text{po}} [s := y; x := M] & \text{if } x \not= y \text{ and } s \not\in \text{id}(M)
\end{align*}
\]

Here id($M$) is the set of locations and registers that occur in $M$. Using augmentation closure, the semantics also validates roach-motel reorderings [Sevčík 2008]. For example, on read/write pairs:

\[
\begin{align*}
[x^\mu := M; s := y] & \xrightarrow{\text{po}} [s := y; x^\mu := M] & \text{if } x \not= y \text{ and } s \not\in \text{id}(M) \\
[x := M; s := y^\mu] & \xrightarrow{\text{po}} [s := y^\mu; x := M] & \text{if } x \not= y \text{ and } s \not\in \text{id}(M)
\end{align*}
\]

Notably, the semantics does not validate read-introduction. When combined with if-introduction (§8.3), read-introduction can break temporal reasoning. This combination is allowed by speculative operational models. See §9 for a discussion.

## 6 PwT-C11: POMSETS WITH PREDICATE TRANSFORMERS FOR C11

PwT can be used to generate semantic dependencies to prohibit thin-air executions of C11, while preserving optimal lowering for relaxed access. We follow the approach of Paviotti et al. [2020],
using our semantics to generate C11 candidate executions with a dependency relation, then applying
the axioms of RC11 [Lahav et al. 2017]. The No-Thin-Air axiom of RC11 is overly restrictive,
requiring that $(rf \cup po)$ be acyclic. Instead, we require that $(rf \cup <)$ is acyclic. This is a more precise
categorization of thin-air behavior, and it allows aggressive compiler optimizations that would be
erroneously forbidden by RC11’s original No-Thin-Air axiom.

The chief difficulty is instrumenting our semantics to generate program order, for use in the
various axioms of C11. Using the obvious construction (described in the proof of Lemma 6.2),
program order $(po)$ is a pre-order, which may include cycles due to coalescing. For example:

\[
\begin{array}{c}
\text{if} (r) \{ x := 1 ; y := 1 \} \text{else} \{ y := 1 ; x := 1 \}
\end{array}
\]

We solve this by adding phantom events. The function $\pi$ maps phantom events to real events. For
this program, we have the following PwT-po. (We visualize $po$ using a dotted arrow \( \cdots \mapsto \), and $\pi$
using a double arrow \( \cdots \rightarrow \).)

\[
\begin{array}{c}
\text{Wx1} \mapsto \cdots \mapsto \text{Wy1}
\end{array}
\]

Once the pomset is completed, $r$ will be known, causing all the preconditions to be either tauto-
logical or unsatisfiable. We can then extract program order by restricting phantom events to
have tautological preconditions (Def. 6.3). Thus, our strategy for C11 is to first construct a com-
plete PwT-po, then extract top-level program order, then apply the axioms of RC11. We refer to a
PwT-po that survives this filtering as a PwT-C11.

**Definition 6.1.** A PwT-po is a PwT (Def. 3.4) equipped with relations $\pi$ and $po$ such that

1. $(m8) \; \pi : (E \rightarrow E)$ is an idempotent function capturing merging, such that
   
   \[
   \text{let } R = \{ e \mid \pi(e) = e \} \text{ be real events, let } \overline{R} = (E \setminus R) \text{ be phantom events,}
   \]
   \[
   \text{let } S = \{ e \mid \forall d. \pi(d)e \Rightarrow d=e \} \text{ be simple events, let } \overline{S} = (E \setminus S) \text{ be compound events,}
   \]
   \[
   \text{(m8a)} \; \lambda(e) = \lambda(\pi(e)), \quad \text{(m8b)} \text{ if } e \in \overline{S} \text{ then } e' = \bigvee_{c \in \overline{R} | \pi(e) = c} \pi(c).
   \]

2. $(m9) \; po \subseteq (S \times S)$ is a partial order capturing program order.

A PwT-po is complete if

\[
\text{(c3) if } e \in R \text{ then } e' \in R \text{ is a tautology,} \quad \text{(c5) } \checkmark \text{ is a tautology.}
\]

A complete PwT-po is a PwT-C11 if it additionally satisfies the axioms of RC11.

Since $\pi$ is idempotent, we have $\pi(\pi(e)) = \pi(e)$. Equivalently, we could require $\pi(e) \in R$.

We use $\pi$ to partition events $E$ in two ways: we distinguish real events $R$ from phantom events
$\overline{R}$; we distinguish simple events $S$ from compound events $\overline{S}$. From idempotency, it follows that all
phantom events are simple ($\overline{R} \subseteq S$) and all compound events are real ($\overline{S} \subseteq R$). In addition, all
phantom events map to compound events (if $e \in \overline{R}$ then $\pi(e) \in \overline{S}$).

**Lemma 6.2.** If $P$ is a PwT then there is a PwT-po $P''$ that conservatively extends it.

**Proof.** The proof strategy is as follows: We extend the semantics of Fig. 1 with $po$. The obvious
definition gives us a preorder rather than a partial order. To get a partial order, we replay the
semantics without merging to get an unmerged pomset $P'$; the construction also produces the
map $\pi$. We then construct $P''$ as the union of $P$ and $P'$, using the dependency relation from $P$.

We extend the semantics with $po$ as follows. For pomsets with at most one event, $po$ is the
identity. For sequential composition, $po = po_1 \cup po_2 \cup E_1 \times E_2$. For parallel composition and the
conditional, $po = po_1 \cup po_2$. As noted at the beginning of this section, $po$ may contain cycles. To
find an acyclic $po'$, we replay the construction of $P$ to get $P'$. When building $P'$, we require disjoint
union in $s_1$ and $s_2$: $E' = E'_1 \cup E'_2$. If and event is unmerged in $P$ ($e \in E_1 \cup E_2$) then we choose the same
event name for \( P \). If an event is merged in \( P (e \in E_1 \cap E_2) \) then we choose fresh event names—\( e'_1 \) and \( e'_2 \)—and extend \( \pi \) accordingly: \( \pi(e'_1) = \pi(e'_2) = e \). In \( P' \), we take \( \leq' = \rho o' \).

To arrive at \( P'' \), we take (1) \( E'' = E \cup E' \), (2) \( \lambda'' = \lambda' \cup \lambda' \), (3a) if \( e \in E \) then \( \kappa''(e) = \kappa(e) \), (3b) if \( e \in E' \setminus E \) then \( \kappa''(e) = \kappa'(e) \), (4) \( r''D = r''(\pi^{-1}(D)) \), (5) \( \sqrt{''} = \sqrt{'} \), (6) \( d <'' e \) exactly when \( \pi(d) < \pi(e) \), (7) \( \rho o'' = \rho o' \), and (8) \( \pi'' \) is the constructed merge function.

**Definition 6.3.** For a PwT-po, let extract(\( P \)) be the projection of \( P \) onto the set \{ \( e \in E_1 \mid e \) is simple and \( \kappa_1(e) \) is a tautology \}.

By definition, extract(\( P \)) includes the simple events of \( P \) whose preconditions are tautologies. These are already in program order, as per item 7 of the proof. The dependency order is derived from the real events using \( \pi \), as per item 6.

The following lemma (immediate from m88b) shows that if \( P \) is complete, then extract(\( P \)) includes at least one simple event for every compound event in \( P \).

**Lemma 6.4.** If \( P \) is a complete PwT-po with compound event \( e \), then there is a phantom event \( c \in \pi^{-1}(e) \) such that \( \kappa(c) \) is a tautology.

A pomset in the image of extract is a C11 candidate execution. As an example, consider Java Causality Test Case 6 [Pugh 2004]. Taking \( w = 0 \) and \( v = 1 \), the PwT-po on the left below produces the candidate execution on the right.

![Diagram](image)

We write \([.]^{po}\) for the semantic function defined by applying the construction of Lemma 6.2 to the base semantics of 1.

The dependency calculation of \([.]^{po}\) is sufficient for C11; however, it ignores synchronization and coherence completely. For example, consider:

\[
\begin{align*}
\text{if} (r) \{ x := 1 \}; & \text{ if} (s) \{ x := 2 \}; \text{ if} (!r) \{ x := 1 \} \\
\text{if} (r) \{ x := 1 \}; & \text{ if} (s) \{ x := 2 \}; \text{ if} (!r) \{ x := 1 \}
\end{align*}
\]

Adding a pair of reads to complete the pomset, we can extract the following candidate executions.

![Diagram](image)

It is somewhat surprising that the writes are independent of both reads!

In PwT-mca, delay stops the merge in (\( \ddagger \)).

![Diagram](image)

It is possible to mimic this in PwT-C11, without introducing extra dependencies: one can filter executions post-hoc using the relation \( \leq \), defined as follows:

\[ \pi(d) \leq \pi(e) \text{ if } d \ldots^{po} e \text{ and } \lambda(d) \text{ delays } \lambda(e). \]

In (\( \ddagger \)), we have both \( d \leq e \) and \( e \leq d \). To rule out (\( \ddagger \)), it suffices to require that \( \leq \) is a partial order.
Table 1. Tool results for supported Java Causality Test Cases [Pugh 2004]. \( ^\perp \) indicates the tool failed to run for this test due to a memory overflow. Tests run on an Intel i9-9980HK with 64 GB of memory. For context, results for the MRD, MRD_{imm}, and MRD_{c11} are also included [Paviotti et al. 2020].

<table>
<thead>
<tr>
<th>Test</th>
<th>PwT-C11</th>
<th>MRD</th>
<th>MRD_{imm}</th>
<th>MRD_{c11}</th>
</tr>
</thead>
<tbody>
<tr>
<td>TC1</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC2</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC3</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC4</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC5</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC6</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC7</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>TC8</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

Program (\( ^\perp \)) shows that the definition of semantic dependency is up for debate in C11. The International Standard Organization’s C++ concurrency subgroup acknowledges that semantic dependency (sdep) would address the Out-of-Thin-Air problem: "Prohibiting executions that have cycles in (rf \( \cup \) sdep) can therefore be expected to prohibit Out-of-Thin-Air behaviors" [McKenney et al. 2016]. PwT-C11 resolves program structure into a dependency relation—not a complex state— that is precise and easily adjusted. As refinements are made to C11, PwT-C11 can accommodate these and test them automatically.

7 PwTER: AUTOMATIC LITMUS TEST EVALUATOR

PwTER automatically and exhaustively calculates the allowed outcomes of litmus tests for the PwT, PwT-po, and PwT-C11 models, obviating the need for error-prone hand evaluation. It is built in OCaml, using Z3 [de Moura and Bjørner 2008] to judge the truth of predicates.

PwTER allows several modes of evaluation: it can evaluate the rules of Fig. 1, implementing PwT; it can generate program order according to §6, implementing PwT-po; and similar to MRD [Paviotti et al. 2020], it can construct C11-style pre-executions and filter them according to the rules of RC11 as described in §6, implementing PwT-C11. Finally, PwTER also allows us to toggle the complete check of Def. 3.4, providing an interface for understanding how fragments of code might compose by exposing preconditions and termination conditions that are not yet tautologies.

We have run PwTER over the Java Causality Tests [Pugh 2004] supported in the input syntax, and tabulated the results in Table 1. For context, we have included the results of MRD for the Java Causality tests [Paviotti et al. 2020]. Note that MRD and MRD_{c11} do not give the correct outcome on TC17–18—the reason is that local invariant reasoning in MRD is too constrained (see §3.8).

For larger test cases, the tool takes exponentially longer to compute, and for the largest tests the memory footprint is too large for even a well-equipped computer. The compositional nature of the semantics makes tool building practical, but it is not enough to make it scalable for large tests. In combination with the rules for reads and writes, the definitions of SEQ(\( P_1, P_2 \)) and IF(\( \phi, P_1, P_2 \)) have exponential complexity. This is compounded by the hidden complexity of calculating the possible merges between pomsets through union in rules 51 and 11. Significant effort has been put into throwing away spurious merges early in PwTER, so that executing the tool remains manageable for small examples. Some further optimizations may be possible within the tool to improve the situation further, such as killing “dead-end” pomsets at each sequence operator, or by doing a directed search for particular execution outcomes. PwTER is available in the supplementary material.
8 REFINEMENTS AND ADDITIONAL FEATURES

In the paper so far, we have assumed that registers are assigned at most once. We have done this primarily for readability. In the first subsection below, we drop this assumption, instead using substitution to rename registers. We use a set of registers indexed by event identifier: \( S_E = \{ s_e \mid e \in \mathcal{E} \} \). By assumption (§3.1), these registers do not appear in programs: \( S[N/s_e] = S \). The resulting semantics satisfies redundant read elimination.

In the remainder of this section we consider several mostly-orthogonal features: address calculation, if-introduction, and read-modify-write operations. Address calculation and if-introduction do have some interaction, and we spell out the combined semantics in §8.5.

It is worth pointing out that address calculation and if-introduction only affect the semantics of read and write. RMWs introduce new infrastructure in order to ensure atomicity while supporting Arm’s load-exclusive and store-exclusive operations. The resulting semantics satisfies redundant read elimination.

These extensions preserve all of the program transformation discussed thus far, and apply equally to the various semantics we have discussed: PwT, PwT-MCA\(_1\), PwT-MCA\(_2\), and PwT-C11. The results discussed in §5 also apply equally, with the exception of RMWs, which are excluded from the proof of RF-sc and from the proof of lowering to Arm8.

8.1 Register Recycling and Redundant Read Elimination

JMM Test Case 2 [Pugh 2004] states the following execution should be allowed “since redundant read elimination could result in simplification of \( r=s \) to true, allowing \( y:=1 \) to be moved early.”

\[
\begin{align*}
&\text{r := x; s := x; if(r=s)\{y := 1\} \parallel x := y} \\
&(\text{R} x_1 \leftarrow \text{W} y_1) \text{ and } (\text{R} y_1 \leftarrow \text{W} x_1)
\end{align*}
\]

(\text{TC2})

Under the semantics of Fig. 1, the precondition of \( e \) in the independent case is

\[
(1=r \lor x=r) \Rightarrow (1=s \lor r=s) \Rightarrow (r=s), \tag{\star}
\]

which is equivalent to \((x=r) \Rightarrow (1=s) \Rightarrow (r=s)\), which is not a tautology, and thus Fig. 1 requires order from \( d \) to \( e \) in order to complete the pomset.

This execution is allowed, however, if we rename registers using a map from event names to register names. By using this renaming, coalesced events must choose the same register name. In the above example, the precondition of \( e \) in the independent case becomes

\[
(1=s_e \lor x=s_e) \Rightarrow (1=s_e \lor s_e=s_e) \Rightarrow (s_e=s_e), \tag{\star\star}
\]

which is a tautology. In \((\star\star)\), the first read resolves the nondeterminism in both the first and the second read. Given the choice of event names, the outcome of the second read is predetermined!

In \((\star)\), the second read remains nondeterministic, even if the events are destined to coalesce.

Test Cases 17–18 [Pugh 2004] also require coalescing of reads. Contrary to the claim, the semantics of Jagadeesan et al. validates neither redundant load elimination nor these test cases.

Definition 8.1. Let \([\cdot]\) be defined as in Fig. 1, changing \text{R4} of READ:

\(\text{R4a)}\) if \( e \in E \cap D \) then \( \text{r}^D(\psi) \equiv (\kappa(e) \Rightarrow v=s_e) \Rightarrow \psi[s_e/r] \),

\(\text{R4b)}\) if \( e \in E \setminus D \) then \( \text{r}^D(\psi) \equiv (\kappa(e) \Rightarrow (v=s_e \lor x=s_e)) \Rightarrow \psi[s_e/r] \),

\(\text{R4c)}\) if \( E = \emptyset \) then \( \text{r}^D(\psi) \equiv (\forall s) \psi[s/r] \).

With this semantics, it is straightforward to see that redundant load elimination is sound:

\[ [r := x^\mu; s := x^\mu] \supseteq [r := x^\mu; s := r] \]
As a further example, consider Fig. 5 of Sevčík and Aspinall [2008], referenced by Paviotti et al. [2020, §6.4]. Consider the case where the reads are merged, both seeing 1:

\[ r := y; \text{if}(r=1)\{ s := y; x := s \}\text{else}\{ x := 1 \} \]

In order to be independent of both reads, we take the precondition \( \phi \) to be:

\[ (1=r \lor y=r) \Rightarrow [r=1 \land ((1=s \lor y=s) \Rightarrow s=1)] \lor [r \neq 1] \]

Then collapsing \( r \) and \( s \) and substituting the initial value of \( y \) (say 0), we have a tautology:

\[ (1=r \lor 0=r) \Rightarrow [r=1 \land ((1=r \lor 0=r) \Rightarrow r=1)] \lor [r \neq 1] \]

Support for register recycling requires predicate transformers, which allow substitution, rather than simple postconditions.

### 8.2 Read-Modify-Write Operations

To support rmws, we extend the syntax:

\[
S ::= \cdots \mid r := \text{CAS}^\mu v([L], M, N) \mid r := \text{FADD}^\mu v([L], M) \mid r := \text{EXCHG}^\mu v([L], M)
\]

We require that \( r \) does not occur in \( L \). Semantically, we add a relation \( \text{rmw} \subseteq E \times E \) that relates the read of a successful rmw to the succeeding write.

**Definition 8.2.** Extend the definition of a pomset as follows.

\((\text{m}10)\) \( \text{rmw} : E \rightarrow E \) is a partial function capturing read-modify-write atomicity, such that:

\[(\text{m}10a)\] if \( d \xrightarrow{\text{rmw}} e \) then \( \lambda(e) \text{ blocks } \lambda(d) \),

\[(\text{m}10b)\] if \( d \xrightarrow{\text{rmw}} e \) then \( d < e \),

\[(\text{m}10c)\] if \( \lambda(c) \text{ overlaps } \lambda(d) \) and \( d \xrightarrow{\text{rmw}} e \) then \( c < e \) implies \( c \leq d \) and \( d < c \) implies \( e \leq c \).

Extend the definition of \( \text{SEQ}, \text{IF} \) and \( \text{PAR} \) to include:

\((\text{s}10) \) \((\text{p}10)\) \( \text{rmw} = (\text{rmw}_1 \cup \text{rmw}_2) \).

Let \( \text{READ}' \) be defined as for \( \text{READ} \), adding the constraint:

\[(\text{r}4d)\] if \( (E \cap D) = \emptyset \) then \( r^D(\psi) \equiv \psi \).

If \( P \in \text{CAS}(r, x, M, N, \mu, \nu) \) then \( P \in \text{SEQ}(\text{READ}'(r, x, \mu), IF(r=M, WRITE(x, N, \nu), \text{SKIP})) \) and

\[(\text{u}10)\] if \( \lambda(e) \) is a write then there is a read \( \lambda(d) \) such that \( \kappa(e) \equiv \kappa(d) \) and \( d \xrightarrow{\text{rmw}} e \).

\[ [r := \text{CAS}^\mu v(x, M, N)] = \text{CAS}(r, x, M, N, \mu, \nu) \]

\( \text{FADD} \) and \( \text{EXCHG} \) are similar. These definitions ensure atomicity and support lowering to Arm load/store exclusive operations. See Jagadeesan et al. [2020] for examples.

One subtlety of the definition is that we use \( \text{READ}' \) rather than \( \text{READ} \): for rmws, the independent case for a read is the same as the empty case. To see why this should be, consider the relaxed variant of the \( \text{cdRF} \) example from Lee et al. [2020], using \( \text{READ} \) rather than \( \text{READ}' \).

\[ x := 0; (r := \text{FADD}^\mu v(x, 1); \text{if}(r)\{ \text{if}(y)\{ x := 0 \} \} \mid r := \text{FADD}^\mu v(x, 1); \text{if}(r)\{ y := 1 \}) \]

\[ Wx0 \quad Rx0 \xrightarrow{\text{rmw}} Wx1 \quad Ry1 \rightarrow Wx0 \quad Rx0 \xrightarrow{\text{rmw}} Wx1 \quad Wy1 \]

A write should only be visible to one \( \text{FADD} \) instruction, but here the write of 0 is visible to two! This is allowed because, using \( \text{READ} \) instead of \( \text{READ}' \), no order is required from \( \text{Rx}0 \) to \( \text{Wy}1 \) in the last thread.
To see why, consider the independent transformers of the last thread and initializer:

\[ x := 0 \]

\[ r := \text{FADD}^{lx,rlx}(x, 1) \]

\[ \text{if(!} r \text{)} \{ y := 1 \} \]

After sequencing, the precondition of \((Wy1)\) is 0 = r, which is not a tautology. This forces any top-level pomset to include dependency order from \((Rx0)\) to \((Wy1)\).

### 8.3 If-Introduction (aka Case Analysis)

In order to model sequential composition, we must allow inconsistent predicates in a single pomset, unlike PwP [Jagadeesan et al. 2020]. For example, if \(S = (x := 1)\), then the semantics Fig. 1 does not allow:

\[ \text{if}(M) \{ x := 1 \}; S; \text{if}(!M) \{ x := 1 \} \]

However, if \(S = (\text{if}(!M) \{ x := 1 \}; \text{if}(M) \{ x := 1 \})\), then it does allow the execution. Looking at the initial program:

\[ \text{if}(M) \{ x := 1 \} \]

\[ x := 1 \]

\[ \text{if}(!M) \{ x := 1 \} \]

The difficulty is that the middle action can coalesce either with the right action, or the left, but not both. Thus, we are stuck with some non-tautological precondition. Our solution is to allow a pomset to contain many events for a single action, as long as the events have disjoint preconditions.

Def. 8.3 allows the execution, by splitting the middle command:

\[ \text{if}(M) \{ x := 1 \} \]

\[ x := 1 \]

\[ \text{if}(!M) \{ x := 1 \} \]

Coalescing events gives the desired result.

This is not simply a theoretical question; it is observable. For example, the semantics of Fig. 1 does not allow the following, since it must add order in the first thread from the read of y to one of the writes to x.

\[ r := y; \text{if}(!r) \{ x := 1 \}; x := 1; \text{if}(r) \{ x := 1 \}; z := r \]

\[ \| \text{if}(x) \{ x := 0; \text{if}(x) \{ y := 1 \} \} \]

We show the rules for write and read. The rule for fences requires similar treatment.

---

Footnote: The Coq development uses \(\vDash\) rather than \(\models\) in \(w3\) and \(R3\). Given the quantification over \(\phi\), these are equivalent.
The Leaky Semicolon 54:25

\text{(R4b)}

Inevitably, address calculation complicates the definitions of \(8.4\) Address Calculation predicate transformers are derived from those defined for the conditional. The definition allows multiple events to represent a single action, with disjoint preconditions. The predicate transformers derived from those defined for the conditional. \(w6\) and \(r6\) require that the predicates do not mention registers in \(S_E\).

This modification validates Lemma 3.6e, f, and d as equations.

We show how to combine address calculation and if-introduction in §8.5.

\subsection{8.4 Address Calculation}

Inevitably, address calculation complicates the definitions of \text{WRITE} and \text{READ}. In this section, we develop a flat memory model, which does not deal with provenance [Lee et al. 2018].

\textbf{Definition 8.4.} Within a pomset \(P\), let \(K(x) = \bigvee \{k(e) | e \in E \land \lambda(e) = Wx\}\).

\text{If} \(P \in \text{WRITE}(L, M, \mu)\) \text{then} \((\exists \ell \in \mathcal{V}) \ (\exists \phi : E \rightarrow \Phi)\)

\begin{align*}
(\text{w1}) & \text{if} \ |E| \leq 1, \quad \text{(w4)} r^D(\phi) \equiv \psi[M/x][K(E)/Q_x],& \\
(\text{w2}) & \lambda(e) = W^\mu x \psi,& \quad \text{(w5)} \forall \Rightarrow K(E),& \\
(\text{w3}) & k(e) \equiv \phi_e \land M = \psi,& \\
(\text{w6}) & \phi_e[N/s_d] = \phi_e.
\end{align*}

\text{If} \(P \in \text{READ}(r, L, \mu)\) \text{then} \((\exists \ell \in \mathcal{V}) \ (\exists \phi : E \rightarrow \Phi)\)

\begin{align*}
(\text{r1}) & \text{if} \ |E| \leq 1, \quad (\text{r4c}) \text{if} \ E = \emptyset \text{then} r^D(\phi) \equiv (\forall s) \psi[s/r],& \\
(\text{r2}) & \lambda(e) = R^\mu r \psi,& \quad (\text{r5a}) \text{if} \mu \subseteq \text{rlx then} \forall \Rightarrow \text{tt},& \\
(\text{r3}) & k(e) \equiv L_\ell \land M = \psi,& \\
(\text{r4b}) & \phi_e[N/s_d] = \phi_e.& \\
(\text{r4a}) & \psi[k(e) \Rightarrow (v = s_e \land x = s_e)] \Rightarrow \psi[s/r],& \quad \text{where initially} x = 0, y = 0, [0] = 0, [1] = 2, \text{and} [2] = 1. \text{It should only be possible to read} 0, \text{disallowing the attempted execution below:}
\end{align*}

\begin{align*}
(\text{r4b}) & \text{if} \ e \in E \land D \text{then} r^D(\phi) \equiv (\forall s) \psi[s/r],& \\
\end{align*}

The combination of read-read independency (§3.7) and address calculation is somewhat delicate. Consider the following program, from Jagadeesan et al. [2020, §5], where initially \(x = 0, y = 0, [0] = 0, [1] = 2, \text{and} [2] = 1. \text{It should only be possible to read} 0, \text{disallowing the attempted execution below:}

\begin{align*}
(\text{ADD1}) & \text{This execution would become possible, however, if we were to remove} (L=\ell) \text{from} R4\text{—it is included in} \ k. \text{In this case,} (R2y) \text{would not necessarily be dependency ordered before} (Wx1).\end{align*}

\subsection{8.5 Combining Address Calculation and If-Introduction}

Def. 8.4 is naive with respect to merging events. Consider the following example:

\begin{align*}
\begin{array}{c}
\text{[r]} = 0; \ [0] := !r \\
\end{array} & \begin{array}{c}
\begin{array}{c}
\text{[r]} = 0; \ [0] := !r \\
\end{array} & \begin{array}{c}
\begin{array}{c}
\text{[r]} = 0; \ [0] := !r \\
\end{array}
\end{array}
\end{array}
\end{align*}

\begin{align*}
\begin{array}{c}
\begin{array}{c}
\begin{array}{c}
\end{array}
\end{array}
\end{array}
\end{align*}

Merging, we have:

\[
\begin{align*}
\text{if}(M)[r] & := 0; [0] := !r \text{ else } ([r] := 0; [0] := !r) \\
\text{c} & \text{ when } r = 1 \quad \text{W} \{1\}  \\
\text{d} & \text{ when } r = 0 \lor r = 1 \quad \text{W} \{0\}  \\
\text{e} & \text{ when } r = 0 \quad \text{W} \{0\} \\
\end{align*}
\]

The precondition of W[0]0 is a tautology; however, this is not possible for ([r] := 0; [0] := !r) alone. Def. 8.5 enables this execution using if-introduction. Under this semantics, we have:

\[
\begin{align*}
[r] & := 0  \\
0 & := !r  \\
\text{c} & \text{ when } r = 1 \quad \text{W} \{1\}  \\
\text{d} & \text{ when } r = 0 \lor r = 1 \quad \text{W} \{0\}  \\
\text{e} & \text{ when } r = 0 \quad \text{W} \{0\} \\
\end{align*}
\]

Sequencing and merging:

\[
\begin{align*}
[r] & := 0; [0] := !r  \\
\text{c} & \text{ when } r = 1 \quad \text{W} \{1\}  \\
\text{d} & \text{ when } r = 0 \lor r = 1 \quad \text{W} \{0\}  \\
\text{e} & \text{ when } r = 0 \quad \text{W} \{0\} \\
\end{align*}
\]

The precondition of (W[0]0) is a tautology, as required.

Def. 8.5 is a mash-up of the Def. 8.3 and Def. 8.4.

**Definition 8.5.** If \(P \in \text{WRITE}(L, M, \mu)\) then (\(∃t : E \to \nu\)) (\(∃v : E \to \nu\)) (\(∃φ : E \to Φ\))

(w1) if \(K(d) \land K(e)\) is satisfiable then \(d = e\), (w4) \(\mathcal{D}(ψ) \equiv \bigwedge_{k \in \mathcal{V}} L = k \Rightarrow ψ[M/k][K(k)] / Q[k],\)

(w2) \(λ(e) = W^μ[Φ] u_e,\) (w5) \(\checkmark \equiv K(E),\)

(w3) \(K(e) \equiv φ_e \land L = e \land M = ν_e,\) (w6) \(φ_e[N/s] = φ_e.\)

If \(P \in \text{READ}(r, L, \mu)\) then (\(∃t : E \to \nu\)) (\(∃v : E \to \nu\)) (\(∃φ : E \to Φ\))

(r1) if \(K(d) \land K(e)\) is satisfiable then \(d = e\), (r5a) if \(μ \subseteq τ|L|\) then \(\checkmark \equiv tt,\)

(r2) \(λ(e) = R^μ[Φ] u_e,\) (r5b) if \(μ \equiv \nu\) then \(\checkmark \equiv K(E),\)

(r3) \(K(e) \equiv φ_e \land L = e \land Q[Φ],\) (r6) \(φ_e[N/s] = φ_e.\)

(r4) \(\mathcal{D}(ψ) \equiv \bigwedge_{e \in E \land D} φ_e \Rightarrow (K(e) \Rightarrow u_e = s_e) \Rightarrow ψ[s_e/r]\)

\(\land (\bigwedge_{e \in E \land D} φ_e \Rightarrow (K(e) \Rightarrow (s_e \land [L = e]) \Rightarrow ψ[s_e/r]\)

\(\land (\bigwedge_{e \in E} \neg φ_e) \Rightarrow (\forall s) \psi[s/r],\)

9 RELATED WORK

Marino et al. [2015] argue that the “silently shifting semicolon” is sufficiently problematic for programmers that concurrent languages should guarantee sequential abstraction, despite the performance penalties (see also Liu et al. [2021]). In this paper, we take the opposite approach. We have attempted to find the most intellectually tractable model that encompasses all of the messiness of relaxed memory.

There are two prior studies of relaxed memory that include precise calculation of semantic dependencies—neither gives the semantics of sequential composition in direct style. First, Paviotti et al. [2020] defined mRD, which calculates dependencies using event structures rather than logic. This strategy is brittle than ours, leading to false positives (§3.8). Second, Jagadeesan et al. [2020] defined PwP, using logical entailment to define dependency. Although PwT is based on PwP, there are many differences. Some of these are motivated by requirements unique to PwT (see §3.9). Other differences are stylistic: For example, we use termination conditions rather than termination actions—our formulation fixes an error in Jagadeesan et al.’s definition of parallel composition. We also fix an error in their treatment of redundant read elimination (§8.1).

Kavanagh and Brookes [2018] define a semantics using pomsets without preconditions. Instead, their model uses syntactic dependencies, thus invalidating many compiler optimizations. They also require a fence after every relaxed read on Arm8. Pichon-Pharabod and Sewell [2016] use event structures to calculate dependencies, combined with an operational semantics that incorporates program transformations. This approach seems to require whole-program analysis.
Other studies of relaxed memory can be categorized by their approach to dependency calculation. Hardware models use syntactic dependencies [Alglave et al. 2014]. Many software models do not bother with dependencies at all [Batty et al. 2011; Cox 2016; Watt et al. 2020, 2019]. Others have strong dependencies that disallow compiler optimizations and efficient implementation, typically requiring fences for every relaxed read on Arm8 [Boehm and Demsky 2014; Dolan et al. 2018; Jeffrey and Riely 2016; Lahav et al. 2017; Lamport 1979]. Many of the most prominent models are operational models based on speculative execution [Chakraborty and Vafeiadis 2019; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005].

Morally, PwT fits between the strong models and the speculative ones. Looking at the details, however, PwT-MCA is incomparable to both RC11 [Lahav et al. 2017] and the promising semantics [Kang et al. 2017], to take two examples. RC11 allows non-MCA behaviors that PwT-MCA disallows. PwT-MCA has a weaker notion of coherence than the promising semantics.

Jagadeesan et al. [2020] argue that the speculative models allow too many executions, resulting in a failure of temporal reasoning and potentially jeopardizing type safety and other security properties. In a similar vein, Cho et al. [2021] argue that local DRF guarantees are violated when read-introduction is followed by if-introduction, branching on the read just introduced. These optimizations are validated by the speculative models—Cho et al. manage to avoid the problem by adopting a sub-optimal lowering for RMWS. PwT does not suffer from this problem, since PwT does not validate read-introduction. There appears to be a genuine tension between temporal reasoning, as supported by PwT, and read-introduction, as supported by the speculative models.

Other work in relaxed memory has shown that tooling is especially useful to researchers, architects, and language specifiers, enabling them to build intuitions experimentally [Alglave et al. 2014; Batty et al. 2011; Cooksey et al. 2019; Paviotti et al. 2020]. Unfortunately, it is not obvious that tools can be built for all thin-air-free models: the calculation of Pichon-Pharabod and Sewell [2016] does not have a termination proof for an arbitrary input; the enormous state space for the operational models of Kang et al. [2017] and Chakraborty and Vafeiadis [2019] is daunting for a tool builder—and as yet no tool exists for automatically evaluating these models. We described a tool for automatically evaluating PwT in §7.

10 LIMITATIONS AND FUTURE WORK

This paper is the first to present a direct denotational semantics for sequential composition that can be efficiently compiled to modern CPUs. We defined two models: PwT-C11 solves the out-of-thin-air problem for C11, and PwT-MCA solves it for safe languages such as Java and Javascript.

Our work has several limitations, providing opportunities for future work:

PwT-C11 can be lowered efficiently to any architecture supported by C11, but inherits the top-level axioms of RC11, compromising compositionality. PwT-MCA is as a compositional as a model of concurrent imperative programming can be, but is limited to MCA architectures for optimal lowering. It would be interesting to explore the middle ground to find a fully compositional model that supports optimal lowering to all modern architectures.

As mentioned in §9, some safety guarantees may be violated when read-introduction is followed by if-introduction, branching on the read just introduced. Nonetheless, read-introduction is ubiquitous in some compilers [Lee et al. 2017]. It would be interesting to know the cost of restricting this optimization. In a similar vein, PwT-MCA1 is a simpler model than PwT-MCA2, but requires fences on acquiring reads for Arm8. It would be illuminating to find out what the performance penalty is for these fences.

We have defined the soundness of compiler optimizations in the model, rather than contextually: \( S' \) is a sound refinement of \( S \) if \( [S'] \subseteq [S] \). This approach has several advantages—for example, it is immediate that a sound optimization is sound in any context. It also has a disadvantage: some
optimizations complicate the semantics. For example, PWT-mca does not validate access elimination, such as store-forwarding and dead-write-removal—consider that complete executions of \( [x := 1; r := x] \) must include a read action and that complete executions of \( [x := 1; x := 2] \) must include two write actions. As another example, PWT-mca does not validate the reverse inclusions for Lemma 3.6g—consider that \( \{ 1 \} (r) (x := 1) \triangleleft \{ 2 \} (x := 2) \) has an augmented (Lemma 3.7) execution with \( r = 0 \mid W x 2 \Rightarrow (r \neq 0 \mid W x 1) \), whereas \( \{ 1 \} (r) (x := 1); \{ 1 \} (r) (x := 2) \) has no such execution. We expect that these optimizations can be validated, at the cost of complicating the semantics. For access elimination, it is likely sufficient to allow events with different actions to merge. For Lemma 3.6g, it is likely sufficient to encode \textit{delay} in the logic—the problem in the execution above is that \textit{delay} introduces order even when the preconditions are disjoint.

We have not treated loops, although we expect that the usual approach of showing continuity for all the semantic operations with respect to set inclusion would go through. Paviotti et al. [2020] use step-indexing to account for loops; perhaps this approach could be adapted.

While we have mechanized some proofs, most of our proofs are informal. In particular, we have only a pen-and-paper proof showing that PWT-mca supports optimal lowering to Arm8. The same is true for local data-race-freedom (LDRF-sc)—additionally, our proof sketch for LDRF-sc elides Rmws, which have caused complications in other models [Cho et al. 2021].

Supplementary material for this paper is available at https://weakmemory.github.io/pwt.

Acknowledgements

This paper has been greatly improved by the comments of the anonymous reviewers. James Riely was supported by the National Science Foundation under grant No. CCR-1617175. Mark Batty and Simon Cooksey were supported by the EPSRC under grant Nos. EP/V000470/1 and EP/R032971/1, and by VeTSS. Anton Podkopaev was supported by JetBrains Research.

REFERENCES


