Description
A1: int* bug(int N, atomic<int>& a) {
A2: int* p = (int*)malloc(sizeof(int));
A3:
A4: for (int i = 1; i <= N; i++) {
A5: *p = i;
A6: foo(i, a);
A7: }
A8:
A9: return p;
A10: }
LICMPass of Clang++ (13.0.0+) optimizes the above function into the following (C++ syntax is used for brevity, see godbolt for the IR code):
B1: int* bug(int N, atomic<int>& a) {
B2: int* p = (int*)malloc(sizeof(int));
B3:
B4: for (int i = 1; i <= N; i++) {
B5: foo(i, a);
B6: }
B7:
B8: *p = N;
B9:
B10: return p;
B11: }
This is a miscompilation bug. By compiling and running a concurrent program that involves the bug
function on Armv8 processors Apple M1 and Cavium ThunderX2, we actually observed certain program behaviors forbidden by the original program before the optimization. The actual program we tested can be found here. (The forbidden behavior can be observed couple of times throughout millions~billions of program runs. The test code includes some intricate crafts, e.g., that helps repeat the test many times.)
In the following, we explain a simplified version of the concurrent program we used and show why the above LICM is a miscompilation bug.
To see a buggy execution, let's consider two threads where the first thread calls the bug
function.
Thread 1:
T1.1: void thread1(int N, atomic<int>& a, atomic<int*>& X) {
T1.2: int* p = bug(N, a);
T1.3: X.store(p, memory_order_relaxed);
T1.4: }
Thread 2:
T2.1: void thread2(int N, atomic<int>& a, atomic<int*>& X) {
T2.2: while (a.load(memory_order_acquire) < N);
T2.3: int *p = X.load(memory_order_relaxed);
T2.4: if (p != NULL) {
T2.5: if (*p != N) {
T2.6: cout << "bug!!!" << endl;
T2.7: }
T2.8: }
T2.9: }
Also suppose that the foo
function is given as follows:
F1: void foo(int i, atomic<int>& a) {
F2: a.store(i, memory_order_release);
F3: }
Now, if we create two threads as follows, the program should never print bug!!!
before the problematic LICM in the bug
function.
atomic<int> a;
atomic<int*> X;
int N = 1;
a.store(0, memory_order_relaxed);
X.store(NULL, memory_order_relaxed);
thread* th1 = new thread(thread1, N, ref(a), ref(X));
thread* th2 = new thread(thread2, N, ref(a), ref(X));
Indeed, once thread2
reads N
from a
in an acquire access mode (line T2.2), the last call to the foo
function (line A6, writing N
to a
with a release mode) synchronizes with the acquire read. Therefore, the last write to p
(line A5, *p = N
) by thread1
happens-before the read from *p
(line T2.5, *p != N
) by thread2
, and thread2
can only read N
from *p
. Note that there is no data race in this program (before the LICM).
However, after the above LICM, the program becomes racy on *p
, and thread2
can read a value other than N
from *p
. We actually observed this program printing "bug!!!" by compiling and running it on Armv8 processors Apple M1 and ThunderX2!
Roughly, the above LICM is wrong because it moves a non-atomic write *p = i
(line A5) after a release write (line A6). This is wrong even though p
is never leaked before it is returned. Specifically, LLVM applies the above LICM and inlines the bug
function in thread1
, resulting in the following code:
...
O1: for (int i = 1; i <= N; i++) {
O2: foo(i, a);
O3: }
O4:
O5: *p = N;
O6: X.store(p, memory_order_relaxed);
...
Now, the write to p
(line O5) is not properly synchronized anymore, and thread2
can read an uninitialized value from p
(line T2.5). In particular, such behavior is allowed by the Arm architecture because it can reorder the two independent stores O5 and O6. Therefore, this program can print "bug!!!", which is originally forbidden.
We conclude that the pointer p
in the function bug
should not be treated as completely local even before it is leaked (line A9).