BPF: Ensure __sync_fetch_and_add() always generate atomic_fetch_add insn #101428

yonghong-song · 2024-07-31T23:13:38Z

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like
__sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted.

The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. This patch made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. I also removed the whole BPFMIChecking.cpp file whose original purpose is to detect and issue errors if XADD{W,D,W32} may return a value used subsequently. Since insns XADD{W,D,W32} are all inline asm only now, such error detection is not needed.

[1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
[2] 286daaf

peilin-ye · 2024-08-01T00:28:18Z

Thanks for the quick fix! The reporter's first name in the commit message should be "Peilin".

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like __sync_fetch_and_add(&foo, 1); llvm BPF backend generates locked insn e.g. lock *(u32 *)(r1 + 0) += r2 If __sync_fetch_and_add(...) returns a value like res = __sync_fetch_and_add(&foo, 1); llvm BPF backend generates like r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted based on __sync_fetch_and_add() semantics. The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend. Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or() and __sync_fetch_and_xor(). But the change in [3] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...' insn and everything will work fine. This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions, the inline asm with 'lock' insn is used and it will work with or without [3]. [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] llvm/llvm-project@286daaf [3] llvm/llvm-project#101428 [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") Signed-off-by: Yonghong Song <[email protected]>

eddyz87

The change itself looks fine to me, have a nitpick regarding dis-assembler behaviour for cpu v3.

eddyz87 · 2024-08-02T18:19:49Z

llvm/test/CodeGen/BPF/atomics.ll

Nitpick, the CHECKs could be simplified a bit, e.g.:

; RUN: llc < %s -march=bpfel -verify-machineinstrs -show-mc-encoding | FileCheck %s ; RUN: llc < %s -march=bpfel -verify-machineinstrs -show-mc-encoding -mcpu=v3 | FileCheck --check-prefix=CHECK-V3 %s ; CHECK-LABEL: test_load_add_32 ; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) ; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2) ; CHECK: encoding: [0xc3,0x21,0x00,0x00,0x01,0x00,0x00,0x00] define void @test_load_add_32(ptr %p, i32 zeroext %v) { entry: atomicrmw add ptr %p, i32 %v seq_cst ret void } ; CHECK-LABEL: test_load_add_64 ; CHECK: r2 = atomic_fetch_add((u64 *)(r1 + 0), r2) ; CHECK: encoding: [0xdb,0x21,0x00,0x00,0x01,0x00,0x00,0x00] define void @test_load_add_64(ptr %p, i64 zeroext %v) { entry: atomicrmw add ptr %p, i64 %v seq_cst ret void }

On the other hand, it looks like we should change the dis-assembler to avoid the discrepancy:

; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) ; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2)

Since instruction encoding is the same.

Thanks. Indeed simplified CHECKs a little bit.
For the difference for

; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) ; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2)

I think it is okay, since this is asm output from llc.

Do you mind if I try to change this behaviour for both llc and disasm? (in a separate pr)

Do you mind if I try to change this behaviour for both llc and disasm? (in a separate pr)

I don't know what code change you intend to do. But you can go ahead to send a patch and so we can discuss the change.

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like __sync_fetch_and_add(&foo, 1); llvm BPF backend generates locked insn e.g. lock *(u32 *)(r1 + 0) += r2 If __sync_fetch_and_add(...) returns a value like res = __sync_fetch_and_add(&foo, 1); llvm BPF backend generates like r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted based on __sync_fetch_and_add() semantics. The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend. Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or() and __sync_fetch_and_xor(). But the change in [3] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...' insn and everything will work fine. This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions, the inline asm with 'lock' insn is used and it will work with or without [3]. [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] llvm/llvm-project@286daaf [3] llvm/llvm-project#101428 [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") Signed-off-by: Yonghong Song <[email protected]>

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like __sync_fetch_and_add(&foo, 1); llvm BPF backend generates locked insn e.g. lock *(u32 *)(r1 + 0) += r2 If __sync_fetch_and_add(...) returns a value like res = __sync_fetch_and_add(&foo, 1); llvm BPF backend generates like r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted based on __sync_fetch_and_add() semantics. The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend. Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or() and __sync_fetch_and_xor(). But the change in [3] caused arena_atomics selftest failure. test_arena_atomics:PASS:arena atomics skeleton open 0 nsec libbpf: prog 'and': BPF program load failed: Permission denied libbpf: prog 'and': -- BEGIN PROG LOAD LOG -- arg#0 reference type('UNKNOWN ') size cannot be determined: -22 0: R1=ctx() R10=fp0 ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar() 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x) ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91 6: (18) r1 = 0x100000000060 ; R1_w=scalar() 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2) BPF_ATOMIC stores into R1 arena is not allowed processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'and': failed to load: -13 libbpf: failed to load object 'arena_atomics' libbpf: failed to load BPF skeleton 'arena_atomics': -13 test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13) #3 arena_atomics:FAIL The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...' insn and everything will work fine. This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions, the inline asm with 'lock' insn is used and it will work with or without [3]. Note that three bpf programs ('and', 'or' and 'xor') are guarded with __BPF_FEATURE_ADDR_SPACE_CAST as well to ensure compilation failure for llvm <= 18 version. Note that for llvm <= 18 where addr_space_cast is not supported, all arena_atomics subtests are skipped with below message: test_arena_atomics:SKIP:no ENABLE_ATOMICS_TESTS or no addr_space_cast support in clang #3 arena_atomics:SKIP [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] llvm/llvm-project@286daaf [3] llvm/llvm-project#101428 [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") Signed-off-by: Yonghong Song <[email protected]>

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like __sync_fetch_and_add(&foo, 1); llvm BPF backend generates locked insn e.g. lock *(u32 *)(r1 + 0) += r2 If __sync_fetch_and_add(...) returns a value like res = __sync_fetch_and_add(&foo, 1); llvm BPF backend generates like r2 = atomic_fetch_add((u32 *)(r1 + 0), r2) The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted. The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend. Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. This patch made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. I also removed the whole BPFMIChecking.cpp file whose original purpose is to detect and issue errors if XADD{W,D,W32} may return a value used subsequently. Since insns XADD{W,D,W32} are all inline asm only now, such error detection is not needed. [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] llvm@286daaf

brycekahle · 2024-08-23T17:50:27Z

@yonghong-song this seems to be a breaking change for folks still targeting v1 or v2 instruction sets. LLVM generates an instruction that isn't available.

yonghong-song · 2024-08-25T05:00:48Z

@yonghong-song this seems to be a breaking change for folks still targeting v1 or v2 instruction sets. LLVM generates an instruction that isn't available.

@brycekahle Yes. This is a deliberate choice. See upstream discussion in
https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f

The following is the quote from the discussion from Alexei:

Right. We did it for backward compat. Older llvm was
completely wrong to generate xadd for __sync_fetch_and_add.
That was my hack from 10 years ago when xadd was all we had.
So we fixed that old llvm bug, but introduced another with all
good intentions.
Since proper atomic insns were introduced 3 years ago we should
remove this backward compat feature/bug from llvm.
The only breakage is for kernels older than 5.12.
I think that's an acceptable risk.

brycekahle · 2024-08-26T16:05:52Z

@yonghong-song what is the best place for me to voice my concerns? We ship object files to our Datadog customers that target a wide range of kernels, going much further back than 5.12. The total customer host install count is 400,000+. We do not have the luxury of only deploying eBPF on internal infrastructure where we have total control of the kernel version used.

We also intentionally limit ourselves to v1 of the instruction set. Having the compiler generate v3 instructions, even when specifying v1, seems like a bug and a breaking change.

At the very least, there should be a non-inline assembly way for us to explicitly use the XADD instruction. Inline BPF assembly has its own challenges if you cannot use clang -target bpf, but must use clang + llc -march=bpf.

yonghong-song · 2024-08-26T17:11:13Z

@brycekahle
For the following,

We also intentionally limit ourselves to v1 of the instruction set. Having the compiler generate v3 instructions, even
when specifying v1, seems like a bug and a breaking change.
Yes, the new change will generate atomic_fetch_add insn with cpu=v1 instead of 'lock xadd'. We could move generating atomic_fetch_add into -mcpu=v3 but then a lot of bpf programs may fail. Note that atomic_fetch_add in v1 will be okay with >= 5.12. As mentioned in my previous comments, this is a deliberate choice.

You can use inline asm like in
https://lore.kernel.org/bpf/[email protected]/
'clang -target bpf ...' should work. You do not need to use 'clang + llc'. Let me know if otherwise and I am happy to help.

cc @4ast for more comments.

brycekahle · 2024-08-26T18:48:11Z

@yonghong-song we cannot use clang -target bpf because we have non-CO-RE kernel tracing. From https://www.kernel.org/doc/html/v5.17/bpf/bpf_devel_QA.html#q-clang-flag-for-target-bpf

Q: In some cases clang flag -target bpf is used but in other cases the default clang target, which matches the underlying architecture, is used. What is the difference and when I should use which?

A: Although LLVM IR generation and optimization try to stay architecture independent, -target still has some impact on generated code:

BPF program may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, while bpf target may fail if bpf backend assembler does not understand these assembly codes, which is true in most cases.

eddyz87 · 2024-08-26T19:06:21Z

Hi @brycekahle ,

Could you please elaborate a bit on issues with inline assembly and clang + llc?
A simple test appears to work:

$ cat test3.c
void foo(unsigned long *p, unsigned long v)
{
  asm volatile("lock *(u64 *)(%[p] + 0) += %[v];" :: [p]"r"(p), [v]"r"(v));
}

$ clang -mcpu=v1 -O2 --target=bpf -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=asm
	.text
	.file	"test3.c"
	.globl	foo                             # -- Begin function foo
	.p2align	3
	.type	foo,@function
foo:                                    # @foo
.Lfoo$local:
	.type	.Lfoo$local,@function
# %bb.0:
	#APP
	lock *(u64 *)(r1 + 0) += r2

	#NO_APP
	exit
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.size	.Lfoo$local, .Lfunc_end0-foo
                                        # -- End function

brycekahle · 2024-08-26T20:50:57Z

@eddyz87 You are still using clang --target=bpf, you have to use the default target.

eddyz87 · 2024-08-26T21:12:00Z

@brycekahle,

@eddyz87 You are still using clang --target=bpf, you have to use the default target.

Well, that is a very strange arrangement, I'm not sure there is a documentation stating that this is a supported use-case. I remember several architecture-specific checks in the clang front-end. Anyways, simple example still works:

$ clang -O2 -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=asm
	.text
	.file	"test3.c"
	.globl	foo                             # -- Begin function foo
	.p2align	3
	.type	foo,@function
foo:                                    # @foo
.Lfoo$local:
	.type	.Lfoo$local,@function
	.cfi_startproc
# %bb.0:
	#APP
	lock *(u64 *)(r1 + 0) += r2

	#NO_APP
	exit
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.size	.Lfoo$local, .Lfunc_end0-foo
	.cfi_endproc
                                        # -- End function

brycekahle · 2024-08-26T21:39:52Z

It is explicitly called out in the kernel documentation I linked and quoted.

eddyz87 · 2024-08-26T21:49:45Z

@brycekahle

It is explicitly called out in the kernel documentation I linked and quoted.

Oh, sorry, my bad.

But you stated that there is some issue in usage of inline assembly and clang + llc. What is the issue you observe?

brycekahle · 2024-08-26T23:56:07Z

It turns out what breaks inline assembly with the default target is trying to use the bpf registers by name. Since the example you gave does not do that, it does actually compile.

I still think generating v3 instructions when v1 is explicitly asked for, is a giant footgun that is going to trip up a lot of folks. Could you do this new behavior only when v3 (or later) is used?

eddyz87 · 2024-08-27T01:56:59Z

@brycekahle,

It turns out what breaks inline assembly with the default target is trying to use the bpf registers by name. Since the example you gave does not do that, it does actually compile.

For asm body or for asm parameters?

I was able to get error from clang specifying register names in parameters. Clang does some limited target-specific checks for inline assembly parameters, e.g. see virtual TargetInfo::isValidGCCRegisterName which is used for semantic checks (here and here). And this is why I said the this clang --target=native + llc -march=bpf arrangement is strange and is ultimately a hack.

I still think generating v3 instructions when v1 is explicitly asked for, is a giant footgun that is going to trip up a lot of folks. Could you do this new behavior only when v3 (or later) is used?

Agree that this is a footgun. I think we should report an error when __sync_fetch_and_add is used for cpu v1.
And probably provide an intrinsic to generate xadd. However, the following code works:

$ cat test3.c 
__attribute__((always_inline))
void xadd(unsigned long *p, unsigned long v)
{
  asm volatile("lock *(u64 *)(%[p] + 0) += %[v];" :: [p]"r"(p), [v]"r"(v): "memory");
}

void foo(unsigned long *p, unsigned long v)
{
  xadd(p, v);
}

$ clang -O2 -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=obj | llvm-objdump --no-show-raw-insn -d -

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <xadd>:
       0:	lock *(u64 *)(r1 + 0x0) += r2
       1:	exit

0000000000000010 <foo>:
       2:	lock *(u64 *)(r1 + 0x0) += r2
       3:	exit

So intrinsic is optional.

yonghong-song · 2024-08-27T18:43:45Z

clang with default architecture and llc with bpf target works in some cases but not always, e.g. x86_64 target may generate (e.g. jump table) which bpf back will not support. In early days, when CORE is not available, we do use clang with default architecture and llc with bpf model, but that is complicated and need to access kernel internal headers. When CORE is available, the recommendation is to use 'clang --target=bpf ...' directly which is also we did in bpf selftests.

brycekahle · 2024-08-27T18:55:52Z

x86_64 target may generate (e.g. jump table) which bpf back will not support

We avoid this by passing -fno-jump-tables (along with some other flags).

when CORE is not available

This is the use case I'm talking about.

4ast · 2024-08-27T20:18:18Z

I think we should report an error when __sync_fetch_and_add is used for cpu v1.

datadog is not the only user of (void)__sync_fetch_and_add(&counter, ...);
cilium has the same pattern and they also support old kernels.
If new llvm errors with -mcpu=v1 the users wouldn't upgrading llvm.
So I think for -mcpu=v1 and (void)__sync_fetch_and_add() llvm should generate old bpf xadd insn,
since that's the only insn it can generate.
For pattern "var = __sync_fetch_and_add()" and -mcpu=v1 llvm should error.
I believe it started doing this already some time back.

For (void)__sync_fetch_and_add() and -mcpu=v3 we should generate fetch_and_add() insn.

yonghong-song · 2024-08-27T20:39:11Z

I think we should report an error when __sync_fetch_and_add is used for cpu v1.

datadog is not the only user of (void)__sync_fetch_and_add(&counter, ...); cilium has the same pattern and they also support old kernels. If new llvm errors with -mcpu=v1 the users wouldn't upgrading llvm. So I think for -mcpu=v1 and (void)__sync_fetch_and_add() llvm should generate old bpf xadd insn, since that's the only insn it can generate. For pattern "var = __sync_fetch_and_add()" and -mcpu=v1 llvm should error. I believe it started doing this already some time back.

For (void)__sync_fetch_and_add() and -mcpu=v3 we should generate fetch_and_add() insn.

Sounds good, I will fix the issue for this in llvm20.

…ch_add insn (llvm#101428)" This reverts commit c566769. See discussion in [1]. Currently, with -mcpu=v1/v2, atomic_fetch_add() insn is generated for (void)__sync_fetch_and_add(...). This breaks backward compatibility since there are codes who runs on old system (< 5.12) which does not support atomci_fetch_add(). Now let revert previous comment ([1]) and add additional logic in the next patch to ensure for (void)__sync_fetch_and_add(...), v1/v2: generate locked add insn >= v3: generate atomic_fetch_add insn [1] llvm#101428

…106494) This patch contains two pars: - first to revert the patch #101428. - second to remove `atomic_fetch_and_*()` to `atomic_<op>()` conversion (when return value is not used), but preserve `__sync_fetch_and_add()` to locked insn with cpu v1/v2.

Before llvm20, (void)__sync_fetch_and_add(...) always generates locked xadd insns. In linux kernel upstream discussion [1], it is found that for arm64 architecture, the original semantics of (void)__sync_fetch_and_add(...), i.e., __atomic_fetch_add(...), is preferred in order for jit to emit proper native barrier insns. In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will generate the following insns: - for cpu v1/v2: locked xadd insns to keep backward compatibility - for cpu v3/v4: __atomic_fetch_add() insns To ensure proper barrier semantics for (void)__sync_fetch_and_add(...), cpu v3/v4 is recommended. This patch enables cpu=v3 as the default cpu version. For users wanting to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc command line. [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] llvm#101428 [3] llvm#106494

Before llvm20, (void)__sync_fetch_and_add(...) always generates locked xadd insns. In linux kernel upstream discussion [1], it is found that for arm64 architecture, the original semantics of (void)__sync_fetch_and_add(...), i.e., __atomic_fetch_add(...), is preferred in order for jit to emit proper native barrier insns. In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will generate the following insns: - for cpu v1/v2: locked xadd insns to keep backward compatibility - for cpu v3/v4: __atomic_fetch_add() insns To ensure proper barrier semantics for (void)__sync_fetch_and_add(...), cpu v3/v4 is recommended. This patch enables cpu=v3 as the default cpu version. For users wanting to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc command line. [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f [2] #101428 [3] #106494

yonghong-song requested review from 4ast and eddyz87 July 31, 2024 23:18

eddyz87 approved these changes Aug 2, 2024

View reviewed changes

yonghong-song force-pushed the fix-xadd branch from e6e46e3 to 225d726 Compare August 3, 2024 03:51

4ast approved these changes Aug 5, 2024

View reviewed changes

yonghong-song merged commit c566769 into llvm:main Aug 5, 2024
7 checks passed

yonghong-song mentioned this pull request Aug 29, 2024

BPF: Generate locked insn for __sync_fetch_and_add() with cpu v1/v2 #106494

Merged

yonghong-song mentioned this pull request Sep 2, 2024

[BPF] Make -mcpu=v3 as the default #107008

Merged

yonghong-song deleted the fix-xadd branch February 8, 2025 06:05

BPF: Ensure __sync_fetch_and_add() always generate atomic_fetch_add insn #101428

BPF: Ensure __sync_fetch_and_add() always generate atomic_fetch_add insn #101428

Uh oh!

Conversation

yonghong-song commented Jul 31, 2024

Uh oh!

peilin-ye commented Aug 1, 2024

Uh oh!

eddyz87 left a comment

Choose a reason for hiding this comment

Uh oh!

eddyz87 Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

yonghong-song Aug 3, 2024

Choose a reason for hiding this comment

Uh oh!

eddyz87 Aug 3, 2024

Choose a reason for hiding this comment

Uh oh!

yonghong-song Aug 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brycekahle commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonghong-song commented Aug 25, 2024

Uh oh!

brycekahle commented Aug 26, 2024

Uh oh!

yonghong-song commented Aug 26, 2024

Uh oh!

brycekahle commented Aug 26, 2024

Uh oh!

eddyz87 commented Aug 26, 2024

Uh oh!

brycekahle commented Aug 26, 2024

Uh oh!

eddyz87 commented Aug 26, 2024

Uh oh!

brycekahle commented Aug 26, 2024

Uh oh!

eddyz87 commented Aug 26, 2024

Uh oh!

brycekahle commented Aug 26, 2024

Uh oh!

eddyz87 commented Aug 27, 2024

Uh oh!

yonghong-song commented Aug 27, 2024

Uh oh!

brycekahle commented Aug 27, 2024

Uh oh!

4ast commented Aug 27, 2024

Uh oh!

yonghong-song commented Aug 27, 2024

Uh oh!

Uh oh!

brycekahle commented Aug 23, 2024 •

edited

Loading