Skip to content

BPF: Ensure __sync_fetch_and_add() always generate atomic_fetch_add insn #101428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2024

Conversation

yonghong-song
Copy link
Contributor

Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without return value like
__sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit since proper barrier is not inserted.

The above discrepancy is due to commit [2] where it tries to maintain backward compatability since before commit [2], __sync_fetch_and_add(...) generates lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can have proper barrier support in jit. This patch made sure that __sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm. I also removed the whole BPFMIChecking.cpp file whose original purpose is to detect and issue errors if XADD{W,D,W32} may return a value used subsequently. Since insns XADD{W,D,W32} are all inline asm only now, such error detection is not needed.

[1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
[2] 286daaf

@yonghong-song yonghong-song requested review from 4ast and eddyz87 July 31, 2024 23:18
@peilin-ye
Copy link
Contributor

Thanks for the quick fix! The reporter's first name in the commit message should be "Peilin".

kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Aug 1, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Aug 1, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
Copy link
Contributor

@eddyz87 eddyz87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks fine to me, have a nitpick regarding dis-assembler behaviour for cpu v3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, the CHECKs could be simplified a bit, e.g.:

; RUN: llc < %s -march=bpfel -verify-machineinstrs -show-mc-encoding | FileCheck %s
; RUN: llc < %s -march=bpfel -verify-machineinstrs -show-mc-encoding -mcpu=v3 | FileCheck --check-prefix=CHECK-V3 %s

; CHECK-LABEL: test_load_add_32
; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)
; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2)
; CHECK: encoding: [0xc3,0x21,0x00,0x00,0x01,0x00,0x00,0x00]
define void @test_load_add_32(ptr %p, i32 zeroext %v) {
entry:
  atomicrmw add ptr %p, i32 %v seq_cst
  ret void
}

; CHECK-LABEL: test_load_add_64
; CHECK: r2 = atomic_fetch_add((u64 *)(r1 + 0), r2)
; CHECK: encoding: [0xdb,0x21,0x00,0x00,0x01,0x00,0x00,0x00]
define void @test_load_add_64(ptr %p, i64 zeroext %v) {
entry:
  atomicrmw add ptr %p, i64 %v seq_cst
  ret void
}

On the other hand, it looks like we should change the dis-assembler to avoid the discrepancy:

; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)
; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2)

Since instruction encoding is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Indeed simplified CHECKs a little bit.
For the difference for

; CHECK: r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)
; CHECK-V3: w2 = atomic_fetch_add((u32 *)(r1 + 0), w2)

I think it is okay, since this is asm output from llc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I try to change this behaviour for both llc and disasm? (in a separate pr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I try to change this behaviour for both llc and disasm? (in a separate pr)

I don't know what code change you intend to do. But you can go ahead to send a patch and so we can discuss the change.

kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Aug 2, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Aug 2, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Aug 3, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].
Note that three bpf programs ('and', 'or' and 'xor') are guarded with __BPF_FEATURE_ADDR_SPACE_CAST
as well to ensure compilation failure for llvm <= 18 version. Note that for llvm <= 18 where
addr_space_cast is not supported, all arena_atomics subtests are skipped with below message:
  test_arena_atomics:SKIP:no ENABLE_ATOMICS_TESTS or no addr_space_cast support in clang
  #3 arena_atomics:SKIP

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Aug 3, 2024
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted based on __sync_fetch_and_add() semantics.

The above discrepancy is due to commit [2] where it tries to maintain backward
compatability since before commit [2], __sync_fetch_and_add(...) generates
lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy so we can
have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
and __sync_fetch_and_xor().

But the change in [3] caused arena_atomics selftest failure.

  test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
  libbpf: prog 'and': BPF program load failed: Permission denied
  libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
  arg#0 reference type('UNKNOWN ') size cannot be determined: -22
  0: R1=ctx() R10=fp0
  ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
  0: (18) r1 = 0xffffc90000064000       ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
  2: (61) r6 = *(u32 *)(r1 +0)          ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
  4: (77) r0 >>= 32                     ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  5: (5d) if r0 != r6 goto pc+11        ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
  ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
  6: (18) r1 = 0x100000000060           ; R1_w=scalar()
  8: (bf) r1 = addr_space_cast(r1, 0, 1)        ; R1_w=arena
  9: (18) r2 = 0x1100000000             ; R2_w=0x1100000000
  11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
  BPF_ATOMIC stores into R1 arena is not allowed
  processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'and': failed to load: -13
  libbpf: failed to load object 'arena_atomics'
  libbpf: failed to load BPF skeleton 'arena_atomics': -13
  test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
  #3       arena_atomics:FAIL

The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
insn and everything will work fine.

This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
the inline asm with 'lock' insn is used and it will work with or without [3].
Note that three bpf programs ('and', 'or' and 'xor') are guarded with __BPF_FEATURE_ADDR_SPACE_CAST
as well to ensure compilation failure for llvm <= 18 version. Note that for llvm <= 18 where
addr_space_cast is not supported, all arena_atomics subtests are skipped with below message:
  test_arena_atomics:SKIP:no ENABLE_ATOMICS_TESTS or no addr_space_cast support in clang
  #3 arena_atomics:SKIP

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm/llvm-project@286daaf
  [3] llvm/llvm-project#101428
  [4] d503a04 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")

Signed-off-by: Yonghong Song <[email protected]>
Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...)
without return value like
  __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates locked insn e.g.
  lock *(u32 *)(r1 + 0) += r2

If __sync_fetch_and_add(...) returns a value like
  res = __sync_fetch_and_add(&foo, 1);
llvm BPF backend generates like
  r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)

The above generation of 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
since proper barrier is not inserted.

The above discrepancy is due to commit [2] where it tries to maintain
backward compatability since before commit [2], __sync_fetch_and_add(...)
generates lock insn in BPF backend.

Based on discussion in [1], now it is time to fix the above discrepancy
so we can have proper barrier support in jit. This patch made sure that
__sync_fetch_and_add(...) always generates atomic_fetch_add(...) insns.
Now 'lock *(u32 *)(r1 + 0) += r2' can only be generated by inline asm.
I also removed the whole BPFMIChecking.cpp file whose original purpose
is to detect and issue errors if XADD{W,D,W32} may return a value used
subsequently. Since insns XADD{W,D,W32} are all inline asm only now,
such error detection is not needed.

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm@286daaf
@yonghong-song yonghong-song merged commit c566769 into llvm:main Aug 5, 2024
7 checks passed
@brycekahle
Copy link

brycekahle commented Aug 23, 2024

@yonghong-song this seems to be a breaking change for folks still targeting v1 or v2 instruction sets. LLVM generates an instruction that isn't available.

@yonghong-song
Copy link
Contributor Author

@yonghong-song this seems to be a breaking change for folks still targeting v1 or v2 instruction sets. LLVM generates an instruction that isn't available.

@brycekahle Yes. This is a deliberate choice. See upstream discussion in
https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f

The following is the quote from the discussion from Alexei:

Right. We did it for backward compat. Older llvm was
completely wrong to generate xadd for __sync_fetch_and_add.
That was my hack from 10 years ago when xadd was all we had.
So we fixed that old llvm bug, but introduced another with all
good intentions.
Since proper atomic insns were introduced 3 years ago we should
remove this backward compat feature/bug from llvm.
The only breakage is for kernels older than 5.12.
I think that's an acceptable risk.

@brycekahle
Copy link

@yonghong-song what is the best place for me to voice my concerns? We ship object files to our Datadog customers that target a wide range of kernels, going much further back than 5.12. The total customer host install count is 400,000+. We do not have the luxury of only deploying eBPF on internal infrastructure where we have total control of the kernel version used.

We also intentionally limit ourselves to v1 of the instruction set. Having the compiler generate v3 instructions, even when specifying v1, seems like a bug and a breaking change.

At the very least, there should be a non-inline assembly way for us to explicitly use the XADD instruction. Inline BPF assembly has its own challenges if you cannot use clang -target bpf, but must use clang + llc -march=bpf.

@yonghong-song
Copy link
Contributor Author

@brycekahle
For the following,

We also intentionally limit ourselves to v1 of the instruction set. Having the compiler generate v3 instructions, even
when specifying v1, seems like a bug and a breaking change.
Yes, the new change will generate atomic_fetch_add insn with cpu=v1 instead of 'lock xadd'. We could move generating atomic_fetch_add into -mcpu=v3 but then a lot of bpf programs may fail. Note that atomic_fetch_add in v1 will be okay with >= 5.12. As mentioned in my previous comments, this is a deliberate choice.

You can use inline asm like in
https://lore.kernel.org/bpf/[email protected]/
'clang -target bpf ...' should work. You do not need to use 'clang + llc'. Let me know if otherwise and I am happy to help.

cc @4ast for more comments.

@brycekahle
Copy link

@yonghong-song we cannot use clang -target bpf because we have non-CO-RE kernel tracing. From https://www.kernel.org/doc/html/v5.17/bpf/bpf_devel_QA.html#q-clang-flag-for-target-bpf

Q: In some cases clang flag -target bpf is used but in other cases the default clang target, which matches the underlying architecture, is used. What is the difference and when I should use which?

A: Although LLVM IR generation and optimization try to stay architecture independent, -target still has some impact on generated code:

BPF program may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, while bpf target may fail if bpf backend assembler does not understand these assembly codes, which is true in most cases.

@eddyz87
Copy link
Contributor

eddyz87 commented Aug 26, 2024

Hi @brycekahle ,

Could you please elaborate a bit on issues with inline assembly and clang + llc?
A simple test appears to work:

$ cat test3.c
void foo(unsigned long *p, unsigned long v)
{
  asm volatile("lock *(u64 *)(%[p] + 0) += %[v];" :: [p]"r"(p), [v]"r"(v));
}

$ clang -mcpu=v1 -O2 --target=bpf -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=asm
	.text
	.file	"test3.c"
	.globl	foo                             # -- Begin function foo
	.p2align	3
	.type	foo,@function
foo:                                    # @foo
.Lfoo$local:
	.type	.Lfoo$local,@function
# %bb.0:
	#APP
	lock *(u64 *)(r1 + 0) += r2

	#NO_APP
	exit
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.size	.Lfoo$local, .Lfunc_end0-foo
                                        # -- End function

@brycekahle
Copy link

@eddyz87 You are still using clang --target=bpf, you have to use the default target.

@eddyz87
Copy link
Contributor

eddyz87 commented Aug 26, 2024

@brycekahle,

@eddyz87 You are still using clang --target=bpf, you have to use the default target.

Well, that is a very strange arrangement, I'm not sure there is a documentation stating that this is a supported use-case. I remember several architecture-specific checks in the clang front-end. Anyways, simple example still works:

$ clang -O2 -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=asm
	.text
	.file	"test3.c"
	.globl	foo                             # -- Begin function foo
	.p2align	3
	.type	foo,@function
foo:                                    # @foo
.Lfoo$local:
	.type	.Lfoo$local,@function
	.cfi_startproc
# %bb.0:
	#APP
	lock *(u64 *)(r1 + 0) += r2

	#NO_APP
	exit
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo
	.size	.Lfoo$local, .Lfunc_end0-foo
	.cfi_endproc
                                        # -- End function

@brycekahle
Copy link

It is explicitly called out in the kernel documentation I linked and quoted.

@eddyz87
Copy link
Contributor

eddyz87 commented Aug 26, 2024

@brycekahle

It is explicitly called out in the kernel documentation I linked and quoted.

Oh, sorry, my bad.

But you stated that there is some issue in usage of inline assembly and clang + llc. What is the issue you observe?

@brycekahle
Copy link

It turns out what breaks inline assembly with the default target is trying to use the bpf registers by name. Since the example you gave does not do that, it does actually compile.

I still think generating v3 instructions when v1 is explicitly asked for, is a giant footgun that is going to trip up a lot of folks. Could you do this new behavior only when v3 (or later) is used?

@eddyz87
Copy link
Contributor

eddyz87 commented Aug 27, 2024

@brycekahle,

It turns out what breaks inline assembly with the default target is trying to use the bpf registers by name. Since the example you gave does not do that, it does actually compile.

For asm body or for asm parameters?

I was able to get error from clang specifying register names in parameters. Clang does some limited target-specific checks for inline assembly parameters, e.g. see virtual TargetInfo::isValidGCCRegisterName which is used for semantic checks (here and here). And this is why I said the this clang --target=native + llc -march=bpf arrangement is strange and is ultimately a hack.

I still think generating v3 instructions when v1 is explicitly asked for, is a giant footgun that is going to trip up a lot of folks. Could you do this new behavior only when v3 (or later) is used?

Agree that this is a footgun. I think we should report an error when __sync_fetch_and_add is used for cpu v1.
And probably provide an intrinsic to generate xadd. However, the following code works:

$ cat test3.c 
__attribute__((always_inline))
void xadd(unsigned long *p, unsigned long v)
{
  asm volatile("lock *(u64 *)(%[p] + 0) += %[v];" :: [p]"r"(p), [v]"r"(v): "memory");
}

void foo(unsigned long *p, unsigned long v)
{
  xadd(p, v);
}

$ clang -O2 -S -emit-llvm -o - test3.c | llc -march=bpf -mcpu=v1 -o - -filetype=obj | llvm-objdump --no-show-raw-insn -d -

<stdin>:	file format elf64-bpf

Disassembly of section .text:

0000000000000000 <xadd>:
       0:	lock *(u64 *)(r1 + 0x0) += r2
       1:	exit

0000000000000010 <foo>:
       2:	lock *(u64 *)(r1 + 0x0) += r2
       3:	exit

So intrinsic is optional.

@yonghong-song
Copy link
Contributor Author

clang with default architecture and llc with bpf target works in some cases but not always, e.g. x86_64 target may generate (e.g. jump table) which bpf back will not support. In early days, when CORE is not available, we do use clang with default architecture and llc with bpf model, but that is complicated and need to access kernel internal headers. When CORE is available, the recommendation is to use 'clang --target=bpf ...' directly which is also we did in bpf selftests.

@brycekahle
Copy link

x86_64 target may generate (e.g. jump table) which bpf back will not support

We avoid this by passing -fno-jump-tables (along with some other flags).

when CORE is not available

This is the use case I'm talking about.

@4ast
Copy link
Member

4ast commented Aug 27, 2024

I think we should report an error when __sync_fetch_and_add is used for cpu v1.

datadog is not the only user of (void)__sync_fetch_and_add(&counter, ...);
cilium has the same pattern and they also support old kernels.
If new llvm errors with -mcpu=v1 the users wouldn't upgrading llvm.
So I think for -mcpu=v1 and (void)__sync_fetch_and_add() llvm should generate old bpf xadd insn,
since that's the only insn it can generate.
For pattern "var = __sync_fetch_and_add()" and -mcpu=v1 llvm should error.
I believe it started doing this already some time back.

For (void)__sync_fetch_and_add() and -mcpu=v3 we should generate fetch_and_add() insn.

@yonghong-song
Copy link
Contributor Author

I think we should report an error when __sync_fetch_and_add is used for cpu v1.

datadog is not the only user of (void)__sync_fetch_and_add(&counter, ...); cilium has the same pattern and they also support old kernels. If new llvm errors with -mcpu=v1 the users wouldn't upgrading llvm. So I think for -mcpu=v1 and (void)__sync_fetch_and_add() llvm should generate old bpf xadd insn, since that's the only insn it can generate. For pattern "var = __sync_fetch_and_add()" and -mcpu=v1 llvm should error. I believe it started doing this already some time back.

For (void)__sync_fetch_and_add() and -mcpu=v3 we should generate fetch_and_add() insn.

Sounds good, I will fix the issue for this in llvm20.

yonghong-song pushed a commit to yonghong-song/llvm-project that referenced this pull request Aug 29, 2024
…ch_add insn (llvm#101428)"

This reverts commit c566769.

See discussion in [1]. Currently, with -mcpu=v1/v2, atomic_fetch_add() insn is generated
for (void)__sync_fetch_and_add(...). This breaks backward compatibility since
there are codes who runs on old system (< 5.12) which does not support atomci_fetch_add().

Now let revert previous comment ([1]) and add additional logic in the next patch
to ensure for (void)__sync_fetch_and_add(...),
  v1/v2:  generate locked add insn
  >= v3:  generate atomic_fetch_add insn

  [1] llvm#101428
yonghong-song pushed a commit to yonghong-song/llvm-project that referenced this pull request Aug 30, 2024
…ch_add insn (llvm#101428)"

This reverts commit c566769.

See discussion in [1]. Currently, with -mcpu=v1/v2, atomic_fetch_add() insn is generated
for (void)__sync_fetch_and_add(...). This breaks backward compatibility since
there are codes who runs on old system (< 5.12) which does not support atomci_fetch_add().

Now let revert previous comment ([1]) and add additional logic in the next patch
to ensure for (void)__sync_fetch_and_add(...),
  v1/v2:  generate locked add insn
  >= v3:  generate atomic_fetch_add insn

  [1] llvm#101428
yonghong-song added a commit that referenced this pull request Aug 30, 2024
…106494)

This patch contains two pars:
- first to revert the patch #101428.
- second to remove `atomic_fetch_and_*()` to `atomic_<op>()`
  conversion (when return value is not used), but preserve 
  `__sync_fetch_and_add()` to locked insn with cpu v1/v2.
yonghong-song pushed a commit to yonghong-song/llvm-project that referenced this pull request Sep 2, 2024
Before llvm20, (void)__sync_fetch_and_add(...) always generates locked xadd
insns. In linux kernel upstream discussion [1], it is found that
for arm64 architecture, the original semantics of (void)__sync_fetch_and_add(...),
i.e., __atomic_fetch_add(...), is preferred in order for jit to emit
proper native barrier insns.

In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will
generate the following insns:
  - for cpu v1/v2: locked xadd insns to keep backward compatibility
  - for cpu v3/v4: __atomic_fetch_add() insns

To ensure proper barrier semantics for (void)__sync_fetch_and_add(...),
cpu v3/v4 is recommended.

This patch enables cpu=v3 as the default cpu version. For users wanting
to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc command line.

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm#101428
  [3] llvm#106494
yonghong-song pushed a commit to yonghong-song/llvm-project that referenced this pull request Sep 3, 2024
Before llvm20, (void)__sync_fetch_and_add(...) always generates locked xadd
insns. In linux kernel upstream discussion [1], it is found that
for arm64 architecture, the original semantics of (void)__sync_fetch_and_add(...),
i.e., __atomic_fetch_add(...), is preferred in order for jit to emit
proper native barrier insns.

In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will
generate the following insns:
  - for cpu v1/v2: locked xadd insns to keep backward compatibility
  - for cpu v3/v4: __atomic_fetch_add() insns

To ensure proper barrier semantics for (void)__sync_fetch_and_add(...),
cpu v3/v4 is recommended.

This patch enables cpu=v3 as the default cpu version. For users wanting
to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc command line.

  [1] https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] llvm#101428
  [3] llvm#106494
yonghong-song added a commit that referenced this pull request Sep 3, 2024
Before llvm20, (void)__sync_fetch_and_add(...) always generates locked
xadd insns. In linux kernel upstream discussion [1], it is found that
for arm64 architecture, the original semantics of
(void)__sync_fetch_and_add(...), i.e., __atomic_fetch_add(...), is
preferred in order for jit to emit proper native barrier insns.

In llvm commits [2] and [3], (void)__sync_fetch_and_add(...) will
generate the following insns:
  - for cpu v1/v2: locked xadd insns to keep backward compatibility
  - for cpu v3/v4: __atomic_fetch_add() insns

To ensure proper barrier semantics for (void)__sync_fetch_and_add(...),
cpu v3/v4 is recommended.

This patch enables cpu=v3 as the default cpu version. For users wanting
to use cpu v1, -mcpu=v1 needs to be explicitly added to clang/llc
command line.

  [1]
https://lore.kernel.org/bpf/[email protected]/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
  [2] #101428
  [3] #106494
@yonghong-song yonghong-song deleted the fix-xadd branch February 8, 2025 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants