[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

inclyc · 2024-07-03T22:17:34Z

Declare that we do not support these "RTLIB" symbols. Thus PreISel pass could handle it.

Also removes warn-call.ll because we will generate loop for memintrins.

Declare that we do not support these "RTLIB" symbols. Thus PreISel pass could handle it. Also removes `warn-call.ll` because we will generate loop for memintrins. Fixes: #93700

eddyz87 · 2024-07-05T02:06:06Z

Hi @inclyc ,

I tried this patch with BPF Linux kernel selftests and this triggered a verification failure with async_stack_depth.bpf.o.
The reason for this failure is a change in a way char buf[256] = {} is translated.
Before this change it looked as follows:

0000000000000000 <timer_cb>:
; {
       0:	b7 01 00 00 00 00 00 00	r1 = 0x0
    ...
; 	volatile char buf[256] = {};
    ...
       9:	b4 02 00 00 00 00 00 00	w2 = 0x0
      10:	63 2a f8 ff 00 00 00 00	*(u32 *)(r10 - 0x8) = r2
      11:	73 2a fc ff 00 00 00 00	*(u8 *)(r10 - 0x4) = r2
      12:	73 2a b7 ff 00 00 00 00	*(u8 *)(r10 - 0x49) = r2
      13:	6b 2a b0 ff 00 00 00 00	*(u16 *)(r10 - 0x50) = r2
      14:	7b 1a a8 ff 00 00 00 00	*(u64 *)(r10 - 0x58) = r1
      15:	7b 1a a0 ff 00 00 00 00	*(u64 *)(r10 - 0x60) = r1
      16:	7b 1a 98 ff 00 00 00 00	*(u64 *)(r10 - 0x68) = r1
      17:	7b 1a 90 ff 00 00 00 00	*(u64 *)(r10 - 0x70) = r1
      18:	7b 1a 88 ff 00 00 00 00	*(u64 *)(r10 - 0x78) = r1
    ...

After this change it looks as follows:

; 	volatile char buf[256] = {};
       9:	73 1a ba ff 00 00 00 00	*(u8 *)(r10 - 0x46) = r1
      10:	b7 02 00 00 00 00 00 00	r2 = 0x0
      11:	bf a3 00 00 00 00 00 00	r3 = r10
      12:	07 03 00 00 00 ff ff ff	r3 += -0x100
      13:	0f 23 00 00 00 00 00 00	r3 += r2
      14:	73 13 00 00 00 00 00 00	*(u8 *)(r3 + 0x0) = r1
      15:	07 02 00 00 01 00 00 00	r2 += 0x1
      16:	a5 02 fa ff ba 00 00 00	if r2 < 0xba goto -0x6 <timer_cb+0x58>

Basically, fully unrolled version was replaced by loop.
There is some code in the BPFISelLowering.cpp that specifies limits for full unroll:

BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
                                     const BPFSubtarget &STI)
    : TargetLowering(TM) {
  ...
    MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 0;
    MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 0;
    MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 0;
  ...
}

Because of the way kernel BPF verifier works, the unrolled version is preferable to the loop (verifier traces execution, so processing a loop would take more instruction budget). Is it possible to make llvm::expandMemMoveAsLoop() respect the limits set by MaxStoresPerMemset?

@yonghong-song , fyi.

inclyc · 2024-07-05T02:13:02Z

Hi @eddyz87,

Thanks for testing this PR in the kernel! ❤️

Because of the way kernel BPF verifier works, the unrolled version is preferable to the loop (verifier traces execution, so processing a loop would take more instruction budget). Is it possible to make llvm::expandMemMoveAsLoop() respect the limits set by MaxStoresPerMemset?

I'd like to figure out how to, and I kindly ask if "memset" is unrolled then it is friendly to the verifier, what should we do if the memset is called with dynamic length? (Expand it, or report error?)

eddyz87 · 2024-07-05T05:03:04Z

I'd like to figure out how to, and I kindly ask if "memset" is unrolled then it is friendly to the verifier, what should we do if the memset is called with dynamic length? (Expand it, or report error?)

In certain cases verifier would be able to handle dynamic length.
Suppose there are two programs like:

# program A                    # program B
1: r1 = 0;                     1:       r1 = 0;  
2: *(u64 *)(r10 - 8) = r1;     2:       r2 = 0;
3: *(u64 *)(r10 - 16) = r1;    3:       r3 = r10;
4: *(u64 *)(r10 - 24) = r1;    4:   l1: r3 += -8;
5: *(u64 *)(r10 - 32) = r1;    5:       *(u64 *)(r3 - 0) = r1;
                               6:       r2 += 1;
                               7:       if r2 != 4 goto l1;

In essence, verifier traces (probably, it is correct to say that it does abstract interpretation) for each execution path in the program, and tries to trim traces that reach equivalent states. This logic makes loops a pain point. For program A verifier would trace path 1,2,3,4,5; while for program B verifier would trace path 1,2,3,4,5,6,7,4,5,6,7,4,5,6,7,4,5,6,7, which is significantly longer. The bound at instruction B.7 does not necessarily has to be a constant, it could be a register with a value known to verifier (e.g. if we add instruction B.0 with r4 = 4 the instruction at B.7 could be replaced by if r2 != r4 goto l1 and the trace would be the same).
Hence, it is preferable but not strictly necessary to unroll the loops.

Added: and there is a certain budget for maximal number instructions that verifier would trace, before giving up.

yonghong-song · 2024-07-05T16:22:26Z

If the user has a call to memcpy which cannot be simply unrolled, the current behavior is to issue an error. For example,

$ cat test1.c
#include <stdint.h>
#include <string.h>
typedef struct {
    unsigned char x[8];
} buf_t;
void f(buf_t *buf, uint64_t y, uint64_t z) {
    if (z > 8) z = 8;
    unsigned char *y_bytes = (unsigned char *)&y;
    memcpy(buf->x, y_bytes, z);
}

With current compiler (llvm18), we will have

/* I add gnu/stubs-32.h in the current directory to ensure compilation pass */
$ clang -O2 -target bpf -I . test1.c -c
test1.c:6:6: error: A call to built-in function 'memcpy' is not supported.
    6 | void f(buf_t *buf, uint64_t y, uint64_t z) {
      |      ^
1 error generated.

But with this patch, I see

0000000000000000 <f>:
       0:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 0x8) = r2
       1:       b7 04 00 00 08 00 00 00 r4 = 0x8
       2:       bf 32 00 00 00 00 00 00 r2 = r3
       3:       2d 34 01 00 00 00 00 00 if r4 > r3 goto +0x1 <f+0x28>
       4:       b7 02 00 00 08 00 00 00 r2 = 0x8
       5:       15 02 0b 00 00 00 00 00 if r2 == 0x0 goto +0xb <f+0x88>
       6:       b7 02 00 00 00 00 00 00 r2 = 0x0
       7:       bf 15 00 00 00 00 00 00 r5 = r1
       8:       0f 25 00 00 00 00 00 00 r5 += r2
       9:       bf a0 00 00 00 00 00 00 r0 = r10
      10:       07 00 00 00 f8 ff ff ff r0 += -0x8
      11:       0f 20 00 00 00 00 00 00 r0 += r2
      12:       71 00 00 00 00 00 00 00 r0 = *(u8 *)(r0 + 0x0)
      13:       73 05 00 00 00 00 00 00 *(u8 *)(r5 + 0x0) = r0
      14:       07 02 00 00 01 00 00 00 r2 += 0x1
      15:       3d 32 01 00 00 00 00 00 if r2 >= r3 goto +0x1 <f+0x88>
      16:       2d 24 f6 ff 00 00 00 00 if r4 > r2 goto -0xa <f+0x38>
      17:       95 00 00 00 00 00 00 00 exit

basically memcpy is 'inlined' by the compiler.

This is probably not what we want for test1.c since memcpy call is from user. In such cases, we would like user to explicitly write the loop to have maximum performance (e.g., load/store with u64 and remaining with u8 etc).

Maybe somehow we should prevent memcpy generation if it cannot be fully unrolled later (meaning no loops)? We have some backend hooks at IR level, maybe we can leverage them?

eddyz87 · 2024-07-05T16:50:43Z

@yonghong-song , the memcpy is inserted by LoopIdiomRecognizePass.
It uses TargetLibraryInfo class to check if memcpy is available (see LoopIdiomRecognize::runOnLoop).
After some inspection I found two ways to customize TargetLibraryInfo:

add function attributes of form no-builtin-... (see explicit TargetLibraryInfo::TargetLibraryInfo constructor);
adapt TargetLibraryInfo.cpp:initializeLibCalls().

Given that there are many builtins that we probably would like to disable, second option seems better.
On the other hand, I don't know what impact this would have on programs that use memcpy/memmove/memset explicitly.

There is also the following thing used by LoopIdiomRecognizePass in LoopIdiomRecognize.h:

/// Options to disable Loop Idiom Recognize, which can be shared with other
/// passes.
struct DisableLIRP {
  /// When true, the entire pass is disabled.
  static bool All;

  /// When true, Memset is disabled.
  static bool Memset;

  /// When true, Memcpy is disabled.
  static bool Memcpy;
};

But this would only solve issue with loop idioms pass.

yonghong-song · 2024-07-05T17:58:46Z

Thanks @eddyz87 I think the following change is good enough for now.

$ git diff
diff --git a/llvm/lib/Target/BPF/BPFTargetMachine.cpp b/llvm/lib/Target/BPF/BPFTargetMachine.cpp
index 7d91fa8bb824..005f6affaffd 100644
--- a/llvm/lib/Target/BPF/BPFTargetMachine.cpp
+++ b/llvm/lib/Target/BPF/BPFTargetMachine.cpp
@@ -29,6 +29,7 @@
 #include "llvm/Support/FormattedStream.h"
 #include "llvm/Target/TargetOptions.h"
 #include "llvm/Transforms/Scalar.h"
+#include "llvm/Transforms/Scalar/LoopIdiomRecognize.h"
 #include "llvm/Transforms/Scalar/SimplifyCFG.h"
 #include "llvm/Transforms/Utils/SimplifyCFGOptions.h"
 #include <optional>
@@ -76,6 +77,9 @@ BPFTargetMachine::BPFTargetMachine(const Target &T, const Triple &TT,
       Subtarget(TT, std::string(CPU), std::string(FS), *this) {
   initAsmInfo();
 
+  // Add some comments here.
+  DisableLIRP::All = true;
+
   BPFMCAsmInfo *MAI =
       static_cast<BPFMCAsmInfo *>(const_cast<MCAsmInfo *>(AsmInfo.get()));
   MAI->setDwarfUsesRelocationsAcrossSections(!Subtarget.getUseDwarfRIS());

Basically, LoopIdiomRecognizePass exposed DisableLIRP::All (for memcpy/memset) so other pass can disable this transformation. This should prevent original source code from generating a library func which later is inlined again.

Note that bpf program does not like library functions since it prevents verifier from doing its work.

inclyc · 2024-07-06T03:04:15Z

Hi @yonghong-song ,

Do you think it is a nice option to add some mem intrinsics into the kernel? Thus we can combine & generate such intrinsics, and verifier can easily recognize it (if it is just a BPF_CALL).

yonghong-song · 2024-07-06T04:29:59Z

Do you think it is a nice option to add some mem intrinsics into the kernel? Thus we can combine & generate such intrinsics, and verifier can easily recognize it (if it is just a BPF_CALL).

You probably mean adding some kernel kfunc functions (similar to what you mentioned intrinsic functions, e.g., memcpy/memset etc.). Currently, there is no plan for this as the verifier is able to handle fully-unrolled loop or actual loop pretty well. There is no special processing here.

Introducing kfunc needs additional verifier change and there are no obvious benefit. We do not have use case for memcpy/memset kfunc yet.

[BPF] expand mem intrinsics (memcpy, memmove, memset)

ed8bce2

Declare that we do not support these "RTLIB" symbols. Thus PreISel pass could handle it. Also removes `warn-call.ll` because we will generate loop for memintrins. Fixes: #93700

inclyc requested review from eddyz87, 4ast and RKSimon July 3, 2024 22:17

inclyc added the backend:BPF label Jul 4, 2024

inclyc requested a review from yonghong-song July 5, 2024 02:10

vincentmli mentioned this pull request Oct 6, 2024

i32 (ptr): A call to built-in function 'memcpy' is not supported. vincentmli/xdp-tools#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

Uh oh!

inclyc commented Jul 3, 2024

Uh oh!

eddyz87 commented Jul 5, 2024

Uh oh!

inclyc commented Jul 5, 2024

Uh oh!

eddyz87 commented Jul 5, 2024 •

edited

Loading

Uh oh!

yonghong-song commented Jul 5, 2024

Uh oh!

eddyz87 commented Jul 5, 2024

Uh oh!

yonghong-song commented Jul 5, 2024 •

edited

Loading

Uh oh!

inclyc commented Jul 6, 2024

Uh oh!

yonghong-song commented Jul 6, 2024

Uh oh!

Uh oh!

[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

Are you sure you want to change the base?

[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

Uh oh!

Conversation

inclyc commented Jul 3, 2024

Uh oh!

eddyz87 commented Jul 5, 2024

Uh oh!

inclyc commented Jul 5, 2024

Uh oh!

eddyz87 commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonghong-song commented Jul 5, 2024

Uh oh!

eddyz87 commented Jul 5, 2024

Uh oh!

yonghong-song commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inclyc commented Jul 6, 2024

Uh oh!

yonghong-song commented Jul 6, 2024

Uh oh!

Uh oh!

eddyz87 commented Jul 5, 2024 •

edited

Loading

yonghong-song commented Jul 5, 2024 •

edited

Loading