Skip to content

[BPF] expand mem intrinsics (memcpy, memmove, memset) #97648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

inclyc
Copy link
Member

@inclyc inclyc commented Jul 3, 2024

Declare that we do not support these "RTLIB" symbols. Thus PreISel pass could handle it.

Also removes warn-call.ll because we will generate loop for memintrins.

Fixes: #93700

Declare that we do not support these "RTLIB" symbols.
Thus PreISel pass could handle it.

Also removes `warn-call.ll` because we will generate loop for memintrins.

Fixes: #93700
@inclyc inclyc requested review from eddyz87, 4ast and RKSimon July 3, 2024 22:17
@eddyz87
Copy link
Contributor

eddyz87 commented Jul 5, 2024

Hi @inclyc ,

I tried this patch with BPF Linux kernel selftests and this triggered a verification failure with async_stack_depth.bpf.o.
The reason for this failure is a change in a way char buf[256] = {} is translated.
Before this change it looked as follows:

0000000000000000 <timer_cb>:
; {
       0:	b7 01 00 00 00 00 00 00	r1 = 0x0
    ...
; 	volatile char buf[256] = {};
    ...
       9:	b4 02 00 00 00 00 00 00	w2 = 0x0
      10:	63 2a f8 ff 00 00 00 00	*(u32 *)(r10 - 0x8) = r2
      11:	73 2a fc ff 00 00 00 00	*(u8 *)(r10 - 0x4) = r2
      12:	73 2a b7 ff 00 00 00 00	*(u8 *)(r10 - 0x49) = r2
      13:	6b 2a b0 ff 00 00 00 00	*(u16 *)(r10 - 0x50) = r2
      14:	7b 1a a8 ff 00 00 00 00	*(u64 *)(r10 - 0x58) = r1
      15:	7b 1a a0 ff 00 00 00 00	*(u64 *)(r10 - 0x60) = r1
      16:	7b 1a 98 ff 00 00 00 00	*(u64 *)(r10 - 0x68) = r1
      17:	7b 1a 90 ff 00 00 00 00	*(u64 *)(r10 - 0x70) = r1
      18:	7b 1a 88 ff 00 00 00 00	*(u64 *)(r10 - 0x78) = r1
    ...

After this change it looks as follows:

; 	volatile char buf[256] = {};
       9:	73 1a ba ff 00 00 00 00	*(u8 *)(r10 - 0x46) = r1
      10:	b7 02 00 00 00 00 00 00	r2 = 0x0
      11:	bf a3 00 00 00 00 00 00	r3 = r10
      12:	07 03 00 00 00 ff ff ff	r3 += -0x100
      13:	0f 23 00 00 00 00 00 00	r3 += r2
      14:	73 13 00 00 00 00 00 00	*(u8 *)(r3 + 0x0) = r1
      15:	07 02 00 00 01 00 00 00	r2 += 0x1
      16:	a5 02 fa ff ba 00 00 00	if r2 < 0xba goto -0x6 <timer_cb+0x58>

Basically, fully unrolled version was replaced by loop.
There is some code in the BPFISelLowering.cpp that specifies limits for full unroll:

BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
                                     const BPFSubtarget &STI)
    : TargetLowering(TM) {
  ...
    MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 0;
    MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 0;
    MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 0;
  ...
}

Because of the way kernel BPF verifier works, the unrolled version is preferable to the loop (verifier traces execution, so processing a loop would take more instruction budget). Is it possible to make llvm::expandMemMoveAsLoop() respect the limits set by MaxStoresPerMemset?

@yonghong-song , fyi.

@inclyc inclyc requested a review from yonghong-song July 5, 2024 02:10
@inclyc
Copy link
Member Author

inclyc commented Jul 5, 2024

Hi @eddyz87,

Thanks for testing this PR in the kernel! ❤️

Because of the way kernel BPF verifier works, the unrolled version is preferable to the loop (verifier traces execution, so processing a loop would take more instruction budget). Is it possible to make llvm::expandMemMoveAsLoop() respect the limits set by MaxStoresPerMemset?

I'd like to figure out how to, and I kindly ask if "memset" is unrolled then it is friendly to the verifier, what should we do if the memset is called with dynamic length? (Expand it, or report error?)

@eddyz87
Copy link
Contributor

eddyz87 commented Jul 5, 2024

I'd like to figure out how to, and I kindly ask if "memset" is unrolled then it is friendly to the verifier, what should we do if the memset is called with dynamic length? (Expand it, or report error?)

In certain cases verifier would be able to handle dynamic length.
Suppose there are two programs like:

# program A                    # program B
1: r1 = 0;                     1:       r1 = 0;  
2: *(u64 *)(r10 - 8) = r1;     2:       r2 = 0;
3: *(u64 *)(r10 - 16) = r1;    3:       r3 = r10;
4: *(u64 *)(r10 - 24) = r1;    4:   l1: r3 += -8;
5: *(u64 *)(r10 - 32) = r1;    5:       *(u64 *)(r3 - 0) = r1;
                               6:       r2 += 1;
                               7:       if r2 != 4 goto l1; 

In essence, verifier traces (probably, it is correct to say that it does abstract interpretation) for each execution path in the program, and tries to trim traces that reach equivalent states. This logic makes loops a pain point. For program A verifier would trace path 1,2,3,4,5; while for program B verifier would trace path 1,2,3,4,5,6,7,4,5,6,7,4,5,6,7,4,5,6,7, which is significantly longer. The bound at instruction B.7 does not necessarily has to be a constant, it could be a register with a value known to verifier (e.g. if we add instruction B.0 with r4 = 4 the instruction at B.7 could be replaced by if r2 != r4 goto l1 and the trace would be the same).
Hence, it is preferable but not strictly necessary to unroll the loops.

Added: and there is a certain budget for maximal number instructions that verifier would trace, before giving up.

@yonghong-song
Copy link
Contributor

If the user has a call to memcpy which cannot be simply unrolled, the current behavior is to issue an error. For example,

$ cat test1.c
#include <stdint.h>
#include <string.h>
typedef struct {
    unsigned char x[8];
} buf_t;
void f(buf_t *buf, uint64_t y, uint64_t z) {
    if (z > 8) z = 8;
    unsigned char *y_bytes = (unsigned char *)&y;
    memcpy(buf->x, y_bytes, z);
}

With current compiler (llvm18), we will have

/* I add gnu/stubs-32.h in the current directory to ensure compilation pass */
$ clang -O2 -target bpf -I . test1.c -c
test1.c:6:6: error: A call to built-in function 'memcpy' is not supported.
    6 | void f(buf_t *buf, uint64_t y, uint64_t z) {
      |      ^
1 error generated.

But with this patch, I see

0000000000000000 <f>:
       0:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 0x8) = r2
       1:       b7 04 00 00 08 00 00 00 r4 = 0x8
       2:       bf 32 00 00 00 00 00 00 r2 = r3
       3:       2d 34 01 00 00 00 00 00 if r4 > r3 goto +0x1 <f+0x28>
       4:       b7 02 00 00 08 00 00 00 r2 = 0x8
       5:       15 02 0b 00 00 00 00 00 if r2 == 0x0 goto +0xb <f+0x88>
       6:       b7 02 00 00 00 00 00 00 r2 = 0x0
       7:       bf 15 00 00 00 00 00 00 r5 = r1
       8:       0f 25 00 00 00 00 00 00 r5 += r2
       9:       bf a0 00 00 00 00 00 00 r0 = r10
      10:       07 00 00 00 f8 ff ff ff r0 += -0x8
      11:       0f 20 00 00 00 00 00 00 r0 += r2
      12:       71 00 00 00 00 00 00 00 r0 = *(u8 *)(r0 + 0x0)
      13:       73 05 00 00 00 00 00 00 *(u8 *)(r5 + 0x0) = r0
      14:       07 02 00 00 01 00 00 00 r2 += 0x1
      15:       3d 32 01 00 00 00 00 00 if r2 >= r3 goto +0x1 <f+0x88>
      16:       2d 24 f6 ff 00 00 00 00 if r4 > r2 goto -0xa <f+0x38>
      17:       95 00 00 00 00 00 00 00 exit

basically memcpy is 'inlined' by the compiler.

This is probably not what we want for test1.c since memcpy call is from user. In such cases, we would like user to explicitly write the loop to have maximum performance (e.g., load/store with u64 and remaining with u8 etc).

Maybe somehow we should prevent memcpy generation if it cannot be fully unrolled later (meaning no loops)? We have some backend hooks at IR level, maybe we can leverage them?

@eddyz87
Copy link
Contributor

eddyz87 commented Jul 5, 2024

@yonghong-song , the memcpy is inserted by LoopIdiomRecognizePass.
It uses TargetLibraryInfo class to check if memcpy is available (see LoopIdiomRecognize::runOnLoop).
After some inspection I found two ways to customize TargetLibraryInfo:

  • add function attributes of form no-builtin-... (see explicit TargetLibraryInfo::TargetLibraryInfo constructor);
  • adapt TargetLibraryInfo.cpp:initializeLibCalls().

Given that there are many builtins that we probably would like to disable, second option seems better.
On the other hand, I don't know what impact this would have on programs that use memcpy/memmove/memset explicitly.

There is also the following thing used by LoopIdiomRecognizePass in LoopIdiomRecognize.h:

/// Options to disable Loop Idiom Recognize, which can be shared with other
/// passes.
struct DisableLIRP {
  /// When true, the entire pass is disabled.
  static bool All;

  /// When true, Memset is disabled.
  static bool Memset;

  /// When true, Memcpy is disabled.
  static bool Memcpy;
};

But this would only solve issue with loop idioms pass.

@yonghong-song
Copy link
Contributor

yonghong-song commented Jul 5, 2024

Thanks @eddyz87 I think the following change is good enough for now.

$ git diff
diff --git a/llvm/lib/Target/BPF/BPFTargetMachine.cpp b/llvm/lib/Target/BPF/BPFTargetMachine.cpp
index 7d91fa8bb824..005f6affaffd 100644
--- a/llvm/lib/Target/BPF/BPFTargetMachine.cpp
+++ b/llvm/lib/Target/BPF/BPFTargetMachine.cpp
@@ -29,6 +29,7 @@
 #include "llvm/Support/FormattedStream.h"
 #include "llvm/Target/TargetOptions.h"
 #include "llvm/Transforms/Scalar.h"
+#include "llvm/Transforms/Scalar/LoopIdiomRecognize.h"
 #include "llvm/Transforms/Scalar/SimplifyCFG.h"
 #include "llvm/Transforms/Utils/SimplifyCFGOptions.h"
 #include <optional>
@@ -76,6 +77,9 @@ BPFTargetMachine::BPFTargetMachine(const Target &T, const Triple &TT,
       Subtarget(TT, std::string(CPU), std::string(FS), *this) {
   initAsmInfo();
 
+  // Add some comments here.
+  DisableLIRP::All = true;
+
   BPFMCAsmInfo *MAI =
       static_cast<BPFMCAsmInfo *>(const_cast<MCAsmInfo *>(AsmInfo.get()));
   MAI->setDwarfUsesRelocationsAcrossSections(!Subtarget.getUseDwarfRIS());

Basically, LoopIdiomRecognizePass exposed DisableLIRP::All (for memcpy/memset) so other pass can disable this transformation. This should prevent original source code from generating a library func which later is inlined again.

Note that bpf program does not like library functions since it prevents verifier from doing its work.

@inclyc
Copy link
Member Author

inclyc commented Jul 6, 2024

Hi @yonghong-song ,

Do you think it is a nice option to add some mem intrinsics into the kernel? Thus we can combine & generate such intrinsics, and verifier can easily recognize it (if it is just a BPF_CALL).

@yonghong-song
Copy link
Contributor

Do you think it is a nice option to add some mem intrinsics into the kernel? Thus we can combine & generate such intrinsics, and verifier can easily recognize it (if it is just a BPF_CALL).

You probably mean adding some kernel kfunc functions (similar to what you mentioned intrinsic functions, e.g., memcpy/memset etc.). Currently, there is no plan for this as the verifier is able to handle fully-unrolled loop or actual loop pretty well. There is no special processing here.

Introducing kfunc needs additional verifier change and there are no obvious benefit. We do not have use case for memcpy/memset kfunc yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bpf target tries to emit memcpy and fails to compile
3 participants