Skip to content

[NVPTX] Implement variadic functions using IR lowering #96015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jun 19, 2024

Summary:
This patch implements support for variadic functions for NVPTX targets.
The implementation here mainly follows what was done to implement it for
AMDGPU in #93362.

We change the NVPTX codegen to lower all variadic arguments to functions
by-value. This creates a flattened set of arguments that the IR lowering
pass converts into a struct with the proper alignment.

The behavior of this function was determined by iteratively checking
what the NVCC copmiler generates for its output. See examples like
https://godbolt.org/z/KavfTGY93. I have noted the main methods that
NVIDIA uses to lower variadic functions.

  1. All arguments are passed in a pointer to aggregate.
  2. The minimum alignment for a plain argument is 4 bytes.
  3. Alignment is dictated by the underlying type
  4. Structs are flattened and do not have their alignment changed.
  5. NVPTX never passes any arguments indirectly, even very large ones.

This patch passes the tests in the libc project currently, including
support for sprintf.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. libc backend:NVPTX llvm:transforms labels Jun 19, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 19, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-nvptx
@llvm/pr-subscribers-libc
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-clang-codegen

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch implements support for variadic functions for NVPTX targets.
The implementation here mainly follows what was done to implement it for
AMDGPU in #93362.

We change the NVPTX codegen to lower all variadic arguments to functions
by-value. This creates a flattened set of arguments that the IR lowering
pass converts into a struct with the proper alignment.

The behavior of this function was determined by iteratively checking
what the NVCC copmiler generates for its output. See examples like
https://godbolt.org/z/KavfTGY93. I have noted the main methods that
NVIDIA uses to lower variadic functions.

  1. All arguments are passed in a pointer to aggregate.
  2. The minimum alignment for a plain argument is 4 bytes.
  3. Alignment is dictated by the underlying type
  4. Structs are flattened and do not have their alignment changed.
  5. NVPTX never passes any arguments indirectly, even very large ones.

This patch passes the tests in the libc project currently, including
support for sprintf.


Patch is 49.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96015.diff

9 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/NVPTX.cpp (+13-3)
  • (added) clang/test/CodeGen/variadic-nvptx.c (+77)
  • (modified) libc/config/gpu/entrypoints.txt (+4-11)
  • (modified) libc/test/src/__support/CMakeLists.txt (+9-12)
  • (modified) llvm/lib/Target/NVPTX/NVPTXPassRegistry.def (+2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+2)
  • (modified) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+40-4)
  • (added) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+427)
  • (added) llvm/test/CodeGen/NVPTX/variadics-lowering.ll (+348)
diff --git a/clang/lib/CodeGen/Targets/NVPTX.cpp b/clang/lib/CodeGen/Targets/NVPTX.cpp
index 423485c9ca16e..1a5205eb4dabc 100644
--- a/clang/lib/CodeGen/Targets/NVPTX.cpp
+++ b/clang/lib/CodeGen/Targets/NVPTX.cpp
@@ -203,8 +203,15 @@ ABIArgInfo NVPTXABIInfo::classifyArgumentType(QualType Ty) const {
 void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {
   if (!getCXXABI().classifyReturnType(FI))
     FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
-  for (auto &I : FI.arguments())
-    I.info = classifyArgumentType(I.type);
+
+  unsigned ArgumentsCount = 0;
+  for (auto &I : FI.arguments()) {
+    if (FI.isVariadic() && ArgumentsCount > 0)
+      I.info = ABIArgInfo::getDirect();
+    else
+      I.info = classifyArgumentType(I.type);
+    ++ArgumentsCount;
+  }
 
   // Always honor user-specified calling convention.
   if (FI.getCallingConvention() != llvm::CallingConv::C)
@@ -215,7 +222,10 @@ void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {
 
 RValue NVPTXABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                QualType Ty, AggValueSlot Slot) const {
-  llvm_unreachable("NVPTX does not support varargs");
+  return emitVoidPtrVAArg(CGF, VAListAddr, Ty, /*IsIndirect=*/false,
+                          getContext().getTypeInfoInChars(Ty),
+                          CharUnits::fromQuantity(4),
+                          /*AllowHigherAlign=*/true, Slot);
 }
 
 void NVPTXTargetCodeGenInfo::setTargetAttributes(
diff --git a/clang/test/CodeGen/variadic-nvptx.c b/clang/test/CodeGen/variadic-nvptx.c
new file mode 100644
index 0000000000000..b47a5d7a2670d
--- /dev/null
+++ b/clang/test/CodeGen/variadic-nvptx.c
@@ -0,0 +1,77 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 5
+// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -emit-llvm -o - %s | FileCheck %s
+
+extern void varargs_simple(int, ...);
+
+// CHECK-LABEL: define dso_local void @foo(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[C:%.*]] = alloca i8, align 1
+// CHECK-NEXT:    [[S:%.*]] = alloca i16, align 2
+// CHECK-NEXT:    [[I:%.*]] = alloca i32, align 4
+// CHECK-NEXT:    [[L:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[F:%.*]] = alloca float, align 4
+// CHECK-NEXT:    [[D:%.*]] = alloca double, align 8
+// CHECK-NEXT:    [[A:%.*]] = alloca [[STRUCT_ANON:%.*]], align 4
+// CHECK-NEXT:    [[V:%.*]] = alloca <4 x i32>, align 16
+// CHECK-NEXT:    store i8 1, ptr [[C]], align 1
+// CHECK-NEXT:    store i16 1, ptr [[S]], align 2
+// CHECK-NEXT:    store i32 1, ptr [[I]], align 4
+// CHECK-NEXT:    store i64 1, ptr [[L]], align 8
+// CHECK-NEXT:    store float 1.000000e+00, ptr [[F]], align 4
+// CHECK-NEXT:    store double 1.000000e+00, ptr [[D]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[C]], align 1
+// CHECK-NEXT:    [[CONV:%.*]] = sext i8 [[TMP0]] to i32
+// CHECK-NEXT:    [[TMP1:%.*]] = load i16, ptr [[S]], align 2
+// CHECK-NEXT:    [[CONV1:%.*]] = sext i16 [[TMP1]] to i32
+// CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[I]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[L]], align 8
+// CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[F]], align 4
+// CHECK-NEXT:    [[CONV2:%.*]] = fpext float [[TMP4]] to double
+// CHECK-NEXT:    [[TMP5:%.*]] = load double, ptr [[D]], align 8
+// CHECK-NEXT:    call void (i32, ...) @varargs_simple(i32 noundef 0, i32 noundef [[CONV]], i32 noundef [[CONV1]], i32 noundef [[TMP2]], i64 noundef [[TMP3]], double noundef [[CONV2]], double noundef [[TMP5]])
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 4 [[A]], ptr align 4 @__const.foo.a, i64 12, i1 false)
+// CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[A]], i32 0, i32 0
+// CHECK-NEXT:    [[TMP7:%.*]] = load i32, ptr [[TMP6]], align 4
+// CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[A]], i32 0, i32 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load i8, ptr [[TMP8]], align 4
+// CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds [[STRUCT_ANON]], ptr [[A]], i32 0, i32 2
+// CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[TMP10]], align 4
+// CHECK-NEXT:    call void (i32, ...) @varargs_simple(i32 noundef 0, i32 [[TMP7]], i8 [[TMP9]], i32 [[TMP11]])
+// CHECK-NEXT:    store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, ptr [[V]], align 16
+// CHECK-NEXT:    [[TMP12:%.*]] = load <4 x i32>, ptr [[V]], align 16
+// CHECK-NEXT:    call void (i32, ...) @varargs_simple(i32 noundef 0, <4 x i32> noundef [[TMP12]])
+// CHECK-NEXT:    ret void
+//
+void foo() {
+  char c = '\x1';
+  short s = 1;
+  int i = 1;
+  long l = 1;
+  float f = 1.f;
+  double d = 1.;
+  varargs_simple(0, c, s, i, l, f, d);
+
+  struct {int x; char c; int y;} a = {1, '\x1', 1};
+  varargs_simple(0, a);
+
+  typedef int __attribute__((ext_vector_type(4))) int4;
+  int4 v = {1, 1, 1, 1};
+  varargs_simple(0, v);
+}
+
+typedef struct {long x; long y;} S;
+extern void varargs_complex(S, ...);
+
+// CHECK-LABEL: define dso_local void @bar(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[S:%.*]] = alloca [[STRUCT_S:%.*]], align 8
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i64(ptr align 8 [[S]], ptr align 8 @__const.bar.s, i64 16, i1 false)
+// CHECK-NEXT:    call void (ptr, ...) @varargs_complex(ptr noundef byval([[STRUCT_S]]) align 8 [[S]], i32 noundef 1, i64 noundef 1, double noundef 1.000000e+00)
+// CHECK-NEXT:    ret void
+//
+void bar() {
+  S s = {1l, 1l};
+  varargs_complex(s, 1, 1l, 1.0);
+}
diff --git a/libc/config/gpu/entrypoints.txt b/libc/config/gpu/entrypoints.txt
index 2217a696fc5d1..a8bb5b37636cc 100644
--- a/libc/config/gpu/entrypoints.txt
+++ b/libc/config/gpu/entrypoints.txt
@@ -1,13 +1,3 @@
-if(LIBC_TARGET_ARCHITECTURE_IS_AMDGPU)
-  set(extra_entrypoints
-      # stdio.h entrypoints
-      libc.src.stdio.sprintf
-      libc.src.stdio.snprintf
-      libc.src.stdio.vsprintf
-      libc.src.stdio.vsnprintf
-  )
-endif()
-
 set(TARGET_LIBC_ENTRYPOINTS
     # assert.h entrypoints
     libc.src.assert.__assert_fail
@@ -185,7 +175,10 @@ set(TARGET_LIBC_ENTRYPOINTS
     libc.src.errno.errno
 
     # stdio.h entrypoints
-    ${extra_entrypoints}
+    libc.src.stdio.sprintf
+    libc.src.stdio.snprintf
+    libc.src.stdio.vsprintf
+    libc.src.stdio.vsnprintf
     libc.src.stdio.feof
     libc.src.stdio.ferror
     libc.src.stdio.fseek
diff --git a/libc/test/src/__support/CMakeLists.txt b/libc/test/src/__support/CMakeLists.txt
index d05377eca8a81..54c9a7de6709a 100644
--- a/libc/test/src/__support/CMakeLists.txt
+++ b/libc/test/src/__support/CMakeLists.txt
@@ -86,18 +86,15 @@ add_libc_test(
     libc.src.__support.uint128
 )
 
-# NVPTX does not support varargs currently.
-if(NOT LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
-  add_libc_test(
-    arg_list_test
-    SUITE
-      libc-support-tests
-    SRCS
-      arg_list_test.cpp
-    DEPENDS
-      libc.src.__support.arg_list
-  )
-endif()
+add_libc_test(
+  arg_list_test
+  SUITE
+    libc-support-tests
+  SRCS
+    arg_list_test.cpp
+  DEPENDS
+    libc.src.__support.arg_list
+)
 
 if(NOT LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
   add_libc_test(
diff --git a/llvm/lib/Target/NVPTX/NVPTXPassRegistry.def b/llvm/lib/Target/NVPTX/NVPTXPassRegistry.def
index 6ff15ab6f13c4..95c243d9acc09 100644
--- a/llvm/lib/Target/NVPTX/NVPTXPassRegistry.def
+++ b/llvm/lib/Target/NVPTX/NVPTXPassRegistry.def
@@ -17,6 +17,8 @@
 #define MODULE_PASS(NAME, CREATE_PASS)
 #endif
 MODULE_PASS("generic-to-nvvm", GenericToNVVMPass())
+MODULE_PASS("expand-variadics",
+            ExpandVariadicsPass(ExpandVariadicsMode::Lowering))
 MODULE_PASS("nvptx-lower-ctor-dtor", NVPTXCtorDtorLoweringPass())
 #undef MODULE_PASS
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index b60a1d747af79..6ec171579b460 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -33,6 +33,7 @@
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Target/TargetOptions.h"
 #include "llvm/TargetParser/Triple.h"
+#include "llvm/Transforms/IPO/ExpandVariadics.h"
 #include "llvm/Transforms/Scalar.h"
 #include "llvm/Transforms/Scalar/GVN.h"
 #include "llvm/Transforms/Vectorize/LoadStoreVectorizer.h"
@@ -343,6 +344,7 @@ void NVPTXPassConfig::addIRPasses() {
   }
 
   addPass(createAtomicExpandLegacyPass());
+  addPass(createExpandVariadicsPass(ExpandVariadicsMode::Lowering));
   addPass(createNVPTXCtorDtorLoweringLegacyPass());
 
   // === LSR and other generic IR passes ===
diff --git a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
index d340bc041ccda..8737196cf1e1a 100644
--- a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
+++ b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
@@ -456,8 +456,8 @@ bool ExpandVariadics::runOnFunction(Module &M, IRBuilder<> &Builder,
   // Replace known calls to the variadic with calls to the va_list equivalent
   for (User *U : make_early_inc_range(VariadicWrapper->users())) {
     if (CallBase *CB = dyn_cast<CallBase>(U)) {
-      Value *calledOperand = CB->getCalledOperand();
-      if (VariadicWrapper == calledOperand)
+      Value *CalledOperand = CB->getCalledOperand();
+      if (VariadicWrapper == CalledOperand)
         Changed |=
             expandCall(M, Builder, CB, VariadicWrapper->getFunctionType(),
                        FixedArityReplacement);
@@ -938,6 +938,37 @@ struct Amdgpu final : public VariadicABIInfo {
   }
 };
 
+struct NVPTX final : public VariadicABIInfo {
+
+  bool enableForTarget() override { return true; }
+
+  bool vaListPassedInSSARegister() override { return true; }
+
+  Type *vaListType(LLVMContext &Ctx) override {
+    return PointerType::getUnqual(Ctx);
+  }
+
+  Type *vaListParameterType(Module &M) override {
+    return PointerType::getUnqual(M.getContext());
+  }
+
+  Value *initializeVaList(Module &M, LLVMContext &Ctx, IRBuilder<> &Builder,
+                          AllocaInst *, Value *Buffer) override {
+    return Builder.CreateAddrSpaceCast(Buffer, vaListParameterType(M));
+  }
+
+  VAArgSlotInfo slotInfo(const DataLayout &DL, Type *Parameter) override {
+    // NVPTX doesn't apply minimum alignment to types present in structs. Types
+    // with alignment less than four should be promoted by the compiler and will
+    // get the proper minimum alignment in those cases.
+    const unsigned MinAlign = 1;
+    Align A = DL.getABITypeAlign(Parameter);
+    if (A < MinAlign)
+      A = Align(MinAlign);
+    return {A, false};
+  }
+};
+
 struct Wasm final : public VariadicABIInfo {
 
   bool enableForTarget() override {
@@ -967,8 +998,8 @@ struct Wasm final : public VariadicABIInfo {
     if (A < MinAlign)
       A = Align(MinAlign);
 
-    if (auto s = dyn_cast<StructType>(Parameter)) {
-      if (s->getNumElements() > 1) {
+    if (auto *S = dyn_cast<StructType>(Parameter)) {
+      if (S->getNumElements() > 1) {
         return {DL.getABITypeAlign(PointerType::getUnqual(Ctx)), true};
       }
     }
@@ -988,6 +1019,11 @@ std::unique_ptr<VariadicABIInfo> VariadicABIInfo::create(const Triple &T) {
     return std::make_unique<Wasm>();
   }
 
+  case Triple::nvptx:
+  case Triple::nvptx64: {
+    return std::make_unique<NVPTX>();
+  }
+
   default:
     return {};
   }
diff --git a/llvm/test/CodeGen/NVPTX/variadics-backend.ll b/llvm/test/CodeGen/NVPTX/variadics-backend.ll
new file mode 100644
index 0000000000000..0e0c89d3e0214
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/variadics-backend.ll
@@ -0,0 +1,427 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64-- -march=nvptx64 -mcpu=sm_52 -mattr=+ptx64 < %s | FileCheck %s --check-prefix=CHECK-PTX
+; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64-- -march=nvptx64 -mcpu=sm_52 -mattr=+ptx64 | %ptxas-verify %}
+
+%struct.S1 = type { i32, i8, i64 }
+%struct.S2 = type { i64, i64 }
+
+@__const.bar.s1 = private unnamed_addr constant %struct.S1 { i32 1, i8 1, i64 1 }, align 8
+@__const.qux.s = private unnamed_addr constant %struct.S2 { i64 1, i64 1 }, align 8
+
+define dso_local i32 @variadics1(i32 noundef %first, ...) {
+; CHECK-PTX-LABEL: variadics1(
+; CHECK-PTX:       {
+; CHECK-PTX-NEXT:    .reg .b32 %r<11>;
+; CHECK-PTX-NEXT:    .reg .b64 %rd<11>;
+; CHECK-PTX-NEXT:    .reg .f64 %fd<7>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT:  // %bb.0: // %entry
+; CHECK-PTX-NEXT:    ld.param.u32 %r1, [variadics1_param_0];
+; CHECK-PTX-NEXT:    ld.param.u64 %rd1, [variadics1_param_1];
+; CHECK-PTX-NEXT:    ld.u32 %r2, [%rd1];
+; CHECK-PTX-NEXT:    add.s32 %r3, %r1, %r2;
+; CHECK-PTX-NEXT:    ld.u32 %r4, [%rd1+4];
+; CHECK-PTX-NEXT:    add.s32 %r5, %r3, %r4;
+; CHECK-PTX-NEXT:    ld.u32 %r6, [%rd1+8];
+; CHECK-PTX-NEXT:    add.s32 %r7, %r5, %r6;
+; CHECK-PTX-NEXT:    add.s64 %rd2, %rd1, 19;
+; CHECK-PTX-NEXT:    and.b64 %rd3, %rd2, -8;
+; CHECK-PTX-NEXT:    ld.u64 %rd4, [%rd3];
+; CHECK-PTX-NEXT:    cvt.u64.u32 %rd5, %r7;
+; CHECK-PTX-NEXT:    add.s64 %rd6, %rd5, %rd4;
+; CHECK-PTX-NEXT:    cvt.u32.u64 %r8, %rd6;
+; CHECK-PTX-NEXT:    add.s64 %rd7, %rd3, 15;
+; CHECK-PTX-NEXT:    and.b64 %rd8, %rd7, -8;
+; CHECK-PTX-NEXT:    ld.f64 %fd1, [%rd8];
+; CHECK-PTX-NEXT:    cvt.rn.f64.s32 %fd2, %r8;
+; CHECK-PTX-NEXT:    add.rn.f64 %fd3, %fd2, %fd1;
+; CHECK-PTX-NEXT:    cvt.rzi.s32.f64 %r9, %fd3;
+; CHECK-PTX-NEXT:    add.s64 %rd9, %rd8, 15;
+; CHECK-PTX-NEXT:    and.b64 %rd10, %rd9, -8;
+; CHECK-PTX-NEXT:    ld.f64 %fd4, [%rd10];
+; CHECK-PTX-NEXT:    cvt.rn.f64.s32 %fd5, %r9;
+; CHECK-PTX-NEXT:    add.rn.f64 %fd6, %fd5, %fd4;
+; CHECK-PTX-NEXT:    cvt.rzi.s32.f64 %r10, %fd6;
+; CHECK-PTX-NEXT:    st.param.b32 [func_retval0+0], %r10;
+; CHECK-PTX-NEXT:    ret;
+entry:
+  %vlist = alloca ptr, align 8
+  call void @llvm.va_start.p0(ptr %vlist)
+  %argp.cur = load ptr, ptr %vlist, align 8
+  %argp.next = getelementptr inbounds i8, ptr %argp.cur, i64 4
+  store ptr %argp.next, ptr %vlist, align 8
+  %0 = load i32, ptr %argp.cur, align 4
+  %add = add nsw i32 %first, %0
+  %argp.cur1 = load ptr, ptr %vlist, align 8
+  %argp.next2 = getelementptr inbounds i8, ptr %argp.cur1, i64 4
+  store ptr %argp.next2, ptr %vlist, align 8
+  %1 = load i32, ptr %argp.cur1, align 4
+  %add3 = add nsw i32 %add, %1
+  %argp.cur4 = load ptr, ptr %vlist, align 8
+  %argp.next5 = getelementptr inbounds i8, ptr %argp.cur4, i64 4
+  store ptr %argp.next5, ptr %vlist, align 8
+  %2 = load i32, ptr %argp.cur4, align 4
+  %add6 = add nsw i32 %add3, %2
+  %argp.cur7 = load ptr, ptr %vlist, align 8
+  %3 = getelementptr inbounds i8, ptr %argp.cur7, i32 7
+  %argp.cur7.aligned = call ptr @llvm.ptrmask.p0.i64(ptr %3, i64 -8)
+  %argp.next8 = getelementptr inbounds i8, ptr %argp.cur7.aligned, i64 8
+  store ptr %argp.next8, ptr %vlist, align 8
+  %4 = load i64, ptr %argp.cur7.aligned, align 8
+  %conv = sext i32 %add6 to i64
+  %add9 = add nsw i64 %conv, %4
+  %conv10 = trunc i64 %add9 to i32
+  %argp.cur11 = load ptr, ptr %vlist, align 8
+  %5 = getelementptr inbounds i8, ptr %argp.cur11, i32 7
+  %argp.cur11.aligned = call ptr @llvm.ptrmask.p0.i64(ptr %5, i64 -8)
+  %argp.next12 = getelementptr inbounds i8, ptr %argp.cur11.aligned, i64 8
+  store ptr %argp.next12, ptr %vlist, align 8
+  %6 = load double, ptr %argp.cur11.aligned, align 8
+  %conv13 = sitofp i32 %conv10 to double
+  %add14 = fadd double %conv13, %6
+  %conv15 = fptosi double %add14 to i32
+  %argp.cur16 = load ptr, ptr %vlist, align 8
+  %7 = getelementptr inbounds i8, ptr %argp.cur16, i32 7
+  %argp.cur16.aligned = call ptr @llvm.ptrmask.p0.i64(ptr %7, i64 -8)
+  %argp.next17 = getelementptr inbounds i8, ptr %argp.cur16.aligned, i64 8
+  store ptr %argp.next17, ptr %vlist, align 8
+  %8 = load double, ptr %argp.cur16.aligned, align 8
+  %conv18 = sitofp i32 %conv15 to double
+  %add19 = fadd double %conv18, %8
+  %conv20 = fptosi double %add19 to i32
+  call void @llvm.va_end.p0(ptr %vlist)
+  ret i32 %conv20
+}
+
+declare void @llvm.va_start.p0(ptr)
+
+declare ptr @llvm.ptrmask.p0.i64(ptr, i64)
+
+declare void @llvm.va_end.p0(ptr)
+
+define dso_local i32 @foo() {
+; CHECK-PTX-LABEL: foo(
+; CHECK-PTX:       {
+; CHECK-PTX-NEXT:    .local .align 8 .b8 __local_depot1[40];
+; CHECK-PTX-NEXT:    .reg .b64 %SP;
+; CHECK-PTX-NEXT:    .reg .b64 %SPL;
+; CHECK-PTX-NEXT:    .reg .b32 %r<4>;
+; CHECK-PTX-NEXT:    .reg .b64 %rd<5>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT:  // %bb.0: // %entry
+; CHECK-PTX-NEXT:    mov.u64 %SPL, __local_depot1;
+; CHECK-PTX-NEXT:    cvta.local.u64 %SP, %SPL;
+; CHECK-PTX-NEXT:    mov.u64 %rd1, 4294967297;
+; CHECK-PTX-NEXT:    st.u64 [%SP+0], %rd1;
+; CHECK-PTX-NEXT:    mov.b32 %r1, 1;
+; CHECK-PTX-NEXT:    st.u32 [%SP+8], %r1;
+; CHECK-PTX-NEXT:    mov.u64 %rd2, 1;
+; CHECK-PTX-NEXT:    st.u64 [%SP+16], %rd2;
+; CHECK-PTX-NEXT:    mov.u64 %rd3, 4607182418800017408;
+; CHECK-PTX-NEXT:    st.u64 [%SP+24], %rd3;
+; CHECK-PTX-NEXT:    st.u64 [%SP+32], %rd3;
+; CHECK-PTX-NEXT:    add.u64 %rd4, %SP, 0;
+; CHECK-PTX-NEXT:    { // callseq 0, 0
+; CHECK-PTX-NEXT:    .param .b32 param0;
+; CHECK-PTX-NEXT:    st.param.b32 [param0+0], 1;
+; CHECK-PTX-NEXT:    .param .b64 param1;
+; CHECK-PTX-NEXT:    st.param.b64 [param1+0], %rd4;
+; CHECK-PTX-NEXT:    .param .b32 retval0;
+; CHECK-PTX-NEXT:    call.uni (retval0),
+; CHECK-PTX-NEXT:    variadics1,
+; CHECK-PTX-NEXT:    (
+; CHECK-PTX-NEXT:    param0,
+; CHECK-PTX-NEXT:    param1
+; CHECK-PTX-NEXT:    );
+; CHECK-PTX-NEXT:    ld.param.b32 %r2, [retval0+0];
+; CHECK-PTX-NEXT:    } // callseq 0
+; CHECK-PTX-NEXT:    st.param.b32 [func_retval0+0], %r2;
+; CHECK-PTX-NEXT:    ret;
+entry:
+  %conv = sext i8 1 to i32
+  %conv1 = sext i16 1 to i32
+  %conv2 = fpext float 1.000000e+00 to double
+  %call = call i32 (i32, ...) @variadics1(i32 noundef 1, i32 noundef %conv, i32 noundef %conv1, i32 noundef 1, i64 noundef 1, double noundef %conv2, double noundef 1.000000e+00)
+  ret i32 %call
+}
+
+define dso_local i32 @variadics2(i32 noundef %first, ...) {
+; CHECK-PTX-LABEL: variadics2(
+; CHECK-PTX:       {
+; CHECK-PTX-NEXT:    .local .align 2 .b8 __local_depot2[4];
+; CHECK-PTX-NEXT:    .reg .b64 %SP;
+; CHECK-PTX-NEXT:    .reg .b64 %SPL;
+; CHECK-PTX-NEXT:    .reg .b16 %rs<6>;
+; CHECK-PTX-NEXT:    .reg .b32 %r<7>;
+; CHECK-PTX-NEXT:    .reg .b64 %rd<11>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT:  // %bb.0: // %entry
+; CHECK-PTX-NEXT:    mov.u64 %SPL, __local_depot2;
+; CHECK-PTX-NEXT:    cvta.local.u64 %SP, %SPL;
+; CHECK-PTX-NEXT:    ld.param.u32 %r1, [variadics2_param_0];
+; CHECK-PTX-NEXT:    ld.param.u64 %rd1, [variadics2_param_1];
+; CHECK-PTX-NEXT:    add.s64 %rd2, %rd1, 7;
+; CHECK-PTX-NEXT:    and.b64 %rd3, %rd2, -8;
+; CHECK-PTX-NEXT:    ld.u32 %r2, [%rd3];
+; CHECK-PTX-NEXT:    or.b64 %rd4, %rd3, 4;
+; CHECK-PTX-NEXT:    ld.s8 %r3, [%rd4];
+; CHECK-PTX-NEXT:    or.b64 %rd5, %rd3, 5;
+; CHECK-PTX-NEXT:    or.b64 %rd6, %rd3, 7;
+; CHECK-PTX-NEXT:    ld.u8 %rs1, [%rd6];
+; CHECK-PTX-NEXT:    st.u8 [%SP+2], %rs1;
+; CHECK-PTX-NEXT:    ld.u8 %rs2, [%rd5];
+; CHECK-PTX-NEXT:    or.b64 %rd7, %rd3, 6;
+; CHECK-PTX-NEXT:    ld.u8 %rs3, [%rd7];
+; CHECK-PTX-NEXT:    shl.b16 %rs4, %rs3, 8;
+; CHECK-PTX-NEXT:    or.b16 %rs5, %rs4, %rs2;
+; CHECK-PTX-NEXT:    st.u16 [%SP+0], %rs5;
+; CHECK-PTX-NEXT:    ld.u64 %rd8, [%rd3+8];
+; CHECK-PTX-NEXT:    add.s32 %r4, %r1, %r2;
+; CHECK-PTX-NEXT:    add.s32 %r5, %r4, %r3;
+; CHECK-PTX-NEXT:    cvt.u64.u32 %rd9, %r5;
+; CHECK-PTX-NEXT:    add.s64 %rd10, %rd9, %rd8;
+; CHECK-PTX-NEXT:    cvt.u32.u64 %r6, %rd10;
+; CHECK-PTX-NEXT:    st.param.b32 [func_retval0+0], %r6;
+; CHECK-PTX-NEXT:    ret;
+entry:
+  %vlist = alloca ptr, align 8
+  %s1.sroa.3 = alloca [3 x i8], align 1
+  call void @llvm.va_start.p0(ptr %vlist)
+  %argp.cur = load ptr, ptr %vlist, align 8
+  %0 = getelementptr inbounds i8, ptr %argp.cur, i32 7
+  %argp.cur.aligned = call ptr @llvm.ptrmask.p0.i64(ptr %0, i64 -8)
+  %argp.next = getelementptr inbounds i8, ptr %argp.cur.aligned, i64 16
+  store ptr %argp.next, ptr %vlist, align 8
+  %s1.sroa.0.0.copyload = load i32, ptr %argp.cur.aligned, align 8
+  %s1.sroa.2.0.argp.cur.aligned.sroa_idx = getelementptr inbounds i8, ptr %argp.cur.aligned, i64 4
+  %s1.sroa.2.0.copyload = load i8, ptr %s1.sroa.2.0.argp.cur.aligned.sroa_idx, align 4
+  %s1.sroa.3.0.argp.cur.aligned.sroa_idx = getelementptr inbounds i8, ptr %argp.cur.aligned, i64 5
+  call void @llvm.memcpy...
[truncated]

@@ -17,6 +17,8 @@
#define MODULE_PASS(NAME, CREATE_PASS)
#endif
MODULE_PASS("generic-to-nvvm", GenericToNVVMPass())
MODULE_PASS("expand-variadics",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary, I think. I don't recall whether I removed it from the amdgpu one before commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't remember if adding it to addIRPasses applied to all uses. I remember something like different pipeline configurations using different things. I'll try to figure it out.

Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the possible exception of some alignment handling this looks about as I'd expect it to. Ideally we'd get some feedback from nvptx-associated people but fixing libc is a good sign

// NVPTX doesn't apply minimum alignment to types present in structs. Types
// with alignment less than four should be promoted by the compiler and will
// get the proper minimum alignment in those cases.
const unsigned MinAlign = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does what the comment says it should. Also some half types might be passed with 2 byte alignment, others might be passed promoted to double.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the standard varargs handling will automatically promote things like shorts to ints and floats to doubles. What the comment means is that clang already handled the size / alignment in those cases, so we need to use a minimum alignment of 1 so we respect the alignment for things that clang didn't modify.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 19, 2024

With the possible exception of some alignment handling this looks about as I'd expect it to. Ideally we'd get some feedback from nvptx-associated people but fixing libc is a good sign

Yep, I believe @Artem-B is on vacation, so hopefully @AlexMaclean can chime in. This should be ABI compatible with NVIDIA as far as I'm aware.

@jhuber6 jhuber6 force-pushed the NVPTXVarargs branch 2 times, most recently from a05b24a to 0cae8db Compare June 19, 2024 19:10
@llvmbot llvmbot added the clang:frontend Language frontend issues, e.g. anything involving "Sema" label Jun 19, 2024
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 22, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 25, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jun 25, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jul 1, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 12, 2024

Ping, waiting on this so I can land #96369.

@jlebar
Copy link
Member

jlebar commented Jul 12, 2024

@Artem-B

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit.

Summary:
This patch implements support for variadic functions for NVPTX targets.
The implementation here mainly follows what was done to implement it for
AMDGPU in llvm#93362.

We change the NVPTX codegen to lower all variadic arguments to functions
by-value. This creates a flattened set of arguments that the IR lowering
pass converts into a struct with the proper alignment.

The behavior of this function was determined by iteratively checking
what the NVCC copmiler generates for its output. See examples like
https://godbolt.org/z/KavfTGY93. I have noted the main methods that
NVIDIA uses to lower variadic functions.

1. All arguments are passed in a pointer to aggregate.
2. The minimum alignment for a plain argument is 4 bytes.
3. Alignment is dictated by the underlying type
4. Structs are flattened and do not have their alignment changed.
5. NVPTX never passes any arguments indirectly, even very large ones.

This patch passes the tests in the `libc` project currently, including
support for `sprintf`.
@jhuber6 jhuber6 merged commit 486d00e into llvm:main Jul 12, 2024
4 of 6 checks passed
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Jul 12, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
jhuber6 added a commit that referenced this pull request Jul 13, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on #96015.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Summary:
This patch implements support for variadic functions for NVPTX targets.
The implementation here mainly follows what was done to implement it for
AMDGPU in llvm#93362.

We change the NVPTX codegen to lower all variadic arguments to functions
by-value. This creates a flattened set of arguments that the IR lowering
pass converts into a struct with the proper alignment.

The behavior of this function was determined by iteratively checking
what the NVCC copmiler generates for its output. See examples like
https://godbolt.org/z/KavfTGY93. I have noted the main methods that
NVIDIA uses to lower variadic functions.

1. All arguments are passed in a pointer to aggregate.
2. The minimum alignment for a plain argument is 4 bytes.
3. Alignment is dictated by the underlying type
4. Structs are flattened and do not have their alignment changed.
5. NVPTX never passes any arguments indirectly, even very large ones.

This patch passes the tests in the `libc` project currently, including
support for `sprintf`.
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Summary:
This patch implements the `printf` family of functions on the GPU using
the new variadic support. This patch adapts the old handling in the
`rpc_fprintf` placeholder, but adds an extra RPC call to get the size of
the buffer to copy. This prevents the GPU from needing to parse the
string. While it's theoretically possible for the pass to know the size
of the struct, it's prohibitively difficult to do while maintaining ABI
compatibility with NVIDIA's varargs.

Depends on llvm#96015.
@tahonermann
Copy link
Contributor

The change to NVPTXTargetInfo::getBuiltinVaListKind() in clang/lib/Basic/Targets/NVPTX.h caused a regression in Intel's downstream Clang-based compiler when compiling SYCL code with device compilation targeting NVPTX. CUDA might be similarly impacted, but I haven't verified. There is no need to revert the change; we've backed out the relevant part of the change in our downstream fork for now. But we need to figure out a proper solution for the problem as described below.

SYCL compilation uses the host standard library headers for both host and device compilation. Microsoft's standard library headers define va_list as char* as shown at line 72 below:

...\Microsoft Visual Studio\2019\Professional\VC\Tools\MSVC\14.29.30133\include\vadefs.h:
 67 #ifndef _VA_LIST_DEFINED
 68    #define _VA_LIST_DEFINED
 69    #ifdef _M_CEE_PURE
 70        typedef System::ArgIterator va_list;
 71    #else
 72        typedef char* va_list;
 73    #endif
 74 #endif
 ..
 97     void  __cdecl __va_start(va_list*, ...);
 98     void* __cdecl __va_arg(va_list*, ...);
 99     void  __cdecl __va_end(va_list*);
100 
101     #define __crt_va_start_a(ap, v) ((void)(__va_start(&ap, _ADDRESSOF(v), _SLOTSIZEOF(v), __alignof(v), _ADDRESSOF(v))))
102     #define __crt_va_arg(ap, t)     (*(t *)__va_arg(&ap, _SLOTSIZEOF(t), _APALIGN(t,ap), (t*)0))
103     #define __crt_va_end(ap)        ((void)(__va_end(&ap)))

The Clang driver interposes on Microsoft's vadefs.h header file by placing its own vadefs.h header file earlier in the header search path. This header overrides some of the macros defined by Microsoft's vadefs.h header file with ones that are intended to work with Clang's builtin variadic function support.

<llvm-project>/clang/lib/Headers/vadefs.h:
 18 #include_next <vadefs.h>
 19
 20 /* Override macros from vadefs.h with definitions that work with Clang. */
 ..
 34 /* VS 2015 switched to double underscore names, which is an improvement, but now
 35  * we have to intercept those names too.
 36  */
 37 #ifdef __crt_va_start
 38 #undef __crt_va_start
 39 #define __crt_va_start(ap, param) __builtin_va_start(ap, param)
 40 #endif
 41 #ifdef __crt_va_end
 42 #undef __crt_va_end
 43 #define __crt_va_end(ap)          __builtin_va_end(ap)
 44 #endif
 45 #ifdef __crt_va_arg
 46 #undef __crt_va_arg
 47 #define __crt_va_arg(ap, type)    __builtin_va_arg(ap, type)
 48 #endif

The result is that invocations of the __crt_va_start, __crt_va_end, and __crt_va_arg macros in Microsoft standard library headers end up passing objects of the Microsoft defined va_list type (aka, char*) to builtin functions like __builtin_va_start() that expect objects of type __builtin_va_list (aka, void*) thus leading to compilation failures when compiling in C++ modes.

There are at least a few options available to try to address this.

  1. Revert the change to NVPTXTargetInfo::getBuiltinVaListKind() so that TargetInfo::CharPtrBuiltinVaList is returned. I don't know what motivated that particular change, so I don't know how feasible this option is.
  2. Modify NVPTXTargetInfo::getBuiltinVaListKind() to conditionally return a value based on which standard library is being used. I'm not sure how feasible this is either. There are currently two targets that conditionally return different values from their getBuiltinVaListKind() implementations; Hexagon (see here] and ARM (see here).
  3. Modify Clang's vadefs.h header file to also interpose on Microsoft's definition of va_list with a change like the following. This could still lead to problems if _VA_LIST_DEFINED is already defined, if va_list is already declared (as char*), or in code that assumes that va_list is defined as char* in Microsoft environments.
 +  #ifndef _VA_LIST_DEFINED
 +      #define _VA_LIST_DEFINED
 +      typedef __builtin_va_list va_list;
 +  #endif
 18 #include_next <vadefs.h>

Any thoughts on the above are much appreciated!

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 24, 2024

The change to NVPTXTargetInfo::getBuiltinVaListKind() in clang/lib/Basic/Targets/NVPTX.h caused a regression in Intel's downstream Clang-based compiler when compiling SYCL code with device compilation targeting NVPTX. CUDA might be similarly impacted, but I haven't verified. There is no need to revert the change; we've backed out the relevant part of the change in our downstream fork for now. But we need to figure out a proper solution for the problem as described below.

Is it just the char * vs void *? We can probably change that back, in fact I thought I did that in the patch but I might've forgotten to push the update.

@tahonermann
Copy link
Contributor

Is it just the char * vs void *? We can probably change that back, in fact I thought I did that in the patch but I might've forgotten to push the update.

Yes. The local workaround we're currently using just unconditionally returns CharPtrBuiltinVaList.

@tahonermann
Copy link
Contributor

For historical reference, @jhuber6 backed out the change to NVPTXTargetInfo::getBuiltinVaListKind() in #100438 (commit fb1e077)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:NVPTX clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category libc llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants