Skip to content

[NVPTX] sqrtf always approximated with llvm.nvvm.sqrt.approx.f #131749

Open
@Lai-YT

Description

@Lai-YT

I've encountered an issue with the square-root (sqrtf) floating-point operation on NVPTX. It always compiles to the llvm.nvvm.sqrt.approx.f intrinsic, even when -fno-approx-func is specified. I have a hypothesis about the cause and would like to explore a potential fix. Any guidance would be appreciated! 😃


What's the Problem?

Here's the relevant code:

// cat main.cu
__device__ float f(float x) {
  return sqrtf(x);
}

The function sqrtf always compiles to the llvm.nvvm.sqrt.approx.f intrinsic, regardless of -fno-approx-func:

; clang -Wno-unknown-cuda-version -fno-approx-func main.cu -S -emit-llvm --cuda-device-only -o -
; ModuleID = 'main.cu'
source_filename = "main.cu"
target datalayout = "e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@.str = private unnamed_addr constant [11 x i8] c"__CUDA_FTZ\00", align 1
@.str.2 = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1

; Function Attrs: convergent noinline nounwind optnone
define dso_local noundef float @_Z1ff(float noundef %x) #0 {
entry:
  %__a.addr.i = alloca float, align 4
  %x.addr = alloca float, align 4
  store float %x, ptr %x.addr, align 4
  %0 = load float, ptr %x.addr, align 4
  store float %0, ptr %__a.addr.i, align 4
  %1 = load float, ptr %__a.addr.i, align 4
  %2 = call float @llvm.nvvm.sqrt.approx.f(float %1) #3
  ret float %2
}

; Function Attrs: convergent nounwind
declare i32 @__nvvm_reflect(ptr) #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.rn.ftz.f(float) #2

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.approx.ftz.f(float) #2

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.rn.f(float) #2

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.approx.f(float) #2

attributes #0 = { convergent noinline nounwind optnone "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_52" "target-features"="+ptx85,+sm_52" }
attributes #1 = { convergent nounwind "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_52" "target-features"="+ptx85,+sm_52" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(none) }
attributes #3 = { nounwind }

!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.ident = !{!4, !5}
!nvvmir.version = !{!6}

!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 12, i32 6]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!3 = !{i32 7, !"frame-pointer", i32 2}
!4 = !{!"clang version 21.0.0git ([email protected]:Lai-YT/llvm-project.git e57cd100ca297cf81854e35cccbf703edddd4aad)"}
!5 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
!6 = !{i32 2, i32 0}

What's the Possible Cause?

As defined in __clang_cuda_math.h, sqrtf expands to the __nv_sqrtf intrinsic:

__DEVICE__ float sqrtf(float __a) { return __nv_sqrtf(__a); }

I noticed that __nv_sqrtf is defined by NVIDIA in libdevice.bc, so I grabbed the libdevice on my machine (which is libdevice.10.bc) and disassembled the bitcode with llvm-dis:

@.str = private unnamed_addr constant [11 x i8] c"__CUDA_FTZ\00", align 1
@.str.2 = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1

...

define float @__nv_sqrtf(float %x) #0 {
  %1 = call i32 @__nvvm_reflect(ptr @.str) #6
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %3, label %10

3:                                                ; preds = %0
  %4 = call i32 @__nvvm_reflect(ptr @.str.2) #6
  %5 = icmp ne i32 %4, 0
  br i1 %5, label %6, label %8

6:                                                ; preds = %3
  %7 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

8:                                                ; preds = %3
  %9 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

10:                                               ; preds = %0
  %11 = call i32 @__nvvm_reflect(ptr @.str.2) #6
  %12 = icmp ne i32 %11, 0
  br i1 %12, label %13, label %15

13:                                               ; preds = %10
  %14 = call float @llvm.nvvm.sqrt.rn.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

15:                                               ; preds = %10
  %16 = call float @llvm.nvvm.sqrt.approx.f(float %x) #6
  br label %__nvvm_sqrt_f.exit

__nvvm_sqrt_f.exit:                               ; preds = %6, %8, %13, %15
  %.0 = phi float [ %7, %6 ], [ %9, %8 ], [ %14, %13 ], [ %16, %15 ]
  ret float %.0
}

The selection of intrinsic respects the value of __nvvm_reflect("__CUDA_FTZ") and __nvvm_reflect("__CUDA_PREC_SQRT"). The NVVMReflectPass is the pass that picks them up and replaces the __nvvm_reflect(...) expressions with appropriate integer values.
However, it only recognizes "__CUDA_FTZ" and "__CUDA_ARCH", leaving others with the default value 0. This is why __nvvm_reflect("__CUDA_PREC_SQRT") always evaluates to 0, leading to the selection of llvm.nvvm.sqrt.approx.f:

int ReflectVal = 0; // The default value is 0
if (ReflectArg == "__CUDA_FTZ") {
// Try to pull __CUDA_FTZ from the nvvm-reflect-ftz module flag. Our
// choice here must be kept in sync with AutoUpgrade, which uses the same
// technique to detect whether ftz is enabled.
if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
ReflectVal = Flag->getSExtValue();
} else if (ReflectArg == "__CUDA_ARCH") {
ReflectVal = SmVersion * 10;
}

I guess that the old libdevice doesn't rely on "__CUDA_PREC_SQRT", thus NVVMReflectPass doesn't try to handle it. (The last commit on this is 7 years ago. 😏)

Any Possible Solution?

Clang provides several options that affect the behavior of floating-point operations, and three of them are said to take effect in CUDA code. (Stated in Compiling CUDA with clang, Flags that control numerical code. Some of the flags are renamed after then.):

  • ffp-contract
  • fgpu-flush-denormals-to-zero
  • fgpu-approx-transcendentals

These flags don't seem to handle sqrt. Maybe we can add a flag, e.g., -fgpu-approx-sqrt, or respect the general -fapprox-func flag? And set a metadata just like nvvm-reflect-ftz for NVVMReflectPass to pick up. Or, look at the "approx-func-fp-math"="true" metadata, which is already set when -fapprox-func is given.

To add metadata, the following part is probably where we should place the logic:

if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
// Indicate whether __nvvm_reflect should be configured to flush denormal
// floating point values to 0. (This corresponds to its "__CUDA_FTZ"
// property.)
getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-ftz",
CodeGenOpts.FP32DenormalMode.Output !=
llvm::DenormalMode::IEEE);
}


To sum up the issue, I believe the non-approximated square-root (llvm.nvvm.sqrt.rn.f) should be used by default, or unless there should be an option for the user to control with. Just like NVCC provides the -prec-sqrt flag, Clang could provide this as well.

If this is a missing puzzle, I would like to work on this; if it's already supported or not desired to be touched, please let me know!

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions