Description
I've encountered an issue with the square-root (sqrtf
) floating-point operation on NVPTX. It always compiles to the llvm.nvvm.sqrt.approx.f
intrinsic, even when -fno-approx-func
is specified. I have a hypothesis about the cause and would like to explore a potential fix. Any guidance would be appreciated! 😃
What's the Problem?
Here's the relevant code:
// cat main.cu
__device__ float f(float x) {
return sqrtf(x);
}
The function sqrtf
always compiles to the llvm.nvvm.sqrt.approx.f
intrinsic, regardless of -fno-approx-func
:
; clang -Wno-unknown-cuda-version -fno-approx-func main.cu -S -emit-llvm --cuda-device-only -o -
; ModuleID = 'main.cu'
source_filename = "main.cu"
target datalayout = "e-p6:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
@.str = private unnamed_addr constant [11 x i8] c"__CUDA_FTZ\00", align 1
@.str.2 = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
; Function Attrs: convergent noinline nounwind optnone
define dso_local noundef float @_Z1ff(float noundef %x) #0 {
entry:
%__a.addr.i = alloca float, align 4
%x.addr = alloca float, align 4
store float %x, ptr %x.addr, align 4
%0 = load float, ptr %x.addr, align 4
store float %0, ptr %__a.addr.i, align 4
%1 = load float, ptr %__a.addr.i, align 4
%2 = call float @llvm.nvvm.sqrt.approx.f(float %1) #3
ret float %2
}
; Function Attrs: convergent nounwind
declare i32 @__nvvm_reflect(ptr) #1
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.rn.ftz.f(float) #2
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.approx.ftz.f(float) #2
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.rn.f(float) #2
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare float @llvm.nvvm.sqrt.approx.f(float) #2
attributes #0 = { convergent noinline nounwind optnone "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_52" "target-features"="+ptx85,+sm_52" }
attributes #1 = { convergent nounwind "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="sm_52" "target-features"="+ptx85,+sm_52" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(none) }
attributes #3 = { nounwind }
!llvm.module.flags = !{!0, !1, !2, !3}
!llvm.ident = !{!4, !5}
!nvvmir.version = !{!6}
!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 12, i32 6]}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!3 = !{i32 7, !"frame-pointer", i32 2}
!4 = !{!"clang version 21.0.0git ([email protected]:Lai-YT/llvm-project.git e57cd100ca297cf81854e35cccbf703edddd4aad)"}
!5 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
!6 = !{i32 2, i32 0}
What's the Possible Cause?
As defined in __clang_cuda_math.h
, sqrtf
expands to the __nv_sqrtf
intrinsic:
I noticed that __nv_sqrtf
is defined by NVIDIA in libdevice.bc, so I grabbed the libdevice on my machine (which is libdevice.10.bc
) and disassembled the bitcode with llvm-dis:
@.str = private unnamed_addr constant [11 x i8] c"__CUDA_FTZ\00", align 1
@.str.2 = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
...
define float @__nv_sqrtf(float %x) #0 {
%1 = call i32 @__nvvm_reflect(ptr @.str) #6
%2 = icmp ne i32 %1, 0
br i1 %2, label %3, label %10
3: ; preds = %0
%4 = call i32 @__nvvm_reflect(ptr @.str.2) #6
%5 = icmp ne i32 %4, 0
br i1 %5, label %6, label %8
6: ; preds = %3
%7 = call float @llvm.nvvm.sqrt.rn.ftz.f(float %x) #6
br label %__nvvm_sqrt_f.exit
8: ; preds = %3
%9 = call float @llvm.nvvm.sqrt.approx.ftz.f(float %x) #6
br label %__nvvm_sqrt_f.exit
10: ; preds = %0
%11 = call i32 @__nvvm_reflect(ptr @.str.2) #6
%12 = icmp ne i32 %11, 0
br i1 %12, label %13, label %15
13: ; preds = %10
%14 = call float @llvm.nvvm.sqrt.rn.f(float %x) #6
br label %__nvvm_sqrt_f.exit
15: ; preds = %10
%16 = call float @llvm.nvvm.sqrt.approx.f(float %x) #6
br label %__nvvm_sqrt_f.exit
__nvvm_sqrt_f.exit: ; preds = %6, %8, %13, %15
%.0 = phi float [ %7, %6 ], [ %9, %8 ], [ %14, %13 ], [ %16, %15 ]
ret float %.0
}
The selection of intrinsic respects the value of __nvvm_reflect("__CUDA_FTZ")
and __nvvm_reflect("__CUDA_PREC_SQRT")
. The NVVMReflectPass is the pass that picks them up and replaces the __nvvm_reflect(...)
expressions with appropriate integer values.
However, it only recognizes "__CUDA_FTZ"
and "__CUDA_ARCH"
, leaving others with the default value 0
. This is why __nvvm_reflect("__CUDA_PREC_SQRT")
always evaluates to 0
, leading to the selection of llvm.nvvm.sqrt.approx.f
:
llvm-project/llvm/lib/Target/NVPTX/NVVMReflect.cpp
Lines 168 to 178 in 2f808dd
I guess that the old libdevice doesn't rely on "__CUDA_PREC_SQRT"
, thus NVVMReflectPass doesn't try to handle it. (The last commit on this is 7 years ago. 😏)
Any Possible Solution?
Clang provides several options that affect the behavior of floating-point operations, and three of them are said to take effect in CUDA code. (Stated in Compiling CUDA with clang, Flags that control numerical code. Some of the flags are renamed after then.):
ffp-contract
fgpu-flush-denormals-to-zero
fgpu-approx-transcendentals
These flags don't seem to handle sqrt
. Maybe we can add a flag, e.g., -fgpu-approx-sqrt
, or respect the general -fapprox-func
flag? And set a metadata just like nvvm-reflect-ftz
for NVVMReflectPass to pick up. Or, look at the "approx-func-fp-math"="true"
metadata, which is already set when -fapprox-func
is given.
To add metadata, the following part is probably where we should place the logic:
llvm-project/clang/lib/CodeGen/CodeGenModule.cpp
Lines 1289 to 1296 in 1fbfef9
To sum up the issue, I believe the non-approximated square-root (llvm.nvvm.sqrt.rn.f
) should be used by default, or unless there should be an option for the user to control with. Just like NVCC provides the -prec-sqrt
flag, Clang could provide this as well.
If this is a missing puzzle, I would like to work on this; if it's already supported or not desired to be touched, please let me know!