Open
Description
When enabling compilation with relocatable device code flag (--gpu-rdc), with cuda code with header and implementation files, I ran into a linking error:
nvlink error: Function A() not declared __global__ in all source files
nvlink fatal: merge_elf failed
The error can be reproduced with the following set-up:
// bug.cuh
__global__ void foo();
// bug.cu
__global__ void foo() {}
// main.cu
#include "bug.cuh"
int main() { foo<<<1, 1, 1>>>(); }
>clang++ bug.cu main.cu --offload-arch=sm_89 -fgpu-rdc --offload-new-driver -lcudart
@jhuber6's insight:
Solution is to have .extern .entry instead of .extern .func
>clang++ -x cuda main.cu --offload-arch=sm_89 --offload-device-only -o - -S -fgpu-rdc
.version 8.5
.target sm_89
.address_size 64
.extern .func _Z3foov
()
;
.global .align 8 .u64 __clang_gpu_used_external[1] = {_Z3foov};
Workaround, enable -foffload-lto.