Description
Description
This issue was initially found during discussion of #64487. However, this should be handled as a separate issue, especially since the other one is closed now. This issue doesn't affect me directly, but there are users which might run into this issue, especially in mixed GPU systems.
When a user utilizes multiple architectures to offload to in his program (for example -fopenmp-targets=x86_64,nvptx64
, the OMPT interface will receive callbacks for ompt_callback_device_initialize
for each device used. While this works fine, we may encounter a case where the transferred device number for two different architectures overlap, causing issues for tools to differentiate the devices.
Users can select their device by the device number, for example
int main( void )
{
// x86_64 (four devices)
#pragma omp target device(0)
{...}
// nvptx64 (other devices)
#pragma omp target device(4)
{...}
}
This would translate to the following in the OMPT interface
Callback Init: device_num=0 type=generic-64bit
[...]
Callback Init: device_num=0 type=sm_80
Reproducer
This small code should show the issue. I have tested it with with nvptx64 + x86_64
and nvptx64 + amdgcn-amd-amdhsa
, but it should also affect other combinations
#include <assert.h>
#include <stdio.h>
#include <omp-tools.h>
void callback_ompt_device_initialize(int device_num,
const char *type,
ompt_device_t *device,
ompt_function_lookup_t lookup,
const char *documentation)
{
printf("[%s] device_num = %d | type = %s\n", __FUNCTION__, device_num, type);
}
static int
initialize_tool( ompt_function_lookup_t lookup,
int initialDeviceNum,
ompt_data_t* toolData )
{
ompt_set_callback_t set_callback =
( ompt_set_callback_t )lookup( "ompt_set_callback" );
assert( set_callback != 0 );
ompt_set_result_t registration_result = set_callback(ompt_callback_device_initialize, (ompt_callback_t) &callback_ompt_device_initialize);
assert(registration_result == ompt_set_always);
return 1;
}
static void
finalize_tool( ompt_data_t* toolData )
{
}
ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
const char* runtime_version )
{
static ompt_start_tool_result_t tool = { &initialize_tool,
&finalize_tool,
ompt_data_none };
return &tool;
}
int main(void) {
#pragma omp target device(0)
{}
#pragma omp target device(4)
{}
}
Running the tool on a system with a single NVIDIA MX550, we can see the following output:
$ clang --version
clang version 18.0.0 (https://github.com/llvm/llvm-project.git 4b383107fa7585bb5ecd7f03cab7800b33d1585a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/software/software/LLVM/git/bin
$ clang -fopenmp -fopenmp-targets=x86_64,nvptx64 reproducer.c
$ ./a.out
[callback_ompt_device_initialize] device_num = 0 | type = generic-64bit
[callback_ompt_device_initialize] device_num = 0 | type = sm_75
Previous discussion
Here are a few related comments by @mhalk and me regarding this issue:
Let's say we would go with "added support": I added prints for
omp_get_device_num
andomp_get_num_devices
. Also forced the target regions to use device number 3 (and 11, for the two-target-version).Output for
x86_64
only:trunk/bin/clang -fopenmp -fopenmp-targets=x86_64 veccopy-ompt-target.c -o veccopy-ompt-target ./veccopy-ompt-target omp_get_device_num=4 omp_get_num_devices=4 Callback Init: device_num=3 type=unknown device=0x1485000 lookup=0x7f0bd39c8a90 doc=(nil) Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200368 device_addr:(nil) bytes:17112 [...] Callback Fini: device_num=3
Output for
x86_64
andamdgcn-amd-amdhsa
(8 gfx90a GPUs present):trunk/bin/clang -fopenmp -fopenmp-targets=x86_64,amdgcn-amd-amdhsa veccopy-ompt-target.c -o veccopy-ompt-target ./veccopy-ompt-target omp_get_device_num=12 omp_get_num_devices=12 Callback Init: device_num=3 type=gfx90a device=0x12ae980 lookup=0x7f441a241a90 doc=(nil) Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200378 device_addr:(nil) bytes:19208 [...] Callback Init: device_num=3 type=unknown device=0x12b52a0 lookup=0x7f441a241a90 doc=(nil) Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x204f00 device_addr:(nil) bytes:17112 [...] Callback Fini: device_num=3 Callback Fini: device_num=3
So, in the sources, I intentionally used devices 3 and 11 for target regions. Which both correlate to device number 3 in their respective RTL. That is, to illustrate that there might be ambiguous output w.r.t.
Fini
callbacks.
Judging by your output for both
x86_64
and a combination ofx86_64
+amdgcn
, I would expect thatnvptx64
+amdgcn
yields the same results shown in your second example. I think that I can test this. For tool developers, the ambiguous output can be hard to work with, since we can only use thedevice_num
to identify the executing device during target callbacks. The runtime should probably dispatch the same device number users can use to define the executing device (here3
for the AMD GPU and11
for the host).I checked how the device numbers are delivered on a system with an AMD + NVIDIA GPU. We can see the same behavior:
$ clang-18 -fopenmp -fopenmp-targets=nvptx64,amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a veccopy.c $ ./a.out Num devices = 3 Device 0 Callback Init: device_num=0 type=sm_80 device=0x556d3f652a90 lookup=0x7efd6a8027b0 doc=(nil) Callback Load: device_num:0 filename:(null) host_adddr:0x556d3ed12778 device_addr:(nil) bytes:715888 [...] Device 1 Callback Init: device_num=0 type=gfx90a device=0x556d402203d0 lookup=0x7efd6a8027b0 doc=(nil) Callback Load: device_num:0 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064 [...] Device 2 Callback Init: device_num=1 type=gfx90a device=0x556d40229210 lookup=0x7efd6a8027b0 doc=(nil) Callback Load: device_num:1 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064 [...] Success Callback Fini: device_num=0 Callback Fini: device_num=0 Callback Fini: device_num=1
The runtime should probably dispatch the same device number users can use to define the executing device [...]
tl;dr: Agreed. Like that idea, albeit I'd have to see if this is possible. Maybe I'm not aware of some information but ATM I'm not very confident this can be (reasonably) solved on my end.
When the callbacks are dispatched, we only have the information from the corresponding RTL. To adhere to our example, that is e.g.
DeviceId=3
-- while in the OpenMP runtime this might in fact be device number 11. Now the callback is executed, that's it -- for the init callback we could (during execution of the corresponding callback!) find out the "actual" (whatever that means) device number since we have thekind
(amdgcn, nvptx, ...) and might be able to deduce further info. But other callbacks do not have this information.