Skip to content

[OMPT] Overlapping device_num when using multiple offloading architectures #65104

Closed
@Thyre

Description

@Thyre

Description

This issue was initially found during discussion of #64487. However, this should be handled as a separate issue, especially since the other one is closed now. This issue doesn't affect me directly, but there are users which might run into this issue, especially in mixed GPU systems.

When a user utilizes multiple architectures to offload to in his program (for example -fopenmp-targets=x86_64,nvptx64, the OMPT interface will receive callbacks for ompt_callback_device_initialize for each device used. While this works fine, we may encounter a case where the transferred device number for two different architectures overlap, causing issues for tools to differentiate the devices.

Users can select their device by the device number, for example

int main( void )
{
// x86_64 (four devices)
#pragma omp target device(0)
{...}
// nvptx64 (other devices)
#pragma omp target device(4)
{...}
}

This would translate to the following in the OMPT interface

Callback Init: device_num=0 type=generic-64bit
[...]
Callback Init: device_num=0 type=sm_80 

Reproducer

This small code should show the issue. I have tested it with with nvptx64 + x86_64 and nvptx64 + amdgcn-amd-amdhsa, but it should also affect other combinations

#include <assert.h>
#include <stdio.h>
#include <omp-tools.h>


void callback_ompt_device_initialize(int device_num,
                                     const char *type,
                                     ompt_device_t *device,
                                     ompt_function_lookup_t lookup,
                                     const char *documentation)
{
    printf("[%s] device_num = %d | type = %s\n", __FUNCTION__, device_num, type);
}

static int
initialize_tool( ompt_function_lookup_t lookup,
                 int                    initialDeviceNum,
                 ompt_data_t*           toolData )
{
    ompt_set_callback_t set_callback =
        ( ompt_set_callback_t )lookup( "ompt_set_callback" );
    assert( set_callback != 0 );

    ompt_set_result_t registration_result = set_callback(ompt_callback_device_initialize, (ompt_callback_t) &callback_ompt_device_initialize);
    assert(registration_result == ompt_set_always);

    return 1;
}

static void
finalize_tool( ompt_data_t* toolData )
{
}
	
ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
                 const char*  runtime_version )
{
    static ompt_start_tool_result_t tool = { &initialize_tool,
                                             &finalize_tool,
                                             ompt_data_none };
    return &tool;
}


int main(void) {
    #pragma omp target device(0)
    {}

    #pragma omp target device(4)
    {}
}

Running the tool on a system with a single NVIDIA MX550, we can see the following output:

$ clang --version
clang version 18.0.0 (https://github.com/llvm/llvm-project.git 4b383107fa7585bb5ecd7f03cab7800b33d1585a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/software/software/LLVM/git/bin
$ clang -fopenmp -fopenmp-targets=x86_64,nvptx64 reproducer.c
$ ./a.out
[callback_ompt_device_initialize] device_num = 0 | type = generic-64bit
[callback_ompt_device_initialize] device_num = 0 | type = sm_75

Previous discussion

Here are a few related comments by @mhalk and me regarding this issue:

#64487 (comment)

Let's say we would go with "added support": I added prints for omp_get_device_num and omp_get_num_devices. Also forced the target regions to use device number 3 (and 11, for the two-target-version).

Output for x86_64 only:

trunk/bin/clang -fopenmp -fopenmp-targets=x86_64 veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=4
omp_get_num_devices=4

Callback Init: device_num=3 type=unknown device=0x1485000 lookup=0x7f0bd39c8a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200368 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3

Output for x86_64 and amdgcn-amd-amdhsa (8 gfx90a GPUs present):

trunk/bin/clang -fopenmp -fopenmp-targets=x86_64,amdgcn-amd-amdhsa veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=12
omp_get_num_devices=12

Callback Init: device_num=3 type=gfx90a device=0x12ae980 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200378 device_addr:(nil) bytes:19208
[...]
Callback Init: device_num=3 type=unknown device=0x12b52a0 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x204f00 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3
Callback Fini: device_num=3

So, in the sources, I intentionally used devices 3 and 11 for target regions. Which both correlate to device number 3 in their respective RTL. That is, to illustrate that there might be ambiguous output w.r.t. Fini callbacks.

#64487 (comment)

Judging by your output for both x86_64 and a combination of x86_64 + amdgcn, I would expect that nvptx64 + amdgcn yields the same results shown in your second example. I think that I can test this. For tool developers, the ambiguous output can be hard to work with, since we can only use the device_num to identify the executing device during target callbacks. The runtime should probably dispatch the same device number users can use to define the executing device (here 3 for the AMD GPU and 11 for the host).

I checked how the device numbers are delivered on a system with an AMD + NVIDIA GPU. We can see the same behavior:

$ clang-18 -fopenmp -fopenmp-targets=nvptx64,amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a veccopy.c
$ ./a.out
Num devices = 3
Device 0
Callback Init: device_num=0 type=sm_80 device=0x556d3f652a90 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3ed12778 device_addr:(nil) bytes:715888
[...]
Device 1
Callback Init: device_num=0 type=gfx90a device=0x556d402203d0 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Device 2
Callback Init: device_num=1 type=gfx90a device=0x556d40229210 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:1 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Success
Callback Fini: device_num=0
Callback Fini: device_num=0
Callback Fini: device_num=1

#64487 (comment)

The runtime should probably dispatch the same device number users can use to define the executing device [...]

tl;dr: Agreed. Like that idea, albeit I'd have to see if this is possible. Maybe I'm not aware of some information but ATM I'm not very confident this can be (reasonably) solved on my end.

When the callbacks are dispatched, we only have the information from the corresponding RTL. To adhere to our example, that is e.g. DeviceId=3 -- while in the OpenMP runtime this might in fact be device number 11. Now the callback is executed, that's it -- for the init callback we could (during execution of the corresponding callback!) find out the "actual" (whatever that means) device number since we have the kind (amdgcn, nvptx, ...) and might be able to deduce further info. But other callbacks do not have this information.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions