Description
updates according to @mregmi @vbedida79's comments
Summary
GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on
Detail
GPU Workload pods requesting gpu.intel.com/i915
resource cant be executed- until they have access for /dev/drm on the GPU node.
This can be achieved by setting- setsebool container_use_devices on
on the host node. This is not feasible to implement if a cluster has multiple GPU nodes and this permission has to be set on each node manually.
Root cause
The /dev/drm access permission is not been added to the container_device_t policy so the access of the /dev/drm is blocked by SELinux which makes the workload app in the can't access the GPU device node files from the container environment.
Solution
- Work with container-selinux upstream to add the needed permission, and make sure the new container-selinux with the fixing got merged into OCP release.
- Before it is merged into OCP release, we have to distribute this new policy through user-container-policy project.
Workaround
To ensure all GPU workloads (clinfo, AI inference) work properly, please run the following command on the GPU nodes.
- Find all nodes with an Intel Data Center GPU card using the following command:
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true
Example output:
NAME STATUS ROLES AGE VERSION
icx-dgpu-1 Ready worker 30d v1.25.4+18eadca
- Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes with an Intel Data Center GPU card.
$ chroot /host
$ setsebool container_use_devices on