Skip to content

P1-Blocker: GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on #107

Closed
@vbedida79

Description

@vbedida79

updates according to @mregmi @vbedida79's comments

Summary

GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on

Detail

GPU Workload pods requesting gpu.intel.com/i915 resource cant be executed- until they have access for /dev/drm on the GPU node.
This can be achieved by setting- setsebool container_use_devices on on the host node. This is not feasible to implement if a cluster has multiple GPU nodes and this permission has to be set on each node manually.

Root cause

The /dev/drm access permission is not been added to the container_device_t policy so the access of the /dev/drm is blocked by SELinux which makes the workload app in the can't access the GPU device node files from the container environment.

Solution

  • Work with container-selinux upstream to add the needed permission, and make sure the new container-selinux with the fixing got merged into OCP release.
  • Before it is merged into OCP release, we have to distribute this new policy through user-container-policy project.

Workaround

To ensure all GPU workloads (clinfo, AI inference) work properly, please run the following command on the GPU nodes.

  1. Find all nodes with an Intel Data Center GPU card using the following command:
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true

Example output:

NAME         STATUS   ROLES    AGE   VERSION
icx-dgpu-1   Ready    worker   30d   v1.25.4+18eadca
  1. Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes with an Intel Data Center GPU card.
$ chroot /host
$ setsebool container_use_devices on

Metadata

Metadata

Assignees

No one assigned

    Labels

    gpuIntel GPU

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions