Problem
A GPU-enabled instance does not seem to be able to use the device. The driver does not seem to be running. and when running "nvidia-smi" you get an error such as:
$> nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This usually happens after an update of the Operating System kernel, and requires a rebuild of the NVIDIA driver to be compatible with the new kernel.
Solution
Using the morpheus web portal:
- Navigate to the instance showing the problems.
- Click on ACTIONS - Run Workflow.
- Pick "Nvidia driver refresh" and click EXECUTE.
- Morpheus will show the progress of this operation, and after a few moments, the GPUs should be available again.
Once your instance is running, you can check wether your instance can see the GPU with:
$> nvidia-smi Tue Nov 17 15:20:38 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.87 Driver Version: 440.87 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID V100-4C On | 00000000:00:05.0 Off | 0 | | N/A N/A P0 N/A / N/A | 304MiB / 4096MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+