装好 nvidia-docker2 后 Docker 无法找到 NVIDIA Runtime

问题遇到的现象和发生背景

在Ubuntu20.04系统下安装了最新版本的docker desktop, nvidia-docker2,想要创建一个可以跑GUI的Docker Image. 因为该Image之前在WSL2测试过没有问题,所以我没有修改任何东西就在Ubuntu上rebuild了。Dockerfile 和 docker-compose.yml 如下:

FROM osrf/ros:melodic-desktop-full

SHELL ["/bin/bash", "-c"]

# Minimal setup
RUN echo "source /opt/ros/melodic/setup.bash" >> ~/.bashrc
RUN source ~/.bashrc
# Extra pkg installation after this!
services:
  melodic:
    build: .
    image: melodic
    command: roslaunch gazebo_ros empty_world.launch &&
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
    environment:
      - DISPLAY=${DISPLAY}
      - NVIDIA_DRIVER_CAPABILITIES=all
      - NVIDIA_VISIBLE_DEVICES=all
      - QT_X11_NO_MITSHM=1
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix
      - ${PWD}/.Xauthority:/root/.Xauthority:rw
    network_mode: "host“

然而,当我执行如下命令时:

docker compose up

系统给我的错误是

Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
我的解答思路和尝试过的方法

我的想法是,可能runtime并没有设置好,所以我跑了如下指令:

docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu:22.04 nvidia-smi

系统给我的错误是

Error response from daemon: Unknown runtime specified nvidia

然后我按照官方给出的方法加runtime

sudo dockerd --add-runtime=nvidia=/usr/bin/nvidia-container-runtime

但给出的运行结果是:

INFO[2022-07-11T10:30:18.583217896-05:00] Starting up                                  
INFO[2022-07-11T10:30:18.584301607-05:00] detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf 
INFO[2022-07-11T10:30:18.585538257-05:00] parsed scheme: "unix"                         module=grpc
INFO[2022-07-11T10:30:18.585571148-05:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2022-07-11T10:30:18.585618515-05:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock   0 }]  }  module=grpc
INFO[2022-07-11T10:30:18.585635177-05:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2022-07-11T10:30:18.586921837-05:00] parsed scheme: "unix"                         module=grpc
INFO[2022-07-11T10:30:18.586945960-05:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2022-07-11T10:30:18.586973034-05:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock   0 }]  }  module=grpc
INFO[2022-07-11T10:30:18.586984481-05:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2022-07-11T10:30:18.595208030-05:00] [graphdriver] using prior storage driver: overlay2 
failed to start daemon: error while opening volume store metadata database: timeout

现在我怀疑是不是因为我的电脑有两块NNIDIA RTX的原因。
ps:nvidia-smi指令能用 

我想要达到的结果

我最终还是希望能在电脑上用这个docker image,但是因为runtime的问题已经拖了很久,还请各位帮忙。提前谢过大家!

如果你安装了 nvidia-docker2,你不应该再注册 runtime,因为 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#adding-the-nvidia-runtime 官方说 nvidia-docker2 已经注册了,所以最好别再手动增加。

docker compose 运行的时候用户权限设置对了吗?先用下面命令看看什么反应,看看 runtime 的问题:

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi