首页 > 解决方案 > Multistage docker build: stat 报告 NVIDIA 文件不存在,但它确实存在

问题描述

我正在尝试合并两个泊坞窗图像。

这是我的 Dockerfile

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

COPY --from=cuda10 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
   /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
   /usr/lib/x86_64-linux-gnu/

构建失败:

$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist

但是这些文件确实存在:

$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

标签: dockernvidianvidia-docker

解决方案


TL;DR:此文件由运行时 ( docs ) 挂载,因此它不会在构建时出现。您需要在映像中或容器启动时有几个环境变量,以便 NVIDIA 运行时在其中安装驱动程序库。查看最后的 Dockerfile 以获取示例。

为了调查这一点,我首先运行了这个命令:

docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03

并得到了同样的错误:

stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory

所以我进入目录并查看ls

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory

root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so         libnccl_static.a   libnpth.so.0       libnsl.so          libnss_files.so    libnss_nisplus.so  
libnccl.so.2       libnettle.so.6     libnpth.so.0.1.1   libnss_compat.so   libnss_hesiod.so   
libnccl.so.2.6.4   libnettle.so.6.4   libnsl.a           libnss_dns.so      libnss_nis.so      

有文件丢失。

然后我使用了您共享的命令:

docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04

root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root       34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20  2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root  9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66

文件在那里,但版本不同,它与我的 NVIDIA 驱动程序版本匹配:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+

所以在我看来,这个文件只在你使用 NVIDIA 运行时启动容器时才存在。我用谷歌搜索了这个并在这里找到了确认。文档指出,您需要运行一个包含多个环境变量的容器才能挂载驱动程序库。因此,我env在官方 NVIDIA 容器中运行命令,并将每个带有NVIDIA_前缀的变量复制到 Dockerfile 中:

FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop

COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda

ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all

使用 NVIDIA 运行时运行新映像,我发现已安装文件:

docker run --runtime nvidia --rm -it afae756457a9

root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
  Size: 9947144         Blocks: 19432      IO Block: 4096   regular file
Device: 801h/2049d      Inode: 131438      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
 Birth: -

推荐阅读