docker - Multistage docker build: stat 报告 NVIDIA 文件不存在,但它确实存在
问题描述
我正在尝试合并两个泊坞窗图像。
这是我的 Dockerfile
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
COPY --from=cuda10 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 \
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/libcuda.so.410.129 \
/usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 \
/usr/lib/x86_64-linux-gnu/
构建失败:
$ docker build . -t nvidia-ros:osrf
Step 5/7 : COPY --from=cuda10 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.410.129 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/
COPY failed: stat usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03: file does not exist
但是这些文件确实存在:
$ docker run -it --rm --gpus all nvidia/cuda:10.0-devel-ubuntu18.04
root@fc9c1d8ccdc2:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 37 Jan 30 14:13 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.460.32.03
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rw-r--r-- 1 root root 10516984 Dec 27 18:55 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
解决方案
TL;DR:此文件由运行时 ( docs ) 挂载,因此它不会在构建时出现。您需要在映像中或容器启动时有几个环境变量,以便 NVIDIA 运行时在其中安装驱动程序库。查看最后的 Dockerfile 以获取示例。
为了调查这一点,我首先运行了这个命令:
docker run --rm --entrypoint="" -it nvidia/cuda:10.0-devel-ubuntu18.04 \
stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03
并得到了同样的错误:
stat: cannot stat '/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03': No such file or directory
所以我进入目录并查看ls
:
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libnvidia-ptxjitcompiler.so
ls: cannot access 'libnvidia-ptxjitcompiler.so': No such file or directory
root@8c34c353bcbb:/usr/lib/x86_64-linux-gnu# ls libn
libnccl.so libnccl_static.a libnpth.so.0 libnsl.so libnss_files.so libnss_nisplus.so
libnccl.so.2 libnettle.so.6 libnpth.so.0.1.1 libnss_compat.so libnss_hesiod.so
libnccl.so.2.6.4 libnettle.so.6.4 libnsl.a libnss_dns.so libnss_nis.so
有文件丢失。
然后我使用了您共享的命令:
docker run -it --rm --runtime nvidia nvidia/cuda:10.0-devel-ubuntu18.04
root@4a1602f3d5c0:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.*
lrwxrwxrwx 1 root root 34 Jan 30 14:48 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.450.66
-rw-r--r-- 1 root root 12129448 Aug 20 2019 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
-rwxr-xr-x 1 root root 9947144 Sep 28 10:57 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
文件在那里,但版本不同,它与我的 NVIDIA 驱动程序版本匹配:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
所以在我看来,这个文件只在你使用 NVIDIA 运行时启动容器时才存在。我用谷歌搜索了这个并在这里找到了确认。文档指出,您需要运行一个包含多个环境变量的容器才能挂载驱动程序库。因此,我env
在官方 NVIDIA 容器中运行命令,并将每个带有NVIDIA_
前缀的变量复制到 Dockerfile 中:
FROM nvidia/cuda:10.0-devel-ubuntu18.04 AS cuda10
FROM osrf/ros:foxy-desktop
COPY --from=cuda10 /usr/local/cuda-10.0 /usr/local/cuda-10.0
RUN cd /usr/local && ln -s cuda-10.0 cuda
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
ENV NVIDIA_VISIBLE_DEVICES=all
使用 NVIDIA 运行时运行新映像,我发现已安装文件:
docker run --runtime nvidia --rm -it afae756457a9
root@7ebdef701231:/# stat /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
File: /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
Size: 9947144 Blocks: 19432 IO Block: 4096 regular file
Device: 801h/2049d Inode: 131438 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-01-30 14:48:05.765015216 +0000
Modify: 2020-09-28 10:57:18.067125173 +0000
Change: 2020-09-28 10:57:18.067125173 +0000
Birth: -
推荐阅读
- javascript - 在被测函数内部创建的对象上模拟 jQuery 方法调用
- c++ - 我访问的是什么错误的内存导致分段错误?
- nlp - 如何在语义文本相似性任务中使用 BERT 模型的预训练检查点?
- javascript - 我无法使用正则表达式将文本转换为标签
- c# - AppDomain.Unload 抛出 ThreadAbortException 异常
- socket.io - 如何使用确认功能编写套接字发射测试
- android - android studio中是否有任何方法可以将一个活动的集合引用传递给另一个活动以访问firebase firestore?
- node.js - 调用 AWS Lambda 函数时出错
- hbase - 无法运行 JanusGraph 和 Hbase
- javascript - 在 React 中复制和移动组件