java - Spark 未利用 GPU taskResourceAssignments Map(gpu -> [0]
问题描述
我看到任务被分配到 GPU,但 GPU 利用率为 0%。如何获得使用 GPU 的工作?我在独立模式下的 GPU 服务器上同时运行 master 和 1 个 worker。
火花提交
spark-submit \
--master spark://<ip>:7077 \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.worker.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.worker.resource.gpu.amount=1 \
--class com.spark.Class \
app.jar
日志
21/03/30 23:19:25 INFO DAGScheduler: Submitting 10 missing tasks from ShuffleMapStage 251 (MapPartitionsRDD[306] at collect at ClusteringMetrics.scala:102) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/03/30 23:19:25 INFO TaskSchedulerImpl: Adding task set 251.0 with 10 tasks resource profile 0
21/03/30 23:19:25 INFO TaskSetManager: Starting task 0.0 in stage 251.0 (TID 2178) (<ip>, executor 0, partition 0, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_319_piece0 in memory on <ip>:34559 (size: 17.3 KiB, free: 4.0 GiB)
21/03/30 23:19:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 83 to <ip>:34520
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_316_piece0 in memory on <ip>:34559 (size: 547.0 B, free: 4.0 GiB)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 1.0 in stage 251.0 (TID 2179) (<ip>, executor 0, partition 1, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 0.0 in stage 251.0 (TID 2178) in 225 ms on <ip> (executor 0) (1/10)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 2.0 in stage 251.0 (TID 2180) (<ip>, executor 0, partition 2, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 1.0 in stage 251.0 (TID 2179) in 181 ms on <ip> (executor 0) (2/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 3.0 in stage 251.0 (TID 2181) (<ip>, executor 0, partition 3, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 2.0 in stage 251.0 (TID 2180) in 226 ms on <ip> (executor 0) (3/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 4.0 in stage 251.0 (TID 2182) (<ip>, executor 0, partition 4, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 3.0 in stage 251.0 (TID 2181) in 187 ms on <ip> (executor 0) (4/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 5.0 in stage 251.0 (TID 2183) (<ip>, executor 0, partition 5, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 4.0 in stage 251.0 (TID 2182) in 180 ms on <ip> (executor 0) (5/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 6.0 in stage 251.0 (TID 2184) (<ip>, executor 0, partition 6, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 5.0 in stage 251.0 (TID 2183) in 179 ms on <ip> (executor 0) (6/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 7.0 in stage 251.0 (TID 2185) (<ip>, executor 0, partition 7, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 6.0 in stage 251.0 (TID 2184) in 179 ms on <ip> (executor 0) (7/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 8.0 in stage 251.0 (TID 2186) (<ip>, executor 0, partition 8, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 7.0 in stage 251.0 (TID 2185) in 216 ms on <ip> (executor 0) (8/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 9.0 in stage 251.0 (TID 2187) (<ip>, executor 0, partition 9, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 8.0 in stage 251.0 (TID 2186) in 179 ms on <ip> (executor 0) (9/10)
21/03/30 23:19:27 INFO TaskSetManager: Finished task 9.0 in stage 251.0 (TID 2187) in 179 ms on <ip> (executor 0) (10/10)
21/03/30 23:19:27 INFO TaskSchedulerImpl: Removed TaskSet 251.0, whose tasks have all completed, from pool
21/03/30 23:19:27 INFO DAGScheduler: ShuffleMapStage 251 (collect at ClusteringMetrics.scala:102) finished in 1.934 s
21/03/30 23:19:27 INFO DAGScheduler: looking for newly runnable stages
眼镜
我正在使用 AWS EC2 G4dn 机器。
GPU: TU104GL [Tesla T4]
15109MiB
Driver Version: 460.32.03
CUDA Version: 11.2
1 worker: 1 core, 7GB of memory.
解决方案
推荐阅读
- python - 如何使用箭头键向下移动我的乌龟?
- ios - 安装任何新的 Cocoapod 都会导致构建失败
- javascript - 在不导入 polyfill 的情况下摆脱“regeneratorRuntime is not defined”
- java - AWS SDK v2 SdkAsyncHttpClient 实施使用 Java 11 java.net.http HttpClient sendAsync
- python-3.x - How to sort a pandas dataframe by the standard deviations of its columns?
- ios - 一次构建ios应用,发布到多个设备
- c++ - 如果调用了两个 get opts,我将如何停止程序
- c# - 在并发字典中使用集合值 - 读取/写入集合值时是否需要额外的锁定?
- debugging - 调试时在 aspnet 核心应用程序中进行 Windows 身份验证
- php - 大 xlsx 的喷口打开功能问题