google-kubernetes-engine - 在 GKE 上,dcgm-exporter 无法获取 pod/container 的 gpu 使用情况,只能获取节点 gpu 使用情况
问题描述
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0-rc.3"
namespace: zsx
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0-rc.3"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0-rc.3"
name: "dcgm-exporter"
spec:
hostNetwork: true
nodeSelector:
zsx: test
containers:
- image: "nvidia/dcgm-exporter:2.1.8-2.4.0-rc.3-ubuntu20.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
- name: "DCGM_EXPORTER_DEVICES_STR"
value: "g"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
privileged: true
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
mountPath: "/usr/local/nvidia"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
- name: "nvidia-install-dir-host"
hostPath:
path: "/var/paas/nvidia"
我的指标结果。结果,pod、container为null:</p>
DCGM_FI_DEV_SM_CLOCK{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 585
DCGM_FI_DEV_MEM_CLOCK{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 5000
DCGM_FI_DEV_GPU_TEMP{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 52
DCGM_FI_DEV_POWER_USAGE{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 29.830000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 51661862752
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_ENC_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_DEC_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_XID_ERRORS{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_FB_FREE{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 12291
DCGM_FI_DEV_FB_USED{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 2787
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_SM_CLOCK{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 585
DCGM_FI_DEV_MEM_CLOCK{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 5000
DCGM_FI_DEV_GPU_TEMP{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 51
DCGM_FI_DEV_POWER_USAGE{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 28.627000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 48304851551
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_ENC_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_DEC_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_XID_ERRORS{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_FB_FREE{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 13355
DCGM_FI_DEV_FB_USED{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 1723
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
解决方案
推荐阅读
- java - java eclipse中的.dll文件集成
- time - 指定 msts 对象的频率
- apache - Forbidden 您无权访问此服务器上的 /dashboard/
- apache-spark - “OffsetOutOfRangeException:偏移量超出范围,没有为分区配置重置策略”是什么意思?
- excel - VBA Excel-如何根据三列删除重复项
- php - 带有laravel计数的mongo数据库查询需要很长时间
- jmeter - 并发线程组插件 - 源代码在哪里?
- java - 删除 Spring Data 中的映射异常
- c# - 查找等待操作超时的原因
- android - Hyperion-Android:ServiceLoader 无法加载自定义插件