首页 > 解决方案 > 在 GKE 上,dcgm-exporter 无法获取 pod/container 的 gpu 使用情况,只能获取节点 gpu 使用情况

问题描述

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0-rc.3"
  namespace: zsx
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0-rc.3"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.4.0-rc.3"
      name: "dcgm-exporter"
    spec:
      hostNetwork: true
      nodeSelector:
        zsx: test
      containers:
      - image: "nvidia/dcgm-exporter:2.1.8-2.4.0-rc.3-ubuntu20.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        - name: "DCGM_EXPORTER_DEVICES_STR"
          value: "g"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          privileged: true
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
              add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "nvidia-install-dir-host"
          mountPath: "/usr/local/nvidia"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "nvidia-install-dir-host"
        hostPath:
          path: "/var/paas/nvidia"

我的指标结果。结果,pod、container为null:</p>

DCGM_FI_DEV_SM_CLOCK{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 585
DCGM_FI_DEV_MEM_CLOCK{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 5000
DCGM_FI_DEV_GPU_TEMP{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 52
DCGM_FI_DEV_POWER_USAGE{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 29.830000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 51661862752
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_ENC_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_DEC_UTIL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_XID_ERRORS{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_FB_FREE{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 12291
DCGM_FI_DEV_FB_USED{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 2787
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="2",UUID="GPU-ad70e6d3-3531-8243-756b-e17e5a1126bb",device="nvidia2",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0

DCGM_FI_DEV_SM_CLOCK{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 585
DCGM_FI_DEV_MEM_CLOCK{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 5000
DCGM_FI_DEV_GPU_TEMP{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 51
DCGM_FI_DEV_POWER_USAGE{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 28.627000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 48304851551
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_ENC_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_DEC_UTIL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_XID_ERRORS{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_FB_FREE{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 13355
DCGM_FI_DEV_FB_USED{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 1723
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="3",UUID="GPU-490eac7c-abb0-94f6-2e23-630071e02313",device="nvidia3",Hostname="mep-gpu-92973",container="",namespace="",pod=""} 0

标签: google-kubernetes-enginenvidia

解决方案


推荐阅读