首页 > 解决方案 > Kubernetes Pod 在未指明任何原因的情况下陷入待处理状态

问题描述

我们正在使用 client-go 创建 kubernetes 作业和部署。今天在我们的一个集群(kubernetes v1.18.19)中,我遇到了以下奇怪的问题。

kubernetes Job的Pods总是卡在Pending状态,没有任何原因。kubectl describe pod显示没有事件。从主机(通过 kubectl)创建作业是正常的,并且 pod 最终开始运行。

令我惊讶的是创建部署没问题,豆荚最终会运行!!它不仅适用于 Kubernetes Jobs。为什么?如何解决?我可以做什么??我在这里花了几个小时,但没有任何进展。

客户端的 kubeconfig:

Mount from host machine, path: /root/.kube/config

kubectl describe 工作展示:

Name:           unittest
Namespace:      default
Selector:       controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
Labels:         job-id=unittest
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Sat, 19 Jun 2021 00:20:12 +0800
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
           job-name=unittest
  Containers:
   unittest:
    Image:      ubuntu:18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      echo hello
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  21m   job-controller  Created pod: unittest-tt5b2

Kubectl describe on target pod 显示:

Name:           unittest-tt5b2
Namespace:      default
Priority:       0
Node:           <none>
Labels:         controller-uid=f3cec901-c0f4-4098-86d7-f9a7d1fe6cd1
               job-name=unittest
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/unittest
Containers:
 unittest:
   Image:      ubuntu:18.04
   Port:       <none>
   Host Port:  <none>
   Command:
     echo hello
   Environment:  <none>
   Mounts:
     /var/run/secrets/kubernetes.io/serviceaccount from default-token-72g27 (ro)
Volumes:
 default-token-72g27:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  default-token-72g27
   Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none> 

kubectl 获取事件显示:

55m         Normal    ScalingReplicaSet   deployment/job-scheduler              Scaled up replica set job-scheduler-76b7465d74 to 1
19m         Normal    ScalingReplicaSet   deployment/job-scheduler              Scaled up replica set job-scheduler-74f8896f48 to 1
58m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-pp665
49m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-xm6ck
17m         Normal    SuccessfulCreate    job/unittest                          Created pod: unittest-tt5b2

标签: kubernetesubuntu-18.04kubectlkubernetes-podkubernetes-jobs

解决方案


我解决了这个问题。

我们对 NPU 设备使用自定义调度程序,对 GPU 设备使用默认调度程序。对于 GPU 设备,调度程序名称是“default-scheduler”而不是“default”。我为这些 kube 作业传递了“默认值”,这导致 pod 卡在挂起状态。


推荐阅读