apache-spark - Kubernetes 上的 Spark:spark-local-dir 错误:已经存在/不唯一
问题描述
为了正确设置本地目录,我正在努力理解 Spark 文档。
设置:
我通过 Sparkoperator 方法在 Kubernetes 上运行 Spark 3.1.2。执行器 Pod 的数量因集群上的作业大小和可用资源而异。一个典型的情况是,我以 20 个请求的执行者开始工作,但 3 个 Pod 仍处于挂起状态,并用 17 个执行者完成工作。
基础问题:
我在错误“节点资源不足:临时存储”中运行。由于大量数据溢出到通过empty-dir
kubernetes 节点创建的默认本地目录中。
这是一个已知问题,应通过将 指向已local-dir
安装的永久卷来解决。
我试图接近,但两者都不起作用:
方法一:
按照文档https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage我将以下选项添加到 spark-config
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "tmp-spark-spill"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "csi-rbd-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "3000Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": ="/spill-data"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
完整的 yaml 看起来像
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: job1
namespace: spark
spec:
serviceAccount: spark
type: Python
pythonVersion: "3"
mode: cluster
image: "xxx/spark-py:app-3.1.2"
imagePullPolicy: Always
mainApplicationFile: local:///opt/spark/work-dir/nfs/06_dwh_core/jobs/job1/main.py
sparkVersion: "3.0.0"
restartPolicy:
type: OnFailure
onFailureRetries: 0
onFailureRetryInterval: 10
onSubmissionFailureRetries: 0
onSubmissionFailureRetryInterval: 20
sparkConf:
"spark.default.parallelism": "400"
"spark.sql.shuffle.partitions": "400"
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
"spark.sql.debug.maxToStringFields": "1000"
"spark.ui.port": "4045"
"spark.driver.maxResultSize": "0"
"spark.kryoserializer.buffer.max": "512"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "tmp-spark-spill"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "csi-rbd-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "3000Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": ="/spill-data"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
driver:
cores: 1
memory: "20G"
labels:
version: 3.1.2
serviceAccount: spark
volumeMounts:
- name: nfs
mountPath: /opt/spark/work-dir/nfs
executor:
cores: 20
instances: 20
memory: "150G"
labels:
version: 3.0.0
volumeMounts:
- name: nfs
mountPath: /opt/spark/work-dir/nfs
volumes:
- name: nfs
nfs:
server: xxx
path: /xxx
readOnly: false
问题一:
这会导致错误提示 pvc 已经存在并且它只会有效地创建一个执行程序。
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark-poc/persistentvolumeclaims. Message: persistentvolumeclaims "tmp-spark-spill" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=persistentvolumeclaims, name=tmp-spark-spill, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=persistentvolumeclaims "tmp-spark-spill" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
我是否必须为每个执行者定义这个本地目录声明?有点儿
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "tmp-spark-spill"
.
.
.
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-2.options.claimName": "tmp-spark-spill"
.
.
.
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-3.options.claimName": "tmp-spark-spill"
.
.
.
但是,如果我有不断变化的执行者数量,我该如何动态地做到这一点?它不会自动从执行程序配置中获取吗?
方法二:
我自己创建了一个 pvc 将其安装为卷并将本地目录设置为 spark config 参数
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-spark-spill
namespace: spark-poc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3000Gi
storageClassName: csi-rbd-sc
volumeMode: Filesystem
安装到执行器上,例如
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: job1
namespace: spark
spec:
serviceAccount: spark
type: Python
pythonVersion: "3"
mode: cluster
image: "xxx/spark-py:app-3.1.2"
imagePullPolicy: Always
mainApplicationFile: local:///opt/spark/work-dir/nfs/06_dwh_core/jobs/job1/main.py
sparkVersion: "3.0.0"
restartPolicy:
type: OnFailure
onFailureRetries: 0
onFailureRetryInterval: 10
onSubmissionFailureRetries: 0
onSubmissionFailureRetryInterval: 20
sparkConf:
"spark.default.parallelism": "400"
"spark.sql.shuffle.partitions": "400"
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
"spark.sql.debug.maxToStringFields": "1000"
"spark.ui.port": "4045"
"spark.driver.maxResultSize": "0"
"spark.kryoserializer.buffer.max": "512"
"spark.local.dir": "/spill"
driver:
cores: 1
memory: "20G"
labels:
version: 3.1.2
serviceAccount: spark
volumeMounts:
- name: nfs
mountPath: /opt/spark/work-dir/nfs
executor:
cores: 20
instances: 20
memory: "150G"
labels:
version: 3.0.0
volumeMounts:
- name: nfs
mountPath: /opt/spark/work-dir/nfs
- name: pvc-spark-spill
mountPath: /spill
volumes:
- name: nfs
nfs:
server: xxx
path: /xxx
readOnly: false
- name: pvc-spark-spill
persistentVolumeClaim:
claimName: pvc-spark-spill
第 2 期
此方法失败,并显示/spill
必须唯一的消息。
Message: Pod "job1-driver" is invalid: spec.containers[0].volumeMounts[7].mountPath: Invalid value: "/spill": must be unique.
总结和问题
似乎每个执行者都需要他自己的 pvc 或至少他自己的 pvc 文件夹来溢出他的数据。但是我该如何正确配置呢?我没有从文档中得到它
感谢您的帮助亚历克斯
解决方案
spark 应该能够通过设置 claimName= OnDemand 来动态创建 PVC。为同一 pvc 附加多个 pod 将在 Kubernetes 端出现问题
您可以查看将在 kubenetes 管理卷之外工作的 nfs 共享。示例 https://www.datamechanics.co/blog-post/apache-spark-3-1-release-spark-on-kubernetes-is-now-ga
推荐阅读
- jsp - (JSP) ${errorString} 做什么?它是如何工作的?
- ruby-on-rails - 如何使用rails在s3上上传空文件夹
- c++ - 以简单格式“c:\files\sample.txt”从用户读取文件路径
- c++ - QVariant 转换无法识别我的模板调用的 std::string
- python - 如何控制 Scapy 发送数据包的速度?
- java - 将 java.util.function.Function 转换为 Kotlin 的函数式接口类型
- ms-access - 将子表单作为单个表单和数据表添加到拆分表单
- android - 在图像的每个像素和二进制值中生成具有颜色的 QR 码
- python-3.x - Python 3 - 在猜谜游戏中随时键入“exit”退出程序
- laravel - Laravel Mail:队列没有从环境中获取价值