apache-spark - Spark Pod 在 Kubernetes 中每小时重启一次
问题描述
我已经在 kubernetes 中以集群模式部署了 spark 应用程序。spark 应用程序 pod 几乎每小时都会重新启动。驱动程序日志在重新启动之前有此消息:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
Executor 日志有:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
如何找到导致执行者删除的原因?
部署:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
解决方案
我对在 Kubernetes 上运行 Spark 进行了快速研究,似乎 Spark 在设计上将在他们完成运行 Spark 应用程序时终止 executor pod。引用自 Spark 官方网站:
当应用程序完成时,executor pod 终止并被清理,但驱动程序 pod 保留日志并在 Kubernetes API 中保持“已完成”状态,直到最终被垃圾收集或手动清理。
因此,我相信只要您的 Spark 实例仍然能够在需要时启动 executor pod,就无需担心重启。
参考:https ://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
推荐阅读
- python - DJANGO - redirect() 不重定向 - 将当前路径名附加到域
- python-3.x - 无法从 spyder IDE 运行 pathos 程序
- java - JAVA中CSV文件最后一行的条件
- snakemake - 使用目录作为snakemake中的输入以获取特定脚本
- javascript - Django React 页面无法加载 css 文件
- r - R中的部分斜体和粗体文本
- jquery - 仅显示一个隐藏行(Visual Page Builder 手风琴)
- python - Python 列表操作 - 计算常见实例
- windows-runtime - WinRT 保存配置文件
- python-3.x - Python subprocess.CalledProcessError:命令'['tar','-xf','/home//filename.tar.gz']'返回非零退出状态2