首页 > 解决方案 > Mesos 上的马拉松未能更新容器的资源

问题描述

我想在马拉松上运行 docker 容器。(Mesos 和 Marathon 都在 docker 上运行。)

当我直接通过 docker run 命令运行图像时,一切都很好并且可以工作。

但是当通过马拉松运行图像时,它在一分钟(60秒)后被杀死,然后马拉松重建容器在一分钟后仍然被杀死,依此类推。

最奇怪的是我在mesos-slave中发现了以下日志

无法在终端任务的状态更新上更新执行器“jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86”运行任务 jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 的容器 f891ffca-39c3-4b70-adae-2520864c42b2 的资源,销毁容器:无法确定“cpu”子系统的 cgroup:无法读取 /proc/27321/cgroup:没有这样的文件或目录

我在互联网上研究过这个问题,但大多数都是通过增加内存来解决的,这种方法对我不起作用,即使我把所有的记忆都给了它。

登录 mesos-slave

I0703 06:05:25.992172    18 slave.cpp:5283] Handling status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 from executor(1)@mesos1:35786
E0703 06:05:26.070716    14 slave.cpp:5614] Failed to update resources for container f891ffca-39c3-4b70-adae-2520864c42b2 of executor 'jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86' running task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/27321/cgroup: No such file or directory
I0703 06:05:26.070905    19 docker.cpp:2331] Destroying container f891ffca-39c3-4b70-adae-2520864c42b2 in RUNNING state
I0703 06:05:26.070940    19 docker.cpp:2336] Sending SIGTERM to executor with pid: 792
I0703 06:05:26.070894    12 task_status_update_manager.cpp:328] Received task status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.070981    12 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.071130    12 slave.cpp:5775] Forwarding the update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 to master@mesos1:5060
I0703 06:05:26.071277    12 slave.cpp:5684] Sending acknowledgement for status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 to executor(1)@mesos1:35786
I0703 06:05:26.073609    19 docker.cpp:2381] Running docker stop on container f891ffca-39c3-4b70-adae-2520864c42b2
I0703 06:05:26.076584    17 slave.cpp:5907] Got exited event for executor(1)@mesos1:35786
I0703 06:05:26.082994    12 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.083410    12 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.171222    14 docker.cpp:2560] Executor for container f891ffca-39c3-4b70-adae-2520864c42b2 has exited
I0703 06:05:26.172829    12 slave.cpp:6305] Executor 'jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86' of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 terminated with signal Terminated
I0703 06:05:26.172868    12 slave.cpp:6403] Cleaning up executor 'jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86' of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 at executor(1)@mesos1:35786
I0703 06:05:26.173218    18 gc.cpp:90] Scheduling '/var/tmp/mesos/slaves/82f5f7aa-772c-48b8-b9e9-5675fe0b7fa9-S0/frameworks/6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003/executors/jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86/runs/f891ffca-39c3-4b70-adae-2520864c42b2' for gc 6.99999799573037days in the future

登录 mesos-master

I0703 06:05:26.071590    15 master.cpp:7962] Status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 from agent 82f5f7aa-772c-48b8-b9e9-5675fe0b7fa9-S0 at slave(1)@mesos1:5051 (mesos1)
I0703 06:05:26.071923    15 master.cpp:8018] Forwarding status update TASK_FAILED (Status UUID: 4f55a18e-37ea-48fc-8f2d-2228f95a7097) for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.072099    15 master.cpp:10278] Updating the state of task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0703 06:05:26.080749    15 master.cpp:5623] Processing REVIVE call for framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 (marathon) at scheduler-13cb0ef0-e5fb-40ac-aa5e-d8d7284e409b@mesos1:44408
I0703 06:05:26.080828    15 hierarchical.cpp:1339] Revived offers for roles { * } of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003
I0703 06:05:26.081554    13 master.cpp:8870] Sending 1 offers to framework 2b59b774-1033-4f63-b403-fce174f8155b-0004 (Spark Cluster) at scheduler-48bf92fe-e69d-4af8-8d7e-e22cc5177d02@mesos3:46594
I0703 06:05:26.081902    13 master.cpp:8870] Sending 2 offers to framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 (marathon) at scheduler-13cb0ef0-e5fb-40ac-aa5e-d8d7284e409b@mesos1:44408
I0703 06:05:26.082159    12 http.cpp:1185] HTTP GET for /master/state?jsonp=angular.callbacks._3w from 10.1.21.12:65271 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
I0703 06:05:26.082294    12 master.cpp:5877] Processing ACKNOWLEDGE call 4f55a18e-37ea-48fc-8f2d-2228f95a7097 for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 (marathon) at scheduler-13cb0ef0-e5fb-40ac-aa5e-d8d7284e409b@mesos1:44408 on agent 82f5f7aa-772c-48b8-b9e9-5675fe0b7fa9-S0
I0703 06:05:26.082341    12 master.cpp:10382] Removing task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86 with resources cpus(allocated: *):0.5; mem(allocated: *):128; disk(allocated: *):128; ports(allocated: *):[50690-50690] of framework 6f16c868-e43d-4d49-aa57-2dee2bbd782d-0003 on agent 82f5f7aa-772c-48b8-b9e9-5675fe0b7fa9-S0 at slave(1)@mesos1:5051 (mesos1)

登录马拉松

[2018-07-03 06:05:26,073] INFO  Received status update for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86: TASK_FAILED (Failed to get exit status of container) (mesosphere.marathon.MarathonScheduler:Thread-97)
[2018-07-03 06:05:26,075] INFO  all tasks of instance [jping.marathon-eff7501c-7e86-11e8-aea3-22ffdeeedb86] are terminal, requesting to expunge (mesosphere.marathon.core.instance.update.InstanceUpdater$:marathon-akka.actor.default-dispatcher-16)
[2018-07-03 06:05:26,079] INFO  Removed app [/jping] from tracker (mesosphere.marathon.core.task.tracker.InstanceTracker$InstancesBySpec:marathon-akka.actor.default-dispatcher-16)
[2018-07-03 06:05:26,080] INFO  Increasing delay. Task launch delay for [/jping - 2018-07-03T03:49:41.174Z] is set to 2 seconds 313 milliseconds (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-16)
[2018-07-03 06:05:26,080] INFO  receiveInstanceUpdate: instance [jping.marathon-eff7501c-7e86-11e8-aea3-22ffdeeedb86] was deleted (Failed) (mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor:marathon-akka.actor.default-dispatcher-19)
[2018-07-03 06:05:26,080] INFO  Received reviveOffers notification: ReviveOffers$ (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-15)
[2018-07-03 06:05:26,080] INFO  => revive offers NOW, canceling any scheduled revives (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-15)
[2018-07-03 06:05:26,080] INFO  initiating a scale check for runSpec [/jping] due to [instance [jping.marathon-eff7501c-7e86-11e8-aea3-22ffdeeedb86]] Failed (mesosphere.marathon.core.task.update.impl.steps.ScaleAppUpdateStepImpl:marathon-akka.actor.default-dispatcher-16)
[2018-07-03 06:05:26,080] INFO  2 further revives still needed. Repeating reviveOffers according to --revive_offers_repetitions 3 (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-15)
[2018-07-03 06:05:26,080] INFO  => Schedule next revive at 2018-07-03T06:05:31.080Z in 5000 milliseconds, adhering to --min_revive_offers_interval 5000 (ms) (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-15)
[2018-07-03 06:05:26,080] INFO  Acknowledge status update for task jping.eff7501c-7e86-11e8-aea3-22ffdeeedb86: TASK_FAILED (Failed to get exit status of container) (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl:scala-execution-context-global-150)
[2018-07-03 06:05:26,081] INFO  Need to scale /jping from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0)
[2018-07-03 06:05:26,081] INFO  Queueing 1 new instances for /jping to the already 0 queued ones (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0)
[2018-07-03 06:05:26,081] INFO  add 1 instances to 0 instances to launch 

马拉松任务失败消息(此图显示另一个任务ID,但消息和问题相同)

有没有类似的问题可以参考?

马拉松版本:1.6.352

Mesos 版本:1.5.1


解决

我终于找到了问题。

marathon 容器任务被杀死的原因是 docker 不知道 marathon 创建的 pid。所以Docker杀死了马拉松创建的pid。这导致了我的问题。

由于我在 mesos-slave 上添加了带有 --pid=host 的 docker run 命令,所以这个问题已经解决了。

我在这里找到了解决方案

标签: dockercontainersmesosmarathon

解决方案


推荐阅读