首页 > 解决方案 > 如何在 AWS 集群上的 Matlab parfor 作业中修复“无法重新运行任务,因为没有剩余的重新运行尝试”

问题描述

我在 AWS 集群上启动了一个批处理作业,不幸的是,它在大约 2 小时后以错误结束。在提交作业之前,我在本地集群上运行它,减少循环迭代,它运行良好。错误消息是:

Task with properties:

ID: 1
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:14
Running Duration: 0 days 1h 42m 50s

Error: All workers aborted during execution of the parfor loop.
Error Stack: parallel_function (line 607)
generic_adaptation (line 75)
Warnings: List warnings
Task with properties:

ID: 2
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 3
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 4
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 5
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 6
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 40s

Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
The worker MATLAB exited or was stopped during task evaluation. MATLAB ended with exit status 9.
Warnings:
Task with properties:

ID: 7
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 8
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 9
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 10
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 11
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 12
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 13
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 14
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 15
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:

ID: 16
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s

Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:

由于一切都在我的 PC 上的本地集群上运行,我怀疑代码本身很好,但错误的原因在其他地方(可能是与 AWS EC2 集群的连接或集群上的内部错误?)

标签: matlabamazon-web-servicesamazon-ec2

解决方案


推荐阅读