matlab - 如何在 AWS 集群上的 Matlab parfor 作业中修复“无法重新运行任务,因为没有剩余的重新运行尝试”
问题描述
我在 AWS 集群上启动了一个批处理作业,不幸的是,它在大约 2 小时后以错误结束。在提交作业之前,我在本地集群上运行它,减少循环迭代,它运行良好。错误消息是:
Task with properties:
ID: 1
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:14
Running Duration: 0 days 1h 42m 50s
Error: All workers aborted during execution of the parfor loop.
Error Stack: parallel_function (line 607)
generic_adaptation (line 75)
Warnings: List warnings
Task with properties:
ID: 2
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 3
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 4
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 5
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 6
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 40s
Error: Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
The worker MATLAB exited or was stopped during task evaluation. MATLAB ended with exit status 9.
Warnings:
Task with properties:
ID: 7
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 8
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 9
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 10
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 11
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 12
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 13
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:23
Running Duration: 0 days 1h 42m 42s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 14
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 15
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
Task with properties:
ID: 16
State: finished
Function: @parallel.internal.cluster.executeScript
Parent: Job 5
StartDateTime: 11-Apr-2019 10:45:22
Running Duration: 0 days 1h 42m 43s
Error: The parallel job was cancelled because the task with ID 2 terminated abnormally for the following reason:
Cannot rerun task because there are no rerun attempts left (The task has no rerun attempts left.).
Original cancel message:
Unexpected error in PostJobEvaluate - MATLAB will now exit and restart.
Warnings:
由于一切都在我的 PC 上的本地集群上运行,我怀疑代码本身很好,但错误的原因在其他地方(可能是与 AWS EC2 集群的连接或集群上的内部错误?)
解决方案
推荐阅读
- javascript - 为什么 GraphQL 错误:createReadStream 不是函数
- react-admin - react-admin 链接参考输入
- mobile - UI更改在我的Android手机中不可见
- javascript - 函数在承诺结果之前返回?
- azure - Kudu 尝试从旧的 Azure 分支分支进行 Zip 部署
- c# - 如何使用 View 中的文本框值在 Viewmodel 中添加对象?
- css - :active 状态在滚动时过早激活
- moodle - mdl_user , mdl_file 用户资料图片
- azure-webjobs - 尝试运行触发的 Web 作业时出现 Azure Web 作业错误
- python - 如何通过与由不同字符分隔的另一个数据帧的结果进行比较来过滤掉一个 python pandas 数据帧中的子字符串