python - actor ImplicitFunc is too large 错误
问题描述
我最近将 Ray 更新到 1.7,当我使用以前的版本时一切正常,但现在我遇到了The actor ImplicitFunc is too large
错误。我曾经tune.with_parameters()
将我的数据集传递给 train 函数。另外,我测量了我传递给的所有参数的大小tune.run()
,最大的是 13MB,是训练集。我在讨论.ray.io 上找到了测量大小的代码,它是
pickled = pickle.dumps(my_object)
ength_mib = len(pickled) // (1024 * 1024)
print("Length mb: {}".format(length_mib))
我还删除了整个函数体,但问题仍然存在。
我发现修复它的唯一方法是使用tune.with_parameters()
,但错误仍然存在。
这是我的代码的一部分:
def train(self, config, data):
print("Train")
net = None
if self.df:
net = Net(k1=config["k1"], k2=config["k2"], out1=config["out1"], out2=config["out2"], L1=config["l1"])
else:
net = Net()
net.to(self.device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
trainloader = torch.utils.data.DataLoader(
data[0],
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
valloader = torch.utils.data.DataLoader(
data[1],
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
# Trains the network
with tune.checkpoint_dir(epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save((net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy= correct/total)#eval(self.part.init_data["val"]["label"].to_numpy(), predicted_labels.astype(int))["F1"])
def main(self, num_samples=50, max_num_epochs=20, gpus_per_trial=1):
config = None
print("Main")
config = {
"l1": tune.sample_from(lambda _: 2**np.random.randint(2, 10)),
"lr": tune.loguniform(1e-4, 1e-1),
"k1": tune.choice([4, 5]),
"k2": tune.choice([4, 5]),
"out1": tune.choice([16, 32, 64, 128]),
"out2": tune.choice([16, 32, 64, 128]),
"batch_size": tune.choice([16, 32, 50, 64, 128]),
"epoch": tune.choice([5, 10, 15, 20, 25, 30, 40, 50, 75, 100])
}
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
result = tune.run(
tune.with_parameters(self.train, data=(self.train_data,self.val)),
resources_per_trial={"cpu": 4, "gpu": 1},
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=ExperimentTerminationReporter(),
verbose=1)
以及完整的日志:
2021-10-29 18:01:03,649 INFO services.py:1250 -- View the Ray dashboard at http://127.0.0.1:8265
2021-10-29 18:01:04,916 WARNING function_runner.py:558 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2021-10-29 18:01:11,619 ERROR ray_trial_executor.py:599 -- Trial train_6c120_00000: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 590, in start_trial
return self._start_trial(trial, checkpoint, train=train)
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 465, in _start_trial
runner = self._setup_remote_runner(trial)
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 382, in _setup_remote_runner
return full_actor_class.remote(**kwargs)
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/actor.py", line 480, in remote
return actor_cls._remote(
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 371, in _invocation_actor_class_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/actor.py", line 713, in _remote
worker.function_actor_manager.export_actor_class(
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/_private/function_manager.py", line 383, in export_actor_class
check_oversized_function(actor_class_info["class"],
File "/home/.../anaconda3/envs/raytune/lib/python3.8/site-packages/ray/_private/utils.py", line 641, in check_oversized_function
raise ValueError(error)
ValueError: The actor ImplicitFunc is too large (177 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
此外,当使用 Ray 1.6.0 时,我收到警告:The actor ImplicitFunc is very large (88 MiB)
但它有效。
重要更新:我发现那(self.train, (self.train_data, self.val))
是 146Mb。但是使用tune.with_parameters()
并不能解决任何问题。非常感谢您提供的任何帮助
解决方案
我刚刚发现了问题。我在一个对象中使用设置,所以传递self
给train()
是“重载”系统。
推荐阅读
- python - 在 Python 中使用 argparse 的困惑
- python - 如何使字典仅按最大值输出(implied_volatility)?
- javascript - TypeError [ERR_UNESCAPED_CHARACTERS] 路径包含日语
- python - 使用python将具有相似值的文本文件合并到一个文件中
- python - 如何为图中的每个点绘制不同的标准偏差?
- android - 如何在 Mastercard Gateway Android SDK 中进行 3-D 安全认证?
- qt - 构建opencv contrib cvv模块时出现LNK错误
- html - 具有固定位置的 Html div 布局
- c++ - 如何确保不会重新分配 std::vector?
- reactjs - React Native (Class Component) 当子组件的props改变时重新渲染父组件