python - Simultaneous reads of the same PyTorch torchvision.datasets object
问题描述
Consider the following piece of code to fetch a data set for training from torchvision.datasets
and to create a DataLoader
for it.
import torch
from torchvision import datasets, transforms
training_set_mnist = datasets.MNIST('./mnist_data', train=True, download=True)
train_loader_mnist = torch.utils.data.DataLoader(training_set_mnist, batch_size=128,
shuffle=True)
Assume that several Python processes have access to the folder ./mnist_data
and execute the above piece of code simultaneously; in my case, each process is a different machine on a cluster and the data set is stored in an NFS location accessible by everyone. You may also assume that the data is already downloaded in this folder so download=True
should have no effect. Moreover, each process may use a different seed, as set by torch.manual_seed()
.
I would like to know whether this scenario is allowed in PyTorch. My main concern is whether the above code can change the data folders or files in ./mnist_data
such that if ran by multiple processes it can potentially lead to unexpected behavior or other issues. Also, given that shuffle=True
I would expect that if 2 or more processes try to create the DataLoader
each of them will get a different shuffling of the data assuming that the seeds are different. Is this true?
解决方案
My main concern is whether the above code can change the data folders or files in ./mnist_data such that if ran by multiple processes it can potentially lead to unexpected behavior or other issues.
You will be fine as processes are only reading data, not modifying in (loading tensors
with data into RAM in case of MNIST
). Please notice processes do not share memory addresses, hence tensor
with data will be loaded multiple times (which shouldn't be a big problem in case of MNIST
).
Also, given that
shuffle=True
I would expect that if 2 or more processes try to create the DataLoader each of them will get a different shuffling of the data assuming that the seeds are different.
shuffle=True
has nothing to do with data itself. What it does, is it get __len__()
of provided dataset
, makes a range [0, __len__())
and this range is shuffled and used to index dataset
's __getitem__
. Check out this section for more info about Samplers
.
推荐阅读
- reactjs - React + Redux 将子组件方法绑定到父组件或保留它们自己
- swift - 通过这个函数的所有路径都会调用自己,永远运行然后崩溃
- javascript - 为什么我不在钩子的 useEffect 中创建对象?
- javascript - 在 p5.js Javascript 框架中动态更改 traingles 大小
- c# - 如何在不覆盖以前数据的情况下将数据添加到文件中
- python - 无法使用 Selenium 单击按钮 - python
- unicode - 如何在 Overleaf LaTex 中显示多字符 unicode 表情符号?
- elasticsearch - Elasticsearch查询关键字值大于X的所有文档[7.2]
- ngrx - 如何覆盖/替换 NgRx 中的 DefaultDataService 并编写自定义 API 方法
- python - 我的标准神经网络成本正在上升