python-3.x - 包裹在函数中时 Python 代码变慢
问题描述
我正在读取和处理一个文件(使用相同的代码位),它以两种截然不同的速度运行:1. 脚本化(每秒 50K+ 次迭代)和 2. 包裹在函数中(每秒约 300 次迭代)。我真的不明白为什么阅读时间消耗会有如此巨大的差异。
模块结构(未使用和不相关的文件省略。代码在最后。):
| experiments/
|--| experiment_runner.py
|
| module/
|--| shared/
|--|--| dataloaders.py
|--|--| data.py
在data.py
我们有实际加载文件的方法( load
,包装该方法的类继承自)。torch.utils.data.Dataset
在dataloaders.py
我准备要传递给的参数中load
,为我正在使用的每个数据集包装在一个函数中。然后将其传递给一个loader
函数,该函数处理拆分数据集等。
experiment_runner
然后是速度差异发生的地方。如果我在dataloaders.py
加载中使用数据集函数,则大约 300 次迭代/秒。如果我从函数中复制代码并将其直接扔到 中experiment_runner
,仍然使用loader
from 中的函数dataloaders.py
(因此,没有为每个数据集包装在函数中),加载大约以 50000 次迭代/秒发生。我完全不知道为什么将代码包装在一个函数中会极大地改变它的速度。
现在实际代码:
数据.py:
def load(self, dataset: str = 'train', skip_header = True, **kwargs) -> None:
fp = open(self.data_files[dataset])
if skip_header:
next(fp)
data = []
for line in tqdm(self.reader(fp), desc = f'loading {self.name} ({dataset})'):
data_line, datapoint = {}, base.Datapoint()
for field in self.train_fields:
idx = field.index if self.ftype in ['CSV', 'TSV'] else field.cname
data_line[field.name] = self.process_doc(line[idx].rstrip())
data_line['original'] = line[idx].rstrip()
for field in self.label_fields:
idx = field.index if self.ftype in ['CSV', 'TSV'] else field.cname
if self.label_preprocessor:
data_line[field.name] = self.label_preprocessor(line[idx].rstrip())
else:
data_line[field.name] = line[idx].rstrip()
for key, val in data_line.items():
setattr(datapoint, key, val)
data.append(datapoint)
fp.close()
if self.length is None:
# Get the max length
lens = []
for doc in data:
for f in self.train_fields:
lens.append(len([tok for tok in getattr(doc, getattr(f, 'name'))]))
self.length = max(lens)
if dataset == 'train':
self.data = data
elif dataset == 'dev':
self.dev = data
elif dataset == 'test':
self.test = data
数据加载器.py:
def loader(args: dict, **kwargs):
"""Loads the dataset.
:args (dict): Dict containing arguments to load dataaset.
:returns: Loaded and splitted dataset.
"""
dataset = GeneralDataset(**args)
dataset.load('train', **kwargs)
if (args['dev'], args['test']) == (None, None): # Only train set is given.
dataset.split(dataset.data, [0.8, 0.1, 0.1], **kwargs)
elif args['dev'] is not None and args['test'] is None: # Dev set is given, test it not.
dataset.load('dev', **kwargs)
dataset.split(dataset.data, [0.8], **kwargs)
elif args['dev'] is None and args['test'] is not None: # Test is given, dev is not.
dataset.split(dataset.data, [0.8], **kwargs)
dataset.dev_set = dataset.test
dataset.load('test', **kwargs)
else: # Both dev and test sets are given.
dataset.load('dev', **kwargs)
dataset.load('test', **kwargs)
return dataset
def binarize(label: str) -> str:
if label in ['0', '1']:
return 'pos'
else:
return 'neg'
def datal(path: str, cleaners: base.Callable, preprocessor: base.Callable = None):
args = {'data_dir': path,
'ftype': 'csv',
'fields': None,
'train': 'dataset.csv', 'dev': None, 'test': None,
'train_labels': None, 'dev_labels': None, 'test_labels': None,
'sep': ',',
'tokenizer': lambda x: x.split(),
'preprocessor': preprocessor,
'transformations': None,
'length': None,
'label_preprocessor': binarize,
'name': 'First dataset.'
}
ignore = base.Field('ignore', train = False, label = False, ignore = True)
d_text = base.Field('text', train = True, label = False, ignore = False, ix = 6, cname = 'text')
d_label = base.Field('label', train = False, label = True, cname = 'label', ignore = False, ix = 5)
args['fields'] = [ignore, ignore, ignore, ignore, ignore, d_label, d_text]
return loader(args)
出于以下目的:experiment_runner.py
from module.dataloaders import datal, loader
dataset = datal() # Slow: 300-ish iterations/second
# Fast version: 50000 iter/second
def binarize(label: str) -> str:
if label in ['0', '1']:
return 'pos'
else:
return 'neg'
args = {'data_dir': path,
'ftype': 'csv',
'fields': None,
'train': 'dataset.csv', 'dev': None, 'test': None,
'train_labels': None, 'dev_labels': None, 'test_labels': None,
'sep': ',',
'tokenizer': lambda x: x.split(),
'preprocessor': preprocessor,
'transformations': None,
'length': None,
'label_preprocessor': binarize,
'name': 'First dataset.'
}
ignore = base.Field('ignore', train = False, label = False, ignore = True)
d_text = base.Field('text', train = True, label = False, ignore = False, ix = 6, cname = 'text')
d_label = base.Field('label', train = False, label = True, cname = 'label', ignore = False, ix = 5)
args['fields'] = [ignore, ignore, ignore, ignore, ignore, d_label, d_text]
dataset = loader(args)
理想情况下,我更愿意将数据集函数(例如datal
)包装起来以保持逻辑分离,但是随着速度的降低,这是不可行的。
解决方案
推荐阅读
- algorithm - Dijkstra 的算法来防止任何顶点在火路径上
- django - 电子邮件正文显示 html 代码代替 django 中的模板
- python - 函数值存储在 python.how 上的 int 中?
- .net - 在 .NET Windows 应用程序中使用 Windows 凭据登录
- mips - MIPS FPU 编程 - 操作数类型不正确
- python - 字典列表
- git - Git:如何删除已删除的忽略文件?
- java - Java中的Android Studio Code接受用户输入和显示输出
- javascript - setTimeout() 倒计时不工作 Discord JS
- php - 无法修改主机上的标头信息错误