python - Pandas - 块之间有重叠的块 read_csv
问题描述
问题陈述
如何使用在块之间有重叠的熊猫分块读取 csv 文件?
例如,假设列表indexes
表示我希望读取的某个数据帧的索引。
indexes = [0,1,2,3,4,5,6,7,8,9]
read_csv(文件名,块大小=无):
indexes = [0,1,2,3,4,5,6,7,8,9] # read in all indexes at once
read_csv(文件名,块大小=5):
indexes = [[0,1,2,3,4], [5,6,7,8,9]] # iteratively read in mutually exclusive index sets
read_csv(文件名,块大小=5,重叠=2):
indexes = [[0,1,2,3,4], [3,4,5,6,7], [6,7,8,9]] # iteratively read in indexes sets with overlap size 2
工作解决方案
我有一个使用skiprows和nrows的破解解决方案,但它在读取 csv 文件时变得越来越慢。
indexes = [*range(10)]
chunksize = 5
overlap_count = 2
row_count = len(indexes) # this I can work out before reading the whole file in rather cheaply
chunked_indexes = [(i, i + chunksize) for i in range(0, row_count, chunksize - overlap_count)] # final chunk here may be janky, assume it works for now (it's more about the logic)
for chunk in chunked_indexes:
skiprows = [*range(chunk[0], chunk[1])]
pd.read_csv(filename, skiprows=skiprows, nrows=chunksize)
有没有人对此问题有任何见解或改进的解决方案?
解决方案
我认为你应该传递一个数字skiprow
而不是列表,尝试:
for i in list(range(0, row_count-overlap_count, chunksize - overlap_count)):
print (pd.read_csv('test.csv',
skiprows=i+1, #here it is +1 because the first row was header
nrows=chunksize,
index_col=0, # this was how I save my csv
header=None) # you may need to read header before
.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64', name=0)
Int64Index([3, 4, 5, 6, 7], dtype='int64', name=0)
Int64Index([6, 7, 8, 9], dtype='int64', name=0)
推荐阅读
- python - Can scipy.stats.wasserstein_distance be used with empirical distributions of different (unequal) sizes?
- asynchronous - 在 F# 中的异步块中返回位置
- html - WooCommerce 购物车页面:更改结帐按钮的位置
- php - How to propagate colors from bash script to CI (GitHub Actions, Travis, Gitlab...)?
- matlab - How to generalize the RK4 code to solve m number of ODEs in MATLAB?
- javascript - Dropdown button
- delphi - Automatically create modal form and switch to the specified tab (one liner)
- xcode - Can't symbolicate macOS crash log / no dSYM file when building
- javascript - Calculate balances (in database) of different currencies (rates in API) - using javascript
- reactjs - Formik(反应验证):没有样式错误
但简单控制