python - 如何每次从 python 或 pyspark 中的 csv 读取 10 条记录?
问题描述
我有一个包含 100,000 行的 csv 文件,我想一次读取 10 行并处理每一行以每次保存到其各自的文件并休眠 5 秒。我正在尝试 Nslice,但它只读取前 10 个并停止。我希望程序运行到 EOF。如果有任何帮助,我正在使用 jupyter、python2 和 pyspark。
from itertools import islice
with open("per-vehicle-records-2020-01-31.csv") as f:
while True:
next_n_lines = list(islice(f, 10))
if not next_n_lines:
break
else:
print(next_n_lines)
sleep(5)
这不会分隔每一行。它将 10 行组合成一个列表
['"cosit","year","month","day","hour","minute","second","millisecond","minuteofday","lane","lanename","straddlelane","straddlelanename","class","classname","length","headway","gap","speed","weight","temperature","duration","validitycode","numberofaxles","axleweights","axlespacings"\n', '"000000000997","2020","1","31","1","30","2","0","90","1","Test1","0","","5","HGV_RIG","11.4","2.88","3.24","70.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","3","0","90","2","Test2","0","","2","CAR","5.2","3.17","2.92","71.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","5","0","90","1","Test1","0","","2","CAR","5.1","2.85","2.51","70.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","6","0","90","2","Test2","0","","2","CAR","5.1","3.0","2.94","69.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","9","0","90","1","Test1","0","","5","HGV_RIG","11.5","3.45","3.74","70.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","10","0","90","2","Test2","0","","2","CAR","5.4","3.32","3.43","71.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","13","0","90","2","Test2","0","","2","CAR","5.3","3.19","3.23","71.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","13","0","90","1","Test1","0","","2","CAR","5.2","3.45","3.21","70.0","0.0","0.0","0","0","0","",""\n', '"000000000997","2020","1","31","1","30","16","0","90","1","Test1","0","","5","HGV_RIG","11.0","2.9","3.13","69.0","0.0","0.0","0","0","0","",""\n']
解决方案
这应该有效:
import pandas as pd
import time
path_data = 'per-vehicle-records-2020-01-31.csv'
reader = pd.read_csv(path_data, sep=';', chunksize=10, iterator=True)
for i in reader:
df = next(reader)
print(df)
time.sleep(5)
chunksize 将每 10 行读取一次,for 循环应确保以这种方式读取它们,并在每次迭代之间休眠 5 秒。
推荐阅读
- airflow - 无法使用 PIP 安装airflow-pentaho-plugin python 包
- ngx-graph - 有没有办法在 ngx-graph 中的节点上方显示边缘?
- java - Java 正则表达式检查
- vue.js - 如何使用 axio 调用的响应来构建另一个请求
- ios - 如何在使用 CocoaPods 时解决错误“XX.framework 的捆绑包”包含不允许的文件“Frameworks”
- r - 使用shinydashboard和shinyjs在tabBox中启动时隐藏tabPanel
- dagger-2 - Dagger 2 中的错误:无法解析组件类/组件类没有代码生成
- java - Http 请求延迟到其他微服务
- c# - 以下一代格式提供图像,在不更改旧链接的情况下替换图像
- r - autoarfima 使用 R 中的 rugarch 包选择 arfima 模型的参数