python - Export dask groups to csv
问题描述
I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame
I would use this:
for k, v in df.groupby(['id']):
v.to_csv(k, sep='\t', header=True, index=False)
However, I get the error KeyError: 'Column not found: 0'
there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe, which I cannot do. Any help on splitting this file up would be greatly appreciated.
解决方案
You want to use apply()
for this:
def do_to_csv(df):
df.to_csv(df.name, sep='\t', header=True, index=False)
return df
df.groupby(['id']).apply(do_to_csv, meta=df._meta).size.compute()
Note
- the group key is stored in the dataframe name
- we return back the dataframe and supply a meta
; this is not really necessary, but you will need to compute on something and it's convenient to know exactly what that thing is
- the final output will be the number of rows written.
推荐阅读
- python-3.x - 如何在 Python 中正确使用迭代器?Python初学者
- authentication - 从 REST / EJB 服务记录用户信息和更新操作
- regex - Bash - 在文件中查找连续的管道并删除多余的管道
- python - pandas.Series:如何获得下一个值的速率
- java - 将 JTable 值保存在 ArrayList 中
- sql - SQL Server 按分隔符将字符串拆分为列(动态长度)
- flutter - 如何通过POST将动态数组传递给php脚本
- java - 根据透视变化启用或禁用文件->新建弹出菜单
- c# - 根据行类型将父子表映射到其他表
- python - MongoDB查找mongo数组中的所有项目是否不在python列表中