python - 将 Dask Bag of Pandas DataFrames 转换为单个 Dask DataFrame
问题描述
问题总结
精简版
如何从 Dask Bag of Pandas DataFrames 转到单个 Dask DataFrame?
长版
我有许多 dask.dataframe 的各种read
函数(例如dd.read_csv
或dd.read_parquet
)都无法读取的文件。我确实有自己的函数,可以将它们作为 Pandas DataFrames 读取(函数一次只能处理一个文件,类似于pd.read_csv
)。我想将所有这些单个 Pandas DataFrames 放在一个大的 Dask DataFrame 中。
最小工作示例
这是一些示例 CSV 数据(我的数据实际上不是 CSV 格式,而是在此处使用以方便示例)。要创建一个最小的工作示例,您可以将其保存为 CSV 并制作几份副本,然后使用下面的代码
"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db
files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)
到目前为止我尝试过的
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)
df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this
# This gets me really close. It's a bag of Dask DataFrames.
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)
df = dd.from_delayed(bag) # returns an error
解决方案
我建议将 dask.delayed 与 dask.dataframe 一起使用。有一个很好的例子可以做你想做的事情:
推荐阅读
- javascript - 不要在循环中创建函数 no-loop-func Axios 请求异步等待
- extjs - cfgrid ext js rowdblclick 侦听器未触发
- python - PyQt5 和 Wing IDE:QThread 冻结应用程序
- javascript - 如何检查对象属性值是否为带空格的字符串
- python - BeautifulSoup 在 csv 中用“NA”填充缺失的信息
- c++ - 使用类来处理非平凡函数序列中的共享状态
- kotlin - Kotlin 协程如何在内部工作?
- reactjs - 如何在点/悬停工具提示中添加额外信息
- java - 我开发了一个 Flash Light 应用程序。但在高安卓版本下无法开启
- r - 如何在 Shiny R 中有效地使用观察函数