首页 > 解决方案 > 如何在 groupby 计算()之后保留 Dask DataFrame

问题描述

我有一个 Dask DataFrame,当我有一个 groupby 时,我发现在使用 compute() 之前我没有处理列,但是在使用 compute() 时 Dask DataFrame 更改为 Pandas DataFrame,所以 Dask DataFrame 没有优势,我想保留 Dask一直是DataFrame,查看详情:</p>

import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({"name":["Jack","Jack","Willom","Willom","James","James","Morgan"],
                   "fix_num":[50,50,70,70,90,90,100],
                   "score1":[50,60,70,80,90,40,60],
                   "score2":[90,50,30,40,100,80,80]})
ddf = dd.from_pandas(df, npartitions=1)

ddf.compute()
     name  fix_num  score1  score2
0    Jack       50      50      90
1    Jack       50      60      50
2  Willom       70      70      30
3  Willom       70      80      40
4   James       90      90     100
5   James       90      40      80
6  Morgan      100      60      80
def _element_coment(t):
    a = t["score1"].sum()
    b = t["score2"].sum()
    return pd.Series((a, b), index=['sum_1', 'sum_2'])


grp = ddf.groupby(['name','fix_num'])\
         .apply(_element_coment,meta={'sum_1':int, 'sum_2':int})\
         .reset_index()

judg = grp.fix_num <= grp.sum_2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3387, in __getattr__
    raise AttributeError("'DataFrame' object has no attribute %r" % key)
AttributeError: 'DataFrame' object has no attribute 'fix_num'
grp.columns        #I found no fix_num in columns
Index(['index', 'sum_1', 'sum_2'], dtype='object')
grp_2 = grp.compute()    
grp_2
     name  fix_num  sum_1  sum_2
0    Jack       50    110    140
1   James       90    130    180
2  Morgan      100     60     80
3  Willom       70    150     70
# grp_2 have fix_num in columns, but grp_2 is pandas DataFrame
jud g = grp2_2.fix_num<=grp2_2.sum_2  
grp_2.dtypes
name       object
fix_num     int64
sum_1       int64
sum_2       int64
dtype: object**

那么如何保留 Dask DataFrame 进行处理呢?

标签: pandasgroup-bydask

解决方案


推荐阅读