首页 > 解决方案 > 我如何在 Python 中的数据帧上使用 groupby 函数

问题描述

输入数据框:

      Last Updated  Downloads             Category  
0             2018      10000       ART_AND_DESIGN  
1             2018     500000       ART_AND_DESIGN  
2             2018    5000000       ART_AND_DESIGN  
3             2018   50000000       ART_AND_DESIGN  
4             2018     100000       ART_AND_DESIGN  
           ...        ...                  ...  
10838         2017       1000              MEDICAL  
10839         2015       1000  BOOKS_AND_REFERENCE  
10840         2018   10000000            LIFESTYLE  

问题陈述是:“对于 2016、2017、2018 年,下载次数最多和最少的应用程序类别是什么”

为了解决这个问题,我使用了:

df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
                                    Downloads
Last Updated Category                        
2010         FAMILY                    100000
2011         BOOKS_AND_REFERENCE      1000000
             BUSINESS                    1000
             FAMILY                     50000
             GAME                    10100000
             LIBRARIES_AND_DEMO       1000000
             LIFESTYLE                 100000
             TOOLS                    5156100
2012         BUSINESS                   10000
             COMMUNICATION               1000
             FAMILY                    711210
             FINANCE                   100000
             GAME                     1050000
             HEALTH_AND_FITNESS       1100000
             LIBRARIES_AND_DEMO      10000000
             MEDICAL                   120000
             PHOTOGRAPHY               500000
             PRODUCTIVITY              100000
             SHOPPING                  100000
             TOOLS                     200000
2013         BOOKS_AND_REFERENCE         2000
             BUSINESS                   10300
             COMMUNICATION             151000
             EDUCATION                  50000
             FAMILY                  50338310
             FINANCE                    60100
             GAME                    40265250
             HEALTH_AND_FITNESS         10000
             HOUSE_AND_HOME            100000
             LIBRARIES_AND_DEMO       6000000
                                      ...
2018         BOOKS_AND_REFERENCE   1880913110
             BUSINESS               975227003
             COMICS                  55201050
             COMMUNICATION        32548874886
             DATING                 262259557
             EDUCATION              842800000
             ENTERTAINMENT         2836150000
             EVENTS                  15410330
             FAMILY                9020112207
             FINANCE                872763824
             FOOD_AND_DRINK         271663081
             GAME                 33052192901
             HEALTH_AND_FITNESS    1568697276
             HOUSE_AND_HOME         161847101
             LIBRARIES_AND_DEMO      16283100
             LIFESTYLE              468085968
             MAPS_AND_NAVIGATION    702264990
             MEDICAL                 50556517
             NEWS_AND_MAGAZINES    7491323670
             PARENTING               31140010
             PERSONALIZATION       2130701875
             PHOTOGRAPHY           9402062515
             PRODUCTIVITY         13963101723
             SHOPPING              3243802640
             SOCIAL               13924137461
             SPORTS                1540744703
             TOOLS                10633528879
             TRAVEL_AND_LOCAL      6846181981
             VIDEO_PLAYERS         5928936510
             WEATHER                407227020

[188 rows x 1 columns]

现在我需要分别在 2016、2017、2018 三个年份中为 Max 和 Min 的 Category 请建议任何有效的方法来解决 Python 中的此查询。

标签: pythonpandasanalytics

解决方案


首先过滤Series.isinand boolean indexing,因此只处理必要的行(原因是更少的行处理是更好的性能)。

因为你想要Category几年,首先在聚合中创建DataFrame,然后使用and 作为每个组的最小值和最大值的索引,所以可能用于选择,用于将行转换为列:as_index=FalsesumDataFrameGroupBy.idxmaxDataFrameGroupBy.idxminDataFrame.locDataFrame.stack

df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()


df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
df1.set_index("Last Updated", inplace=True)
df1=df1.loc[['2016','2017','2018']]


print(df1)
    Last Updated             Category  Downloads
7           2016             BUSINESS         10
6           2016  BOOKS_AND_REFERENCE      10000
14          2017         PRODUCTIVITY         10
9           2017       ART_AND_DESIGN    2660000
16          2018    AUTO_AND_VEHICLES        100
15          2018       ART_AND_DESIGN   84345000

另一个想法是 sortig byDataFrame.sort_values并使用每组的GroupBy.nth第一行和最后一行:

df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()

df1 = (df1.sort_values(['Last Updated','Downloads'])
          .groupby('Last Updated', as_index=False)
          .nth([0,-1]))
print(df1)
    Last Updated           Category  Downloads
7           2016           BUSINESS         10
8           2016            FINANCE      10000
14          2017       PRODUCTIVITY         10
9           2017     ART_AND_DESIGN    2660000
16          2018  AUTO_AND_VEHICLES        100
15          2018     ART_AND_DESIGN   84345000

推荐阅读