python - 我如何在 Python 中的数据帧上使用 groupby 函数
问题描述
输入数据框:
Last Updated Downloads Category
0 2018 10000 ART_AND_DESIGN
1 2018 500000 ART_AND_DESIGN
2 2018 5000000 ART_AND_DESIGN
3 2018 50000000 ART_AND_DESIGN
4 2018 100000 ART_AND_DESIGN
... ... ...
10838 2017 1000 MEDICAL
10839 2015 1000 BOOKS_AND_REFERENCE
10840 2018 10000000 LIFESTYLE
问题陈述是:“对于 2016、2017、2018 年,下载次数最多和最少的应用程序类别是什么”
为了解决这个问题,我使用了:
df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
Downloads
Last Updated Category
2010 FAMILY 100000
2011 BOOKS_AND_REFERENCE 1000000
BUSINESS 1000
FAMILY 50000
GAME 10100000
LIBRARIES_AND_DEMO 1000000
LIFESTYLE 100000
TOOLS 5156100
2012 BUSINESS 10000
COMMUNICATION 1000
FAMILY 711210
FINANCE 100000
GAME 1050000
HEALTH_AND_FITNESS 1100000
LIBRARIES_AND_DEMO 10000000
MEDICAL 120000
PHOTOGRAPHY 500000
PRODUCTIVITY 100000
SHOPPING 100000
TOOLS 200000
2013 BOOKS_AND_REFERENCE 2000
BUSINESS 10300
COMMUNICATION 151000
EDUCATION 50000
FAMILY 50338310
FINANCE 60100
GAME 40265250
HEALTH_AND_FITNESS 10000
HOUSE_AND_HOME 100000
LIBRARIES_AND_DEMO 6000000
...
2018 BOOKS_AND_REFERENCE 1880913110
BUSINESS 975227003
COMICS 55201050
COMMUNICATION 32548874886
DATING 262259557
EDUCATION 842800000
ENTERTAINMENT 2836150000
EVENTS 15410330
FAMILY 9020112207
FINANCE 872763824
FOOD_AND_DRINK 271663081
GAME 33052192901
HEALTH_AND_FITNESS 1568697276
HOUSE_AND_HOME 161847101
LIBRARIES_AND_DEMO 16283100
LIFESTYLE 468085968
MAPS_AND_NAVIGATION 702264990
MEDICAL 50556517
NEWS_AND_MAGAZINES 7491323670
PARENTING 31140010
PERSONALIZATION 2130701875
PHOTOGRAPHY 9402062515
PRODUCTIVITY 13963101723
SHOPPING 3243802640
SOCIAL 13924137461
SPORTS 1540744703
TOOLS 10633528879
TRAVEL_AND_LOCAL 6846181981
VIDEO_PLAYERS 5928936510
WEATHER 407227020
[188 rows x 1 columns]
现在我需要分别在 2016、2017、2018 三个年份中为 Max 和 Min 的 Category 请建议任何有效的方法来解决 Python 中的此查询。
解决方案
首先过滤Series.isin
and boolean indexing
,因此只处理必要的行(原因是更少的行处理是更好的性能)。
因为你想要Category
几年,首先在聚合中创建DataFrame
,然后使用and
作为每个组的最小值和最大值的索引,所以可能用于选择,用于将行转换为列:as_index=False
sum
DataFrameGroupBy.idxmax
DataFrameGroupBy.idxmin
DataFrame.loc
DataFrame.stack
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
df1.set_index("Last Updated", inplace=True)
df1=df1.loc[['2016','2017','2018']]
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
6 2016 BOOKS_AND_REFERENCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000
另一个想法是 sortig byDataFrame.sort_values
并使用每组的GroupBy.nth
第一行和最后一行:
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = (df1.sort_values(['Last Updated','Downloads'])
.groupby('Last Updated', as_index=False)
.nth([0,-1]))
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
8 2016 FINANCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000
推荐阅读
- java - Why is Coordinate layout ( CL )hiding half of the floating action button when the height of CL is set to wrap_Content?
- angular - ion-select-option selected attribute doesn't work
- ios - Finding a view of certain type within a parent
- node.js - express + socket.io + kubernetes Access-Control-Allow-Origin' header
- python-3.x - 有没有办法使用def列出从最小到最大的python数字?不使用 python 内置函数
- javascript - 你怎么知道 JavaScript 浮点数何时不会被舍入?
- python - 如何创建带有列表列表的字典?
- javascript - 如何制作持久 PWA 缓存?
- python - 如何安装特定版本的 H2O
- regex - Extract text only from the first square brackets