pandas - 如何根据列表中的条件将行分组在一起?熊猫
问题描述
如果它们在某些列中有匹配的值,我希望能够将它们组合为一个,但是我只希望它们在值在列表中时被分组。例如,
team_sports = ['football', 'basketball']
view of df
country sport age
USA football 21
USA football 28
USA golf 20
USA golf 44
China football 30
China basketball 22
China basketball 41
wanted outcome
country sport age
USA football 21,28
USA golf 20
USA golf 44
China football 30
China basketball 22,41
The attempt I made was,
team_sports = ['football', 'basketball']
for i in df['Sport']:
if i in team_sports:
group_df= df.groupby(['Country', 'Sport'])['Age'].apply(list).reset_index()
这需要永远运行,我正在使用的数据库有 100,000 行。
非常感谢任何帮助,谢谢
解决方案
更直接的方法是sports
根据. 分开然后回到一起:isin
team_sports
groupby aggregate
concat
team_sports = ['football', 'basketball']
m = df['sport'].isin(team_sports)
cols = ['country', 'sport']
group_df = pd.concat([
# Group those that do match condition
df[m].groupby(cols, as_index=False)['age'].agg(list),
# Leave those that don't match condition as is
df[~m]
], ignore_index=True).sort_values(cols)
*sort_values
可以选择将国家和体育重新组合在一起
group_df
:
country sport age
0 China basketball [22, 41]
1 China football [30]
2 USA football [21, 28]
3 USA golf 20
4 USA golf 44
isin
不太直接的方法是使用+创建基于值是否在团队运动列表中的新分组级别cumsum
:
team_sports = ['football', 'basketball']
group_df = (
df.groupby(
['country', 'sport',
(~df['sport'].sort_values().isin(team_sports)).cumsum().sort_index()],
as_index=False,
sort=False
)['age'].agg(list)
)
group_df
:
country sport age
0 USA football [21, 28]
1 USA golf [20]
2 USA golf [44]
3 China football [30]
4 China basketball [22, 41]
组的创建方式:
team_sports = ['football', 'basketball']
print(pd.DataFrame({
'country': df['country'],
'sport': df['sport'],
'not_in_team_sports': (~df['sport'].sort_values()
.isin(team_sports)).cumsum().sort_index()
}))
country sport not_in_team_sports
0 USA football 0
1 USA football 0
2 USA golf 1 # golf 1
3 USA golf 2 # golf 2 (not in the same group)
4 China football 0
5 China basketball 0
6 China basketball 0
*sort_values
在这里是必需的,这样sport
组就不会被不在列表中的运动打断。
df = pd.DataFrame({
'country': ['USA', 'USA', 'USA'],
'sport': ['football', 'golf', 'football'],
'age': [21, 28, 20]
})
team_sports = ['football', 'basketball']
print(pd.DataFrame({
'country': df['country'],
'sport': df['sport'],
'not_sorted': (~df['sport'].isin(team_sports)).cumsum(),
'sorted': (~df['sport'].sort_values()
.isin(team_sports)).cumsum().sort_index()
}))
country sport not_sorted sorted
0 USA football 0 0
1 USA golf 1 1
2 USA football 1 0 # football 1 (separate group if not sorted)
排序确保足球齐头并进,因此不会发生这种情况
设置:
import pandas as pd
df = pd.DataFrame({
'country': ['USA', 'USA', 'USA', 'USA', 'China', 'China', 'China'],
'sport': ['football', 'football', 'golf', 'golf', 'football', 'basketball',
'basketball'],
'age': [21, 28, 20, 44, 30, 22, 41]
})
推荐阅读
- dask - 使用 dask.delayed 和 pandas.DataFrame 将 dask.bag 字典转换为 dask.dataframe
- react-virtualized - WindowScroller + AutoSizer + List 没有按预期工作
- java - 使用 Sparql 从 DBPedia 仅获取编程语言的问题
- sip - SIP 请求中“SecurityClient”标头的“prot”参数的可能值是多少?
- python - 如何在不使用内置替换功能的情况下用另一个单词替换字符串中的单词?
- java - 即使禁用了 TLSv.1.3,java 11 HttpClient 也会导致无休止的 SSL 循环
- c# - C# 打印字典内容
- python - 如何使用 Pycharm 安装基于 Mesa 代理的建模?
- python - 用keras初始化变量
- c++ - 如何为开源 uEye 应用程序正确安装缺少的库