首页 > 解决方案 > 如何根据列表中的条件将行分组在一起?熊猫

问题描述

如果它们在某些列中有匹配的值,我希望能够将它们组合为一个,但是我只希望它们在值在列表中时被分组。例如,

team_sports = ['football', 'basketball']

view of df

country    sport      age
USA       football    21
USA       football    28
USA       golf        20
USA       golf        44
China     football    30
China     basketball  22
China     basketball  41

wanted outcome
country    sport      age
USA       football    21,28
USA       golf        20
USA       golf        44
China     football    30
China     basketball  22,41

The attempt I made was,

team_sports = ['football', 'basketball']

for i in df['Sport']:
  if i in team_sports:
     group_df= df.groupby(['Country', 'Sport'])['Age'].apply(list).reset_index() 

这需要永远运行,我正在使用的数据库有 100,000 行。

非常感谢任何帮助,谢谢

标签: pandasdatabasegroup-by

解决方案


更直接的方法是sports根据. 分开然后回到一起:isinteam_sportsgroupby aggregateconcat

team_sports = ['football', 'basketball']

m = df['sport'].isin(team_sports)
cols = ['country', 'sport']
group_df = pd.concat([
    # Group those that do match condition
    df[m].groupby(cols, as_index=False)['age'].agg(list),
    # Leave those that don't match condition as is
    df[~m]
], ignore_index=True).sort_values(cols)

*sort_values可以选择将国家和体育重新组合在一起

group_df

  country       sport       age
0   China  basketball  [22, 41]
1   China    football      [30]
2     USA    football  [21, 28]
3     USA        golf        20
4     USA        golf        44

isin不太直接的方法是使用+创建基于值是否在团队运动列表中的新分组级别cumsum

team_sports = ['football', 'basketball']

group_df = (
    df.groupby(
        ['country', 'sport',
         (~df['sport'].sort_values().isin(team_sports)).cumsum().sort_index()],
        as_index=False,
        sort=False
    )['age'].agg(list)
)

group_df

  country       sport       age
0     USA    football  [21, 28]
1     USA        golf      [20]
2     USA        golf      [44]
3   China    football      [30]
4   China  basketball  [22, 41]

组的创建方式:

team_sports = ['football', 'basketball']

print(pd.DataFrame({
    'country': df['country'],
    'sport': df['sport'],
    'not_in_team_sports': (~df['sport'].sort_values()
                           .isin(team_sports)).cumsum().sort_index()
}))
  country       sport  not_in_team_sports
0     USA    football                   0
1     USA    football                   0
2     USA        golf                   1  # golf 1
3     USA        golf                   2  # golf 2 (not in the same group)
4   China    football                   0
5   China  basketball                   0
6   China  basketball                   0

*sort_values在这里是必需的,这样sport组就不会被不在列表中的运动打断。

df = pd.DataFrame({
    'country': ['USA', 'USA', 'USA'],
    'sport': ['football', 'golf', 'football'],
    'age': [21, 28, 20]
})
team_sports = ['football', 'basketball']

print(pd.DataFrame({
    'country': df['country'],
    'sport': df['sport'],
    'not_sorted': (~df['sport'].isin(team_sports)).cumsum(),
    'sorted': (~df['sport'].sort_values()
                           .isin(team_sports)).cumsum().sort_index()
}))
  country     sport  not_sorted  sorted
0     USA  football           0       0
1     USA      golf           1       1
2     USA  football           1       0  # football 1 (separate group if not sorted)

排序确保足球齐头并进,因此不会发生这种情况


设置:

import pandas as pd

df = pd.DataFrame({
    'country': ['USA', 'USA', 'USA', 'USA', 'China', 'China', 'China'],
    'sport': ['football', 'football', 'golf', 'golf', 'football', 'basketball',
              'basketball'],
    'age': [21, 28, 20, 44, 30, 22, 41]
})

推荐阅读