首页 > 解决方案 > 删除仅与其他列中的一个值相关的行 pandas

问题描述

想象一下我有这样的数据框:

item     name     gender
banana   tom      male
banana   kate     female
apple    kate     female
kiwi     jim      male
apple    tom      male
banana   kimmy    female
kiwi     kate     female
banana   tom      male

有没有办法删除该人仅关联(购买)少于 2 件商品的行?我也不想删除重复项。所以我想要的输出是这样的:

item     name     gender
banana   tom      male
banana   kate     female
apple    kate     female
apple    tom      male
kiwi     kate     female
banana   tom      male 

标签: pythonpandasdataframe

解决方案


@sammywemmy 的解决方案: df.loc[df.groupby('name').item.transform('size').ge(2)]

  1. groupby将具有相同名称的行分组在一起
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
         index    item   name  gender
name                                 
jim   0      3    kiwi    jim    male
kate  0      1  banana   kate  female
      1      2   apple   kate  female
      2      6    kiwi   kate  female
kimmy 0      5  banana  kimmy  female
tom   0      0  banana    tom    male
      1      4   apple    tom    male
      2      7  banana    tom    male
  1. 转换以在代表组大小的每一行中获取一个值。(行数)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
     item   name  gender  group_size
0  banana    tom    male           3
1  banana   kate  female           3
2   apple   kate  female           3
3    kiwi    jim    male           1
4   apple    tom    male           3
5  banana  kimmy  female           1
6    kiwi   kate  female           3
7  banana    tom    male           3

在这种情况下,这可以在任何列上完成:

# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
     item   name  gender  group_size
0  banana    tom    male           3
1  banana   kate  female           3
2   apple   kate  female           3
3    kiwi    jim    male           1
4   apple    tom    male           3
5  banana  kimmy  female           1
6    kiwi   kate  female           3
7  banana    tom    male           3

请注意,现在每行最后都有相应的组大小。tom有 3 个实例,所以每name == tom行有 3 个 in group_size

  1. ge基于关系运算符转换为布尔索引
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
     item   name  gender  group_size  should_keep
0  banana    tom    male           3         True
1  banana   kate  female           3         True
2   apple   kate  female           3         True
3    kiwi    jim    male           1        False
4   apple    tom    male           3         True
5  banana  kimmy  female           1        False
6    kiwi   kate  female           3         True
7  banana    tom    male           3         True
  1. loc使用布尔索引来获取所需的行
print(df.groupby('name')['item'].transform('size').ge(2))
0     True
1     True
2     True
3    False
4     True
5    False
6     True
7     True
Name: item, dtype: bool

loc将包括任何索引,即TrueFalse将排除任何索引。(索引 3 和 5False不包括在内)


全部一起:

import pandas as pd

df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
                            3: 'kiwi', 4: 'apple', 5: 'banana',
                            6: 'kiwi', 7: 'banana'},
                   'name': {0: 'tom', 1: 'kate', 2: 'kate',
                            3: 'jim', 4: 'tom', 5: 'kimmy',
                            6: 'kate', 7: 'tom'},
                   'gender': {0: 'male', 1: 'female',
                              2: 'female', 3: 'male',
                              4: 'male', 5: 'female',
                              6: 'female', 7: 'male'}})

print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
     item  name  gender
0  banana   tom    male
1  banana  kate  female
2   apple  kate  female
4   apple   tom    male
6    kiwi  kate  female
7  banana   tom    male

推荐阅读