python - 删除仅与其他列中的一个值相关的行 pandas
问题描述
想象一下我有这样的数据框:
item name gender
banana tom male
banana kate female
apple kate female
kiwi jim male
apple tom male
banana kimmy female
kiwi kate female
banana tom male
有没有办法删除该人仅关联(购买)少于 2 件商品的行?我也不想删除重复项。所以我想要的输出是这样的:
item name gender
banana tom male
banana kate female
apple kate female
apple tom male
kiwi kate female
banana tom male
解决方案
@sammywemmy 的解决方案:
df.loc[df.groupby('name').item.transform('size').ge(2)]
- groupby将具有相同名称的行分组在一起
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
index item name gender
name
jim 0 3 kiwi jim male
kate 0 1 banana kate female
1 2 apple kate female
2 6 kiwi kate female
kimmy 0 5 banana kimmy female
tom 0 0 banana tom male
1 4 apple tom male
2 7 banana tom male
- 转换以在代表组大小的每一行中获取一个值。(行数)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
在这种情况下,这可以在任何列上完成:
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
请注意,现在每行最后都有相应的组大小。tom
有 3 个实例,所以每name == tom
行有 3 个 in group_size
。
- ge基于关系运算符转换为布尔索引
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
item name gender group_size should_keep
0 banana tom male 3 True
1 banana kate female 3 True
2 apple kate female 3 True
3 kiwi jim male 1 False
4 apple tom male 3 True
5 banana kimmy female 1 False
6 kiwi kate female 3 True
7 banana tom male 3 True
- loc使用布尔索引来获取所需的行
print(df.groupby('name')['item'].transform('size').ge(2))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 True
Name: item, dtype: bool
loc
将包括任何索引,即True
,False
将排除任何索引。(索引 3 和 5False
不包括在内)
全部一起:
import pandas as pd
df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
3: 'kiwi', 4: 'apple', 5: 'banana',
6: 'kiwi', 7: 'banana'},
'name': {0: 'tom', 1: 'kate', 2: 'kate',
3: 'jim', 4: 'tom', 5: 'kimmy',
6: 'kate', 7: 'tom'},
'gender': {0: 'male', 1: 'female',
2: 'female', 3: 'male',
4: 'male', 5: 'female',
6: 'female', 7: 'male'}})
print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
item name gender
0 banana tom male
1 banana kate female
2 apple kate female
4 apple tom male
6 kiwi kate female
7 banana tom male
推荐阅读
- git - 如何 git 撤消合并文件
- c# - 程序在没有任何东西的情况下关闭 C#
- html - Flex如何显示保持垂直对齐的项目表
- regex - 尝试匹配以 ly 结尾但不是比较/最高级形式的副词
- drop-down-menu - 如何将下拉按钮中的列表大小固定为数字并使其在颤动中可滚动?
- java - Observer - 围绕电子邮件接收逻辑的可观察模型
- json - 从 JSON 检索值到经典 ASP
- angular - 有没有办法在 Angular 7 中命名路线?
- mysql - UNION ALL LEFT JOIN 和 SUM 返回不正确的值
- selenium - 使用无头时 Selenium 数据目录出错