python - 如果存在基于另一个列值的行,Pandas 会删除一个列值
问题描述
我有一个这样的数据框:
+----+--------------+-----------+---------------------------------------------------+-----------+
| | Filename | Result | IssueType | isBad |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3 | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
我想按Filename
and对行进行分组Result
,为此我进行了查询:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()
但是我想删除IssueType
value (to NaN
) 如果isBad
column 是其中('No', 'Tolerable')
之一,并且至少存在一个具有相同文件名的其他行,其中isBad
column 具有 value 'Bad'
。
如果没有列所在的行,isBad
则IssueType'Bad'
没有变化。
(例如,这里 #10 行将IssueType
是NaN
因为 #9 具有相同的文件名但具有isBad = Yes
)
之后输出数据框:
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| | Filename | Result | IssueType | isBad | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3 | E0CCG5S238-0 | Tolerable | NaN | NaN | #4's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no change |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad | NaN | Tolerable | #9's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
有没有办法做到这一点?
解决方案
isBad
我认为您需要首先通过Series.eq
with GroupBy.transform
and进行比较的掩码DataFrameGroupBy.any
:
mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')
或者使用Series.isin
with Filename
s ifisBad
匹配条件:
mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])
Series.mask
最后在链式条件中设置缺失值,仅用于匹配Tolerable
:
df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
Filename Result IssueType isBad
0 E0CCG5S2370 Bad NaN Yes
1 E0CCG5S2370 Bad OCR_Text Misrecognition Yes
2 E0CCG5S2371 Good NaN Yes
3 E0CCG5S2380 Tolerable NaN Tolerable
4 E0CCG5S2380 Tolerable NaN Yes
5 E0CCG5S2390 Tolerable MA_Superscript,subscript. Tolerable
6 E0CCG5S2390 Tolerable Extra Spaces Tolerable
7 E0CCG5S2390 Tolerable MA_Link missing from the DV Tolerable
8 E0CCG5S2390 Tolerable CS_Font Incosistency Tolerable
9 E0CCG5S2420 Bad MLOrphanContent Yes
10 E0CCG5S2420 Bad NaN Tolerable
推荐阅读
- matlab - 如何使用 fzero() 在 MATLAB 中求解多项式方程?
- laravel - 如何散列最后一个插入ID并设置到请求中然后保存到laravel中的db中。?
- javascript - 我在 javascript 中遇到 DOM 的问题
- scala - 如何在 Play 中获取原始请求正文?
- android - Google Play 64 位要求 Gamemaker Studio
- javascript - 角度如何使用查询列表对象?
- java - 在控制台中使用在 Eclipse 中创建的包运行代码
- javascript - 如何获得此加载屏幕效果
- java - 如何实现一种以异步方式处理服务超时的方法?
- runtime - 如何在运行时设置 Phaser 3 游戏的背景颜色?