首页 > 解决方案 > 如果存在基于另一个列值的行,Pandas 会删除一个列值

问题描述

我有一个这样的数据框:

+----+--------------+-----------+---------------------------------------------------+-----------+
|    | Filename     | Result    | IssueType                                         | isBad     |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3  | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad       | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+

我想按Filenameand对行进行分组Result,为此我进行了查询:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()

但是我想删除IssueTypevalue (to NaN) 如果isBadcolumn 是其中('No', 'Tolerable') 之一,并且至少存在一个具有相同文件名的其他行,其中isBadcolumn 具有 value 'Bad'

如果没有列所在的行,isBad则IssueType'Bad' 没有变化。

(例如,这里 #10 行将IssueTypeNaN因为 #9 具有相同的文件名但具有isBad = Yes

之后输出数据框:

+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
|    | Filename     | Result    | IssueType                                         | isBad     |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3  | E0CCG5S238-0 | Tolerable | NaN                                               | NaN       | #4's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no   change |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad       | NaN                                               | Tolerable | #9's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+

有没有办法做到这一点?

标签: pythonpython-3.xpandas

解决方案


isBad我认为您需要首先通过Series.eqwith GroupBy.transformand进行比较的掩码DataFrameGroupBy.any

mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')

或者使用Series.isinwith Filenames ifisBad匹配条件:

mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])

Series.mask最后在链式条件中设置缺失值,仅用于匹配Tolerable

df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
       Filename     Result                    IssueType      isBad
0   E0CCG5S2370        Bad                          NaN        Yes
1   E0CCG5S2370        Bad      OCR_Text Misrecognition        Yes
2   E0CCG5S2371       Good                          NaN        Yes
3   E0CCG5S2380  Tolerable                          NaN  Tolerable
4   E0CCG5S2380  Tolerable                          NaN        Yes
5   E0CCG5S2390  Tolerable    MA_Superscript,subscript.  Tolerable
6   E0CCG5S2390  Tolerable                 Extra Spaces  Tolerable
7   E0CCG5S2390  Tolerable  MA_Link missing from the DV  Tolerable
8   E0CCG5S2390  Tolerable         CS_Font Incosistency  Tolerable
9   E0CCG5S2420        Bad              MLOrphanContent        Yes
10  E0CCG5S2420        Bad                          NaN  Tolerable

推荐阅读