首页 > 解决方案 > 重复,drop_duplicates 故障

问题描述

这是关于将两个文件与蛋白质数据合并的问题的后续内容。

当我使用包导入数据框时biopandas,我无法duplicated/drop_duplicates删除我的重复项。我的数据框很大:

# df:

col1    col2    col3    col4    col5    col6    col7    col8    col9

0   ATOM    N   SER     15  17.203  0.286   72.985  4pxz
1   ATOM    CA  SER     15  16.713  1.342   73.869  4pxz
2   ATOM    C   SER     15  17.885  2.188   74.412  4pxz
3   ATOM    O   SER     15  18.028  3.351   74.013  4pxz
4   ATOM    CB  SER     15  15.889  0.750   75.014  4pxz
...     ...     ...     ...     ...     ...     ...     ...     ...
3   ATOM    CD  ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE  ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ  ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

3148 rows × 8 columns

我想使用以下方法在重复范围内检查它:

df2 = df[df.duplicated(['col3','col4','col5'])] # show me duplicates containing identical type(col3), abbreviation(col4) and number(col5).

我得到了:

col1    col2    col3    col4    col5    col6    col7    col8

2132    ATOM    CA      HIS     1063    38.442  -16.479     -5.209  4pxz
2136    ATOM    CB      HIS     1063    37.502  -15.555     -6.008  4pxz
2138    ATOM    CG      HIS     1063    38.007  -15.211     -7.378  4pxz
2140    ATOM    ND1     HIS     1063    38.342  -16.194     -8.293  4pxz
2142    ATOM    CD2     HIS     1063    38.213  -14.000     -7.943  4pxz
2144    ATOM    CE1     HIS     1063    38.749  -15.553     -9.375  4pxz
2146    ATOM    NE2     HIS     1063    38.688  -14.231     -9.213  4pxz
0       ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1       ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2       ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3       ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4       ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5       ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6       ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7       ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

预期输出:

col1    col2    col3    col4    col5    col6    col7    col8    col9

606     ATOM    CA  ARG     93  11.357  9.429   58.493  4pxz
609     ATOM    CB  ARG     93  12.236  9.564   59.757  4pxz
610     ATOM    CG  ARG     93  13.088  8.333   60.120  4pxz
611     ATOM    CD  ARG     93  13.985  7.822   58.995  4pxz
612     ATOM    NE  ARG     93  14.503  6.485   59.295  4pxz
613     ATOM    CZ  ARG     93  15.012  5.642   58.400  4pxz
614     ATOM    NH1 ARG     93  15.074  5.979   57.116  4pxz
615     ATOM    NH2 ARG     93  15.455  4.453   58.780  4pxz
0   ATOM    CA      ARG     93  11.357  9.429   58.493  hatp
1   ATOM    CB      ARG     93  12.236  9.564   59.757  hatp
2   ATOM    CG      ARG     93  11.569  9.166   61.087  hatp
3   ATOM    CD      ARG     93  12.319  8.102   61.886  hatp
4   ATOM    NE      ARG     93  11.978  6.754   61.425  hatp
5   ATOM    CZ      ARG     93  11.731  5.714   62.217  hatp
6   ATOM    NH2     ARG     93  11.430  4.535   61.694  hatp
7   ATOM    NH1     ARG     93  11.793  5.843   63.538  hatp

如您所见,它没有遵循duplicated()方法中的说明(drop_duplicates工作方式完全相同)。我需要使用:

df2 = df[df['col5'] == 93]

怎么了?

标签: pythonpandas

解决方案


不是命令df.duplicated吗?

还要确保通过 option keep=False


推荐阅读