首页 > 解决方案 > Pandas remove reversed duplicates across two columns

问题描述

An example DataFrame:

df = pd.DataFrame({'node_a': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'],
                   'node_b': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
                   'value':  [  2,   8,   1,   8,   7,   3,   1,   3,   2]})

    node_a  node_b  value
0   X       X       2
1   X       Y       8
2   X       Z       1
3   Y       X       8
4   Y       Y       7
5   Y       Z       3
6   Z       X       1
7   Z       Y       3
8   Z       Z       2

I need to remove reversed duplicates, e.g. keep node_a = 'X', node_b = 'Y' but remove node_a = 'Y', node_b = 'X'.

Desired output:

    node_a  node_b  value
0   X       X       2
1   X       Y       8
2   X       Z       1
4   Y       Y       7
5   Y       Z       3
8   Z       Z       2

Please note I need a general solution not specific to this actual data.

标签: pythonpandasdataframe

解决方案


Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:

df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)

Result:

print(out)
  node_a node_b  value
0      X      X      2
1      X      Y      8
2      X      Z      1
4      Y      Y      7
5      Y      Z      3
8      Z      Z      2

推荐阅读