首页 > 解决方案 > 如何在熊猫数据框中找到非一对一的组合

问题描述

我有以下数据框

 bootstrap  cluster_main    cluster_b   distance
    1   0   Cluster 0   Cluster 1   0.002016
    15  0   Cluster 0   Cluster 3   0.001282
    4   0   Cluster 1   Cluster 0   0.000772
    10  0   Cluster 2   Cluster 2   0.000990
    26  1   Cluster 0   Cluster 2   0.001034
    16  1   Cluster 2   Cluster 0   0.000159
    31  1   Cluster 3   Cluster 3   0.000889
    21  1   Cluster 1   Cluster 1   0.000961
    35  2   Cluster 0   Cluster 3   0.099427
    36  2   Cluster 1   Cluster 0   0.067036
    43  2   Cluster 2   Cluster 3   0.102834
    45  2   Cluster 3   Cluster 1   0.069814

我想找到 和之间bootstrap没有一对一匹配的 s 。cluster_maincluster_b

在上面的示例中,输出应该是2and 0,因为Cluster 3cluster_bfor 列中bootstrap 2,被“匹配”了两次,并且在 for 列中发生了同样Cluster 0cluster_main情况bootstrap 0

标签: pythonpandas

解决方案


我相信你需要:

#compared sorted values
#f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
#comppred sets
#f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    False
1     True
2    False
dtype: bool
bootstrap
0    False
1     True
2    False
dtype: bool

out = m.index[~m]
print (out)
Int64Index([0, 2], dtype='int64', name='bootstrap')

编辑:我意识到第一个解决方案与比较集相同,因此被删除。

这里有可能看到区别:

print (df)
    bootstrap cluster_main  cluster_b  distance
1           0    Cluster 0  Cluster 1  0.002016
15          0    Cluster 0  Cluster 1  0.001282
4           0    Cluster 1  Cluster 0  0.000772
10          0    Cluster 2  Cluster 2  0.000990
26          1    Cluster 2  Cluster 0  0.001034
16          1    Cluster 0  Cluster 2  0.000159
31          1    Cluster 3  Cluster 3  0.000889
21          1    Cluster 1  Cluster 1  0.000961
35          2    Cluster 0  Cluster 0  0.099427
36          2    Cluster 2  Cluster 2  0.067036
43          2    Cluster 2  Cluster 3  0.102834
45          2    Cluster 3  Cluster 2  0.069814


#compared sorted values
f = lambda x: sorted(x['cluster_main']) == sorted(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    False
1     True
2     True
dtype: bool

f = lambda x: set(x['cluster_main']) == set(x['cluster_b'])
m = df.groupby('bootstrap').apply(f)
print (m)
bootstrap
0    True
1    True
2    True
dtype: bool

推荐阅读