首页 > 解决方案 > 在一列中查找重复项,在另一列中查找非重复项

问题描述

我正在努力研究如何获取数据集并输出在一个列中找到重复信息而在另一列中找到非重复项的结果。如果说第 0 列和第 2 列是完全重复的,我不关心数据集,只有当第 0 列的条目在第 2 列中有多个值时。如果是这样,我想要所有与第 0 列匹配的行数。

我首先使用 concat 将数据集缩小到具有重复项的行。我现在的问题是尝试仅获取第 2 列不同的行。

我的示例数据集是:

Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

在这个集合中,我想要最后三行的结果是“33333”,因为它们在第 2 列中有不止一种类型的值。“11111”只匹配 Prod_P,所以我不在乎。

import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
    if row["Pattern_or_URI"] in ignorelist:
        continue
    ignorelist.append(row["Pattern_or_URI"])
    # testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
    # print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
    newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]

如果我取消注释我尝试使用相同的 concat 来查找具有 "Partition" > 1 的条目的行,我会收到一个错误ValueError: No objects to concatenate。我知道它通过第一个迭代,打印语句未注释。

有没有更简单或更好的方法来做到这一点?我是熊猫的新手,一直在想可能有一种方法可以找到我还没有想出来的方法。谢谢你。

期望的输出:

Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

标签: pythonpandas

解决方案


我认为说您正在寻找重复项有点误导。这确实是一个分组问题。

您希望在系列中找到Pattern or URI与多个唯一值相对应的相同值组Partition


transform+nunique

s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]

   Pattern or URI  Route Filter Clause Partition Pattern Usage  Owning Object Owning Object Partition  Cluster ID  Catalog Name  Route String      Device Name Device Description
5           33333                  NaN    Prod_P        Device          33333                  Prod_P         NaN           NaN           NaN  SEPFFFFCF87AAEA            device6
6           33333                  NaN   Dummy_P        Device          33333                 Dummy_P         NaN           NaN           NaN  SEPFFFF18FF65A0            device7
7           33333                  NaN    Prod_P        Device          33333                  Prod_P         NaN           NaN           NaN  SEPFFFFCFCCAABB            device8

推荐阅读