首页 > 解决方案 > 识别唯一标识 pandas DataFrame 行的最小列子集

问题描述

给定一个包含几列分类变量的 pd.DataFrame,识别唯一标识 pd.DataFrame 行的那些列的子集的最有效方法是什么(假设存在这样的子集)?

在许多情况下,可能已经存在唯一索引。例如,下面的“ID”列:

在此处输入图像描述

否则,必须将几个列组合起来形成一个唯一标识符。例如下面的列 ['Name', 'Level']:

在此处输入图像描述

标签: pythonpandasdataframeuniqueidentifier

解决方案


你可以试试这个:

from itertools import combinations

import pandas as pd

# Toy dataframe
df = pd.DataFrame(
    {
        "name": ["Georgia", "Georgia", "Florida"],
        "level 1": ["value", "other value", "other value"],
        "level 2": ["Sub-country", "Sub-country", "Country"],
        "level 3": ["Sub-country", "Country", "Sub-country"],
        "value 1": ["a", "b", "c"],
        "value 2": ["d", "e", "f"],
    }
)
print(df)
      name      level 1      level 2      level 3 value 1 value 2
0  Georgia        value  Sub-country  Sub-country       a       d
1  Georgia  other value  Sub-country      Country       b       e
2  Florida  other value      Country  Sub-country       c       f

# Setup
unique_identifiers = []
target_cols = ["name", "level 1", "level 2", "level 3"]

# Identify all combinations
combined_cols = [list(combinations(target_cols, i)) for i in range(1, len(target_cols))]
combined_cols = [list(cols) for item in combined_cols for cols in item]

# Identify unique identifiers
for cols in combined_cols:
    if df.loc[:, cols].duplicated().sum():
        continue
    else:
        unique_identifiers.append(cols)

# Get all the smallest unique identifiers (in case more than one)
smallest = len(min(unique_identifiers))
if smallest:
    unique_identifiers = [item for item in unique_identifiers if len(item) == smallest]

print(unique_identifiers)
# Outputs
[
    ["name", "level 1"],
    ["name", "level 3"],
    ["level 1", "level 2"],
    ["level 1", "level 3"],
    ["level 2", "level 3"],
]

推荐阅读