首页 > 解决方案 > 使用模糊匹配匹配数据框列和值列表并将匹配值添加到新列

问题描述

我有两个数据框:

df = pd.DataFrame(
    {
        "company name": [
            "apple inc",
            "google llc",
            "netflix, inc",
            "facebook",
            "shopify incorporated",
            "amazon.com",
        ],
        "employees": [25000, 50000, 25000, 45000, 15000, 50000],
    }
)

ticker_lookup = pd.DataFrame(
    {
        "standard company name": [
            "EBay",
            "Etsy",
            "Apple",
            "Google",
            "Netflix",
            "Facebook",
            "Shopify",
            "Amazon",
        ],
        "ticker": ["ebay", "etsy", "aapl", "googl", "nflx", "fb", "shop", "amzn"],
    }
)

我想执行模糊匹配df['company name']以匹配其中的值,ticker_lookup['standard company name']以便我可以将股票代码带入df. 我也想要标准的公司名称。

我为这项任务选择了 RapidFuzz,但我想 FuzzyWuzzy 也可以。使用 RapidFuzz,我编写了以下代码:

import pandas as pd
from rapidfuzz.fuzz import ratio
from rapidfuzz.process import extractOne

# Get list of lookup values from ticker_lookup, and create empty list
# to append results
lookup_list = list(ticker_lookup["standard company name"])
matched_values = []

# For each value in 'company name', run rapidfuzz's extractOne 
# module to perform the fuzzy matching
# This results in a list of tuples each containing: matched value, 
# similarity score, and position of match in list
for i in list(df["company name"]):
    matched_values.append(extractOne(i, lookup_list))

# Store results in a DataFrame
matched_df = pd.DataFrame(
    matched_values,
    columns=["standard company name", "similarity score", "index in list"],
)

# Concat results with original DataFrame
result = pd.concat([df, matched_df], axis=1)

虽然这种方法有效,但我的实际数据帧包含超过 100K 的记录,并且查找列表约为 3K,因此此过程需要很长时间。我想知道是否还有另一种更简洁的方法,这样我就可以执行模糊匹配并将结果直接附加到dffor 循环中。

任何建议将不胜感激!

标签: pythonpandasfuzzywuzzyfuzzy-logicrapidfuzz

解决方案


推荐阅读