首页 > 解决方案 > 检查 Pandas DataFrame 中每一行到其他行的 Levenshtein 距离?

问题描述

我有两个数据框:

df1 = pd.DataFrame({'text': ['hello world', 'world hello'], 'id': [11,31]})
df2 = pd.DataFrame({'test': ['hello', 'world'], 'id': [13,11]})

我想用 df2 计算 df1 中每个文本行的 Levenshtein 距离,如果分数 >=0.9,则从 df1 中删除该记录。

我尝试过的:

def check_levenshtein_distance(df1,df2):
    score = []
    with tqdm(total=df1.shape[0]) as pbar:    
        for index, row in df1.iterrows():
            for index1, row1 in df2.iterrows():
                dis = Levenshtein.ratio(str(row['text']), str(row1['text']))
                if dis>=0.9:
                    score.append(index)          
            pbar.update(1)
    return check

data_d = check_levenshtein_distance(df1, df2)

之后

df1 = df1.drop(df1.index[data_d])

纯熊猫中是否有更好更快的方法来执行此任务?

标签: pythonpython-3.xpandastextlevenshtein-distance

解决方案


由于您已经指出之前的解决方案导致内存不足问题(这并不奇怪,因为我们正在生成所有可能的组合)我有另一个建议。它会慢一点,但它不会创建所有可能的组合,因此它会占用更少的内存。我确实想敦促您重新考虑数据框是否是最好的方法。在处理大量文本时,数据框通常不是最佳解决方案......

import pandas
import Levenshtein

df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})

# Make sure the types of the columns are correct
df1["text"] = df1["text"].astype(str)
df2["test"] = df2["test"].astype(str)


def filter_rows(row: pandas.Series) -> pandas.Series:

    # By default, the row doesn't need to be removed
    row["remove"] = False

    # Loop over the texts in the other dataframe
    for text in df2["test"].values:

        # Check the distance
        if Levenshtein.ratio(row["text"], text) >= 0.9:

            # Indicate that this row needs to be removed
            row["remove"] = True

            # Return the row, so don't look any futher!
            return row

    # If we didn't return yet, just return the default
    return row


# Apply the function (this will create a new column called "remove", indicating if a row should be removed)
df1 = df1.apply(filter_rows, axis=1)

# Remove the rows that have the remove indication, and drop the column
df1 = df1.loc[~df1["remove"]].drop(columns=["remove"])

上一个答案:

试试这种方式:

import pandas
import Levenshtein

df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})

# Create all possible combinations by joining the dataframes on a fictional key
df1["key"] = 0
df2["key"] = 0
df = df1.merge(df2, on="key").drop(columns=["key"])

# Calculate the distances for all possible combinations
df["distance"] = df.apply(lambda row: Levenshtein.ratio(str(row["text"]), str(row["test"])), axis=1)

# Use the distances as a filter
df1.loc[df1["id"].isin(df.loc[df["distance"] < 0.9, "id_x"])]

推荐阅读