python - 检查 Pandas DataFrame 中每一行到其他行的 Levenshtein 距离?
问题描述
我有两个数据框:
df1 = pd.DataFrame({'text': ['hello world', 'world hello'], 'id': [11,31]})
df2 = pd.DataFrame({'test': ['hello', 'world'], 'id': [13,11]})
我想用 df2 计算 df1 中每个文本行的 Levenshtein 距离,如果分数 >=0.9,则从 df1 中删除该记录。
我尝试过的:
def check_levenshtein_distance(df1,df2):
score = []
with tqdm(total=df1.shape[0]) as pbar:
for index, row in df1.iterrows():
for index1, row1 in df2.iterrows():
dis = Levenshtein.ratio(str(row['text']), str(row1['text']))
if dis>=0.9:
score.append(index)
pbar.update(1)
return check
data_d = check_levenshtein_distance(df1, df2)
之后
df1 = df1.drop(df1.index[data_d])
纯熊猫中是否有更好更快的方法来执行此任务?
解决方案
由于您已经指出之前的解决方案导致内存不足问题(这并不奇怪,因为我们正在生成所有可能的组合)我有另一个建议。它会慢一点,但它不会创建所有可能的组合,因此它会占用更少的内存。我确实想敦促您重新考虑数据框是否是最好的方法。在处理大量文本时,数据框通常不是最佳解决方案......
import pandas
import Levenshtein
df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})
# Make sure the types of the columns are correct
df1["text"] = df1["text"].astype(str)
df2["test"] = df2["test"].astype(str)
def filter_rows(row: pandas.Series) -> pandas.Series:
# By default, the row doesn't need to be removed
row["remove"] = False
# Loop over the texts in the other dataframe
for text in df2["test"].values:
# Check the distance
if Levenshtein.ratio(row["text"], text) >= 0.9:
# Indicate that this row needs to be removed
row["remove"] = True
# Return the row, so don't look any futher!
return row
# If we didn't return yet, just return the default
return row
# Apply the function (this will create a new column called "remove", indicating if a row should be removed)
df1 = df1.apply(filter_rows, axis=1)
# Remove the rows that have the remove indication, and drop the column
df1 = df1.loc[~df1["remove"]].drop(columns=["remove"])
上一个答案:
试试这种方式:
import pandas
import Levenshtein
df1 = pandas.DataFrame({"text": ["hello world", "world hello"], "id": [11, 31]})
df2 = pandas.DataFrame({"test": ["hello", "world", "hello word"], "id": [13, 11, 12]})
# Create all possible combinations by joining the dataframes on a fictional key
df1["key"] = 0
df2["key"] = 0
df = df1.merge(df2, on="key").drop(columns=["key"])
# Calculate the distances for all possible combinations
df["distance"] = df.apply(lambda row: Levenshtein.ratio(str(row["text"]), str(row["test"])), axis=1)
# Use the distances as a filter
df1.loc[df1["id"].isin(df.loc[df["distance"] < 0.9, "id_x"])]
推荐阅读
- python - 检查 DataFrame 中的项目是否存在于项目列表中
- json - JSON JQ 过滤器按日期早于 bash
- javascript - 如何利用提高页面速度的机会(调试 javascript)
- python - Python - 在遍历 XML 文件、搜索文本和在需要的地方替换它时遇到问题
- laravel - Laravel 按月对 Postgresql 进行排序或排序
- javascript - 使用键名将数组转换为对象
- reactjs - 反应组件道具打字稿问题
- python - 重新分配中间有最大值的二维数据
- javascript - 为什么我在按元素或 xpath 搜索时得到一个空对象 _____ HTMLSession, requests-html
- python - 从 Google 抓取天气数据时遇到问题