首页 > 解决方案 > 如何计算每个标记词的距离并返回列中距离为 0 的计数

问题描述

我有两个描述,一个在数据框中,另一个是单词列表,我需要计算描述中每个单词与列表中每个单词的 levensthein 距离,并返回 levensthein 距离结果的计数,即等于 0

import pandas as pd


definitions=['very','similarity','seem','scott','hello','names']
# initialize list of lists 
data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Descriptions']) 

# print dataframe. 
df 

列计算每行中所有单词的数量,计算字典中每个单词的 Lev 距离返回 0

df['lev_count_0']= 列计算每行中所有单词的数量,计算字典中每个单词的 Lev 距离返回 0

例如,第一种情况是

edit_distance("hello","very") # This will be equal to 4
edit_distance("hello","similarity") # this will be equal to 9
edit_distance("hello","seem") # This will be equal to 4
edit_distance("hello","scott") # This will be equal to 5
edit_distance("hello","hello")# This will be equal to 0
edit_distance("hello","names") # this will be equal to 5

因此,对于 df['lev_count_0'] 中的第一行,结果应该是 1,因为只有一个 0 将描述中的所有单词与定义列表进行比较

Description               | lev_count_0
hello my name is Scott    |      1


标签: pythonpython-3.xstringnlplevenshtein-distance

解决方案


我的解决方案

from nltk import edit_distance
import pandas as pd


data = [['hello my name is Scott'], ['I went to the mall yesterday'], ['This seems very similar']] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Descriptions']) 

dictionary=['Hello', 'my']


def lev_dist(colum):
    count=0
    dataset=list(colum.split(" "))
    for word in dataset : 
        for dic in dictionary:
            result=edit_distance(word,dic)
            if result ==0 :
                count=count+1
    return count




df['count_lev_0'] = df.Descriptions.apply(lev_dist)


推荐阅读