首页 > 解决方案 > Python计算两个文件目录的余弦相似度

问题描述

我有两个文件目录。一个包含人工转录文件,另一个包含 IBM Watson 转录文件。两个目录具有相同数量的文件,并且都是从相同的电话录音转录而来的。

我正在使用匹配文件之间的 SpaCy 的 .similarity 计算余弦相似度,并将结果与​​比较的文件名一起打印或存储。除了 for 循环之外,我还尝试使用函数进行迭代,但找不到在两个目录之间进行迭代、将两个文件与匹配索引进行比较并打印结果的方法。

这是我当前的代码:

# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(human_file).read())
    api_model = nlp_small(open(api_file).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

我已经让它只遍历一个目录并通过打印文件名检查它是否具有预期的输出,但是在使用两个目录时它不起作用。我也尝试过这样的事情:

# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")

# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
    for i in (0, (len(human_directory) - 1)):
        print(human_directory[i], api_directory[i])

nlp_small(human_directory, api_directory)

返回:

human_10.txt watson_10.csv
human_9.txt watson_9.csv

但这只是其中的两个文件,而不是全部 17 个文件。

任何关于遍历两个目录上的匹配索引的指针都将不胜感激。

编辑:感谢@kevinjiang,这是工作代码块:

# set the directories containing transcripts
human_directory = os.path.join(os.getcwd(), "00_data\Human Transcripts")
api_directory = os.path.join(os.getcwd(), "00_data\Watson Scripts")

# iterate through files in both directories
for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):
    # set the documents to be compared and parse them through the small spacy nlp model
    human_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Human Transcripts", human_file)).read())
    api_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Watson Scripts", api_file)).read())
    
    # print similarity score with the names of the compared files
    print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

这是大部分输出(需要在停止循环的文件之一中修复 UTF-16 字符):

nlp_small = spacy.load('en_core_web_sm')
Similarity using small model: human_10.txt watson_10.csv 0.9274665883462793
Similarity using small model: human_11.txt watson_11.csv 0.9348740684005554
Similarity using small model: human_12.txt watson_12.csv 0.9362025469343344
Similarity using small model: human_13.txt watson_13.csv 0.9557355330988958
Similarity using small model: human_14.txt watson_14.csv 0.9088701120190216
Similarity using small model: human_15.txt watson_15.csv 0.9479464053189846
Similarity using small model: human_16.txt watson_16.csv 0.9599724037676819
Similarity using small model: human_17.txt watson_17.csv 0.9367605599306302
Similarity using small model: human_18.txt watson_18.csv 0.8760760037870665
Similarity using small model: human_2.txt watson_2.csv 0.9184563762823503
Similarity using small model: human_3.txt watson_3.csv 0.9287452822270265
Similarity using small model: human_4.txt watson_4.csv 0.9415664367046419
Similarity using small model: human_5.txt watson_5.csv 0.9158895909429551
Similarity using small model: human_6.txt watson_6.csv 0.935313240861153

在我修复了字符编码错误之后,我将把它包装在一个函数中,这样我就可以在两个目录上调用大模型或小模型,以获取我必须测试的剩余 API。

标签: pythonnlpspacycosine-similarity

解决方案


两个小错误阻止您循环通过。对于第二个示例,在 for 循环中,您仅循环通过索引 0 和索引 (len(human_directory) - 1))。相反,你应该这样做for i in range(len(human_directory)):应该允许你循环遍历两者。

首先,我想你可能会得到某种too many values to unpack error. 要同时循环遍历两个可迭代对象,请使用 zip(),所以它应该看起来像

for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):


推荐阅读