首页 > 解决方案 > Python字典和文本文件的交集

问题描述

我正在完成一项 NLP 练习,需要一些帮助来了解获得结果的最佳方法。我有两个文本文件,一个是单词列表,比如词汇表,另一个是文章。我需要计算输入文章中我的文本文件列表中每个单词的频率。

我正在尝试一步一步地做到这一点,以便提高我的技能。

我已经导入了文本,对两个文件中的单词进行了标记/拆分,现在我将文章中的单词放入字典中。

我的下一步是找到字典和单词列表文本文件的交集(我假设),并返回我的文章中存在多少单词条目的频率。

wordlist = terms.split()
splittext = input_article.split()
freq = {}
for term in splittext:
    if term in freq:
        freq[term] += 1
    else: freq[term] = 1
#print(freq)

result = {i for i in wordlist if i in freq.keys()}
print(result)

这个 ^ 是我到目前为止所拥有的,但这是让我卡住的最后一行。我将文章中的所有单词都放在一个字典中......现在我想返回输入文章中每个词汇表条目的频率。

关于如何实现这一目标的任何提示?

标签: pythonnltktext-processing

解决方案


据我了解,这应该有效:

text = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum"
    
key = "? Lorem Ipsum more was not the with 123 test notin desktop"
    
dict = {}
dict2 = {}
words = text.split(" ")
keys = key.split(" ")
    
for word in words:
    if word in dict:
        dict[word] += 1
    else:
        dict[word] = 1
    
    
for i in range(len(keys)):
    if keys[i] in dict.keys():
        print("Key: {} freq: {}".format(keys[i], dict[keys[i]]))
        dict2[keys[i]] = dict[keys[i]]
        
    
    
print(dict2)

输出:

{'Lorem': 4, 'Ipsum': 4, 'more': 1, 'was': 1, 'not': 1, 'the': 6, 'with': 2, 'desktop': 1}

推荐阅读