首页 > 解决方案 > 如何使用 nltk 计算文本中存在的单词的频率

问题描述

我有一个 python 脚本,它读取文本并应用预处理函数来进行分析。
问题是我想计算单词的频率但系统崩溃并显示以下错误。

文件“F:\AIenv\textAnalysis\setup.py”,第 208 行,在 tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n") 类型错误: 元组索引必须是整数或切片,而不是 str

我正在尝试计算频率,然后写入text file.

def get_freq(tagged):
    freqs = FreqDist(tagged)
    for word, freq in freqs.items():
        print(word, freq)
    result = word,freq
    return result

def tag_and_save(tagger,text,path):
    clt = clean_text(text)
    tagged_data = tagger.tag(clt)

    freq_tagged_data = get_freq(tagged_data)
    file = open(path,"w",encoding = "UTF8")
    for word,tag in tagged_data:
        file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")
    file.close()

我期望这样的输出:

('*****/DTNN') 3


根据答案

我将函数get_freq()更改为:

def get_freq(tagged):
    freq_dist = {}
    freqs = FreqDist(tagged)
    freq_dist = [(word, freq) for word ,freq in freqs.items()]
    return freq_dist

但现在它显示以下错误:

文件“F:\AIenv\textAnalysis\setup.py”,第 217 行,在 tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")

类型错误:列表

索引必须是整数或切片,而不是 str

如何解决此错误,我该怎么办?

标签: pythonnlpnltkword-frequency

解决方案


也许这可能会有所帮助。

import nltk
text = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."
tokens = [t for t in text.split()]
freqs = nltk.FreqDist(tokens)
blah_list = [(k, v) for k, v in freqs.items()]
print(blah_list)

这个片段计算词频。

编辑:代码现在正在运行。


推荐阅读