python - Python Text Summarizer - 维护句子顺序
问题描述
我正在自学 python 并完成了一个基本的文本摘要器。我对摘要文本几乎感到满意,但想进一步完善最终产品。
该代码正确执行了一些标准文本处理(标记化、删除停用词等)。然后代码根据加权词频对每个句子进行评分。我正在使用 heapq.nlargest() 方法返回前 7 个句子,根据我的示例文本,我觉得这些句子做得很好。
我面临的问题是前 7 个句子是从最高分 -> 最低分返回的。我明白为什么会这样。我宁愿保持与原文中相同的句子顺序。我已经包含了相关的代码,希望有人可以指导我解决问题。
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
我正在使用以下文章进行测试:https ://www.bbc.com/news/world-australia-45674716
失去一切的银行客户 他还批评了监管机构对银行和金融公司的不当行为。它还收到了超过 9,300 份来自银行、财务顾问、养老基金和保险公司的不当行为指控。”
作为期望输出的示例:上面的第三句话,“今年的皇家委员会,该国最高的公共调查形式,揭露了该行业普遍存在的不当行为。” 实际上出现在原始文章中的“澳大利亚银行查询:他们不在乎他们伤害了谁”之前,我希望输出保持该句子顺序。
解决方案
得到它的工作,离开这里以防其他人好奇:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)
推荐阅读
- wpf - WPF ICommand 等效于 Click -= Button_Click
- c++ - 有没有办法只允许 C++ 中的最终类继承
- java - 将根元素添加到每个 XML 时如何将 POJO 编组为 XML
- javascript - 如何从我的 JSP 中的 If 条件调用 Javascript 函数?
- wordpress - 按图像名称检索图像 id - WORDPRESS REST API
- visual-studio-code - 如何在 VSCode 中找出选择的语言
- kotlin - 如何协调单例模式与接口回调?- 科特林
- python - numpy快速索引搜索
- arrays - 数组和哈希表的简洁输出格式
- javascript - 多层的 Openlayers 指针移动(悬停)工具提示