python - 有没有办法在python中找到带有TF-IDF的句子的weitage
问题描述
我有一份清单
x=["hello there","hello world","my name is john"]
我完成了 TF-IDF 的矢量化
这是 TF-idf 的输出
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"hello there","hello world","my name is john", ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()
array([[0.60534851, 0. , 0. , 0. , 0. ,
0.79596054, 0. ],
[0.60534851, 0. , 0. , 0. , 0. ,
0. , 0.79596054],
[0. , 0.5 , 0.5 , 0.5 , 0.5 ,
0. , 0. ]])
我们可以找到每个句子的权重(与所有文档进行比较)吗?
如果是,那么如何?
解决方案
我相信使用 TF-idf 您只能计算句子(或文档)中单个单词的权重,这意味着您不能使用它来计算其他句子或文档中句子的权重。
但是,从这个页面我了解了 TF-idf 的工作原理。您可以通过将它们更改为您特别需要的功能来“滥用”它们提供的功能。请允许我演示一下:
import math
corpus = ["hello there", "hello world"]
file = open("your_document.txt", "r")
text = file.read()
file.close()
def computeTF(sentences, document):
dict = {i: 0 for i in sentences}
filelen = len(text.split(' ')) - 1
for s in sentences:
# Since we're counting a whole sentence (containing >= 1 words) we need to count
# that whole sentence as a single word.
sLength = len(s.split(' '))
dict[s] = document.count(s)
# When you know the amount of occurences of the specific sentence s in the
# document, you can recalculate the amount of words in that document (considering
# s as a single word.
filelen = filelen - dict[s] * (sLength - 1)
for s in sentences:
# Since only after the previous we know the amount of words in the document, we
# need a separate loop to calculate the actual weights of each word.
dict[s] = dict[s] / filelen
return dict
def computeIDF(dict, sentences):
idfDict = {s: dict[s] for s in sentences}
N = len(dict)
for s in sentences:
if(idfDict[s] > 0):
idfDict[s] = math.log10(N)
else:
idfDict[s] = 0
return idfDict
dict = computeTF(corpus, text)
idfDict = computeIDF(dict, corpus)
for s in corpus:
print("Sentence: {}, TF: {}, TF-idf: {}".format(s, dict[s], idfDict[s]))
此代码示例仅查看单个文本文件,但您可以轻松地将其扩展为查看多个文本文件。
推荐阅读
- php - 将php别名解析为真实路径
- neural-network - 多标签分类还是回归?
- performance - 不太常用的 Dymola 标志的文档
- python - 内部使用大型 numpy 数组,多处理速度变慢
- quarkus - 在 Quarkus 中将 CloudEvent 从 HTTP POST 中继到 Kafka
- python - 根据一列的行将数据拆分为多列
- javascript - 如何在不使用任何外部框架的情况下为简单的 HTML5 捆绑多个 JS 文件
- python - Python:替代 Panda 的 merge()
- drools - 如何使用 kie-feel-dmn 中的 DecisionTableImpl 类?
- python - 我试图运行这段代码,但由于某种原因它不起作用有人可以帮助我吗?