python - 如何在不使用 python 库的情况下实现 Tf-idf?
问题描述
Sklearn 在其 TFIDF 矢量化器版本的实现中做了一些调整,因此要复制确切的结果,您需要将以下内容添加到您的自定义 tfidf 矢量化器实现中:
Sklearn 的词汇表由 idf 生成,按字母顺序排列
idf的sklearn公式不同于标准的教科书公式。在这里,常数“1”被添加到 idf 的分子和分母上,就好像看到一个额外的文档恰好包含集合中的每个术语一次,这可以防止零除法。IDF(t)=1+(loge((1 + 集合中的文档总数)/(1+包含术语 t 的文档数))。
Sklearn 将 L2 归一化应用于其输出矩阵。
sklearn tfidf vectorizer 的最终输出是一个稀疏矩阵。
我试图在不使用库的情况下实现它,但遇到了我无法调试的错误。
代码:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
def fit(dataset):
unique_words = set() # at first we will initialize an empty set
# check if its list type or not
if isinstance(dataset, (list)):
for document in dataset: # for each review in the dataset
for word in document.split(" "): # for each word in the review.#split method converts a string into list of words
if len(word) < 2:
continue
unique_words.add(word)
unique_words = sorted(list(unique_words))
vocab = {j:i for i,j in enumerate(unique_words)}
return vocab
else:
print("you need to pass list of sentance")
vocab=fit(corpus)
print(vocab)
output:{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
def idf(unique_words):
idf_dict={}
N=len(corpus)
for i in unique_words:
count=0
for row in corpus:
if i in row.split():
count+=1
idf_dict[i]=float(1+math.log((N+1)/(count+1)))
return idf_dict
def transform(dataset,vocab):
rows = []
columns = []
values = []
if isinstance(dataset, (list,)):
for idx, row in enumerate(dataset): # for each document in the dataset
# it will return a dict type object where key is the word and values is its frequency {word:frequency}
word_freq = dict(Counter(row.split()))
# for every unique word in the document
for word, freq in word_freq.items(): # for each unique word in the review.
if len(word) < 2:
continue
# we will check if its there in the vocabulary that we build in fit() function
# dict.get() function will return the values, if the key doesn't exits it will return -1
col_index = vocab.get(word, -1) # retrieving the dimension number of a word
# if the word exists
if col_index !=-1:
# we are storing the index of the document
rows.append(idx)
# we are storing the dimensions of the word
columns.append(col_index)
td = freq/float(len(rows)) # the number of times a word occured in a document
idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
values.append((td) * (idf_))
return normalize(csr_matrix( ((values),(row,columns)), shape=(len(dataset),len(vocab))),norm='l2' )
else:
print("you need to pass list of strings")
print(transform(corpus,vocab))
错误:
TypeError Traceback (most recent call last)
<ipython-input-20-8da73617fb69> in <module>()
----> 1 print(transform(corpus,vocab))
22 td = freq/float(len(rows)) # the number of times a word occured in a document
23 a=idf(word)
---> 24 idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
25 values.append((td) * (idf_))
26
TypeError: unsupported operand type(s) for +: 'int' and 'dict_values'
解决方案
idf(word) -> dict
该函数 idf 返回一个字典。idf 似乎接受了语料库,因此在函数的早期调用它,然后访问您想要获取的单词。
tmp_dict = idf(corpus)
...
idf_ = 1+math.log((1+len(dataset))/float(1+tmp_dict[word]))
推荐阅读
- discord.js - Discord.Message.Embed 不是构造函数
- xml - 如何将xml文件导入飞镖对象
- swift - ARKit – 添加框到锚点?
- logstash - 如何过滤 Logstash Grok 的字段值
- jquery - Bootstrap 3 DropdownButton 激活文件选择框
- bash - 没有检测文件中的新行以在 bash 中构建 csv 文件
- android - 如何删除GridView行之间的垂直空间
- azure-active-directory - 是否可以使用应用程序权限令牌来创建架构扩展?
- powershell - 从文件中每一行的模式中提取文本
- android - Flutter:在 Android V2 升级的发布模式下崩溃