首页 > 解决方案 > 如何在不使用 python 库的情况下实现 Tf-idf?

问题描述

Sklearn 在其 TFIDF 矢量化器版本的实现中做了一些调整,因此要复制确切的结果,您需要将以下内容添加到您的自定义 tfidf 矢量化器实现中:

  1. Sklearn 的词汇表由 idf 生成,按字母顺序排列

  2. idf的sklearn公式不同于标准的教科书公式。在这里,常数“1”被添加到 idf 的分子和分母上,就好像看到一个额外的文档恰好包含集合中的每个术语一次,这可以防止零除法。IDF(t)=1+(loge((1 + 集合中的文档总数)/(1+包含术语 t 的文档数))。

  3. Sklearn 将 L2 归一化应用于其输出矩阵。

  4. sklearn tfidf vectorizer 的最终输出是一个稀疏矩阵。

我试图在不使用库的情况下实现它,但遇到了我无法调试的错误。

代码:

corpus = [
         'this is the first document',
         'this document is the second document',
         'and this is the third one',
         'is this the first document',
         ]
  

def fit(dataset):    
    unique_words = set() # at first we will initialize an empty set
    # check if its list type or not
    if isinstance(dataset, (list)):
        for document in dataset: # for each review in the dataset
            for word in document.split(" "): # for each word in the review.#split method converts a string into list of words
                if len(word) < 2:
                    continue
                unique_words.add(word)
        unique_words = sorted(list(unique_words))
        vocab = {j:i for i,j in enumerate(unique_words)}
        
        return vocab
    else:
        print("you need to pass list of sentance")

vocab=fit(corpus)
print(vocab)
output:{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}

def idf(unique_words):
    idf_dict={}
    N=len(corpus)
    for i in unique_words:
        count=0
        for row in corpus:
            if i in row.split():
                count+=1

        idf_dict[i]=float(1+math.log((N+1)/(count+1)))

    return idf_dict

def transform(dataset,vocab):
    rows = []
    columns = []
    values = []
    if isinstance(dataset, (list,)):
        for idx, row in enumerate(dataset): # for each document in the dataset
            # it will return a dict type object where key is the word and values is its frequency {word:frequency}
            word_freq = dict(Counter(row.split()))
            # for every unique word in the document
            for word, freq in word_freq.items():  # for each unique word in the review.                
                if len(word) < 2:
                    continue
                # we will check if its there in the vocabulary that we build in fit() function
                # dict.get() function will return the values, if the key doesn't exits it will return -1
                col_index = vocab.get(word, -1) # retrieving the dimension number of a word
                # if the word exists
                if col_index !=-1:
                    # we are storing the index of the document
                    rows.append(idx)
                    # we are storing the dimensions of the word
                    columns.append(col_index)
                    td = freq/float(len(rows)) # the number of times a word occured in a document
                    idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
                    values.append((td) * (idf_))
                    
        return normalize(csr_matrix( ((values),(row,columns)), shape=(len(dataset),len(vocab))),norm='l2' )
    else:
        print("you need to pass list of strings")

print(transform(corpus,vocab))

错误:

 TypeError                                 Traceback (most recent call last)
    <ipython-input-20-8da73617fb69> in <module>()
    ----> 1 print(transform(corpus,vocab))
    
    
         22                     td = freq/float(len(rows)) # the number of times a word occured in a document
         23                     a=idf(word)
    ---> 24                     idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
         25                     values.append((td) * (idf_))
         26 
    
    TypeError: unsupported operand type(s) for +: 'int' and 'dict_values'
     

标签: pythonpython-3.xdictionarymachine-learningnlp

解决方案


idf(word) -> dict

该函数 idf 返回一个字典。idf 似乎接受了语料库,因此在函数的早期调用它,然后访问您想要获取的单词。

tmp_dict = idf(corpus)

...
idf_ = 1+math.log((1+len(dataset))/float(1+tmp_dict[word]))

推荐阅读