首页 > 解决方案 > 如何使用 WIDF 算法处理数据集文档 (CSV)

问题描述

我的程序有问题,我创建了一个系统来使用 WIDF 算法使用 python 代码对文档(csv)进行分类

这是 WIDF 算法:

import pprint
     
    class WIdf():
        
        def __init__(self):
            self.total_tf = 0
            self.total_weight = 0
            self.document = []
            self.query = ''
            self.corpus = {}
    
        def transform(self, q, document):
            self.query = q
            self.document = document
            for index, item in enumerate(self.document):
                words = item.split(' ')
                tf = 0
                for word in words:
                    if(self.query.lower() == word.lower()):
                        tf += 1
                self.total_tf += tf
                self.corpus[index] = {"tf" : tf}
            return self
        
        def weight(self):
            for key, value in self.corpus.items():
                weight = value['tf'] / self.total_tf
                self.corpus[key]['weight'] = weight
                self.total_weight += weight
    
        def get_weight(self):
            self.total_weight = 0
            self.weight()
            return self.corpus
    
        def weight_average(self): #bikinan sendiri
            self.total_weight = 0
            self.weight()
            return self.total_weight / len(self.document)

这是如何处理文本数据集的程序:

import pprint
from widf import WIdf

print("1")
texts = ['hatiNN buahNN anugerahNN cintaNN buahNN deritaVB pendamNN hasratNN cobaVB kenalVB bedaJJ takNEG kanMD mungkinMD satuCD jauhJJ dasarNN hatiNN semuaCD sulitJJ akhirNN cintaNN takNEG mampuJJ rubahNN sifatNN bosanNN sikapNN slaluNN abaiNN semuaCD buatIN diriNN cintaNN takNEG kanMD akhirNN hubungNN cintaNN sangatRB untungNN hidupNN',
        'akuVB takNEG mampuJJ sakitNN akuVB takNEG sanggupNN akuVB takNEG mampuJJ sakitNN akuVB takNEG sanggupNN takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ akuVB hargaNN tulusJJ cintaNN milikNN takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ takNEG mungkinMD cintaNN hatiNN tlahNN milikNN takNEG mungkinMD milikNN sepenuhJJ hatiNN akuVB setiaJJ akuVB setiaJJ',
] /this is dataset and i will convert to document

q='cintaNN' /this is a word to be searched for weighting value

print('')
print('Pembobotan W-IDF')
widf = WIdf().transform(q=q, document=texts)
print("Bobot rata-rata: " + str(widf.weight_average()))
pprint.pprint(widf.get_weight())
print("+---------------------------------+")
text_features = tfidf.transform(texts)
predictions = model.predict(text_features)
for text, predicted in zip(texts, predictions):
  #print('"{}"'.format(text))

该程序是以数据集的形式搜索句子中的词权重。所以在这里我将把原本是文本形式的数据集转换并处理成文档(CSV)

标签: pythonmachine-learningtext-processing

解决方案


推荐阅读