首页 > 解决方案 > 词袋模型没有意义

问题描述

我做了一个词袋模型,当我打印出来时,输出不太有意义。这是我用来初始化词袋的代码:

#creating the bag of words model

headline_bow = CountVectorizer()
headline_bow.fit(x)
a = headline_bow.transform(x)
b = headline_bow.get_feature_names()
print(a)

这是来自词袋模型的输出示例:

  (0, 837)  1
  (0, 1496) 1
  (0, 1952) 1
  (0, 2610) 1 

据我了解,对于“(0, 837) 1”,这意味着在通过模型的第一个列表中,该列表中的第 837 个单词出现了一次。这是没有意义的,因为当我打印 x[0] 时,我得到了这个:

Four ways Bob Corker skewered Donald Trump

这里显然没有 837 个单词,所以我对发生的事情感到困惑。

这是 x 是什么的示例:(一堆头条新闻)

['Four ways Bob Corker skewered Donald Trump'
 "Linklater's war veteran comedy speaks to modern America, says star"
 'Trump’s Fight With Corker Jeopardizes His Legislative Agenda' ...
 'Ron Paul on Trump, Anarchism & the AltRight'
 'China to accept overseas trial data in bid to speed up drug approvals'
 'Vice President Mike Pence Leaves NFL Game Because of Anti-American Protests']

这是我的其余代码:

data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]

x = np.array(data['Headline'])
print(x[0])
y = np.array(data["Label"])

# tokenization of the data here'
headline_vector = []

for  headline in x:
    headline_vector.append(word_tokenize(headline))

print(headline_vector)

stopwords = set(stopwords.words('english'))

#removing stopwords at this part
filtered = [[word for word in sentence if word not in stopwords]
            for sentence in headline_vector]
#print(filtered)


stemmed2 = [[stem(word) for word in headline] for headline in filtered]
#print(stemmed2)

#lowercase
lower = [[word.lower() for word in headline] for headline in stemmed2] #start here

#conver lower into a list of strings
lower_sentences = [" ".join(x) for x in lower]

#organising
articles = []


for headline in lower:
    articles.append(headline)

#creating the bag of words model

    headline_bow = CountVectorizer()
headline_bow.fit(lower_sentences)
a = headline_bow.transform(lower_sentences)
print(a)

标签: pythonmachine-learningmodelnlp

解决方案


推荐阅读