python - 词袋模型没有意义
问题描述
我做了一个词袋模型,当我打印出来时,输出不太有意义。这是我用来初始化词袋的代码:
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(x)
a = headline_bow.transform(x)
b = headline_bow.get_feature_names()
print(a)
这是来自词袋模型的输出示例:
(0, 837) 1
(0, 1496) 1
(0, 1952) 1
(0, 2610) 1
据我了解,对于“(0, 837) 1”,这意味着在通过模型的第一个列表中,该列表中的第 837 个单词出现了一次。这是没有意义的,因为当我打印 x[0] 时,我得到了这个:
Four ways Bob Corker skewered Donald Trump
这里显然没有 837 个单词,所以我对发生的事情感到困惑。
这是 x 是什么的示例:(一堆头条新闻)
['Four ways Bob Corker skewered Donald Trump'
"Linklater's war veteran comedy speaks to modern America, says star"
'Trump’s Fight With Corker Jeopardizes His Legislative Agenda' ...
'Ron Paul on Trump, Anarchism & the AltRight'
'China to accept overseas trial data in bid to speed up drug approvals'
'Vice President Mike Pence Leaves NFL Game Because of Anti-American Protests']
这是我的其余代码:
data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
x = np.array(data['Headline'])
print(x[0])
y = np.array(data["Label"])
# tokenization of the data here'
headline_vector = []
for headline in x:
headline_vector.append(word_tokenize(headline))
print(headline_vector)
stopwords = set(stopwords.words('english'))
#removing stopwords at this part
filtered = [[word for word in sentence if word not in stopwords]
for sentence in headline_vector]
#print(filtered)
stemmed2 = [[stem(word) for word in headline] for headline in filtered]
#print(stemmed2)
#lowercase
lower = [[word.lower() for word in headline] for headline in stemmed2] #start here
#conver lower into a list of strings
lower_sentences = [" ".join(x) for x in lower]
#organising
articles = []
for headline in lower:
articles.append(headline)
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(lower_sentences)
a = headline_bow.transform(lower_sentences)
print(a)
解决方案
推荐阅读
- c++ - 函数可以位于赋值运算符的左侧吗?
- virtualbox - PhpStorm Web 服务器调试验证的问题
- python - MNIST Pytorch 中的验证错误意外增加
- javascript - 如何从 html 文件在 vue 模板上显示 html 代码
- javascript - 如何隐藏私钥?
- c# - 使用 itext7 填充 pdf 表单时调用 FillXfaForm 后,XML 数据集在 xfa.datasetsNode 中重复
- sql - 在 SQL 中为每个 id 引用添加列
- facebook - Facebook 显示页面帖子的权限
- shell - api curl脚本没有给出输出
- flutter - 如何修复 Flutter MaterialApp(not_enough_required_arguments)?