text - word2vec,如何用 SVM 对文本进行分类?
问题描述
我有一个 csv 文件,它有 2 列:class 和 text_data。我首先提取 biGram 和 TriGram,然后尝试对我的数据使用 SVM 进行分类。但它显示“TypeError:序列项 0:期望一个类似字节的对象,找到 str”。我使用了 Gensim=4.0.0。非常感谢您的帮助。代码:
# all packages imported
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df_covid['label'] = df_covid['class'].map({
'covidshutdown': 0,
'manufactoring':1,
'corporate':2,
'environmental':3,
'infrastructure':4,
'other':5
})
X = df_covid['text_data']
y = df_covid['label']
corpus=X
lst_corpus = []
for string in corpus:
lst_words = string.split()
lst_grams = [" ".join(lst_words[i:i+1])
for i in range(0, len(lst_words), 1)]
lst_corpus.append(lst_grams)
bigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)
cv = gensim.models.word2vec.Word2Vec(lst_corpus, size=300, window=8,
min_count=1, sg=1, iter=30)
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y,
test_size=0.20, random_state=42)
clf = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
y_pred = clf.predict(X_testCv)
print(classification_report(X_testCv, y_pred))
这是一个数据集 点击这里下载
如果,我不使用 biGrams 和 triGrams 并将 word2Vec 模型更改为此
cv=gensim.models.Word2Vec(lst_corpus,vector_size=100,window=5,min_count=5,workers=4)
出现新的错误信息:
回溯(最近一次通话最后):
File "D:\Dropbox\AAA\50-2\word2Vec_trails\word2Vec_triGrams_34445.py", line 65, in <module>
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, test_size=0.20, random_state=42)
File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2172, in train_test_split
arrays = indexable(*arrays)
File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 299, in indexable
check_consistent_length(*result)
File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]
File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 202, in _num_samples
raise TypeError("Singleton array %r cannot be considered"
TypeError: Singleton array array(<gensim.models.word2vec.Word2Vec object at 0x000001A41DF59820>,
dtype=object) cannot be considered a valid collection.
我做了这个尝试(下面的代码)来使用 word2vec 向量,比如 count vectorizer 或 TFIDF。它产生输出,但不是正确的输出。我想我应该列出向量列表。帮助表示赞赏。这是代码:
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df['label'] = df['class'].map({'covidshutdowncsv': 0, 'manufactoringcsv':1, 'corporatecsv':2, 'environmentalcsv':3,'infrastructurecsv':4, 'other':5})
X = df['message']
y = df['label']
X=X.to_string ()
ls = []
rows = X.splitlines(True)
print('size of rows:', len(rows)) # size of rows: 852
for i in rows:
ls.append(i.split(' '))
print('total words:', len(ls)) # total words: 852
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print('words in vocabolary :',len(words)) # words in vocabolary:3110
print(words)
words=words[0:852] # problem
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
print('vectors of words:', len(data)) # vectors of words: 852
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(data, y, test_size=0.20, random_state=42)
clf_covid = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
clf_covid.score(X_testCv,y_testCv)
# score: 0.50299
解决方案
推荐阅读
- excel - 在 VBA 中过滤具有两个以上异常条件的字段
- python - 在 Python 中下载 CSV - 激活链接时创建的文件
- nginx - 使用 nginx 配置子域
- python - 分离一个单链表,使所有奇数节点一起出现,偶数节点一起出现
- macos - 为什么我的代码签名 dmg 被 Mac 中的 Chrome 和 Gatekeeper 阻止?
- html - 如何在reddit网站上专门找到html中的背景颜色
- c - 打印带有两位小数的双精度值(来自整数表达式)
- angular - 总和值输入 ngFor / Reactive Form
- python - 如何在 VS 代码中自定义 python 语法高亮?
- c# - 哈希槽在 StackExchange.Redis 和 redis 集群中如何工作?