首页 > 解决方案 > word2vec,如何用 SVM 对文本进行分类?

问题描述

我有一个 csv 文件,它有 2 列:class 和 text_data。我首先提取 biGram 和 TriGram,然后尝试对我的数据使用 SVM 进行分类。但它显示“TypeError:序列项 0:期望一个类似字节的对象,找到 str”。我使用了 Gensim=4.0.0。非常感谢您的帮助。代码:

# all packages imported
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df_covid['label'] = df_covid['class'].map({
    'covidshutdown': 0,
    'manufactoring':1,
    'corporate':2,
    'environmental':3,
    'infrastructure':4,
    'other':5
})
X = df_covid['text_data']
y = df_covid['label']
corpus=X
lst_corpus = []
for string in corpus:
lst_words = string.split()
lst_grams = [" ".join(lst_words[i:i+1])
            for i in range(0, len(lst_words), 1)]
lst_corpus.append(lst_grams)

bigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
            delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)
trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

cv = gensim.models.word2vec.Word2Vec(lst_corpus, size=300, window=8,
            min_count=1, sg=1, iter=30)

X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, 
            test_size=0.20, random_state=42)
clf = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
y_pred = clf.predict(X_testCv)
print(classification_report(X_testCv, y_pred))

完整的错误信息: 在此处输入图像描述

这是一个数据集 点击这里下载

如果,我不使用 biGrams 和 triGrams 并将 word2Vec 模型更改为此

cv=gensim.models.Word2Vec(lst_corpus,vector_size=100,window=5,min_count=5,workers=4)

出现新的错误信息:

回溯(最近一次通话最后):

File "D:\Dropbox\AAA\50-2\word2Vec_trails\word2Vec_triGrams_34445.py", line 65, in <module>
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(cv, y, test_size=0.20, random_state=42)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2172, in train_test_split
arrays = indexable(*arrays)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 299, in indexable
check_consistent_length(*result)

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]

File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
lengths = [_num_samples(X) for X in arrays if X is not None]

  File "C:\Users\makta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 202, in _num_samples
raise TypeError("Singleton array %r cannot be considered"

TypeError: Singleton array array(<gensim.models.word2vec.Word2Vec object at 0x000001A41DF59820>,
  dtype=object) cannot be considered a valid collection.

我做了这个尝试(下面的代码)来使用 word2vec 向量,比如 count vectorizer 或 TFIDF。它产生输出,但不是正确的输出。我想我应该列出向量列表。帮助表示赞赏。这是代码:

from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df_covid = pd.read_csv('allCurated853_4.csv', encoding="utf8")
df['label'] = df['class'].map({'covidshutdowncsv': 0,  'manufactoringcsv':1, 'corporatecsv':2, 'environmentalcsv':3,'infrastructurecsv':4, 'other':5})
X = df['message']
y = df['label']
X=X.to_string ()
ls = []
rows = X.splitlines(True)
print('size of rows:', len(rows))   # size of rows: 852
for i in rows:
    ls.append(i.split(' '))
print('total words:', len(ls))    # total words: 852
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print('words in vocabolary :',len(words)) # words in vocabolary:3110
print(words)
words=words[0:852]    # problem
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)    
print('vectors of words:', len(data))   # vectors of words: 852     
X_trainCv, X_testCv, y_trainCv, y_testCv = train_test_split(data, y, test_size=0.20, random_state=42)
clf_covid = svm.SVC(kernel='linear').fit(X_trainCv,y_trainCv)
clf_covid.score(X_testCv,y_testCv)
# score: 0.50299

标签: textclassificationsvmword2vecword-embedding

解决方案


推荐阅读