python - 标签不是 x 出现在所有训练示例中
问题描述
您好,我在尝试预测项目中的标签/标签时遇到了一个问题。我目前正在使用类似的教程(使用我自己的数据)根据给定标签(例如 1 个投诉 --> 许多类型(保修、退款、空调))预测投诉登记中的投诉
DF -> 列标记号 -> 4(原始),2(清理)>genre_new 和 clean_plot 列名 ->ID、情节、标题、流派、genre_new、clean_plot
我使用了本教程: https ://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/ 。这是为了预测具有多个流派的电影,例如 1 部电影有多个流派
我还在 UserWarning 上找到了解决方案:Label not :NUMBER: is present in all training examples
问题:问题可能是某些标签仅出现在少数文档中。当您将数据集拆分为训练和测试以验证您的模型时,可能会发生训练数据中缺少某些标签的情况。
错误:标签警告和 0 预测
但是我不确定如何编写此解决方法来满足我的代码要求,因为我不是编码员。请帮忙。
请参考我的谷歌驱动链接 https://drive.google.com/drive/folders/10yLOVWZPgl1shVwwM5qDy7iyMCm7cS9A?usp=sharing
解决方案
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
mlb = MultiLabelBinarizer()
vect = CountVectorizer()
tfidf = TfidfTransformer()
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)
df = pd.read_excel("Building Compliants in 2018 for training(1).xls")
df['Genre'] = df['Genre'].apply(lambda x: x.split(','))
y = mlb.fit_transform(df['Genre'])
train_data_vect = vect.fit_transform(df['Plot'])
train_data_tfidf = tfidf.fit_transform(train_data_vect)
x_train, x_test, y_train, y_test=train_test_split(train_data_tfidf,y, test_size=0.25)
clf.fit(x_train,y_train) #train your model on train data
print(clf.score(x_test,y_test)) #check score on test data
#op
Out[29]:
0.3333333333333333
#now for predicting , taking first element of Plot column
text = df['Plot'][0]
vect_transform = vect.transform([text])
tfidf_transform = tfidf.transform(vect_transform)
clf.predict(tfidf_transform)
#array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]])
mlb.inverse_transform(clf.predict(tfidf_transform))
#op
[(' Warranty', 'Airconditioning')]
def infer_tags(q):
q = clean_text(q)
q = remove_stopwords(q)
q_vec = tfidf.transform([q])
q_pred = clf.predict(q_vec)
#print(q)
return MultiLabelBinarizer.inverse_transform(q_pred)
for i in range(100):
k = x_test.sample(i).index[2]
#print("Trader: ", Tag['Title'][k])
print("Trader: ", Tag['Title'][k], "\nPredicted genre: ",infer_tags(x_test[k]))
print("Actual genre: ",Tag['Genre'][k], "\n")
#op
Traceback (most recent call last):
File "<ipython-input-70-28cc8e8a7204>", line 11, in <module>
k = x_test.sample(i).index[2]
File "C:\Users\LAUJ3\Documents\Python Project\env\lib\site-
packages\scipy\sparse\base.py", line 688, in __getattr__
raise AttributeError(attr + " not found")
AttributeError: sample not found