python - 用于分类问题的多个文本列的特征提取
问题描述
从多个文本列中提取特征并对其应用任何分类算法的正确方法是什么?请建议我,如果我出错了
示例数据集
自变量: Description1、Description2、State、NumericCol1、NumericCol2
因变量:目标类别
代码:
########### Feature Exttraction for Text Data #####################
######### Description1 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500,
ngram_range = (1,3),
stop_words = "english")
X_Description1 = tfidf.fit_transform(df["Description1"].tolist())
######### Description2 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500,
ngram_range = (1,3),
stop_words = "english")
X_Description2 = tfidf.fit_transform(df["Description2"].tolist())
######### State (have 100 unique entries thats why used BinaryEncoder)
import category_encoders as ce
binary_encoder= ce.BinaryEncoder(cols=['state'],return_df=True)
X_state = binary_encoder.fit_transform(df["state"])
import scipy
X = scipy.sparse.hstack((X_Description1,
X_Description2,
X_state,
df[["NumericCol1", "NumericCol2"]].to_numpy())).tocsr()
y = df['TargetCategory']
##### train Test Split ########
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=111)
##### Create Model Model ######
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics
# Baseline Random forest based Model
rfc = RandomForestClassifier(criterion = 'gini', n_estimators=1000, verbose=1, n_jobs = -1,
class_weight = 'balanced', max_features = 'auto')
rfcg = rfc.fit(X_train,y_train) # fit on training data
####### Prediction ##########
predictions = rfcg.predict(X_test)
print('Baseline: Accuracy: ', round(accuracy_score(y_test, predictions)*100, 2))
print('\n Classification Report:\n', classification_report(y_test,predictions))
解决方案
在 scikit-learn 中使用多列作为输入的方法是使用ColumnTransformer。
这是一个有关如何将其用于异构数据的示例。
推荐阅读
- scala - Spark - GSSException:未提供有效凭据(机制级别:找不到任何 Kerberos tgt)
- python - 在行中显示数组,而不是在 jupyter 笔记本的列中
- javascript - js在chrome浏览器上的数字解析行为
- next.js - 部署后未在根目录中找到静态文件(nextjs)
- javascript - 不应该渲染的 React Native Navigator 屏幕
- python - 卡着了。请看下面的正文
- c - 在 C 语言中,我们使用 \n 表示下一行。有什么办法可以转到上一行吗?
- node.js - 在 Express NodeJS 中测试 REST API
- c++ - 基于proto文件调用setter和getter方法
- firebase - 了解 Firestore 查询