首页 > 解决方案 > 应用方差减少后创建特征数据框

问题描述

我正在使用以下包含 120,000 条记录的数据框(显示 5 条记录的样本)构建分类模型:

在此处输入图像描述

我已经建立了分类模型:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         

model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])

# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  

# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   

# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)

# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     

# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)

# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100

我可以在应用方差阈值之前创建一个显示单词标记的数据框:

X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)

在此处输入图像描述

在方差减少特征减少到 21,758 之后:

在此处输入图像描述

问题:如何在应用方差减少创建一个数据框,如我的特征的 x_train_df 以显示我的 21,758 个特征?

标签: pythonmatrixsparse-matrixvariance

解决方案


推荐阅读