python - 解释结果回归
问题描述
我做了一个序数回归模型(第一次执行回归,请见谅),现在我需要评估它。最好的方法是什么?(我使用mord
API 进行序数回归)
这些是我要完成的任务:
3)建立一个回归模型,该模型将根据与评论中使用的一些非常常见的词相对应的属性来预测每个产品的评分(选择多少词留给您作为决定)。因此,对于每个产品,您将根据每个单词在该产品评论中出现的次数,拥有一个很长的(ish)属性向量。您的目标变量是评级。您将根据构建模型的过程(正则化、子集选择、验证集等)而不是结果的准确性来判断。
4) 使用问题 3 中的向量,执行降维(PCA 或 NMF)。你能总结出你可以保留多少个组件吗?试验这个参数并证明你的最终结论是正确的。
这是我的代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import textblob
import nltk
from pandas import ExcelWriter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from textblob import Word
from collections import Counter
import seaborn as sns
import mord as m
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
%matplotlib inline
df = # import dataframe from link
#Clean up Rating (whilst doing 'hand cleaning' I saw data outside of the [0,5] range; needs to be corrected; this could have been spotted by plotting the data on histogram but since I saw this while going throught the data I feel plotting it is an unnecessary step)
df.loc[df.Rating > 5, 'Rating'] = np.NaN
df.loc[df.Rating < 1, 'Rating'] = np.NaN
# Convert weights to same measure (pounds). Most of the weights I inspected seem wrong...
for i in range(0, df.weight.size-1):
cell = df.weight[i]
while (cell == 0 and i < df.weight.size-1):
i += 1
cell = df.weight[i]
if not(isinstance(cell, float)) and not(isinstance(cell, int)):
number = ''.join([x for x in cell if (x.isdigit() or x=='.')])
num = float(number)
if bool(re.search('ounces', cell)):
df.loc[i, 'weight'] = num * 0.0625 # Ounces to pounds conversion
else:
df.loc[i, 'weight'] = num # Introduce only number (without measure type)
df.loc[:, "Review"] = df["Title"] + str(' - ') + df["Text"]
df.drop('Title', axis=1, inplace=True)
df.drop('Text', axis=1, inplace=True)
df.columns = ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
df['Brand'] = df['Brand'].astype(str)
df['Review'] = df['Review'].astype(str)
df['Name'] = df['Name'].astype(str)
d = {'Brand':'first',
'NumsHelpful':'mean',
'Rating':'mean',
'Weight(Pounds)':'first',
'Review':'/'.join,
}
df = df.groupby('Name').agg(d).reset_index()
df.Rating = df.Rating.round()
df.NumsHelpful = df.NumsHelpful.round()
df['Review2'] = df['Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Review2'] = df['Review2'].str.replace('[^\w\s]','')
stop = stopwords.words('english')
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[:20]
common = ['wine', 'mix', 'taste', 'drink', 'one', 'price', 'product', 'flavour', 'would', 'bitters', 'bottle', 'buy','really', 'make']
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in common))
freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[-10:]
freq = list(freq.index)
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
df['words'] = df.Review2.str.strip().str.split('[\W_]+')
df['Review2'] = df['words'].apply(lambda x: " ".join([Word(word).lemmatize('v') for word in x]))
df['Review2'].str.split(expand=True).stack().value_counts()
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
# Do PCA (dimensionality reduction)
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(x_train)
# Apply transform to both the training set and the test set.
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Make an instance of the Model
pca = PCA(.95)
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=10, scoring='accuracy')
你对上面的代码有什么想法?
非常感谢任何见解!
编辑:
这是数据集的链接
这是包含源代码 (Python) 的 google.doc 的链接
解决方案
推荐阅读
- flutter - Flutter 无法在 ListTile 上显示 JSON 图像
- python - 如何改进此电子邮件提取正则表达式?
- reactjs - 如何为 React 切换自定义钩子设置默认状态?
- calculated-field - 有没有办法在 tableau 中解析一个字符串以获得可变长度的字符串?
- c++ - C++ 动态数组访问冲突,读写位置
- angular - Angular 12 路由问题与路径中的斜线
- javascript - D3js v4更新模式多次过滤图形,无法让节点正确重置为原始数据
- django - Django:在 ajax 请求中包含一个 html
- r - 如何在R中的GA中指定决策变量的等式约束?
- python - 为什么 websocket 需要 20-30 秒才能进入 while 循环