首页 > 解决方案 > 解释结果回归

问题描述

我做了一个序数回归模型(第一次执行回归,请见谅),现在我需要评估它。最好的方法是什么?(我使用mordAPI 进行序数回归)

这些是我要完成的任务:

3)建立一个回归模型,该模型将根据与评论中使用的一些非常常见的词相对应的属性来预测每个产品的评分(选择多少词留给您作为决定)。因此,对于每个产品,您将根据每个单词在该产品评论中出现的次数,拥有一个很长的(ish)属性向量。您的目标变量是评级。您将根据构建模型的过程(正则化、子集选择、验证集等)而不是结果的准确性来判断。

4) 使用问题 3 中的向量,执行降维(PCA 或 NMF)。你能总结出你可以保留多少个组件吗?试验这个参数并证明你的最终结论是正确的。

这是我的代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import textblob
import nltk
from pandas import ExcelWriter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from textblob import Word
from collections import Counter
import seaborn as sns
import mord as m
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

%matplotlib inline

df = # import dataframe from link

#Clean up Rating (whilst doing 'hand cleaning' I saw data outside of the [0,5] range; needs to be corrected; this could have been spotted by plotting the data on histogram but since I saw this while going throught the data I feel plotting it is an unnecessary step)
df.loc[df.Rating > 5, 'Rating'] = np.NaN
df.loc[df.Rating < 1, 'Rating'] = np.NaN

# Convert weights to same measure (pounds). Most of the weights I inspected seem wrong...

for i in range(0, df.weight.size-1):
    cell = df.weight[i]
    while (cell == 0 and i < df.weight.size-1):
        i += 1
        cell = df.weight[i]
    if not(isinstance(cell, float)) and  not(isinstance(cell, int)):
            number = ''.join([x for x in cell if (x.isdigit() or x=='.')])
            num = float(number)
            if bool(re.search('ounces', cell)):
                df.loc[i, 'weight'] = num * 0.0625    # Ounces to pounds conversion
            else:
                df.loc[i, 'weight'] = num            # Introduce only number (without measure type)

df.loc[:, "Review"] = df["Title"] + str(' - ') + df["Text"]
df.drop('Title', axis=1, inplace=True)
df.drop('Text', axis=1, inplace=True)
df.columns = ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
df['Brand'] = df['Brand'].astype(str)
df['Review'] = df['Review'].astype(str)
df['Name'] = df['Name'].astype(str)

d = {'Brand':'first', 
     'NumsHelpful':'mean', 
     'Rating':'mean',
     'Weight(Pounds)':'first',
     'Review':'/'.join, 
    }
df = df.groupby('Name').agg(d).reset_index()

df.Rating = df.Rating.round()
df.NumsHelpful = df.NumsHelpful.round()

df['Review2'] = df['Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

df['Review2'] = df['Review2'].str.replace('[^\w\s]','')

stop = stopwords.words('english')
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[:20]

common = ['wine', 'mix', 'taste', 'drink', 'one', 'price', 'product', 'flavour', 'would', 'bitters', 'bottle', 'buy','really', 'make']
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in common))

freq = pd.Series(' '.join(df['Review2']).split()).value_counts()[-10:]

freq = list(freq.index)
df['Review2'] = df['Review2'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

df['words'] = df.Review2.str.strip().str.split('[\W_]+')

df['Review2'] = df['words'].apply(lambda x: " ".join([Word(word).lemmatize('v') for word in x]))

df['Review2'].str.split(expand=True).stack().value_counts()
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

# Do PCA (dimensionality reduction)
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(x_train)
# Apply transform to both the training set and the test set.
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# Make an instance of the Model
pca = PCA(.95)
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=10, scoring='accuracy')

你对上面的代码有什么想法?

非常感谢任何见解!


编辑:

是数据集的链接

是包含源代码 (Python) 的 google.doc 的链接

标签: pythonplotregression

解决方案


推荐阅读