首页 > 解决方案 > ValueError:发现样本数量不一致的输入变量:[1, 700]

问题描述

我正在对 kaggle 提供的关于泰坦尼克号幸存者预测的数据执行线性回归。我试图预测幸存者列表,所以即使在我重塑 Y 之后,我仍然会收到这个错误,它仍然显示这个错误。

from sklearn.linear_model import LogisticRegression
from csv import reader
import numpy as np

file = open('train.csv', "r")
lines = reader(file)
X = list(lines)
#Deleting unnecessary features
X=np.delete(X, (0), axis=0)
X=np.delete(X, (0), axis=1)
X=np.delete(X, (2), axis=1)
X=np.delete(X, (3), axis=1)
X=np.delete(X, (5), axis=1)
X=np.delete(X, (5), axis=1)
X=np.delete(X, (5), axis=1)
X=np.delete(X, (5), axis=1)
#Converting males to 1 and females to 0
for i in range(891):
   if X[i][2]== 'male':
       X[i][2]=1
   else:
       X[i][2]=0
Y=X.T[0]
#Converting strings to float
X1 = X.astype(np.float) 
Y1 = Y.astype(np.float)
Xw=X1.reshape(-1,1)
split = 700
train,test = Xw[:split,:],Xw[split:,:]
Ytrain,Ytest = Y1[:split],Y1[:split]
logisticRegr = LogisticRegression()
logisticRegr.fit(train.T, Ytrain)
logisticRegr.predict(test[0].T.reshape(1,-1))
score = logisticRegr.score(test.T, Ytest)

标签: python-3.xnumpymachine-learningscikit-learnlinear-regression

解决方案


我强烈建议您熟悉pandas用于数据处理的库,您可以尝试以下方法:

# import
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

import pandas as pd
df = pd.read_csv('train.csv')

# convert to male/female, lets say the column is called as gender
df['gender'] = df['gender'].map({'male': 0, 'female': 1})

trainX, testX, trainY, testY = train_test_split(df, df['Survived'], train_size=700, stratify = df['Survived'],)

logisticRegr = LogisticRegression()
logisticRegr.fit(trainX, trainY)

preds = logisticRegr.predict(testX)
score = metrics.accuracy_score(testY, preds)

推荐阅读