首页 > 解决方案 > 在绘制多元线性回归模型的成本与时期时获得空图

问题描述

我正在学习机器学习并尝试在汽车价格数据集上实施多元线性回归来预测未来汽车的价格。

这是我的数据集

链接到我的 jupyter 笔记本代码

这是我的代码

 In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:

temp = pd.read_csv('car_price.csv')

In [4]:

temp.columns

Out[4]:

Index(['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner', 'mileage', 'engine', 'max_power', 'torque',
       'seats'],
      dtype='object')

In [5]:

data = temp[['year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner', 'mileage', 'engine', 'max_power', 'torque',
       'seats'
            ]]
In [6]:

data['Current_Year'] = 2020

In [7]:

data.head()

Out[7]:

year    selling_price   km_driven   fuel    seller_type transmission    owner   mileage engine  max_power   torque  seats   Current_Year
0   2014    450000  145500  Diesel  Individual  Manual  First Owner 23.4 kmpl   1248 CC 74 bhp  190Nm@ 2000rpm  5.0 2020
1   2014    370000  120000  Diesel  Individual  Manual  Second Owner    21.14 kmpl  1498 CC 103.52 bhp  250Nm@ 1500-2500rpm 5.0 2020
2   2006    158000  140000  Petrol  Individual  Manual  Third Owner 17.7 kmpl   1497 CC 78 bhp  12.7@ 2,700(kgm@ rpm)   5.0 2020
3   2010    225000  127000  Diesel  Individual  Manual  First Owner 23.0 kmpl   1396 CC 90 bhp  22.4 kgm at 1750-2750rpm    5.0 2020
4   2007    130000  120000  Petrol  Individual  Manual  First Owner 16.1 kmpl   1298 CC 88.2 bhp    11.5@ 4,500(kgm@ rpm)   5.0 2020
In [8]: 

data['# Years'] = data['Current_Year'] - data['year']

In [9]:

to_drop = ['Current_Year','year','torque','max_power','seller_type','owner']
data.drop(to_drop, inplace = True, axis = 1)

In [10]:

data.head()

Out[10]:
selling_price   km_driven   fuel    transmission    mileage engine  seats   # Years
0   450000  145500  Diesel  Manual  23.4 kmpl   1248 CC 5.0 6
1   370000  120000  Diesel  Manual  21.14 kmpl  1498 CC 5.0 6
2   158000  140000  Petrol  Manual  17.7 kmpl   1497 CC 5.0 14
3   225000  127000  Diesel  Manual  23.0 kmpl   1396 CC 5.0 10
4   130000  120000  Petrol  Manual  16.1 kmpl   1298 CC 5.0 13

In [11]:
data['engine']= data['engine'].str.replace('[^\d.]', '',regex = True).astype(float)

In [12]:
data['mileage'] = data['mileage'].str.replace('[^\d.]', '',regex = True).astype(float)

In [13]:
data.head()

Out[13]:
selling_price   km_driven   fuel    transmission    mileage engine  seats   # Years
0   450000  145500  Diesel  Manual  23.40   1248.0  5.0 6
1   370000  120000  Diesel  Manual  21.14   1498.0  5.0 6
2   158000  140000  Petrol  Manual  17.70   1497.0  5.0 14
3   225000  127000  Diesel  Manual  23.00   1396.0  5.0 10
4   130000  120000  Petrol  Manual  16.10   1298.0  5.0 13

In [14]:
data.replace(to_replace = ['Diesel','Petrol','LPG','CNG'],value=[1,2,3,4],inplace = True)

In [15]:
data.head()

Out[15]:
selling_price   km_driven   fuel    transmission    mileage engine  seats   # Years
0   450000  145500  1   Manual  23.40   1248.0  5.0 6
1   370000  120000  1   Manual  21.14   1498.0  5.0 6
2   158000  140000  2   Manual  17.70   1497.0  5.0 14
3   225000  127000  1   Manual  23.00   1396.0  5.0 10
4   130000  120000  2   Manual  16.10   1298.0  5.0 13

In [16]:
data.replace(to_replace = ['Manual','Automatic'],value=[1.0,2.0],inplace = True)

In [17]:
data.head()

Out[17]:
selling_price   km_driven   fuel    transmission    mileage engine  seats   # Years
0   450000  145500  1   1.0 23.40   1248.0  5.0 6
1   370000  120000  1   1.0 21.14   1498.0  5.0 6
2   158000  140000  2   1.0 17.70   1497.0  5.0 14
3   225000  127000  1   1.0 23.00   1396.0  5.0 10
4   130000  120000  2   1.0 16.10   1298.0  5.0 13

In [18]:
data.head()

Out[18]:
selling_price   km_driven   fuel    transmission    mileage engine  seats   # Years
0   450000  145500  1   1.0 23.40   1248.0  5.0 6
1   370000  120000  1   1.0 21.14   1498.0  5.0 6
2   158000  140000  2   1.0 17.70   1497.0  5.0 14
3   225000  127000  1   1.0 23.00   1396.0  5.0 10
4   130000  120000  2   1.0 16.10   1298.0  5.0 13
In [ ]:

In [21]:
data = (data - data.mean())/data.std()



X = data.iloc[:,1:8]

ones = np.ones([X.shape[0],1])
X = np.concatenate((ones,X),axis=1)

y = data.iloc[:,0:1].values 
theta = np.zeros([1,8])

print(X)





def computeCost(X,y,theta):
    tobesummed = np.power(((X @ theta.T)-y),2)
    return np.sum(tobesummed)/(2 * len(X))

def gradientDescent(X,y,theta,iters,alpha):
    cost = np.zeros(iters)
    for i in range(iters):
        theta = theta - (alpha/len(X)) * np.sum(X * (X @ theta.T - y), axis=0)
        cost[i] = computeCost(X, y, theta)
        
    
    return theta,cost


alpha = 0.01
iters = 1000

g,cost = gradientDescent(X,y,theta,iters,alpha)
print(g)

finalCost = computeCost(X,y,g)
print(finalCost)


 
fig, ax = plt.subplots()  
ax.plot(np.arange(iters), cost, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Cost')  
ax.set_title('Error vs. Training Epoch')

[[ 1.          1.33828022 -0.86972865 ... -0.41797619 -0.43426926
  -0.04846121]
 [ 1.          0.88735626 -0.86972865 ...  0.07813794 -0.43426926
  -0.04846121]
 [ 1.          1.24102211  0.95315801 ...  0.07615349 -0.43426926
   1.92965648]
 ...
 [ 1.          0.88735626 -0.86972865 ... -0.41797619 -0.43426926
   1.18786235]
 [ 1.         -0.79255652 -0.86972865 ... -0.12427662 -0.43426926
   0.1988035 ]
 [ 1.         -0.79255652 -0.86972865 ... -0.12427662 -0.43426926
   0.1988035 ]]

[[nan nan nan nan nan nan nan nan]]
nan

Out[21]:
Text(0.5, 1.0, 'Error vs. Training Epoch')

In [ ]:

当绘制成本与纪元时,我得到的是空图,而在打印成本值时,我得到的数据丢失了“nan”

我似乎无法理解我要去哪里错了。

标签: pythonpandasnumpymachine-learninglinear-regression

解决方案


您的数据总共有 663 个空值,因此出现错误,

data.isnull().values.sum()
663

进行了平均插补并替换了所有 NaN。

data = data.fillna(data.mean())
data.isnull().values.sum()
0

然后执行剩下的代码,

alpha = 0.01
iters = 1000
g,cost = gradientDescent(X,y,theta,iters,alpha)
print(g)
[[ 3.37617073e-16 -1.03407822e-01 -7.06228959e-02  3.63774873e-01
   2.51480688e-03  4.51868433e-01 -2.02133329e-01 -2.67955488e-01]]
finalCost = computeCost(X,y,g)
print(finalCost)
0.21914950571622366

阴谋:

fig, ax = plt.subplots()  
ax.plot(np.arange(iters), cost, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Cost')  
ax.set_title('Error vs. Training Epoch')

文本(0.5,1.0,'错误与训练时期')

注意::我认为你不应该像下面这样取整个数据的平均值,

data = (data - data.mean())/data.std()

目标变量selling_price应从缩放中排除。


推荐阅读