首页 > 解决方案 > 在 Python 中以 y =a*x+b 的形式学习线性模型

问题描述

我目前正在使用 python 尝试学习如何使用财富 500 强数据集进行线性回归。到目前为止,我已经通过删除 N.As 来清理我的数据集。但是,当我遇到问题 DI 时,我不确定如何构建这个模型。根据我为 x 假设的说明,我将使用收入(以百万为单位),但是,我不知道 X 中还应该包含什么。我该如何继续并构建这个模型?

B 部分:通过删除利润为 NA 的记录(行)来清理数据集,并研究收入和利润之间的关系。

dfCleanX = df[ df['Profit (in millions)']!='N.A.']
dfCleanX.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 25131 entries, 0 to 25499

Data columns (total 5 columns):

Year                     25131 non-null int64

Rank                     25131 non-null int64

Revenue (in millions)    25131 non-null float64

Profit (in millions)     25131 non-null object

Company                  25131 non-null object

dtypes: float64(1), int64(2), object(2)

memory usage: 1.2+ MB

dfClean = dfCleanX.astype({'Profit (in millions)': 'float64'})

print(dfClean.values.shape )

dfClean.info()

(25131, 5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year                     25131 non-null int64
Rank                     25131 non-null int64
Revenue (in millions)    25131 non-null float64
Profit (in millions)     25131 non-null float64
Company                  25131 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 1.2+ MB

dfClean.plot.scatter(x='Revenue (in millions)', y='Profit (in millions)')

<matplotlib.axes._subplots.AxesSubplot at 0x23e0222a3c8>

C 部分:在这部分,我们只关注“正收益”的案例。我们要检查收入(即 x)和利润(即 y)之间的关系来构建线性模型 y = a*x+b

可视化 y 与 x 的关系,其中 y 和 x 是利润 (>0) 和收入。

positiveProfitMask = dfClean['Profit (in millions)'] > 0
dfClean[ positiveProfitMask  ].plot.scatter(
    x='Revenue (in millions)', 
    y='Profit (in millions)'
    )

<matplotlib.axes._subplots.AxesSubplot at 0x23e023b8358>

问题 D:只关注“正利润”的案例。在下面的单元格中填写缺少的代码以

  1. 学习一个y =a*x+b形式的线性模型来建模收入(即x)和正利润(即y)之间的关系,
  2. 使用模型找到这些案例的预测利润,以及
  3. 将预测与数据一起绘制,以查看模型如何拟合数据。
from sklearn.linear_model import LinearRegression

x = dfClean[(Revenues (in millions) )][??? ]
y = dfClean[( Profits (in millions) )][??? ]

model = LinearRegression(fit_intercept=True)
model.fit(positiveProfitMask  , y)

print( "model.coef_ =", model.coef_ )
print( "model.intercept_ =", model.intercept_ )
print( "Linear model about y(profit) and x(revenue): y=",  
       model.coef_, '* x +', model.intercept_)
yfit = model.predict(???  )

plt.scatter(x, y)
plt.plot(x, yfit, 'r');

标签: pythonscikit-learnlinear-regression

解决方案


If only following line yfit = model.predict(??? ) needs to be filled in then you need to just pass a vector X, to see what your model will predict for given values. Since you only need positive profits you need to filter that our first from your X.

Heres how you can do it in pandas:

 cleaned_df = df[df['profit'] > 0]
 y = df['y'].values
 X = df.drop(columns=['y']).values

 yfit = model.predict(X)

推荐阅读