python - 如何将 F 统计量和 P 值放入表中?
问题描述
如何将这些代码简化为一个 for 循环并创建一个表格来显示特征的 F 统计量和 P 值。
print(scipystats.f_oneway(df_data.loc[df_data["SaleCondition"] == 'Normal'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Abnorml'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Partial'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'AdjLand'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Alloca'].SalePrice,
df_data.loc[df_data["SaleCondition"] == 'Family'].SalePrice))
>>>F_onewayResult(statistic=45.57842830969571, pvalue=7.988268404991176e-44)
print(scipystats.f_oneway(df_data.loc[df_data["Fence"] == 'MnPrv'].SalePrice,
df_data.loc[df_data["Fence"] == 'GdWo'].SalePrice,
df_data.loc[df_data["Fence"] == 'GdPrv'].SalePrice,
df_data.loc[df_data["Fence"] == 'MnWw'].SalePrice))
>>>
F_onewayResult(statistic=4.948158647146986, pvalue=0.002312645635631918)
如何创建表格并提取 F 统计量和 P 值作为相应列的输入?并对具有最高 F 统计值的变量进行升序排序 1st?
已编辑 - 哪个结果更准确?
我的方法的结果:
F-statistics P-value
ExterQual 443.334831 1.439551e-204
KitchenQual 407.806352 3.032213e-192
BsmtQual 392.913506 9.610615e-186
GarageFinish 250.962467 1.199117e-93
MasVnrType 111.672380 4.793331e-65
Foundation 100.253851 5.791895e-91
CentralAir 98.305344 1.809506e-22
HeatingQC 88.394462 2.667062e-67
Neighborhood 71.784865 1.558600e-225
GarageType 71.522123 1.247154e-66
BsmtExposure 70.887984 1.022671e-42
BsmtFinType1 67.602175 1.807731e-63
SaleCondition 45.578428 7.988268e-44
MSZoning 43.840282 8.817634e-35
PavedDrive 42.024179 1.803569e-18
LotShape 40.132852 6.447524e-25
Alley 35.562060 4.899826e-08
SaleType 28.863054 5.039767e-42
FireplaceQu 24.398929 5.016300e-19
Electrical 23.067673 1.663249e-18
HouseStyle 19.595001 3.376777e-25
Exterior1st 18.611743 2.586089e-43
RoofStyle 17.805497 3.653523e-17
Exterior2nd 17.500840 4.842186e-43
BsmtCond 14.030600 5.136901e-09
BldgType 13.011077 2.056736e-10
LandContour 12.850188 2.742217e-08
GarageQual 9.570389 1.240803e-07
GarageCond 9.541161 1.309714e-07
ExterCond 8.798714 5.106681e-07
LotConfig 7.809954 3.163167e-06
RoofMatl 6.727305 7.231445e-08
Condition1 6.118017 8.904549e-08
Fence 4.948159 2.312646e-03
Heating 4.259819 7.534721e-04
Functional 4.057875 4.841697e-04
BsmtFinType2 2.702450 1.941009e-02
Street 2.459290 1.170486e-01
MiscFeature 2.157324 1.047276e-01
Condition2 2.073899 4.342566e-02
LandSlope 1.958817 1.413964e-01
PoolQC 1.627469 3.039853e-01
Utilities 0.298804 5.847168e-01
MSSubClass NaN NaN
MoSold NaN NaN
YrSold NaN NaN
@kitman0804 方法的结果:
def anova(data, x, y):
x_val = data[x].unique()
fstat = scipy.stats.f_oneway(*[df_data[y][data[x].isin([x_v])] for x_v in x_val])
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = [x]
return tbl
f2_table = pd.concat([anova(categorical_data, x, 'SalePrice') for x in categorical_data.columns])
F-statistics P-value
ExterQual 443.334831 1.439551e-204
KitchenQual 407.806352 3.032213e-192
BsmtQual 316.148635 8.158548e-196
GarageFinish 213.867028 6.228747e-115
FireplaceQu 121.075121 2.971217e-107
Foundation 100.253851 5.791895e-91
CentralAir 98.305344 1.809506e-22
HeatingQC 88.394462 2.667062e-67
MasVnrType 84.672201 1.054025e-64
GarageType 80.379992 6.117026e-87
Neighborhood 71.784865 1.558600e-225
BsmtFinType1 64.688200 2.386358e-71
BsmtExposure 63.939761 7.557758e-50
SaleCondition 45.578428 7.988268e-44
MSZoning 43.840282 8.817634e-35
PavedDrive 42.024179 1.803569e-18
LotShape 40.132852 6.447524e-25
MSSubClass 33.732076 8.662166e-79
SaleType 28.863054 5.039767e-42
GarageQual 25.776093 5.388762e-25
GarageCond 25.750153 5.711746e-25
BsmtCond 19.708139 8.195794e-16
HouseStyle 19.595001 3.376777e-25
Exterior1st 18.611743 2.586089e-43
Electrical 18.460192 8.226925e-18
RoofStyle 17.805497 3.653523e-17
Exterior2nd 17.500840 4.842186e-43
Alley 15.176614 2.996380e-07
Fence 13.433276 9.379977e-11
BldgType 13.011077 2.056736e-10
LandContour 12.850188 2.742217e-08
PoolQC 10.509853 7.700989e-07
ExterCond 8.798714 5.106681e-07
LotConfig 7.809954 3.163167e-06
BsmtFinType2 7.565378 5.225649e-08
RoofMatl 6.727305 7.231445e-08
Condition1 6.118017 8.904549e-08
Heating 4.259819 7.534721e-04
Functional 4.057875 4.841697e-04
MiscFeature 2.593622 3.500367e-02
Street 2.459290 1.170486e-01
Condition2 2.073899 4.342566e-02
LandSlope 1.958817 1.413964e-01
MoSold 0.957865 4.833523e-01
YrSold 0.645525 6.300888e-01
Utilities 0.298804 5.847168e-01
解决方案
F 统计量和 P 值分别存储在属性statistics
和pvalue
中<class 'scipy.stats.stats.F_onewayResult'>
。
您可以只提取里面的值,然后创建表。下面是一个简单的例子。
import scipy.stats
import pandas as pd
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735, 0.0659, 0.0923, 0.0836]
newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835, 0.0725]
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689]
tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]
fstat = scipy.stats.f_oneway(tillamook, newport, petersburg, magadan, tvarminne)
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = ['OverallQual']
print(tbl)
# F-statistics P-value
# OverallQual 7.121019 0.000281
如果要进行多个 F 测试,则可以使用函数和 for 循环。下面是一个例子,
df = pd.DataFrame({'a': [0,0,0,1,1,1,2,2,2], 'b': [0,1,1,0,0,1,1,0,0], 'outcome': [1,2,3,4,5,6,7,8,9]})
def anova(data, x, y, drop_nan=True):
# Unique values in the column
if drop_nan:
x_val = data[x].dropna().unique()
else:
x_val = data[x].unique()
# F-test
fstat = scipy.stats.f_oneway(*[data[y][data[x].isin([x_v])] for x_v in x_val])
# Tabulate the results
tbl = pd.DataFrame({'F-statistics': [fstat.statistic], 'P-value': [fstat.pvalue]})
tbl.index = ['{:}~{:}'.format(y, x)]
return tbl
f_table = pd.concat([anova(df, x, 'outcome') for x in ['a', 'b']])
print(f_table)
# F-statistics P-value
# outcome~a 27.000000 0.001000
# outcome~b 0.216495 0.655852
推荐阅读
- asp.net - 使用 Visual Studio 2017 通过控制器发布到数据库
- mysql - 每个学生的出席情况
- php - 会话变量问题
- concurrency - “读者-作者”只是具有多个消费者的“生产者-消费者”吗?
- android - 为什么我们在 Android 中完成使用 cursor 后需要做 cursor.close()?
- php - laravel 5.6 中的多个重定向
- javascript - 页面加载时的 Flickity 在水平之前显示垂直
- php - 自定义类型的 Symfony 表单集合
- c# - 绘制多个没有重叠文本的字符串
- javascript - 来自画廊的 multiscroll.js