python - 如何用条件列均值填充数据框的空/nan 单元格
问题描述
我正在尝试使用该特定列的平均值填充(熊猫)数据框的空值/空值。
数据如下所示:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
我正在尝试用行业 == 'Construction' 的收入列的平均值填充该空单元格。
为了得到我们的数值平均值,我做了:
df.groupby(['Industry'], as_index = False).mean()
我正在尝试做这样的事情来就地填充那个空单元格:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..但它不工作。谁能告诉我如何实现它!非常感谢。
预期输出:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
解决方案
尽管用作平均值的数字不同,但我们提供了两种类型的平均值:正常平均值和根据包含 NaN 的案例数计算的平均值。
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
考虑到 NaN 数量的平均值
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
正常平均值:(不包括 NaN)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06
推荐阅读
- javascript - position:sticky 在 IE 11 上不能与 display:flex 一起使用
- android - 向下滚动时在响应菜单中移动的触摸/单击区域
- jenkins - Spring Boot - 詹金斯 CICD
- php - 我想按类别在两个不同的列中显示记录
- javascript - 我不明白 if{} 语句
- optaplanner - VRP中的无限车辆
- dart - Flutter root Column 小部件扩展到整个屏幕,为什么?
- chef-habitat - Rails 应用程序的栖息地中未指定来源
- outlook - 如何使用 SAS 检索全局地址列表
- python - PySide2 拖放功能不起作用