首页 > 解决方案 > 熊猫*有效*将有效值按组复制到其他行

问题描述

“过早的优化是万恶之源(但是一旦你有一个丑陋的解决方案就很好了)” D.Knuth

 给定这个数据集

from io import StringIO
import pandas as pd

csv = StringIO("""country,year,surface,ground,tot_water,enviro,depend
Yemen,2012,2,1.5,2.1,0.55,0
Yemen,2013,,,,,
Yemen,2014,2,1.5,2.1,,0
Yemen,2015,,,,,
Yemen,2016,,,,,
Yemen,2017,,,,0.55,
Zambia,1995,,,,,
Zambia,1996,,,,,
Zambia,1997,104.8,47,104.8,31.48,23.47""")

df = pd.read_csv(csv)
df

Out[0]:
  country  year  surface  ground  tot_water  enviro  depend
0   Yemen  2012      2.0     1.5        2.1    0.55    0.00
1   Yemen  2013      NaN     NaN        NaN     NaN     NaN
2   Yemen  2014      2.0     1.5        2.1     NaN    0.00
3   Yemen  2015      NaN     NaN        NaN     NaN     NaN
4   Yemen  2016      NaN     NaN        NaN     NaN     NaN
5   Yemen  2017      NaN     NaN        NaN    0.55     NaN
6  Zambia  1995      NaN     NaN        NaN     NaN     NaN
7  Zambia  1996      NaN     NaN        NaN     NaN     NaN
8  Zambia  1997    104.8    47.0      104.8   31.48   23.47

我想应用列中的有效值['surface', 'ground', 'tot_water', 'enviro']并将它们复制到所有国家。我有一个解决方案,但它可以做一些优化。

伪代码:

  1. 遍历每个国家并将数据框过滤到该子集
  2. 查找该列的第一个有效值的索引
  3. 将该列的值设置为等于该有效值

我的解决方案

vars_ = ['surface', 'ground', 'tot_water', 'enviro']
# for each country
for country in df.country.unique():
    # and each value in the
    filter_ = df.country == country
    for var in vars_:
        valid_ix = df[filter_][var].first_valid_index()
        df.loc[filter_, var] = df[var][valid_ix]

df

Out[]:
country  year  surface  ground  tot_water  enviro  depend
0   Yemen  2012      2.0     1.5        2.1    0.55    0.00
1   Yemen  2013      2.0     1.5        2.1    0.55     NaN
2   Yemen  2014      2.0     1.5        2.1    0.55    0.00
3   Yemen  2015      2.0     1.5        2.1    0.55     NaN
4   Yemen  2016      2.0     1.5        2.1    0.55     NaN
5   Yemen  2017      2.0     1.5        2.1    0.55     NaN
6  Zambia  1995    104.8    47.0      104.8   31.48     NaN
7  Zambia  1996    104.8    47.0      104.8   31.48     NaN
8  Zambia  1997    104.8    47.0      104.8   31.48   23.47

必须有更有效的方法。在中等大小的数据集上,这需要相当长的时间,而且 for 循环很难看。任何建议/帮助将不胜感激!

标签: pythonpandasdataframenan

解决方案


您可以按国家/地区对数据进行分组并使用 bfill 和 ffill

df.groupby('country').bfill().ffill()

country     year    surface ground  tot_water   enviro  depend
0   Yemen   2012    2.0     1.5     2.1         0.55    0.00
1   Yemen   2013    2.0     1.5     2.1         0.55    0.00
2   Yemen   2014    2.0     1.5     2.1         0.55    0.00
3   Yemen   2015    2.0     1.5     2.1         0.55    0.00
4   Yemen   2016    2.0     1.5     2.1         0.55    0.00
5   Yemen   2017    2.0     1.5     2.1         0.55    0.00
6   Zambia  1995    104.8   47.0    104.8       31.48   23.47
7   Zambia  1996    104.8   47.0    104.8       31.48   23.47
8   Zambia  1997    104.8   47.0    104.8       31.48   23.47

推荐阅读