首页 > 解决方案 > 填充 DataFrame 中其他行中父级的空白

问题描述

假设我有这个显示速度限制的数据集。这个想法是每个地区或城市都可以应用自己的规则,或“继承”其父实体的规则。

+-------------+---------------------------+---------------------+-----------+
| country     | region                    | city                | max_speed |
+-------------+---------------------------+---------------------+-----------+
| France      |                           |                     | 50        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Grégoire      |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Bretagne                  | Saint-Malo          | 30        |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             |                     |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Saint-Cloud         |           |
+-------------+---------------------------+---------------------+-----------+
| France      | Île-de-France             | Vélizy-Villacoublay | 50        |
+-------------+---------------------------+---------------------+-----------+
| Germany     |                           |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Germany     | Bayern                    | Nürnberg            |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands |                           |                     | 90        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland      |                     |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie   Gelderland    | Harderwijk          |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   |                     | 70        |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Haarlem             |           |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland   | Hoorn               | 30        |
+-------------+---------------------------+---------------------+-----------+

每当max_speed缺少值时,都应将其推断为父值。例如,Saint-Grégoire的限速是布列塔尼的限速,而哈德韦克纽伦堡则适用国家规则(即分别为 90 和 70)。

因此,鉴于此DataFrame

data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}

speed_limits = pd.DataFrame(data)

如何填写缺失值max_speed以获得:

+-------------+-------------------------+---------------------+-----------+
| country     | region                  | city                | max_speed |
+-------------+-------------------------+---------------------+-----------+
| France      |                         |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Grégoire      |        70 |
+-------------+-------------------------+---------------------+-----------+
| France      | Bretagne                | Saint-Malo          |        30 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           |                     |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Saint-Cloud         |        50 |
+-------------+-------------------------+---------------------+-----------+
| France      | Île-de-France           | Vélizy-Villacoublay |        50 |
+-------------+-------------------------+---------------------+-----------+
| Germany     |                         |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Germany     | Bayern                  | Nürnberg            |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands |                         |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    |                     |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland    | Harderwijk          |        90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland |                     |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem             |        70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn               |        30 |
+-------------+-------------------------+---------------------+-----------+

我一直在尝试创建一个函数以应用于每一行 where max_speed==np.NaN,检索其父级(在确定缺失值是否适用于一个地区或一个城市之后)并返回它的max_speed值,但是,除了不是很成功地做到这一点,我什至不确定这是最聪明的方法。

任何想法?

标签: pythonpandas

解决方案


利用杠杆ffill()完成工作。首先垂直传播国家和地区限速并设置仅限城市限速列。然后从左到右传播速度限制以获得继承的最大速度限制。

创建一个工作数据框:

wf = speed_limits.copy()

复制和传播国家速度限制:

wf['cntry_spd'] = pd.Series(np.where(wf['region'], np.nan, wf['max_speed'])).ffill()

复制区域速度限制并在区域内传播:

wf['reg_spd'] = np.where(~wf['region'].isna() & wf['city'].isna(), wf['max_speed'], np.nan)
wf['reg_spd'] = wf.groupby(['country','region'])['reg_spd'].ffill() 

创建仅限城市的限速列:

wf['city_spd'] = np.where(~wf['city'].isna(), wf['max_speed'], np.nan)

通过在, ,列中从左到右向前填充 NA 来设置max_speed列,继承尚未设置的速度限制:speed_limits DFcntry_spdreg_spdcity_spd

speed_limits['max_speed'] = wf[['cntry_spd','reg_spd','city_spd']].ffill(axis=1)['city_spd']

结果:

        country                   region                 city  max_speed
0        France                     None                 None       50.0
1        France                 Bretagne                 None       70.0
2        France                 Bretagne       Saint-Grégoire       70.0
3        France                 Bretagne           Saint-Malo       30.0
4        France            Île-de-France                 None       50.0
5        France            Île-de-France          Saint-Cloud       50.0
6        France            Île-de-France  Vélizy-Villacoublay       50.0
7       Germany                     None                 None       70.0
8       Germany                   Bayern                 None       70.0
9       Germany                   Bayern             Nürnberg       70.0
10  Netherlands                     None                 None       90.0
11  Netherlands     Provincie Gelderland                 None       90.0
12  Netherlands     Provincie Gelderland           Harderwijk       90.0
13  Netherlands  Provincie Noord-Holland                 None       70.0
14  Netherlands  Provincie Noord-Holland              Haarlem       70.0
15  Netherlands  Provincie Noord-Holland                Hoorn       30.0

推荐阅读