python - 填充 DataFrame 中其他行中父级的空白
问题描述
假设我有这个显示速度限制的数据集。这个想法是每个地区或城市都可以应用自己的规则,或“继承”其父实体的规则。
+-------------+---------------------------+---------------------+-----------+
| country | region | city | max_speed |
+-------------+---------------------------+---------------------+-----------+
| France | | | 50 |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | | 70 |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | Saint-Grégoire | |
+-------------+---------------------------+---------------------+-----------+
| France | Bretagne | Saint-Malo | 30 |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | | |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | Saint-Cloud | |
+-------------+---------------------------+---------------------+-----------+
| France | Île-de-France | Vélizy-Villacoublay | 50 |
+-------------+---------------------------+---------------------+-----------+
| Germany | | | 70 |
+-------------+---------------------------+---------------------+-----------+
| Germany | Bayern | | |
+-------------+---------------------------+---------------------+-----------+
| Germany | Bayern | Nürnberg | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | | | 90 |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | Harderwijk | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | | 70 |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem | |
+-------------+---------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn | 30 |
+-------------+---------------------------+---------------------+-----------+
每当max_speed
缺少值时,都应将其推断为父值。例如,Saint-Grégoire的限速是布列塔尼的限速,而哈德韦克和纽伦堡则适用国家规则(即分别为 90 和 70)。
因此,鉴于此DataFrame
:
data = {'country': ['France', 'France', 'France', 'France', 'France', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands', 'Netherlands'],
'region': [None, 'Bretagne', 'Bretagne', 'Bretagne', 'Île-de-France', 'Île-de-France', 'Île-de-France', None, 'Bayern', 'Bayern', None, 'Provincie Gelderland', 'Provincie Gelderland', 'Provincie Noord-Holland', 'Provincie Noord-Holland', 'Provincie Noord-Holland'],
'city': [None, None, 'Saint-Grégoire', 'Saint-Malo', None, 'Saint-Cloud', 'Vélizy-Villacoublay', None, None, 'Nürnberg', None, None, 'Harderwijk', None, 'Haarlem', 'Hoorn'],
'max_speed': [50, 70, None, 30, None, None, 50, 70, None, None, 90, None, None, 70, None, 30]}
speed_limits = pd.DataFrame(data)
如何填写缺失值max_speed
以获得:
+-------------+-------------------------+---------------------+-----------+
| country | region | city | max_speed |
+-------------+-------------------------+---------------------+-----------+
| France | | | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | | 70 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | Saint-Grégoire | 70 |
+-------------+-------------------------+---------------------+-----------+
| France | Bretagne | Saint-Malo | 30 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | Saint-Cloud | 50 |
+-------------+-------------------------+---------------------+-----------+
| France | Île-de-France | Vélizy-Villacoublay | 50 |
+-------------+-------------------------+---------------------+-----------+
| Germany | | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Germany | Bayern | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Germany | Bayern | Nürnberg | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | | | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Gelderland | Harderwijk | 90 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Haarlem | 70 |
+-------------+-------------------------+---------------------+-----------+
| Netherlands | Provincie Noord-Holland | Hoorn | 30 |
+-------------+-------------------------+---------------------+-----------+
我一直在尝试创建一个函数以应用于每一行 where max_speed==np.NaN
,检索其父级(在确定缺失值是否适用于一个地区或一个城市之后)并返回它的max_speed
值,但是,除了不是很成功地做到这一点,我什至不确定这是最聪明的方法。
任何想法?
解决方案
利用杠杆ffill()
完成工作。首先垂直传播国家和地区限速并设置仅限城市限速列。然后从左到右传播速度限制以获得继承的最大速度限制。
创建一个工作数据框:
wf = speed_limits.copy()
复制和传播国家速度限制:
wf['cntry_spd'] = pd.Series(np.where(wf['region'], np.nan, wf['max_speed'])).ffill()
复制区域速度限制并在区域内传播:
wf['reg_spd'] = np.where(~wf['region'].isna() & wf['city'].isna(), wf['max_speed'], np.nan)
wf['reg_spd'] = wf.groupby(['country','region'])['reg_spd'].ffill()
创建仅限城市的限速列:
wf['city_spd'] = np.where(~wf['city'].isna(), wf['max_speed'], np.nan)
通过在, ,列中从左到右向前填充 NA 来设置max_speed
列,继承尚未设置的速度限制:speed_limits DF
cntry_spd
reg_spd
city_spd
speed_limits['max_speed'] = wf[['cntry_spd','reg_spd','city_spd']].ffill(axis=1)['city_spd']
结果:
country region city max_speed
0 France None None 50.0
1 France Bretagne None 70.0
2 France Bretagne Saint-Grégoire 70.0
3 France Bretagne Saint-Malo 30.0
4 France Île-de-France None 50.0
5 France Île-de-France Saint-Cloud 50.0
6 France Île-de-France Vélizy-Villacoublay 50.0
7 Germany None None 70.0
8 Germany Bayern None 70.0
9 Germany Bayern Nürnberg 70.0
10 Netherlands None None 90.0
11 Netherlands Provincie Gelderland None 90.0
12 Netherlands Provincie Gelderland Harderwijk 90.0
13 Netherlands Provincie Noord-Holland None 70.0
14 Netherlands Provincie Noord-Holland Haarlem 70.0
15 Netherlands Provincie Noord-Holland Hoorn 30.0
推荐阅读
- openid-connect - 应用清单中包含 groupMemberClaims 的 Azure AD 失败
- multithreading - 如何确保在整个子请求中传递相同的令牌
- python - 对于冗余操作,Python 不是那么“聪明”
- android - 编译 'com.firebaseui:firebase-ui-auth:0.4.0' gradle 构建错误
- r - 如何在R中做一个三路频率表?
- spring - 获取请求中的 401 状态
- performance - 简单阵列处理循环的 AVX 512 与 AVX2 性能对比
- batch-file - 将文件从上一个目录移动到当前目录
- mariadb - 如何“重置”mariadb GLOBAL STATS?
- r - 将 RSI 函数应用于数据帧列表