python - 如何过滤多个列的值或通配符?
问题描述
我有一个带有模型信息(df_model,~700 行,~50 类型)的数据框和一个带有组件信息(df_components,~400.000 行)的数据框。
df_model:模型类型列表,每个模型类型都有几行带有特定组件:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
1 E2 W W IT NaN NaN AC-8
2 E2 P 13.3 NaN NaN NaN NaN
3 P1 C/C C IT NaN TU NaN
4 P1 GR G NAI XSM NaN NaN
df_component:可用组件列表 + 每个组件的交付时间:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
2 W W IT NaN NaN AC-8 CM3 3
3 P 13.3 NaN NaN NaN NaN CM4 13
4 P 15.3 NaN NaN NaN NaN CM5 20
5 B BR NaN KRL NaN NaN CM6 5
我想编写一个函数,该函数返回 - 对于特定模型类型 - model_components 中每个匹配组件的交付时间 + 最短交付时间。
例如:对于模型类型“E2”,“C/CC IT NaN GR NaN”有 2 个匹配组件。一个的交货时间是 8,其中一个是 9。8 < 9 -> 9 被删除,因为我们希望每个模型组件的交货时间最好。
问题:如果模型组件的列中有“NaN”,则它是通配符 = 相应组件列中的任何内容都应该匹配(如果有帮助,我可以将通配符的 np.nan 更改为其他任何内容)。
例如 df_model 的第一行:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
列“param2”=“NaN”所以从 df_component 这 2 行匹配,因为列“param2”无关紧要:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
因此,对于模型类型“E2”,最终结果应该是一个 dict/JSON,其中包含:
{
"leadtime_min": 13,
"CM1": 8,
"CM3": 3,
"CM4": 13,
}
到目前为止,我的函数方法:
- 选择给定的模型类型
- 遍历每个模型组件,搜索匹配的组件
- 获取每个组件的最短交付时间(如果有多个匹配的组件。如果多个匹配的组件具有相同的交付时间,则它们都需要添加到结果 JSON 中)
我的函数中的查询代码返回所有行,而不仅仅是匹配的行,我需要解决这个问题。
到目前为止我的代码:
df_model = pd.DataFrame(
{
"type": ["E2", "E2", "E2", "P1", "P1"],
"mainclass": ["C/C", "W", "P", "C/C", "GR"],
"subclass": ["C", "W", "13.3", "C", "G"],
"param1": ["IT", "IT", np.nan, "IT", "NAI"],
"param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
"param3": ["GR", np.nan, np.nan, "TU", np.nan],
"param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
}
)
df_components = pd.DataFrame(
{
"mainclass": [
"C/C",
"C/C",
"W",
"P",
"P",
"B",
],
"subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
"param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
"param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
"param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
"param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
"component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
"delivery_time": [8, 9, 3, 13, 20, 5],
}
)
def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)
model = model.query("type == @model_type")
fitting_components = {"delivery_time_min": sys.maxsize}
delivery_times_all = []
for model_component in df_model.itertuples():
model_mainclass = model_component.mainclass
model_subclass = model_component.subclass
model_param1 = model_component.param1
model_param2 = model_component.param2
model_param3 = model_component.param3
model_param4 = model_component.param4
checked_components = components.query(
"""
mainclass == @model_mainclass and \
subclass == @model_subclass \
and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
"""
)
component_descriptions = checked_components"component_name" ].tolist()
delivery_times = checked_components["delivery_time"].tolist()
delivery_time_min = min(delivery_times )
delivery_times_all.append(delivery_time_min )
for component_description, delivery_time in zip(component_descriptions, delivery_times ):
if delivery_time == delivery_time_min :
fitting_components[component_description] = delivery_time
fitting_components["delivery_time_min "] = max(delivery_times_all )
print(fitting_components)
get_delivery_time_best(model_type="E2", model=df_model, components=df_components)
但这会返回 df_components 中的所有行,而不是 2 个匹配的行?
解决方案
Merge
和boolean indexing
merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])
l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']
m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)
解释
df_components
将数据框与df_model
列合并'mainclass', 'subclass'
并在右侧 df 添加后缀_r
以唯一标识右侧数据框中的列
>>> merged
mainclass subclass param1 param2 param3 param4 param1_r param2_r param3_r param4_r
0 1a 1b vendor1 spec1 id1 NaN vendor1 NaN id1 NaN
1 1a 1b vendor1 NaN id1 rout3 vendor1 NaN id1 NaN
2 1a 1b vendor1 NaN id2 NaN vendor1 NaN id1 NaN
3 1a 1b NaN NaN NaN NaN vendor1 NaN id1 NaN
4 1a 1b vendor2 NaN id1 NaN vendor1 NaN id1 NaN
5 1a 1b vendor2 NaN NaN NaN vendor1 NaN id1 NaN
从左右数据帧中过滤类似列的参数
>>> l
param1 param2 param3 param4
0 vendor1 spec1 id1 NaN
1 vendor1 NaN id1 rout3
2 vendor1 NaN id2 NaN
3 NaN NaN NaN NaN
4 vendor2 NaN id1 NaN
5 vendor2 NaN NaN NaN
>>> r
param1_r param2_r param3_r param4_r
0 vendor1 NaN id1 NaN
1 vendor1 NaN id1 NaN
2 vendor1 NaN id1 NaN
3 vendor1 NaN id1 NaN
4 vendor1 NaN id1 NaN
5 vendor1 NaN id1 NaN
通过将左侧数据帧中的类似参数列与右侧数据帧中的相应参数类似列进行比较来创建布尔掩码,类似地通过检查右侧数据帧的NaN
值来创建另一个布尔掩码
>>> l.values == r.values
array([[ True, False, True, False],
[ True, False, True, False],
[ True, False, False, False],
[False, False, False, False],
[False, False, True, False],
[False, False, False, False]])
>>> r.isna().values
array([[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True]])
取上述掩码的逻辑或,并减少生成的all
掩码axis=1
>>> m
array([ True, True, False, False, False, False])
df_components
现在在掩码的帮助下过滤行m
>>> merged[m].drop(r.columns, 1)
mainclass subclass param1 param2 param3 param4
0 1a 1b vendor1 spec1 id1 NaN
1 1a 1b vendor1 NaN id1 rout3
推荐阅读
- spring-boot - 在 Spring Boot 项目中停用 Kafka 消息发送和接收以进行测试
- node.js - 如何连接部署在 Heroku 上的套接字 io
- python - 如何验证日期顺序 sqlite、pytelegrambotapi 中的所有字段
- c# - 使用 Photon 进行多人游戏,但我正在控制另一个角色
- flutter - 无法启动 Dart CLI 隔离 (null)。扑
- kotlin - Box 不会在 Compose Desktop 中捕获关键事件
- java - OpenTelemetry java - 行李传播
- c - 在C中返回两个字符串组合的函数
- json - 如果在 Kotlin(Android 工作室)中没有密钥,如何制作 JSON 数据的数据类?
- flutter - Flutter : blockprovider.of() 用上下文调用blockprovider.of() 用上下文调用