首页 > 解决方案 > 如何过滤多个列的值或通配符?

问题描述

我有一个带有模型信息(df_model,~700 行,~50 类型)的数据框和一个带有组件信息(df_components,~400.000 行)的数据框。

df_model:模型类型列表,每个模型类型都有几行带有特定组件:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN
1   E2      W           W           IT      NaN     NaN     AC-8
2   E2      P           13.3        NaN     NaN     NaN     NaN
3   P1      C/C         C           IT      NaN     TU      NaN
4   P1      GR          G           NAI     XSM     NaN     NaN

df_component:可用组件列表 + 每个组件的交付时间:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9
2   W           W           IT      NaN     NaN     AC-8    CM3             3
3   P           13.3        NaN     NaN     NaN     NaN     CM4             13
4   P           15.3        NaN     NaN     NaN     NaN     CM5             20
5   B           BR          NaN     KRL     NaN     NaN     CM6             5

我想编写一个函数,该函数返回 - 对于特定模型类型 - model_components 中每个匹配组件的交付时间 + 最短交付时间。

例如:对于模型类型“E2”,“C/CC IT NaN GR NaN”有 2 个匹配组件。一个的交货时间是 8,其中一个是 9。8 < 9 -> 9 被删除,因为我们希望每个模型组件的交货时间最好。

问题:如果模型组件的列中有“NaN”,则它是通配符 = 相应组件列中的任何内容都应该匹配(如果有帮助,我可以将通配符的 np.nan 更改为其他任何内容)。

例如 df_model 的第一行:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN

列“param2”=“NaN”所以从 df_component 这 2 行匹配,因为列“param2”无关紧要:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9

因此,对于模型类型“E2”,最终结果应该是一个 dict/JSON,其中包含:

{
    "leadtime_min": 13,
    "CM1": 8,
    "CM3": 3,
    "CM4": 13,
}

到目前为止,我的函数方法:

  1. 选择给定的模型类型
  2. 遍历每个模型组件,搜索匹配的组件
  3. 获取每个组件的最短交付时间(如果有多个匹配的组件。如果多个匹配的组件具有相同的交付时间,则它们都需要添加到结果 JSON 中)

我的函数中的查询代码返回所有行,而不仅仅是匹配的行,我需要解决这个问题。

到目前为止我的代码:

df_model = pd.DataFrame(
    {
        "type": ["E2", "E2", "E2", "P1", "P1"],
        "mainclass": ["C/C", "W", "P", "C/C", "GR"],
        "subclass": ["C", "W", "13.3", "C", "G"],
        "param1": ["IT", "IT", np.nan, "IT", "NAI"],
        "param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
        "param3": ["GR", np.nan, np.nan, "TU", np.nan],
        "param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
    }
)

df_components = pd.DataFrame(
    {
        "mainclass": [
            "C/C",
            "C/C",
            "W",
            "P",
            "P",
            "B",
        ],
        "subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
        "param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
        "param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
        "param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
        "param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
        "component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
        "delivery_time": [8, 9, 3, 13, 20, 5],
    }
)



def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)

    model = model.query("type == @model_type")
    fitting_components = {"delivery_time_min": sys.maxsize}
    delivery_times_all = []

    for model_component in df_model.itertuples():
        model_mainclass = model_component.mainclass
        model_subclass = model_component.subclass
        model_param1 = model_component.param1
        model_param2 = model_component.param2
        model_param3 = model_component.param3
        model_param4 = model_component.param4

        checked_components = components.query(
            """
            mainclass == @model_mainclass and \
            subclass == @model_subclass \
            and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
            and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
            and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
            and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
            """
        )

        component_descriptions = checked_components"component_name"    ].tolist()
        delivery_times = checked_components["delivery_time"].tolist()

        delivery_time_min = min(delivery_times )
        delivery_times_all.append(delivery_time_min )

        for component_description, delivery_time in zip(component_descriptions, delivery_times ):
            if delivery_time == delivery_time_min :
                fitting_components[component_description] = delivery_time 

    fitting_components["delivery_time_min "] = max(delivery_times_all )
    print(fitting_components)

get_delivery_time_best(model_type="E2", model=df_model, components=df_components)

但这会返回 df_components 中的所有行,而不是 2 个匹配的行?

标签: pythonpandasdataframe

解决方案


Mergeboolean indexing

merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])

l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']

m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)

解释

df_components将数据框与df_model列合并'mainclass', 'subclass'并在右侧 df 添加后缀_r以唯一标识右侧数据框中的列

>>> merged

  mainclass subclass   param1 param2 param3 param4 param1_r  param2_r param3_r  param4_r
0        1a       1b  vendor1  spec1    id1    NaN  vendor1       NaN      id1       NaN
1        1a       1b  vendor1    NaN    id1  rout3  vendor1       NaN      id1       NaN
2        1a       1b  vendor1    NaN    id2    NaN  vendor1       NaN      id1       NaN
3        1a       1b      NaN    NaN    NaN    NaN  vendor1       NaN      id1       NaN
4        1a       1b  vendor2    NaN    id1    NaN  vendor1       NaN      id1       NaN
5        1a       1b  vendor2    NaN    NaN    NaN  vendor1       NaN      id1       NaN

从左右数据帧中过滤类似列的参数

>>> l

    param1 param2 param3 param4
0  vendor1  spec1    id1    NaN
1  vendor1    NaN    id1  rout3
2  vendor1    NaN    id2    NaN
3      NaN    NaN    NaN    NaN
4  vendor2    NaN    id1    NaN
5  vendor2    NaN    NaN    NaN


>>> r

  param1_r  param2_r param3_r  param4_r
0  vendor1       NaN      id1       NaN
1  vendor1       NaN      id1       NaN
2  vendor1       NaN      id1       NaN
3  vendor1       NaN      id1       NaN
4  vendor1       NaN      id1       NaN
5  vendor1       NaN      id1       NaN

通过将左侧数据帧中的类似参数列与右侧数据帧中的相应参数类似列进行比较来创建布尔掩码,类似地通过检查右侧数据帧的NaN值来创建另一个布尔掩码

>>> l.values == r.values

array([[ True, False,  True, False],
       [ True, False,  True, False],
       [ True, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False]])

>>> r.isna().values

array([[False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True]])

取上述掩码的逻辑或,并减少生成的all掩码axis=1

>>> m

array([ True,  True, False, False, False, False])

df_components现在在掩码的帮助下过滤行m

>>> merged[m].drop(r.columns, 1)

  mainclass subclass   param1 param2 param3 param4
0        1a       1b  vendor1  spec1    id1    NaN
1        1a       1b  vendor1    NaN    id1  rout3

推荐阅读