首页 > 解决方案 > 当列可选或缺失时在 Python/Pandas 中查询数据框

问题描述

我正在用 Python/Pandas 开发一个脚本来比较两个数据帧的内容。

两个数据框都包含来自固定列表的列的任意组合,例如:

"Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"

某些列组合用作唯一键,但其中一些列有时可能会丢失。此外,两个数据框都包含(并错过)相同的列(以避免额外的复杂性)。

所以,我想从一个数据帧中检索一行,给定我从另一个数据帧获得的键(我确定该键与每个数据帧中的单行匹配,在这种情况下也不是问题)。

例如,在这种情况下,这对"Case Name" + "MAC"是我的密钥:

"Case Name" | "MAC"             |"Machine Name" | "OS"    | "Exec Time" | "RSS"
------------+-------------------+---------------+---------+-------------+------
      Case1 | FB:E8:99:88:AC:DE |        Linux1 |   Linux |          60 |  1000 
"Case Name" | "MAC"             |"Machine Name" | "OS"    | "Exec Time" | "RSS"
------------+-------------------+---------------+---------+-------------+------
      Case1 | FB:E8:99:88:AC:DE |      Windows1 | Windows |          80 |   500 

基于这些数据框,我想生成另一个像这样的:

"Case Name" | "MAC"             | "Machine Name 1" | "Machine Name 2" | "OS 1"    | "OS 2"    | "Exec Time 1" | "Exec Time 2" | "RSS 1" | "RSS 2" 
------------+-------------------+------------------+------------------+-----------+-----------+---------------+---------------+---------+--------
      Case1 | FB:E8:99:88:AC:DE |           Linux1 |         Windows1 |     Linux |   Windows |            60 |            80 |    1000 |     500 

但是,在某些情况下,其中一些“关键”列可能会丢失,在这种情况下,数据框将如下所示:

数据框 1:

"Case Name" | "Machine Name" | "OS"    | "Exec Time" | "RSS"
------------+----------------+---------+-------------+------
      Case1 |         Linux1 |   Linux |          60 |  1000 

数据框 2:

"Case Name" | "Machine Name" | "OS"    | "Exec Time" | "RSS"
------------+----------------+---------+-------------+------
      Case1 |       Windows1 | Windows |          80 |   500 

如您所见,"MAC"缺少该列,在这种情况下,我确定(这也不是问题)这"Case Name"是一个足够好的唯一键。

因此,为了构建组合数据框,我尝试了这样的方法:

for index1, data1 in dataFrame1.iterrows():
    caseName       = data1['Case Name']
    try:
        macAddr        = data1['MAC']
    except:
        macAddr        = None
        
    # Let's see if pd.isnull() works fine when no MAC column exists
    if pd.isnull(macAddr):
        print("No MAC column data detected")
    else:
        print("MAC column data detected")
        
    # The rest of the data from the dataFrame1
    machineName1  = data1['Machine Name']
    os1           = data1['OS']
    # etc., etc.

    #then try to locate the equivalent data in the other data frame:
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
    
    machineName2  = data2['Machine Name']
    os2           = data2['OS']
    # etc., etc.

作为一个基于 C 的人(并且是 Python 的初学者),我希望一旦True达到条件,句子就会停止处理,在这种情况下pd.isnull(macAddr),避免执行肯定会触发错误的部分(dataFrame2['MAC'] == macAddr),,因为缺少该列。根据this,我希望如此,但是,在我的情况下似乎没有发生,当我运行它时,我的脚本返回:

caseName  = testCase
No MAC column data detected -> So pd.isnull() works fine!!!
Traceback (most recent call last):
  File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAC'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "~/compare_dataframes.py", line 135, in <module>
    main()
  File "~/compare_dataframes.py", line 79, in main
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
  File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3458, in __getitem__
    indexer = self.columns.get_loc(key)
  File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'MAC'

现在,我可以将其更改为一系列嵌套if条件:

if pd.isnull(macAddr):
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName), ['Machine Name', 'OS', 'Exec Time','RSS']]
else:
    data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (dataFrame2['MAC'] == macAddr), ['Machine Name', 'OS', 'Exec Time','RSS']]

但这是不切实际的,因为它变成了2^n if,如果我将来添加一个新列会发生什么?

所以,我的问题是:这种情况有什么问题?我尽可能多地添加括号,但没有效果。

我正在使用 Python 3.8、Pandas 1.3.4

非常感谢你的帮助。

标签: pythonpandasdataframe

解决方案


尝试这种方式获取第一个数据帧。并将其与第二个
示例合并

Merged_df1=df1.merge(df2,how=''outer", on=["Case Name"])
Merged_df2=df1.merge(df2,how=''outer", on=["MAC"]) 追加这两个数据帧 appended_df=Merged_df1.append(Merged_df2)

然后删除重复项

appended_df .drop_duplicates(subset["Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"])

注意:重复写入 appended_df 中存在的所有列名


推荐阅读