python - 当列可选或缺失时在 Python/Pandas 中查询数据框
问题描述
我正在用 Python/Pandas 开发一个脚本来比较两个数据帧的内容。
两个数据框都包含来自固定列表的列的任意组合,例如:
"Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"
某些列组合用作唯一键,但其中一些列有时可能会丢失。此外,两个数据框都包含(并错过)相同的列(以避免额外的复杂性)。
所以,我想从一个数据帧中检索一行,给定我从另一个数据帧获得的键(我确定该键与每个数据帧中的单行匹配,在这种情况下也不是问题)。
例如,在这种情况下,这对"Case Name" + "MAC"
是我的密钥:
- 数据框 1:
"Case Name" | "MAC" |"Machine Name" | "OS" | "Exec Time" | "RSS"
------------+-------------------+---------------+---------+-------------+------
Case1 | FB:E8:99:88:AC:DE | Linux1 | Linux | 60 | 1000
- 数据框 2
"Case Name" | "MAC" |"Machine Name" | "OS" | "Exec Time" | "RSS"
------------+-------------------+---------------+---------+-------------+------
Case1 | FB:E8:99:88:AC:DE | Windows1 | Windows | 80 | 500
基于这些数据框,我想生成另一个像这样的:
"Case Name" | "MAC" | "Machine Name 1" | "Machine Name 2" | "OS 1" | "OS 2" | "Exec Time 1" | "Exec Time 2" | "RSS 1" | "RSS 2"
------------+-------------------+------------------+------------------+-----------+-----------+---------------+---------------+---------+--------
Case1 | FB:E8:99:88:AC:DE | Linux1 | Windows1 | Linux | Windows | 60 | 80 | 1000 | 500
但是,在某些情况下,其中一些“关键”列可能会丢失,在这种情况下,数据框将如下所示:
数据框 1:
"Case Name" | "Machine Name" | "OS" | "Exec Time" | "RSS"
------------+----------------+---------+-------------+------
Case1 | Linux1 | Linux | 60 | 1000
数据框 2:
"Case Name" | "Machine Name" | "OS" | "Exec Time" | "RSS"
------------+----------------+---------+-------------+------
Case1 | Windows1 | Windows | 80 | 500
如您所见,"MAC"
缺少该列,在这种情况下,我确定(这也不是问题)这"Case Name"
是一个足够好的唯一键。
因此,为了构建组合数据框,我尝试了这样的方法:
for index1, data1 in dataFrame1.iterrows():
caseName = data1['Case Name']
try:
macAddr = data1['MAC']
except:
macAddr = None
# Let's see if pd.isnull() works fine when no MAC column exists
if pd.isnull(macAddr):
print("No MAC column data detected")
else:
print("MAC column data detected")
# The rest of the data from the dataFrame1
machineName1 = data1['Machine Name']
os1 = data1['OS']
# etc., etc.
#then try to locate the equivalent data in the other data frame:
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
machineName2 = data2['Machine Name']
os2 = data2['OS']
# etc., etc.
作为一个基于 C 的人(并且是 Python 的初学者),我希望一旦True
达到条件,句子就会停止处理,在这种情况下pd.isnull(macAddr)
,避免执行肯定会触发错误的部分(dataFrame2['MAC'] == macAddr)
,,因为缺少该列。根据this,我希望如此,但是,在我的情况下似乎没有发生,当我运行它时,我的脚本返回:
caseName = testCase
No MAC column data detected -> So pd.isnull() works fine!!!
Traceback (most recent call last):
File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAC'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/compare_dataframes.py", line 135, in <module>
main()
File "~/compare_dataframes.py", line 79, in main
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (pd.isnull(macAddr) | (dataFrame2['MAC'] == macAddr)), ['Machine Name', 'OS', 'Exec Time', 'RSS']]
File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3458, in __getitem__
indexer = self.columns.get_loc(key)
File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'MAC'
现在,我可以将其更改为一系列嵌套if
条件:
if pd.isnull(macAddr):
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName), ['Machine Name', 'OS', 'Exec Time','RSS']]
else:
data2 = dataFrame2.loc[(dataFrame2['Case Name'] == caseName) & (dataFrame2['MAC'] == macAddr), ['Machine Name', 'OS', 'Exec Time','RSS']]
但这是不切实际的,因为它变成了2^n if
,如果我将来添加一个新列会发生什么?
所以,我的问题是:这种情况有什么问题?我尽可能多地添加括号,但没有效果。
我正在使用 Python 3.8、Pandas 1.3.4
非常感谢你的帮助。
解决方案
尝试这种方式获取第一个数据帧。并将其与第二个
示例合并
Merged_df1=df1.merge(df2,how=''outer", on=["Case Name"])
Merged_df2=df1.merge(df2,how=''outer", on=["MAC"]) 追加这两个数据帧 appended_df=Merged_df1.append(Merged_df2)
然后删除重复项
appended_df .drop_duplicates(subset["Case Name", "MAC", "Machine Name", "OS", "Exec Time", "RSS"])
注意:重复写入 appended_df 中存在的所有列名
推荐阅读
- java - java错误:找不到符号GPSException
- mongodb - Sails 应用程序无法连接本地主机 mongo db
- javascript - 如何在任意圆内创建方形 div?
- html - 如何在大屏幕中设置背景图像全高宽?
- java - 如何使用 j_security_check 调用后续请求
- google-apps-script - 将行中的特定单元格复制到另一张工作表上的特定单元格?
- r - 为什么我应该在 R 中使用 ggraph() 和 set.seed()?
- javascript - Openlayers:feature.get 返回属性路径
- python - Python Pandas 对所有列应用反向地理编码功能需要太长时间?
- javascript - 结合剩余价差和默认值