首页 > 解决方案 > 如何在 python pandas 中的两个数据框之间有效地搜索?

问题描述

我有两个数据框(在熊猫中)

df1:

logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10

df2:

id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B

我想要像 new_df 这样的新数据框:

logged_at, item, value, id
2021-01-03 20:01:23, A, 4, 2
2021-01-03 20:01:24, A, 5, 2
2021-01-03 20:01:25, B, 4, 3
2021-01-03 20:01:26, B, 7, 3
2021-01-03 20:01:27, A, 10, 2

我想要的是将 df2 的 ID 附加到 df1 的列。

条件是df1的logged_at时间存在于df2的start_time和end_time之间。

df1中的数据数超过900000,df2中的数据数超过100000。

附加 df1 的每一行花费的时间太长。

有没有有效的方法?

标签: pythonpython-3.xpandasdataframedask

解决方案


一个简单的合并可以满足您对样本数据的要求。

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)

new_df = df1.merge(df2.loc[:,["id","item"]], on="item")

输出

           logged_at item  value  id
 2021-01-03 20:01:23    A      4   2
 2021-01-03 20:01:24    A      5   2
 2021-01-03 20:01:27    A     10   2
 2021-01-03 20:01:25    B      4   3
 2021-01-03 20:01:26    B      7   3

大熊猫

执行您指定的操作,但是您的示例数据df2看起来错误,因为它为每行提供了两行df1

from pandasql import sqldf
import pandas as pd
import io

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df1["logged_at"] = pd.to_datetime(df1["logged_at"])

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
df2["start_time"] = pd.to_datetime(df2["start_time"])
df2["end_time"] = pd.to_datetime(df2["end_time"])

pysqldf = lambda q: sqldf(q, globals())
pysqldf("""
select df1.*, df2.*
from df1 
left join df2 on df1.logged_at >= df2.start_time and df1.logged_at <= df2.end_time""")

熊猫输出

                 logged_at item  value  id                  start_time                    end_time item
 2021-01-03 20:01:23.000000    A      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:23.000000    A      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:24.000000    A      5   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:24.000000    A      5   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:25.000000    B      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:25.000000    B      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:26.000000    B      7   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:26.000000    B      7   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:27.000000    A     10   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:27.000000    A     10   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B

推荐阅读