首页 > 解决方案 > 在 Pandas DataFrame 的不同列中查找最接近的先前值

问题描述

我正在尝试找到一种方法,在给定特定列值的情况下,在 Pandas Dataframe 的两个单独列的最近行中找到匹配值,如果在其他列中找到,则随后指示“1”否则为“0”。

Dataframe 索引未排序。

数据:

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})

我需要的:

type== 'Trade' 时,我需要回顾bid_priceand ask_price,并找到与 column 匹配的第一个值price。在与交易行相同的行中,我想要两个单独的列来指示价格是否在最近的列bid_priceask_price列中找到。

预期输出

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021034', '2020-11-16 01:39:06.22021033'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': ['NaN', 7026.5, 7026.5, np.NaN, np.NaN, 7024.0], 
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5], 
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5],
  'is_bid_trade': [0, 0, 0, 0, 0, 1],
  'is_ask_trade': [1, 1, 0, 0, 0, 0]})

您可以看到第一笔交易与该列中前一行的报价相匹配ask_price。列中的最终交易匹配bid_price,但这是交易后面的两行。

我已经尝试过(并且得到了 SO 的帮助),但还没有在这里找到解决方案。

遗憾的是,该datetime列并非 100% 准确,因此不能依赖于按时间顺序排序。我还尝试使用 df.index.get_loc() 找到最小索引,但不确定如何将其应用于两列进行搜索。

非常感谢所有帮助。

标签: pythonpandasdataframesearch

解决方案


给你。请注意,在您的输入数据集中,我将字符串 'NaN' 更改为 np.nan 以保持一致,并且我认为您的输出数据集放错了 1。关于 1 应该放在交易发生的地方还是在前一行是不一致的. 尽管如此,我认为这可以按照您提供的数据的方式进行。请参阅代码中的注释。如果 1 应该在交易行,您可以修改索引以获得正确的行。

df = pd.DataFrame({
  'datetime': [
      '2020-11-16 01:39:06.22021017', '2020-11-16 01:39:06.22021020', '2020-11-16 01:39:06.22021022',
      '2020-11-16 01:39:06.22021031', '2020-11-16 01:39:06.22021033', '2020-11-16 01:39:06.22021036'],
  'type': ['Quote', 'Trade', 'Trade', 'Quote', 'Quote', 'Trade'],
  'price': [np.NaN, 7026.5, 7026.5, np.NaN, np.NaN, 7024.0],
  'ask_price': [7026.5, 7026.5, 7026.0, 7026.5, 7026.0, 7026.5],
  'bid_price': [7024.0, 7024.5, 7024.5, 7024.0, 7024.5, 7024.5]})
# you don't have to sort, but reset the index
df.reset_index(drop=True, inplace=True)

# collect the indices where Trade occurred
trade_indices = df.loc[df['type'] == 'Trade'].index.tolist()
# collect corresponding trade price
prices = df['price'].loc[df['price'].notnull()].tolist()
# create a tuple to match the trade row and price
test_tuples = list(zip(trade_indices, prices))
print(test_tuples)
dfo = df # create an output dataframe leaving input df as-is
dfo[['is_bid_trade', 'is_ask_trade']] = 0 # create your new columns with zeroes

# iterate over tuples; this will take full range from 0 up to the row the trade occurred; look for price in either ask or bid price columns, then take the last row (tail(1)). 
# tail(1) will be your most recent row to the trade
for (tradei, price) in test_tuples:
    print(tradei, price)
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)])
    # print(df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1))
    dftemp = df[0:tradei][(df[0:tradei][['ask_price', 'bid_price']] == price).any(axis=1)].tail(1)
    # print(dftemp)
    if dftemp.iat[0,3] == price:
        # test if in ask or bid then write to dfo
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_ask_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1
    else:
        dfindex = dftemp.index[0]
        #dfo.at[dfindex, 'is_bid_trade'] = 1
        dfo.at[tradei, 'is_ask_trade'] = 1

输出:

In [4]: dfo
Out[4]:
datetime                     type  price  ask_price bid_price is_bid_trade is_ask_trade
2020-11-16 01:39:06.22021017 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021020 Trade 7026.5 7026.5    7024.5    0    1
2020-11-16 01:39:06.22021022 Trade 7026.5 7026.0    7024.5    0    1
2020-11-16 01:39:06.22021031 Quote NaN    7026.5    7024.0    0    0
2020-11-16 01:39:06.22021033 Quote NaN    7026.0    7024.5    0    0
2020-11-16 01:39:06.22021036 Trade 7024.0 7026.5    7024.5    0    1

推荐阅读