首页 > 解决方案 > 使用 pandas merge_asof() 识别范围关系

问题描述

给定以下两个表示范围的数据框:

df1 =

  start   end
0   200   300
1   600   900
2   950  1050

df2 =

  start   end
0   350   550
1   650   800
2   900  1100

它们可以这样表示:

df1  [200 300]            [600    900] [950 1050]
df2            [350  550]   [650 800] [900   1100]

我的任务是确定范围df1df2范围之间的四种不同类型的关系:

  1. df2的子集df1
    • df2 [650 800]的子集df1 [600 900]
  2. df2的超集df1
    • df2 [900 1100]的超集df1 [950 1050]
  3. df2之后df1(最近邻,不包括子集/超集)
    • df2 [350 550]df1 [200 300]
    • df2 [900 1100]df1 [600 900]
  4. df2之前df1(最近的邻居,不包括子集/超集)
    • df2 [350 550]df1 [600 900]
    • df2 [650 800]df1 [950 1050]

我正在尝试使用merge_asof()这个答案中学到的东西,但由于超集/子集关系添加的复杂性,它不起作用,例如:

# Create "before" condition
df_before = pd.merge_asof(
    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
    on='before_end',
    direction='forward'
).query('end > before_end')

print(df_before)

输出:

  before_start  before_end  start    end
0          350         550  600.0  900.0
1          650         800  600.0  900.0

目标输出:

  before_start  before_end  start     end
0          350         550  600.0   900.0
1          650         800  950.0  1050.0

问题是

pd.merge_asof(
    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
    on='before_end',
    direction='forward'
)

df1.end在 800 之后找到最接近的df2 [650 800],即df1 [600 900]

  before_start  before_end  start    end
0          350         550  600.0  900.0
1          650         800  600.0  900.0
2          900        1100    NaN    NaN

是否可以merge_asof()根据特定条件查找最接近的值,例如“df1.end仅当df1.start该范围大于 800(在本例中为 950)时才查找最近的值”?有了这种复杂程度,也许还有另一个更适合这项任务的功能?

笔记:

df1 =

  start   end  subset_start  subset_end  superset_start  superset_end  before_start  before_end  after_start  after_end
0   200   300           NaN         NaN             NaN           NaN           NaN         NaN        350.0      550.0
1   600   900         650.0       800.0             NaN           NaN         350.0       550.0        900.0     1100.0
2   950  1050           NaN         NaN           900.0        1100.0         650.0       800.0          NaN        NaN

标签: pythonpandasdataframe

解决方案


可以pd.merge_asof用来查找之前和之后的选项。

before_df = pd.merge_asof(df1, df2, left_on='start', right_on='end', suffixes=['', '_before'])
before_df
#    start   end  start_before  end_before
# 0    200   300           NaN         NaN
# 1    600   900         350.0       550.0
# 2    950  1050         650.0       800.0

after_df = pd.merge_asof(df2, df1, left_on='start', right_on='end', suffixes=['_after', ''])
#    start_after  end_after  start  end
# 0          350        550    200  300
# 1          650        800    200  300
# 2          900       1100    600  900

但是要使其工作或子集和超集计算并不容易。对于那些人,我会争取这种可以一次性工作的算法。

def range_intersect(lh_ranges, rh_ranges): 
    all_ranges = sorted(
        [(b, e, 'lh') for b, e in lh_ranges] +
        [(b, e, 'rh') for b, e in rh_ranges]
    ) 

    res = [] 
    max_b, max_e = None, None 
    for b, e, which in all_ranges: 
        if which == 'rh': 
            if max_e is None or e > max_e: 
                max_b, max_e = b, e 
        elif max_e is not None and e <= max_e: 
            res.append((b, e, max_b, max_e)) 

    return res

这会发现 的元素是lh中元素的子集rh。要查找超集,可以反向运行。为简单起见,它采用范围列表而不是DataFrames。转换很简单。

lh = df1.to_dict('split')['data']
rh = df2.to_dict('split')['data']

lh
# [[200, 300], [600, 900], [950, 1050]]

rh                                                                                                                                                                                                                                  
# [[350, 550], [650, 800], [900, 1100]]

在那之后,DataFrame你想要的结果只是几个合并。

# Compute supersets, then run in reverse to get the subsets.
superset_df = pd.DataFrame(range_intersect(lh, rh), columns=['start', 'end', 'start_superset', 'end_superset'])
subset_df = pd.DataFrame(range_intersect(rh, lh), columns=['start_subset', 'end_subset', 'start', 'end'])

# Merge all the results together.
result = df1.merge(subset_df, how='left').merge(superset_df, how='left').merge(before_df, how='left').merge(after_df, how='left')

# The reversed operations, after and subset, can have many matches in df1.
result.drop_duplicates(['start', 'end'])
#    start   end  start_subset  end_subset  start_superset  end_superset  start_before  end_before  start_after  end_after
# 0    200   300           NaN         NaN             NaN           NaN           NaN         NaN        350.0      550.0
# 2    600   900         650.0       800.0             NaN           NaN         350.0       550.0        900.0     1100.0
# 3    950  1050           NaN         NaN           900.0        1100.0         650.0       800.0          NaN        NaN

推荐阅读