首页 > 解决方案 > 熊猫数据框,通过跨度的交集合并?

问题描述

我想根据跨度的重叠合并两个数据帧(由对(s,e),s-跨度开始,e-跨度结束表示),虽然我有一个非常糟糕的代码,但我想知道是否有一个好的方法来实现它。这是示例:

df1 = pd.DataFrame({'s':[0,10,20,33,424,5345],
                'e':[3,17,30,39,1000,10987],
                'data1':[1,2,3,4,5,6]})

df2 = pd.DataFrame({'s':[1,45,0],
                    'e':[50,46,90],
                    'data2':[1,2,3]})

def overlap(a1,a2,b1,b2):
    if type(b1) == list or type(b1)==np.ndarray:
        assert(len(b1)==len(b2))
        return np.asarray([overlap(a1,a2,b1[k],b2[k]) for k in range(len(b1))])
    else:
        return max((a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1,0)

overlaps = [overlap(df1['s'].iloc[i],df1['e'].iloc[i],df2['s'].values,df2['e'].values)>0 
        for i in range(len(df1))]
df1['data2']=[df2['data2'][o].tolist() for o in overlaps]

输出是:

    s   e   data1   data2
0   0   3   1   [1, 3]
1   10  17  2   [1, 3]
2   20  30  3   [1, 3]
3   33  39  4   [1, 3]
4   424 1000    5   []
5   5345    10987   6   []

编辑:另外,在我的特殊情况下,我保证 df1 跨度是非重叠和连续的(即 s[i]>s[i-1], e[i]>s[i], e[i] < s[i+1] )

Edit2:您可以生成任意数量的几乎有效的假数据(这里我们不保证第一个 df 中的跨度不重叠):

N=int(1e3)
sdf1=np.random.randint(0, high=10*N, size=(N,))
sdf1.sort()
edf1=sdf1+np.random.randint(1, high=10, size=(N,))
data1=range(N)

sdf2=np.random.randint(0, high=10*N, size=(N,))
edf2=sdf2+np.random.randint(1, high=10, size=(N,))
data2=range(N)


df1 = pd.DataFrame({'s':sdf1,
                    'e':edf1,
                    'data1':data1})

df2 = pd.DataFrame({'s':sdf2,
                    'e':edf2,
                    'data2':data2})

标签: pythonpandasmerge

解决方案


当涉及到 pandas 数据框时,您应该始终避免for使用循环来处理行/列并使用applytransform其他 pandas 函数。例如,要获得重叠,您可以执行以下操作:

def has_overlap(a1, a2, b1, b2):
    ''' return True if spans overlap, otherwise return False '''
    return (a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1 > 0

def find_overlap(row1):
    '''return indices of df2 which overlap with the given row of df1 as a list'''
    df2['has_overlap'] = df2.apply(lambda row2: has_overlap(row1.s, row1.e, row2.s, row2.e), axis=1)
    return list(df2['data2'].loc[df2['has_overlap']])

df1['data2'] = df1.apply(lambda row: find_overlap(row), axis=1)
print('df1: {}'.format(df1))

推荐阅读