首页 > 解决方案 > NumPy:如何使用重复的左连接数组

问题描述

要使用 Cython,我需要将df1.merge(df2, how='left')(using Pandas) 转换为 plain NumPy,而我发现numpy.lib.recfunctions.join_by(key, r1, r2, jointype='leftouter')不支持任何重复的key. 有什么办法可以解决吗?

标签: pythonpandasnumpycython

解决方案


这是一个可以处理重复键的纯numpy左连接:

import numpy as np

def join_by_left(key, r1, r2, mask=True):
    # figure out the dtype of the result array
    descr1 = r1.dtype.descr
    descr2 = [d for d in r2.dtype.descr if d[0] not in r1.dtype.names]
    descrm = descr1 + descr2 

    # figure out the fields we'll need from each array
    f1 = [d[0] for d in descr1]
    f2 = [d[0] for d in descr2]

    # cache the number of columns in f1
    ncol1 = len(f1)

    # get a dict of the rows of r2 grouped by key
    rows2 = {}
    for row2 in r2:
        rows2.setdefault(row2[key], []).append(row2)

    # figure out how many rows will be in the result
    nrowm = 0
    for k1 in r1[key]:
        if k1 in rows2:
            nrowm += len(rows2[k1])
        else:
            nrowm += 1

    # allocate the return array
    _ret = np.recarray(nrowm, dtype=descrm)
    if mask:
        ret = np.ma.array(_ret, mask=True)
    else:
        ret = _ret

    # merge the data into the return array
    i = 0
    for row1 in r1:
        if row1[key] in rows2:
            for row2 in rows2[row1[key]]:
                ret[i] = tuple(row1[f1]) + tuple(row2[f2])
                i += 1
        else:
            for j in range(ncol1):
                ret[i][j] = row1[j]
            i += 1

    return ret

基本上,它使用plaindict来执行实际的连接操作。就像numpy.lib.recfunctions.join_by,这个函数也将返回一个掩码数组。当右数组中缺少键时,这些值将在返回数组中被屏蔽。如果您更喜欢记录数组(其中所有丢失的数据都设置为 0),您可以mask=False在调用join_by_left.


推荐阅读