首页 > 解决方案 > 有没有更有效的方法来根据列的内容从 PyArrow 表中选择行?

问题描述

我问了一个相关的问题,关于根据列的内容从 PyArrow 表中选择行的更惯用的方法。@joris 的答案看起来很棒。但看起来像在 PyArrow 中使用行掩码纯粹选择行在稀疏选择时存在性能问题。

有没有更有效的方法来做到这一点,但纯粹留在 PyArrow 中而不在 PyArrow 和 numpy 之间来回切换?

测试用例:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# Example table for data schema:
# Alternating rows with index 0 and 1
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
                   'index':     np.array(irow%2, dtype=np.int16),
                   'value':     np.array(irow*0, dtype=np.int32)},
                   columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)

# Insert rows with index 2 every 16 timestamps
irow = np.arange(10000)
subsample = 16
df1 = pd.DataFrame({'timestamp': np.array(irow*dt*subsample, dtype=np.int64),
                    'index':     np.full_like(irow, 2, dtype=np.int16),
                    'value':     np.array(irow*irow, dtype=np.int32)},
                    columns=['timestamp','index','value'],
                    index=irow*subsample*2+1.5)
df2=pd.concat([df0,df1]).sort_index()
df2.index = pd.RangeIndex(len(df2))
print(df2)
table2 = pa.Table.from_pandas(df2)

# which prints:

         timestamp  index   value
0                0      0       0
1                0      1       0
2                0      2       0
3               17      0       1
4               17      1    1691
...            ...    ...     ...
1058571    8912845      1    9945
1058572    8912862      0  524286
1058573    8912862      1    9978
1058574    8912879      0  524287
1058575    8912879      1    9723

[1058576 rows x 3 columns]

验证 index=2 的稀疏内容:

print(df2[df2['index']==2])

# which prints

        timestamp  index     value
2               0      2         0
35            272      2         1
68            544      2         4
101           816      2         9
134          1088      2        16
...           ...    ...       ...
329837    2718640      2  99900025
329870    2718912      2  99920016
329903    2719184      2  99940009
329936    2719456      2  99960004
329969    2719728      2  99980001

[10000 rows x 3 columns]

和基准测试:

import time

# My method, which sloshes back and forth between PyArrow and numpy
def select_by_index_np(table, ival):
    value_index = table.column('index').to_numpy()
    row_indices = np.nonzero(value_index==ival)[0]
    return table.take(pa.array(row_indices))

# Stay in PyArrow: see https://stackoverflow.com/a/64579502/44330
def select_by_index(table, ival):
    value_index = table.column('index')
    index_type = value_index.type.to_pandas_dtype()
    mask = pc.equal(value_index, index_type(ival))
    return table.filter(mask)

def run_timing_test(table, ival, select_algorithm, nrep=100):
    t1 = time.time_ns()
    for _ in range(nrep):
        tsel = select_algorithm(table, ival)
    t2 = time.time_ns()
    print('%.0fus %20s(%s) -> %s' % 
          ((t2-t1)/1000/nrep, 
           select_algorithm.__name__, 
           ival, 
           tsel.column('value').to_numpy()))

run_timing_test(table2, 0, select_by_index)
run_timing_test(table2, 0, select_by_index_np)
run_timing_test(table2, 1, select_by_index)
run_timing_test(table2, 1, select_by_index_np)
run_timing_test(table2, 2, select_by_index)
run_timing_test(table2, 2, select_by_index_np)

# which prints

7639us      select_by_index(0) -> [     0      1      2 ... 524285 524286 524287]
7780us   select_by_index_np(0) -> [     0      1      2 ... 524285 524286 524287]
7789us      select_by_index(1) -> [   0 1691 3334 ... 9945 9978 9723]
8204us   select_by_index_np(1) -> [   0 1691 3334 ... 9945 9978 9723]
3840us      select_by_index(2) -> [       0        1        4 ... 99940009 99960004 99980001]
1611us   select_by_index_np(2) -> [       0        1        4 ... 99940009 99960004 99980001]

The two methods are comparable when selected rows are a substantial portion of the table, but when they are very small, select_by_index_npwhich marshals to numpy, determines indices where the rows of the mask are True, and marshals back to PyArrow, is faster!

有没有一种有效的方法来做到这一点,但留在 PyArrow 中?(我没有看到任何 pyarrow.compute 相当于numpy.nonzero

标签: pythonpyarrow

解决方案


推荐阅读