python - 有没有更有效的方法来根据列的内容从 PyArrow 表中选择行?
问题描述
我问了一个相关的问题,关于根据列的内容从 PyArrow 表中选择行的更惯用的方法。@joris 的答案看起来很棒。但看起来像在 PyArrow 中使用行掩码纯粹选择行在稀疏选择时存在性能问题。
有没有更有效的方法来做到这一点,但纯粹留在 PyArrow 中而不在 PyArrow 和 numpy 之间来回切换?
测试用例:
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
# Example table for data schema:
# Alternating rows with index 0 and 1
irow = np.arange(2**20)
dt = 17
df0 = pd.DataFrame({'timestamp': np.array((irow//2)*dt, dtype=np.int64),
'index': np.array(irow%2, dtype=np.int16),
'value': np.array(irow*0, dtype=np.int32)},
columns=['timestamp','index','value'])
ii = df0['index'] == 0
df0.loc[ii,'value'] = irow[ii]//2
ii = df0['index'] == 1
df0.loc[ii,'value'] = (np.sin(df0.loc[ii,'timestamp']*0.01)*10000).astype(np.int32)
# Insert rows with index 2 every 16 timestamps
irow = np.arange(10000)
subsample = 16
df1 = pd.DataFrame({'timestamp': np.array(irow*dt*subsample, dtype=np.int64),
'index': np.full_like(irow, 2, dtype=np.int16),
'value': np.array(irow*irow, dtype=np.int32)},
columns=['timestamp','index','value'],
index=irow*subsample*2+1.5)
df2=pd.concat([df0,df1]).sort_index()
df2.index = pd.RangeIndex(len(df2))
print(df2)
table2 = pa.Table.from_pandas(df2)
# which prints:
timestamp index value
0 0 0 0
1 0 1 0
2 0 2 0
3 17 0 1
4 17 1 1691
... ... ... ...
1058571 8912845 1 9945
1058572 8912862 0 524286
1058573 8912862 1 9978
1058574 8912879 0 524287
1058575 8912879 1 9723
[1058576 rows x 3 columns]
验证 index=2 的稀疏内容:
print(df2[df2['index']==2])
# which prints
timestamp index value
2 0 2 0
35 272 2 1
68 544 2 4
101 816 2 9
134 1088 2 16
... ... ... ...
329837 2718640 2 99900025
329870 2718912 2 99920016
329903 2719184 2 99940009
329936 2719456 2 99960004
329969 2719728 2 99980001
[10000 rows x 3 columns]
和基准测试:
import time
# My method, which sloshes back and forth between PyArrow and numpy
def select_by_index_np(table, ival):
value_index = table.column('index').to_numpy()
row_indices = np.nonzero(value_index==ival)[0]
return table.take(pa.array(row_indices))
# Stay in PyArrow: see https://stackoverflow.com/a/64579502/44330
def select_by_index(table, ival):
value_index = table.column('index')
index_type = value_index.type.to_pandas_dtype()
mask = pc.equal(value_index, index_type(ival))
return table.filter(mask)
def run_timing_test(table, ival, select_algorithm, nrep=100):
t1 = time.time_ns()
for _ in range(nrep):
tsel = select_algorithm(table, ival)
t2 = time.time_ns()
print('%.0fus %20s(%s) -> %s' %
((t2-t1)/1000/nrep,
select_algorithm.__name__,
ival,
tsel.column('value').to_numpy()))
run_timing_test(table2, 0, select_by_index)
run_timing_test(table2, 0, select_by_index_np)
run_timing_test(table2, 1, select_by_index)
run_timing_test(table2, 1, select_by_index_np)
run_timing_test(table2, 2, select_by_index)
run_timing_test(table2, 2, select_by_index_np)
# which prints
7639us select_by_index(0) -> [ 0 1 2 ... 524285 524286 524287]
7780us select_by_index_np(0) -> [ 0 1 2 ... 524285 524286 524287]
7789us select_by_index(1) -> [ 0 1691 3334 ... 9945 9978 9723]
8204us select_by_index_np(1) -> [ 0 1691 3334 ... 9945 9978 9723]
3840us select_by_index(2) -> [ 0 1 4 ... 99940009 99960004 99980001]
1611us select_by_index_np(2) -> [ 0 1 4 ... 99940009 99960004 99980001]
The two methods are comparable when selected rows are a substantial portion of the table, but when they are very small, select_by_index_np
which marshals to numpy, determines indices where the rows of the mask are True, and marshals back to PyArrow, is faster!
有没有一种有效的方法来做到这一点,但留在 PyArrow 中?(我没有看到任何 pyarrow.compute 相当于numpy.nonzero
)
解决方案
推荐阅读
- vue.js - Vue.js 不同的符号
- typescript - 如何在打字稿中键入此对象?
- python - Pygamezero 代码,我的代码在我没有任何代码的情况下工作,这是为什么呢?:
- terminal - grep 命令未读取此 adb 命令的命令输出
- swagger-ui - Swagger UI 进入 dropwizard
- embedded-linux - qemu-system-arm 难以启动 zImage
- javascript - app.post 不适用于根路由以外的任何路由
- flutter - 带有 SingleChildScrollView、Column、TabBar 和 TabBarView 的 Scaffold 中的高度问题
- sql - 从列值 varchar SQL 中提取数据类型
- kubernetes - prometheus:计算时间范围内时间序列的数量