首页 > 解决方案 > 读取镶木地板并从 Vertica 导出时模式不一致

问题描述

从 Vertica 导出数据并稍后尝试使用 parquet (python) 读取数据时,我注意到了奇怪的行为。假设我想将表转储到镶木地板:

EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) 
AS select * from table;

它给了我下一个结构:

/data/table_name
 - event_date=2019-01-01
 - event_date=2019-01-02
 - event_date=2019-01-03
...

然后我尝试用 pyarrow 阅读它:

import pyarrow.parquet as pq
df = pq.read_table('/data/table_name')

但是我收到了模式不一致的错误:

ValueError: Schema in partition[event_date=0] ./event_date=2019-01-01/84087de6-node0001-139759025940222.parquet was different.
user_id: string
event_id: int64
event_name: string
install_date: int32
event_date: int32
site_id: string

vs

user_id: string
event_id: int64
event_name: string
install_date: int32
site_id: string

怎么来的?

PS如果我分别阅读每个目录 - 它工作正常。

df1 = pq.read_table('/data/table_name/event_date=2019-01-01')
df2 = pq.read_table('/data/table_name/event_date=2019-01-02')
df3 = pq.read_table('/data/table_name/event_date=2019-01-02')

df1.schema == df2.schema == df3.schema
> True

标签: pythonexportparquetverticapyarrow

解决方案


您需要event_date从导出查询中排除分区列 ( ):

EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) 
AS SELECT user_id,
          event_id,
          event_name,
          install_date,
          site_id
FROM table;

推荐阅读