首页 > 解决方案 > Impala 无法从 Parquet 文件中读取无日期时间戳

问题描述

Impala v2.11.0+ (CDH v5.11.1+) 无法从 Parquet 文件中读取只有时间值的时间戳。

create table TT2(t timestamp) STORED AS PARQUET;
insert into TT2 (t) values ("10:00:00");
select * from TT2;
+------+
| t    |
+------+
| NULL |
+------+
WARNINGS: Parquet file 'hdfs://localhost:20500/test-warehouse/tt2/714d741212df3180-cd4e670800000000_226739479_data.0.parq' column 't' contains an out of range timestamp. The valid date range is 1400-01-01..9999-12-31.`

https://issues.apache.org/jira/browse/IMPALA-5942

即使 select 语句返回 null,元存储管理器也会显示该列具有值

4714-12-30 10:00:00.0

我正在寻找的是另一种查询此数据以获取时间值的方法,而不是手动查找所有无日期时间戳列并将其转换为字符串。

我试过了

SELECT cast(t as string) FROM TT2

SELECT date_part('hour', t) FROM TT2

SELECT from_timestamp(tms, "HH:mm:ss") FROM TT2

SELECT extract(tms, "hour") FROM TT2

SELECT extract(cast(tms as string), "hour") FROM TT2

标签: timetimestampparquetimpalacloudera-cdh

解决方案


I believe that since you are interested in only the time part so I would suggest to simply replace the date part with some acceptable date and then impala provides functions to extact time from timestamp (all your further queries on top of data can make use of it), so I think one possible solution is

1)Create a temporary table where timestamp column is string

2)Now you will get the value like "4714-12-30 10:00:00.0"

3)now , you have to do things, a) use split_apart to extract 10:00:00.0 b) concat the extracted part with "2018-11-12" so now you can get values in the format "2018-11-12 10:00:00.0"

4)Now the result of previous stepcan be casted as a timestamp

5)Use the select [column names col1,2,3,..],(cast (concat("2018-11-12",split_apart(col,delim,index)))as timestamp) to insert data into original table from temporary table


推荐阅读