首页 > 解决方案 > Impala | KUDU Show PARTITION BY HASH. Where my row are?

问题描述

I want to test CREATE TABLE with PARTITION BY HASH in KUDU

This is my CREATE clause.

CREATE TABLE customers (
  state STRING,
  name STRING,
  purchase_count int,
  PRIMARY KEY (state, name)
)
PARTITION BY HASH (state) PARTITIONS 2
STORED AS KUDU
TBLPROPERTIES (
  'kudu.master_addresses' = '127.0.0.1',
  'kudu.num_tablet_replicas' = '1'
)

Some inserts...

insert into customers values ('madrid', 'pili', 8);
insert into customers values ('barcelona', 'silvia', 8);
insert into customers values ('galicia', 'susi', 8);

Avoiding issues...

COMPUTE STATS customers;
Query: COMPUTE STATS customers
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+

And then...

show partitions customers;
Query: show partitions customers
+--------+-----------+----------+----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica | # Replicas |
+--------+-----------+----------+----------------+------------+
| -1     |           | 00000001 | hidra:7050     | 1          |
| -1     | 00000001  |          | hidra:7050     | 1          |
+--------+-----------+----------+----------------+------------+
Fetched 2 row(s) in 2.31s

Where my rows are? What means the "-1"?

There is any way to see if row distribution is workings properly?

标签: impalakuduapache-kudu

解决方案


基于本白皮书https://kudu.apache.org/kudu.pdf中提出的进一步研究 COMPUTE STATS 语句适用于使用 HDFS 的分区表,而不是 Kudu 表,尽管 Kudu 在内部不使用 HDFS 文件 Impala 的模块化架构允许一个查询透明地连接来自多个不同存储组件的数据。例如,可以将 HDFS 上的文本日志文件与存储在 Kudu 中的大型维度表连接起来。

对于涉及 Kudu 表的查询,Impala 可以将过滤结果集的大部分工作委托给 Kudu,从而避免在对包含 HDFS 数据文件的表进行全表扫描时涉及的一些 I/O。这种类型的优化对于分区 Kudu 表特别有效,其中 Impala 查询 WHERE 子句引用一个或多个主键列,这些主键列也用作分区键列。


推荐阅读