大表数据过滤查询很慢

一、问题描述

查询的语句类似如下：

select * from table_name where xxx='yyy' limit 10;

当前的hive表存储格式是orc格式，执行引擎是tez，并行度也已经调整到几十了，但是在执行这个sql的时候，发现一直卡住，执行不成功。

二、问题现象 and 分析：

现象：当前的查询卡住。
分析：
1、查看hiveserver2.log文件，观察当前的sql运行情况，发现当前处理sql的线程，一直在读取数据文件，如下：

看这个样子，当前的sql自己就已经在scan数据了，完全没有走mr任务，完全是本地就直接读取了,相当于全表扫描，这种不慢才怪了。

2、通过jstack查看线程的执行过程

3、通过explain分析执行计划

三、问题解决

通过调整如下的参数：

hive.fetch.task.conversion

Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.

Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

这个配置会尝试将query转换为一个fetch任务；

默认为more，将其改为none再执行上边的sql，就会提交到yarn上执行;

hive.fetch.task.conversion.threshold

Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
默认为1073741824 (1 GB)

本文借鉴：
| https://www.cnblogs.com/barneywill/p/10109217.html

大表数据过滤查询很慢

一、问题描述

二、问题现象 and 分析：

三、问题解决

推荐阅读