首页 > 解决方案 > 为什么 Hive 在预期的某些情况下不使用 MapReduce?

问题描述

我创建了一个 AWS EMR 集群,通过 SSH 连接到主节点,启动了 Hive,然后从 AWS S3 存储桶中的数据创建了一个外部表。但是在一些我希望需要执行一些映射器或减速器作业的查询中,它不会那样做。例如,对于以下查询,我希望执行一些映射器作业,因为我们要过滤到两列:

 SELECT item, store FROM tt3 LIMIT 10;

但它没有,并迅速返回结果。 explain命令确认:

Stage-0   Fetch Operator
    limit:10
    Limit [LIM_2]
      Number of rows:10
      Select Operator [SEL_1]
        Output:["_col0","_col1"]
        TableScan [TS_0]
          Output:["item","store"]

它在查询中按预期select count(*) from tt3;工作并首先运行 MapReduce 作业。

的输出EXPLAIN COUNT(*) FROM tt3;

Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 2
      File Output Operator [FS_6]
        Group By Operator [GBY_4] (rows=1 width=8)
          Output:["_col0"],aggregations:["count(VALUE._col0)"]
        <-Map 1 [CUSTOM_SIMPLE_EDGE]
          PARTITION_ONLY_SHUFFLE [RS_3]
            Group By Operator [GBY_2] (rows=1 width=8)
              Output:["_col0"],aggregations:["count()"]
              Select Operator [SEL_1] (rows=1 width=211312928)
                TableScan [TS_0] (rows=1 width=211312928)
                  default@tt3,tt3,Tbl:COMPLETE,Col:COMPLETE

标签: hivehiveqlamazon-emr

解决方案


这是 Hive 的预期行为。

在 hive 中,如果您执行简单的查询,例如select * from table将不会运行 map reduce 作业,因为我们只是从 HDFS 转储数据。

Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id  | foo.name  | foo.age  |
+---------+-----------+----------+--+
| 1       | a         | 10       |
| 2       | a         | 10       |
| 3       | b         | 10       |
| 4       | c         | 20       |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)

当您进行聚合时,reducer阶段将与map阶段一起执行。

Hive# select count(*) from table group by name;
INFO  : Map 1: 0/1      Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 1/1      Reducer 2: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1
+------+--+
| _c0  |
+------+--+
| 2    |
| 1    |
| 1    |
+------+--+
3 rows selected (13.709 seconds)

我们可以通过向上面的查询添加 order by 子句来添加另一个 reducer 阶段

Hive# select count(*) cnt from foo group by name order by cnt;
INFO  : Map 1: 0/1      Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0/1  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1      Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 1/1
+------+--+
| cnt  |
+------+--+
| 1    |
| 1    |
| 2    |
+------+--+

你可以看到2 个 reducer 阶段已经完成,因为在聚合之后我们正在对结果进行排序

Map1 phase:- Loads the data from HDFS.

Reduer2:- Will does aggregation

Reducer 3:- after aggregation it will order the results to ascending order.

如果您确实对上述查询进行了解释

Hive# explain select count(*) cnt from foo group by name order by cnt;
 Vertex dependency in root stage     
 Reducer 2 <- Map 1 (SIMPLE_EDGE)    
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

当 Hive 使用 Map/Reduce 辅助作业时,请参阅链接以熟悉。


推荐阅读