sql - Hive:查询从小时执行
问题描述
我正在尝试在 Azure HDInsight 群集上执行以下配置单元查询,但它需要前所未有的时间才能完成。是否实施了配置单元设置但没有用。以下是详细信息:
Table
CREATE TABLE DB_MYDB.TABLE1(
MSTR_KEY STRING,
SDNT_ID STRING,
CLSS_CD STRING,
BRNCH_CD STRING,
SECT_CD STRING,
GRP_CD STRING,
GRP_NM STRING,
SUBJ_DES STRING,
GRP_DESC STRING,
DTL_DESC STRING,
ACTV_FLAG STRING,
CMP_NM STRING)
STORED AS ORC
TBLPROPERTIES ('ORC.COMPRESS'='SNAPPY');
Hive Query
INSERT OVERWRITE TABLE DB_MYDB.TABLE1
SELECT
CURR.MSTR_KEY,
CURR.SDNT_ID,
CURR.CLSS_CD,
CURR.BRNCH_CD,
CURR.SECT_CD,
CURR.GRP_CD,
CURR.GRP_NM,
CURR.SUBJ_DES,
CURR.GRP_DESC,
CURR.DTL_DESC,
'Y',
CURR.CMP_NM
FROM DB_MYDB.TABLE2 CURR
LEFT OUTER JOIN DB_MYDB.TABLE3 PREV
ON (CURR.SDNT_ID=PREV.SDNT_ID
AND CURR.CLSS_CD=PREV.CLSS_CD
AND CURR.BRNCH_CD=PREV.BRNCH_CD
AND CURR.SECT_CD=PREV.SECT_CD
AND CURR.GRP_CD=PREV.GRP_CD
AND CURR.GRP_NM=PREV.GRP_NM)
WHERE PREV.SDNT_ID IS NULL;
但是查询运行了几个小时。以下是详细信息:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 46 46 0 0 0 0
Map 3 .......... SUCCEEDED 169 169 0 0 0 0
Reducer 2 .... RUNNING 1009 825 184 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/03 [======================>>----] 84% ELAPSED TIME: 13622.73 s
--------------------------------------------------------------------------------
我确实设置了一些蜂巢属性
SET hive.execution.engine=tez;
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.auto.convert.join.noconditionaltask.size=3470;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.execution.reduce.groupby.enabled=true;
SET hive.cbo.enable=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.compute.query.using.stats=true;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.tezfiles = true;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=16777216;
SET hive.merge.orcfile.stripe.level=true;
Records in Tables:
DB_MYDB.TABLE2= 337319653
DB_MYDB.TABLE3= 1946526625
对查询似乎没有任何影响。谁能帮我:
- 明白为什么这个查询没有完成并且花费了不确定的时间吗?
- 如何优化它以更快更完整地工作?
Using the versions:
Hadoop 2.7.3.2.6.5.3033-1
Hive 1.2.1000.2.6.5.3033-1
Azure HDInsight 3.6
尝试_1:
正如@leftjoin 所建议的那样,尝试设置set hive.exec.reducers.bytes.per.reducer=32000000;
. 这一直有效,直到配置单元脚本的最后第二步,但最后它失败了Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!
最后查询:
INSERT OVERWRITE TABLE DB_MYDB.TABLE3
SELECT
CURR_FULL.MSTR_KEY,
CURR_FULL.SDNT_ID,
CURR_FULL.CLSS_CD,
CURR_FULL.BRNCH_CD,
CURR_FULL.GRP_CD,
CURR_FULL.CHNL_CD,
CURR_FULL.GRP_NM,
CURR_FULL.GRP_DESC,
CURR_FULL.SUBJ_DES,
CURR_FULL.DTL_DESC,
(CASE WHEN CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID THEN 'Y' ELSE
CURR_FULL.SDNT_ID_FLAG END) AS SDNT_ID_FLAG,
CURR_FULL.CMP_NM
FROM
DB_MYDB.TABLE2 CURR_FULL
LEFT OUTER JOIN DB_MYDB.TABLE1 SND_DELTA
ON (CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID);
-----------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
-----------------------------------------------------------------
Map 1 ......... RUNNING 1066 1060 6 0 0 0
Map 4 .......... SUCCEEDED 3 3 0 0 0 0
Reducer 2 RUNNING 1009 0 22 987 0 0
Reducer 3 INITED 1 0 0 1 0 0
-----------------------------------------------------------------
VERTICES: 01/04 [================>>--] 99% ELAPSED TIME: 18187.78 s
错误:
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=8, pendingInputs=1058, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
解决方案
如果您的 fk 列上没有索引,则应确定添加它们,这是我的建议:
create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
请注意,从 hive 版本 3.0 开始,索引已从 hive 中删除,或者您可以使用物化视图(从 Hive 2.3.0 及更高版本支持),它可以为您提供相同的性能。
推荐阅读
- r - 每日数据的运行总和,当月轮换时重置
- python-3.x - 从 libspacialindex 安装 Rtree 以在 geopandas 中使用 .clip()
- mysql - MySQL如何更新id DESC计数大于值的字段
- appian - 在 Appian 中使用 folder() 获取超过 1000 个文档
- python - python中的FFT结果与具有相同矩阵的matlab中的结果不同
- python - 我需要帮助将 Perl 的“解包”代码转换为 Python 代码
- javascript - 从画布返回 dataURL 而不使用回调
- django - 如何修复 ModuleNotFoundError
- recaptcha - 如何在页面背景中包含 reCAPTCHA v3?
- python - 如何将子列表分解为字典?