hadoop - Hive 上小文件的性能问题
问题描述
我正在阅读一篇关于小文件如何降低配置单元查询性能的文章。 https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
我了解关于重载 NameNode 的第一部分。
但是,他所说的重新分级map-reduce似乎没有发生。对于map-reduce和Tez。
当 MapReduce 作业启动时,它会为每个正在处理的数据块安排一个地图任务
我没有看到每个文件都创建了映射器任务。可能的原因是,他指的是 map-reduce 的第 1 版,之后做了很多更改。
Hive 版本: Hive 1.2.1000.2.6.4.0-91
我的桌子:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
数据: 以下代码将创建 100 个小文件,其中仅包含几 kb 的数据。
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
但是,我只看到为以下查询创建了一个映射器和一个减速器任务。
[root@sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
与 map-reduce 的结果相同。
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
解决方案
这是因为下面的配置正在生效
hive.hadoop.supports.splittable.combineinputformat
从文档
是否合并小型输入文件,以便生成更少的映射器。
所以本质上 Hive 可以推断输入是一组小于块大小的小文件,并将它们组合起来减少所需的映射器数量。
推荐阅读
- variables - 如果在声明变量的同一语句中声明,对变量的 Lua 闭包引用将失败
- ag-grid - Ag-Grid QuickFilter 更改以编程方式搜索的列
- html - 如何定位产品卡片并使其看起来合适?
- java - 不同Java库中的重复类导致编译错误
- android - 如何获取简单列表
来自 Android LiveData - >?
- python - 如何使用 OpenCV 在 SEM 图像上检测和测量(fitEllipse)对象?
- solr - 如何在 Banana Solr Dashboard 中可视化 Bettermap 上的数据?
- php - 如何使用mysql从传递相同的ID中获取多个数据
- java - spring boot:java.lang.IllegalArgumentException:对象不是声明类的实例
- c# - 需要帮助修复控制台应用程序中绘制模式的位置