sql - Hive 删除记录的计数
问题描述
我从 CSV 创建了一个配置单元表
CREATE TABLE RECORD_CSV(
completed_on string, distance_travelled double,
end_location_lat double, end_location_long double,
started_on string, driver_rating double,
rider_rating double, start_zip_code int,
end_zip_code int, charity_id int,
requested_car_category string, free_credit_used double,
surge_factor double, start_location_long double,
start_location_lat double, color string,
make string, model string, year int,
rating double, Date string, PRCP double,
TMAX double, TMIN double, AWND double,
GustSpeed2 double, Fog double, HeavyFog double,
Thunder double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
当我运行
SELECT COUNT(*) FROM RECORD_CSV;
它返回
OK
911057
Time taken: 21.403 seconds, Fetched: 1 row(s)
color
当我使用以下命令创建另一个按字段分区的表时,行数下降。
CREATE TABLE RECORD_CSV_BYCOLOR(completed_on string, distance_travelled double,
end_location_lat double ,end_location_long double,
started_on string ,driver_rating double ,rider_rating double ,
start_zip_code int ,end_zip_code int ,charity_id int,
requested_car_category string,free_credit_used double,
surge_factor double,start_location_long double,start_location_lat double ,
make string ,model string ,year int ,rating double,Date string,PRCP double,
TMAX double,TMIN double,AWND double,GustSpeed2 double,
Fog double,HeavyFog double,Thunder double
)
PARTITIONED BY (color string)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
STORED AS TEXTFILE;
INSERT OVERWRITE table RECORD_CSV_BYCOLOR PARTITION(color)
select completed_on, distance_travelled,end_location_lat,
end_location_long, started_on, driver_rating, rider_rating,
start_zip_code, end_zip_code, charity_id, requested_car_category,
free_credit_used, surge_factor, start_location_long, start_location_lat,
make, model, year, rating, Date, PRCP, TMAX, TMIN, AWND, GustSpeed2,
Fog, HeavyFog, Thunder, color FROM RECORD_CSV;
当我跑步时,SELECT COUNT(*) FROM RECORD_CSV_BYCOLOR;
我看到记录已经下降
OK
693991
Time taken: 21.552 seconds, Fetched: 1 row(s)
下面是color
使用GROUP BY
for table的区别RECORD_CSV
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 7.11 sec HDFS Read: 165793766 HDFS Write: 349 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 110 msec
OK
Silver 634
Black 204004
Bronze 214
Burgundy 1587
GREEN 195
Gold 6346
Gray 644
Maroon 847
Silver 170241
Silver 147
Tan 1066
Teal 913
White 152919
White 404
Yellow/Gold 20540
Blue 90
Brown 18594
Gray 134155
Navy Blue 48
Red 80352
WHITE 52
Yellow 448
Black 361
Blue 81999
Dark Blue 199
Dark Grey 18
Green 15396
Grey 12503
Magenta 324
Orange 5817
Time taken: 25.186 seconds, Fetched: 30 row(s)
和下面的RECORD_CSV_BYCOLOR
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.48 sec HDFS Read: 30648230 HDFS Write: 281 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
OK
Silver 634
Black 361
Blue 90
Bronze 214
Brown 18594
Burgundy 1587
Dark Blue 199
Dark Grey 18
GREEN 195
Gold 6346
Gray 644
Green 15396
Grey 12503
Magenta 324
Maroon 847
Navy Blue 48
Orange 5817
Red 80352
Silver 147
Tan 1066
Teal 913
WHITE 52
White 404
Yellow 448
Yellow/Gold 20540
Time taken: 20.937 seconds, Fetched: 25 row(s)
源表GROUP BY
中的 两次给出相同颜色的计数,目标表选择计数最少的行。差异似乎存在,但为什么会发生这种情况?我应该更改什么代码?
解决方案
推荐阅读
- loadrunner - 从 Load Runner(Http/Https 协议)运行负载时,无法在监控工具(AppDynamics)中看到浏览器(Firefox/chrome/IE)
- css - 祖先使用“变换”时的全屏元素
- javascript - 我需要在 Express 中从这些嵌套函数中附加一个列表,但这些项目不会存储
- android - 我想创建流式传输 WiFi 摄像头的 android 应用程序
- http - 在 Lua 中获取链接信息
- java - 使用 Jackson 进行简单的 JSON 解析
- javascript - 我如何使用 Jquery 检测输入值何时发生变化
- go - golang中crypto.subtle.exportKey的替代品是什么
- python - 带有计算字段的更新模块上的odoo异常cachemiss错误
- tree - Rascal:Repl 中的树表示