Hive(四)倾斜表、桶表、表约束、字段检查

最近看了一遍hive的文档，本文是为了记录文档中将来会可用东西，并非最全的《文档》，望谅解

一：hive表

1.倾斜表

倾斜表其实就是将数据按照固定值，重新划分文件。（默认是按照keyhash值划分文件）

    create table if not exists table_test
    (
    aa string
    ) partitioned by (date string,hour string)
    skewed by (aa2) on (('s1'), ('s2'),('s3'))
    stored as orc
    row format delimited ;

以上的语句就是把数据aa字段值为s1,s2,s3的数据不要划分在同一个文件中

    alter table table_name not skewed; 更改为不倾斜表

2.临时表

    //不支持分区   不支持索引   当前会话有效   适合SQL中多次使用
    create temporary table if not exists table_test   
    (
               aa string
    ) partitioned by (date string,hour string)
    row format delimited fields terminated by '\001';

3.事务表

     //不支持分区   不支持索引
     create transactional table if not exists table_test   
    (
               aa string
    ) partitioned by (date string,hour string)
    row format delimited fields terminated by '\001';//orc支持

    显示所有事务：show transactions;

    删除事务：abort transactions 0000007 0000008 0000010 0000015;

4.分区表、常用命令

分区表其实就是将数据按照一个字段放进去不同的文件夹，减少数据扫描。
（分区表是文件夹名字就是字段名字，分文件夹）

    create table if not exists table_test
    (
             aa string
    ) partitioned by (date string,hour string)
    row format delimited fields terminated by '\001'
    location '/hive/table/table_test';

以上的语句就是建一个内表同时又是分区表，一行是一条记录，字段之间按照\001分割，数据最终存放的位置是/hive/table/table_test。

    多分区插入：alter table page_view table_testpartition (date ='2008-08-08', hour='10') location '/path/table_test/2008-08-08/10/'  (date ='2008-08-08', hour='11') location '/path/table_test/2008-08-08/11/' ;

    动态分区插入set hive.exec.dynamic.partition.mode=nonstrict;insert overwrite table page_view partition(dt='2008-06-08', country) select * from table_test

    交换分区（表之间移动数据）：alter table table_test exchange partition (dt='20200512') with table table_test_copy;

    修复分区：msck repair table table_test sync partitions ;（add drop partation）  4.x以后

    分区周期删除："discover.partitions"="true";"partition.retention.period"="7d" ;//删除七天以后的数据  日期的分区表  4.x以后

    手动删除分区：alter table table_test drop if exists partition (dt='20220308')

    归档分区：alter table table_test archive partition (dt='20220308');

    小文件合并：alter table table_test unarchive partition  (dt='20220308');

    显示分区：show partitions table_test  where hr >= 10 and ds='2010-03-03' order by hr desc limit 10;

             show table extended in ods like 'tablename' partition (dt='20220310');

    更改文件存储格式：alter table table_test (dt='20220308') set fileformat  rcfile;   元数据

    更改文件路径：alter table table_test (dt='20220308') set location  "";

    更改文件不被删除：alter table table_name enable|disable no_drop;

    更改表上线，下线：alter table table_name enable|disable offline; 

    更改列名称：alter table table_name change a a1 int;

    更改列位置：alter table test_change change a1 a2 string after b;  放到b后面

    更改位置第一：alter table test_change change c c1 int first;

    替换列：alter table test_change replace columns (a int, b int);  原表 a b c

    更改分区字段：set hive.exec.dynamic.partition = true; alter table test_change partition (ds, hr) change column dec_column_name dec_column_name decimal(38,18); 替换所有分区

    数据导出文件：insert overwrite local directory '/tmp/pv_gender_sum' stored as parquet select * from table_test;

    删除表：drop table if exists table_test; 

    清除数据：truncate table table_test partition(dt='20200512'); 外表不能操作

    更改表属性：alter table table_test set tblproperties ('comment' = new_comment); 

    显示表属性：show tblproperties tablename;

    更改列属性：alter table table_test change a1 a1 int comment 'this is column a1';

    删除表属性：alter table table_name unset serdeproperties ('field.delim');

    内部表转外部表：alter table tablename set tblproperties ('external' = 'true');   alter table table_name set tblproperties('external'='false');

    更改分区属性：alter table table_test set partition(dt='20200512') set serdeproperties ('field.delim'='\t'); 

    重命名分区：alter table table_test partition (dt='20200512') rename to partition (dt='20200512_copy');

配置

    hive.exec.dynamic.partition=true; //需要设置true为启用动态分区插入
    hive.exec.dynamic.partition.mode=strict; //模式下，strict用户必须至少指定一个静态分区，以防用户不小心覆盖了所有分区
    nonstrict模式下所有分区都允许是动态的

    hive.exec.max.dynamic.partitions.pernode=100; //每个 mapper/reducer 节点允许创建的最大动态分区数
    hive.exec.max.dynamic.partitions=; //总共允许创建的最大动态分区数
    hive.exec.max.created.files=; //作业中所有映射器/缩减器创建的 hdfs 文件的最大数量
    hive.error.on.empty.partition=false  ; //如果动态分区插入产生空结果是否抛出异常

5.桶表

桶表其实就是将数据按照一个字段放进去不同文件，减少数据扫描。（将date取hash，对结果对4取余确定文件，个人理解哈哈）

    create table if not exists table_test
    (
             aa string
    ) clustered by(date) into 4 buckets

    row format delimited fields terminated by '\001'
    location '/hive/table/table_test';

以上的语句就是建一个内表同时又是桶表，一行是一条记录，字段之间按照\001分割，数据最终存放的位置是/hive/table/table_test。

装载数据：

    insert into table_test
    select aa,date from table_aa;

    或

    load data inpath '/hive/date/table_test' overwrite into table table_test;(该操作是不会将结果分文件的，其实没有起到桶表的作用)
    set hive.enforce.bucketing = true;

7.表压缩

    表压缩的好处不言而喻，节约空间减少传输，以下是最优的选择。

    create table if not exists table_test
    (
                aa string
    ) partitioned by (date string,hour string)
    row format delimited fields terminated by '\001' stored as orc
    location '/hive/table/table_test'
    tblproperties ('orc.compress'='SNAPPY');
　　//stored as后可跟文件存储格式（TEXTFILE、SEQUENCEFILE、RCFILE、PARQUET）　　
　　// ('orc.compress'='')后可跟压缩方式(ZLIB、SNAPPY、NONE) 选项都是支持的

装载数据：

    insert into table_test

    select aa,date from table_aa;

    或

    load data inpath '/hive/date/table_test' overwrite into table table_test;(该操作是不会将结果分文件的，其实没有起到桶表的作用)

8.内表

    内表其实就是将拷贝到Hive的目录下，表和数据关联，表删除数据删除。

    create table if not exists table_test
    (
        aa string
    )
    row format delimited fields terminated by '\001'
    location '/hive/table/table_test';

以上的语句就是建一个内表，一行是一条记录，字段之间按照\001分割，数据最终存放的位置是/hive/table/table_test。

装载数据：

    insert into table_test

    select aa from table_aa;

    或

    load data inpath '/hive/date/table_test' overwrite into table table_test;(overwrite是覆盖数据，可去掉)

9.外表

内表相对来说是不安全的：因为表删除后数据就丢了。所以还有外表，将数据和表分开（外表drop后重新，什么都没有变会将数据load两份）。

    create external table if not exists table_test
    (
             aa string
    ) 
    row format delimited fields terminated by '\001'
    location '/hive/table/table_test';

以上的语句就是建一个外表，一行是一条记录，字段之间按照\001分割，数据最终存放的位置是/hive/table/table_test。

装载数据：

    load data inpath '/hive/date/table_test' overwrite into table table_test;(overwrite是覆盖数据，可去掉)

    drop table ;只删除元数据

    acid/transactional，物化视图功能，查询结果缓存 仅适用于内表

    archive/unarchive/truncate/merge/concatenate 仅适用于内表

二：hive表约束、字段检查

1.约束

启用约束:
enable( validate) :启用约束,创建索引,对已有及新加入的数据执行约束
enable novalidate :启用约束,创建索引,仅对新加入的数据强制执行约束,而不管表中的现有数据
禁用约束:
disable( novalidate):关闭约束,删除索引,可以对约束列的数据进行修改等操作
disable validate :关闭约束,删除索引,不能对表进行插入/更新/删除等操作

    主键约束：
    create table pk(
        id1 integer,
        id2 integer,
        primary key(id1, id2) disable novalidate

    );

    外键约束：
    create table fk(
        id1 integer,
        id2 integer,
        constraint c1 foreign key(id1, id2) references pk(id2, id1) disable novalidate
    );

2.字段检查

    create table constraints1(                              //字段检查
        id1 integer unique disable novalidate,
        id2 integer not null,
        usr string default current_user(),
        price double check (price > 0 and price <= 1000)
    );
    create table constraints2(
        id1 integer,
        id2 integer,
        constraint c1_unique unique(id1) disable novalidate
    );
    create table constraints3(                              //字段检查
        id1 integer,
        id2 integer,
        constraint c1_check check(id1 + id2 > 0)
    );


    alter table table_name add constraint constraint_name primary key (column, ...) disable novalidate;
    alter table table_name add constraint constraint_name foreign key (column, ...) references table_name(column, ...) disable novalidate rely;
    alter table table_name add constraint constraint_name unique (column, ...) disable novalidate;
    alter table table_name [partition partition_spec] change [column] col_old_name col_new_name column_type [comment col_comment] [first|after column_name] [cascade|restrict];
    cascade|restrict 子句在hive 1.1.0中可用。alter table add|replace columns with cascade 命令更改表元数据的列，并将相同的更改级联到所有分区元数据。restrict 是默认值，将列更改仅限于表元数据。
    alter table table_name change column column_name column_name data_type constraint constraint_name not null enable;
    alter table table_name change column column_name column_name data_type constraint constraint_name default default_value enable;
    alter table table_name change column column_name column_name data_type constraint constraint_name check check_expression enable;
    alter table table_name drop constraint constraint_name;