Flume笔记

flume自定义拦截器：实现Interceptor接口
flume自定义source：继承AbstractSource
flume自定义sink：继承AbstractSink

azkaban:任务调度工具。正常使用即可
任务调度，定时执行，任务之间的依赖

sqoop:数据导入导出工具
将关系型数据库当中的数据导入到大数据平台 import
将大数据平台的数据导出到关系型数据库 export

导入mysql数据到hdfs上面去，指定字段之间的分隔符，指定导入的路径 -m 定义多少个mapTask来导入数据
100GB的数据，定义多少个mapTask比较合适 10-30个，大概运行在半个小时以内要结束掉。

增量导入有三个选项
一般都是借助 --where条件来实现，或者使用--query来实现

实际工作当中，每个表一般都维护三个字段，create_time ,update_time ,is_deleted
实际工作当中，基本上都是做假删除
根据update_time可以获取每天的更新的数据或者插入的数据

如果数据发生变化，数仓当中一个人存在多条数据，怎么办？？？
减量的数据怎么办？？转化成为更新的数据来操作

datax：也是数据导入导出工具

通过Java代码远程执行linux的命令 sshxcute.jar

点击流日志数据分析：主要是分析nginx的日志数据
点击流日志模型：
原始结构化数据表：
pageView表模型：重视的是每一次页面的访问情况
visit表模型：重视的是每一个session访问的情况

网站分析常见的一些指标：
IP：独立的ip的个数。以cookie来统计今天访问的人数
pv：page View 页面浏览量，一共看了多少个页面，看一个页面，算作一次
uv：unique page View 独立用户访问量，统计的是一共有多少个人来访问了，使用的是cookie来进行统计的

基础指标：访问次数，网站停留时间，页面停留时间

复合指标：人均浏览页数（pv/去重人数），跳出率，退出率

来源分析：分析用户是从哪个渠道过来的
受访分析：网站受到的访问的分析

离线日志分析数据处理架构：
日志数据采集：flume source：TailDirSource channel:memory channel sink:HdfsSink
数据的预处理：mapreduce
数据的入库：load 到hive表当中去
数据的分析：使用hql语句来实现数据的分析
报表的展示：echarts来实现报表展示

维度建模基本概念：维度建模是我们对数据仓库分析常用的一种手段

事实表：主要的作用就是正确的记录已经发生的事件事实一定是已经发生的事情
维度表：主要就是从各个不同的方面来看已经发生的事件得到不一样的结果
昨天去星巴克喝了一杯咖啡，花了两百块。
时间：昨天
地点：星巴克
金额：两百块

横看成岭侧成峰，远近高低各不同

维度表和事实表侧重点不一样：事实表侧重的是整个事件全貌，维度表侧重的是某一方面

维度建模的三种方式：
星形模型：类似于天上的星星一样的

ods_weblog_origin有一个字段time_local yyyy-MM-dd HH:mm:ss

求：06-07点一共访问了多少个pv

select count(1) from ods_weblog_origin where time_local > 06 and time_local <07

select count(1) from ods_weblog_origin where time_local > 07 and time_local <08

求每个小时的pv
select hour,count(1) from ods_weblog_origin group by hour

将时间字段给拆开
2013-09-18 06:49:18 ==》 year: 2013年 month:09 day：18 hour：06

select count(1) from ods_weblog_origin group by substring(time_local,12,2);

数据仓库当中允许数据的冗余。
时间字段需要拆开，使用截串即可 month ,year,day,hour
http_refer 需要拆开，使用parse_url_tuple来进行拆开 host,path ,query,queryId

http_refer :查看我们上一级网址是哪里

"http://cos.name/category/software/packages/?username=zhangsan" ==> jd.com

"http://baidu.com/category/software/packages/?username=zhangsan" ==> jd.com

"http://google.com/category/software/packages/?username=zhangsan" ==> jd.com

"http://360.com/category/software/packages/?username=zhangsan" ==> jd.com

统计从每个网站过来有多少流量：

select hosts,count(1) from ods_weblog_origin group by hosts

+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+
| t_ods_tmp_referurl.valid | t_ods_tmp_referurl.remote_addr | t_ods_tmp_referurl.remote_user | t_ods_tmp_referurl.time_local | t_ods_tmp_referurl.request | t_ods_tmp_referurl.status | t_ods_tmp_referurl.body_bytes_sent | t_ods_tmp_referurl.http_referer | t_ods_tmp_referurl.http_user_agent | t_ods_tmp_referurl.datestr | t_ods_tmp_referurl.host | t_ods_tmp_referurl.path | t_ods_tmp_referurl.query | t_ods_tmp_referurl.query_id |
+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+
| false | 194.237.142.21 | - | 2013-09-18 06:49:18 | /wp-content/uploads/2013/07/rstudio-git3.png | 304 | 0 | "-" | "Mozilla/4.0(compatible;)" | 20130918 | NULL | NULL | NULL | NULL |
| false | 163.177.71.12 | - | 2013-09-18 06:49:33 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 163.177.71.12 | - | 2013-09-18 06:49:36 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 101.226.68.137 | - | 2013-09-18 06:49:42 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 101.226.68.137 | - | 2013-09-18 06:49:45 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 60.208.6.156 | - | 2013-09-18 06:49:48 | /wp-content/uploads/2013/07/rcassandra.png | 200 | 185524 | "http://cos.name/category/software/packages/" | "Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36" | 20130918 | cos.name | /category/software/packages/ | NULL | NULL |
| false | 222.68.172.190 | - | 2013-09-18 06:49:57 | /images/my.jpg | 200 | 19939 | "http://www.angularjs.cn/A00n" | "Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36" | 20130918 | www.angularjs.cn | /A00n | NULL | NULL |
| false | 183.195.232.138 | - | 2013-09-18 06:50:16 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 183.195.232.138 | - | 2013-09-18 06:50:16 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 66.249.66.84 | - | 2013-09-18 06:50:28 | /page/6/ | 200 | 27777 | "-" | "Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)" | 20130918 | NULL | NULL | NULL | NULL |
+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+

按照每小时维度进行统计pv

按照来访的维度进行统计pv
1、统计每小时各来访url产生的pv量
每，各，这些关键字都要进行分组

select month,day,hour,ref_host,count(1) from ods_weblog_detail group by month,day,hour,ref_host limit 10;

2、统计每小时各来访host的产生的pv数并排序

select ref_host,month,day,hour,count(1) as total_count from ods_weblog_detail group by ref_host,month,day,hour
order by total_count desc ;

05 google.com google.com baidu.com 360.com 360.com
06 google.com 360.com baidu.com baidu.com

05 google.com 2
05 360.com 2
05 baidu.com 1

06 baidu.com 2
06 google.com 1
06 360.com 1

--需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN top2

统计一天内各个小时来源最多的pvs
select month,day,hour,ref_host,,max(count(1)) from ods_weblog_detail group by month,day,hour,ref_host

每组里面取两个，如果有10组，取20条
select month,day,hour,ref_host,,max(count(1)) from ods_weblog_detail group by month,day,hour,ref_host limit 2

hive当中的分组求topN
https://www.cnblogs.com/wujin/p/6051768.html
id name sal
1 a 10
2 a 12
3 b 13
4 b 12
5 a 14
6 a 15
7 a 13
8 b 11
9 a 16
10 b 17
11 a 14

统计，每个用户获得最大小费金额的前三个

分组求topN row_num over
densen rank over
rank over

9 a 16 1 1 1
6 a 15 2 2 2
11 a 14 3 3 3
5 a 14 4 3 3
7 a 13 5 4 5
2 a 12 6 5 6
1 a 10 7 6 7

10 b 17 1
3 b 13 2
4 b 12 3
8 b 11 4

select id,
name,
sal,
rank()over(partition by name order by sal desc ) rp,
dense_rank() over(partition by name order by sal desc ) drp,
row_number()over(partition by name order by sal desc) rmp
from f_test;

rp drp rmp
10 b 17 1 1 1
3 b 13 2 2 2
4 b 12 3 3 3
8 b 11 4 4 4

9 a 16 1 1 1
6 a 15 2 2 2
11 a 14 3 3 3
5 a 14 3 3 4
7 a 13 5 4 5
2 a 12 6 5 6
1 a 10 7 6 7

hive当中需要注意的函数：行转列，列转行，分组求topN explode reflect

--需求描述：统计今日所有来访者平均请求的页面数。
--总页面请求数/去重总人数

受访分析：网站受到的访问的分析
1、各个页面的pv量

request表示我们请求的url地址，每一个url地址都对应一个页面
select request,count(1) from ods_weblog_detail group by request

2、统计20130918这个分区里面的受访页面的top10
select request,count(1) as total_count from ods_weblog_detail where datestr = '20130918' group by request having request is not null order by total_count desc limit 10;

3、
统计每日最热门页面的top10

访客分析：针对用户进行的分析

以session为次数依据
新老访客：之前有没有来过网站
回头访客：来访问了好多次
单次访客：只来访问了一次

1、需求：按照时间维度来统计独立访客及其产生的pv量
独立访客：每一个独立的访问的用户，叫做独立访客。怎么区分每一个独立的访客：cookie来区分

每个小时，每个独立访客产生的pv量
select month,day,hour,remote_addr,count(1) from ods_weblog_detail month,day,hour,remote_addr

访客visit分析：
-- 回头/单次访客统计
统计哪些用户是回头访客，一天之内访问了好多次
visit表里面
sessionId remote_addr
1 192.168.52.100
2 192.168.52.100

visit表

select remote_addr,count(1) as totol_count from vist group by remote_addr having total_count > 1

查询今日所有回头访客及其访问次数。

select remote_addr,count(1) as total_count from visit where datestr = '20130918' group by remote_addr having total_count > 1

-- 人均访问的频次，
平均一个人访问了多少次
一共访问的次数/去重人数

select count(1)/count(distinct remote_addr) from vist

-- 人均页面浏览量
平均一个人看了多少个页面

select sum(pageVisits)/count(distinct remote_addr) from visit

需求一：求取每个用户每个月总共获得多少小费
select username,month,sum(salary) from t_salary_detail group by username,month;

+-----------+----------+------+--+
| username | month | _c2 |
+-----------+----------+------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| A | 2015-03 | 16 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 17 |
+-----------+----------+------+--+

第二个需求：求每个用户累计获得多少小费

+-----------+----------+------+--+
| username | month | salary |
+-----------+----------+------+--+
| A | 2015-01 | 33 | 33
| A | 2015-02 | 10 | 43
| A | 2015-03 | 16 | 59

| B | 2015-01 | 30 | 30
| B | 2015-02 | 15 | 45
| B | 2015-03 | 17 | 62
+-----------+----------+------+--+

select * from (
select username,month,sum(salary) as salary from t_salary_detail group by username,month
) tempTable1 inner join (select username,month,sum(salary) as salary from t_salary_detail group by username,month)
tempTable2 on tempTable1.username = tempTable2.username where tempTable2.month <= tempTable1.month;

+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+
| temptable1.username | temptable1.month | temptable1.salary | temptable2.username | temptable2.month | temptable2.salary |
+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |

| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |

| A | 2015-03 | 16 | A | 2015-01 | 33 |
| A | 2015-03 | 16 | A | 2015-02 | 10 |
| A | 2015-03 | 16 | A | 2015-03 | 16 |

| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-03 | 17 | B | 2015-01 | 30 |
| B | 2015-03 | 17 | B | 2015-02 | 15 |
| B | 2015-03 | 17 | B | 2015-03 | 17 |
+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+

select tempTable1.username,tempTable1.month,sum(tempTable2.salary) from (
select username,month,sum(salary) as salary from t_salary_detail group by username,month
) tempTable1 inner join (select username,month,sum(salary) as salary from t_salary_detail group by username,month)
tempTable2 on tempTable1.username = tempTable2.username where tempTable2.month <= tempTable1.month group by tempTable1.username,tempTable1.month;

+----------------------+-------------------+------+--+
| temptable1.username | temptable1.month | _c2 |
+----------------------+-------------------+------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 43 |
| A | 2015-03 | 59 |
| B | 2015-01 | 30 |
| B | 2015-02 | 45 |
| B | 2015-03 | 62 |
+----------------------+-------------------+------+--+

hive的级联求和

dw_oute_numbs
+---------------------+----------------------+--+
| dw_oute_numbs.step | dw_oute_numbs.numbs |
+---------------------+----------------------+--+
| step1 | 1029 |
| step2 | 1029 |
| step3 | 1028 |
| step4 | 1018 |
+---------------------+----------------------+--+

求每一步相对于第一步转化率
select a.numbs/b.numbs from dw_oute_numbs a inner join dw_oute_numbs b where b.step = 'step1';

+---------+----------+---------+----------+--+
| a.step | a.numbs | b.step | b.numbs |
+---------+----------+---------+----------+--+
| step1 | 1029 | step1 | 1029 |
| step2 | 1029 | step1 | 1029 |
| step3 | 1028 | step1 | 1029 |
| step4 | 1018 | step1 | 1029 |
+---------+----------+---------+----------+--+

求，每一步相对于上一步的转化率
求每一步相对于第一步转化率
select * from dw_oute_numbs a inner join dw_oute_numbs b ;

| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step2 | 1029 |
| step3 | 1028 | step2 | 1029 |
| step4 | 1018 | step2 | 1029 |

| step1 | 1029 | step3 | 1028 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step3 | 1028 |
| step4 | 1018 | step3 | 1028 |

| step1 | 1029 | step4 | 1018 |
| step2 | 1029 | step4 | 1018 |
| step3 | 1028 | step4 | 1018 |
| step4 | 1018 | step4 | 1018 |
+---------+----------+---------+----------+--+

step2 -1 = step1

select from dw_oute_numbs a innser join dw_oute_numbs b on cast (substring(b.step,5,1) as int ) -1 = cast (substring(a.step ,5,1) as int )

hive当中遇到的函数：
parse_url_tuple
substring
concat
cast
sum
count
时间函数得要注意一下

分组函数：
级联求和：
更多函数，参见hive文档：

实际工作当中，一定要注意：dw层的表基本上都是使用orc或者parquet格式的存储

数据导出：
/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyday --table dw_pvs_everyday --input-fields-terminated-by '\001'

/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyhour_oneday/datestr=20130918 --table dw_pvs_everyhour_oneday --input-fields-terminated-by '\001'

/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_referer_everyhour/datestr=20130918 --table dw_pvs_referer_everyhour --input-fields-terminated-by '\001'

工作流任务调度：
flume数据采集：不需要调度
三个MR的程序：数据清洗，pageView表模型程序，visit表模型程序需要定时调度
ods层的表：分区表，每天加载分区的数据，不需要drop table 每次都要load 数据进入到对应的分区里面去 load数据需要定时执行
dw层的统计分析结果表：需要每天进行drop 或者truncate 需要定时的执行
数据的结果导出：需要定时的执行

课程总结：

visit表模型的创建，涉及到前面三个mr的程序
到hive当中建表，并加载数据 weblog pageView visit
weblog表的拆分时间字段给拆开，http_referer给拆开

各个模块的分析
受访分析，
流量分析等等分组求topN的函数，级联求和（自己关联自己）

数据结果导出 ==》 sqoop导出
定时任务调度 ==》使用azkaban
数据报表展示：三大框架整合

Flume笔记

推荐阅读