漯河市住房建設(shè)局網(wǎng)站百度指數(shù)查詢手機(jī)版app
目錄
- 1、聊天軟件數(shù)據(jù)分析案例需求
- 2、基于Hive數(shù)倉實(shí)現(xiàn)需求開發(fā)
- 2.1 建庫
- 2.2 建表
- 2.3 加載數(shù)據(jù)
- 2.4 ETL數(shù)據(jù)清洗
- 2.5 需求指標(biāo)統(tǒng)計(jì)---都很簡單
- 3、FineBI實(shí)現(xiàn)可視化報(bào)表
- 3.1 FineBI介紹
- 3.2 FineBI配置數(shù)據(jù)
- 3.3 構(gòu)建可視化報(bào)表
1、聊天軟件數(shù)據(jù)分析案例需求
MR速度慢—引入hive
背景:大量的用戶在線,通過對聊天數(shù)據(jù)的分析,構(gòu)建用戶畫像,為用戶提供更好的服務(wù)、以及實(shí)現(xiàn)高ROI的平臺(tái)運(yùn)營推廣,給公司的發(fā)展決策提供精確的數(shù)據(jù)支撐。
目標(biāo):基于Hadoop和Hive實(shí)現(xiàn)聊天數(shù)據(jù)統(tǒng)計(jì)分析,構(gòu)建聊天數(shù)據(jù)分析報(bào)表
需求:
- 統(tǒng)計(jì)今日總消息量
- 統(tǒng)計(jì)今日每小時(shí)消息量、發(fā)送和接收用戶數(shù)
- 統(tǒng)計(jì)今日各地區(qū)發(fā)送消息數(shù)據(jù)量
- 統(tǒng)計(jì)今日發(fā)送消息和接收消息的用戶數(shù)
- 統(tǒng)計(jì)今日發(fā)送消息最多的Top10用戶
- 統(tǒng)計(jì)今日接收消息最多的Top10用戶
- 統(tǒng)計(jì)發(fā)送人的手機(jī)型號(hào)分布情況
- 統(tǒng)計(jì)發(fā)送人的設(shè)備操作系統(tǒng)分布情況
原始數(shù)據(jù):業(yè)務(wù)系統(tǒng)中導(dǎo)出的某一天24小時(shí)的用戶聊天數(shù)據(jù),TSV文件。列分隔符:制表符 \t
2、基于Hive數(shù)倉實(shí)現(xiàn)需求開發(fā)
在Notepad中可以通過顯示所有字符來判斷間隔符
打開Datagrip,創(chuàng)建一個(gè)hive工程,語言選擇hive,并與hive服務(wù)器創(chuàng)建連接。
Datagrip中:
2.1 建庫
--------------1、建庫---------------------如果數(shù)據(jù)庫已存在就刪除
drop database if exists db_msg cascade;
--創(chuàng)建數(shù)據(jù)庫
create database db_msg;
--切換數(shù)據(jù)庫
use db_msg;
2.2 建表
--------------2、建表-------------------
--如果表已存在就刪除
drop table if exists db_msg.tb_msg_source;
--建表
create table db_msg.tb_msg_source(msg_time string comment "消息發(fā)送時(shí)間", sender_name string comment "發(fā)送人昵稱", sender_account string comment "發(fā)送人賬號(hào)", sender_sex string comment "發(fā)送人性別", sender_ip string comment "發(fā)送人ip地址", sender_os string comment "發(fā)送人操作系統(tǒng)", sender_phonetype string comment "發(fā)送人手機(jī)型號(hào)", sender_network string comment "發(fā)送人網(wǎng)絡(luò)類型", sender_gps string comment "發(fā)送人的GPS定位", receiver_name string comment "接收人昵稱", receiver_ip string comment "接收人IP", receiver_account string comment "接收人賬號(hào)", receiver_os string comment "接收人操作系統(tǒng)", receiver_phonetype string comment "接收人手機(jī)型號(hào)", receiver_network string comment "接收人網(wǎng)絡(luò)類型", receiver_gps string comment "接收人的GPS定位", receiver_sex string comment "接收人性別", msg_type string comment "消息類型", distance string comment "雙方距離", message string comment "消息內(nèi)容"
)
--指定分隔符為制表符
row format delimited fields terminated by '\t';
2.3 加載數(shù)據(jù)
--------------3、加載數(shù)據(jù)-------------------
--上傳數(shù)據(jù)文件到node1服務(wù)器本地文件系統(tǒng)(HS2服務(wù)所在機(jī)器)
--shell: mkdir -p /root/hivedata--加載數(shù)據(jù)到表中
load data local inpath '/root/hivedata/data1.tsv' into table db_msg.tb_msg_source;
load data local inpath '/root/hivedata/data2.tsv' into table db_msg.tb_msg_source;--查詢表 驗(yàn)證數(shù)據(jù)文件是否映射成功
select * from tb_msg_source limit 10;--統(tǒng)計(jì)行數(shù)
select count(*) as cnt from tb_msg_source;
2.4 ETL數(shù)據(jù)清洗
加載完數(shù)據(jù)后,需要判斷加載過來的數(shù)據(jù)是否有效–ETL
問題與解決:
- sender_gps字段有些記錄為空,如何處理? – where length(sender_gps) =0篩選出非空的
- 時(shí)間字段,只需要提取中間的小時(shí)信息? —substr(字段,12,1)提取小時(shí)
- GPS經(jīng)緯度是一個(gè)字段,需要獲取經(jīng)緯度兩個(gè)? — split(字段,‘,’)根據(jù)逗號(hào)進(jìn)行字段切割
- 將ETL處理后的結(jié)果保存到一張新hive表中?—CTAS語法
create table … as select … 表結(jié)構(gòu)和數(shù)據(jù)全部都有了
--ETL實(shí)現(xiàn)
--如果表已存在就刪除
drop table if exists db_msg.tb_msg_etl;
--將Select語句的結(jié)果保存到新表中
create table db_msg.tb_msg_etl as
select*,substr(msg_time,0,10) as dayinfo, --獲取天substr(msg_time,12,2) as hourinfo, --獲取小時(shí)split(sender_gps,",")[0] as sender_lng, --提取經(jīng)度split(sender_gps,",")[1] as sender_lat --提取緯度
from db_msg.tb_msg_source
--過濾字段為空的數(shù)據(jù)
where length(sender_gps) > 0 ;
數(shù)據(jù)量太多–記得limit 10
--驗(yàn)證ETL結(jié)果
selectmsg_time,dayinfo,hourinfo,sender_gps,sender_lng,sender_lat
from db_msg.tb_msg_etl
limit 10;
2.5 需求指標(biāo)統(tǒng)計(jì)—都很簡單
需求1:統(tǒng)計(jì)今日總消息量
group by 每日后count計(jì)數(shù)
create table if not exists tb_rs_total_msg_cnt
comment "今日消息總量"
as
selectdayinfo,count(*) as total_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo;select * from tb_rs_total_msg_cnt;--結(jié)果驗(yàn)證
需求2:統(tǒng)計(jì)今日每小時(shí)消息量、發(fā)送和接收用戶數(shù)
按每天,每小時(shí)分組,計(jì)數(shù)
create table if not exists tb_rs_hour_msg_cnt
comment "每小時(shí)消息量趨勢"
as
selectdayinfo,hourinfo,count(*) as total_msg_cnt,count(distinct sender_account) as sender_usr_cnt,count(distinct receiver_account) as receiver_usr_cnt
from db_msg.tb_msg_etl
group by dayinfo,hourinfo;select * from tb_rs_hour_msg_cnt;--結(jié)果驗(yàn)證
需求3:統(tǒng)計(jì)今日各地區(qū)發(fā)送消息數(shù)據(jù)量
按照每日與地區(qū)GPS分組,
出現(xiàn)在select后的字段,要么是group by 后的字段,要么是聚合函數(shù)字段,所以分組還加了經(jīng)緯度字段。
case函數(shù):將原本經(jīng)緯度的string類型轉(zhuǎn)換成double數(shù)字類型
cast(sender_lng as double)
create table if not exists tb_rs_loc_cnt
comment "今日各地區(qū)發(fā)送消息總量"
as
selectdayinfo,sender_gps,cast(sender_lng as double) as longitude,cast(sender_lat as double) as latitude,count(*) as total_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo,sender_gps,sender_lng,sender_lat;select * from tb_rs_loc_cnt; --結(jié)果驗(yàn)證
需求4:統(tǒng)計(jì)今日發(fā)送消息和接收消息的用戶數(shù)
按照天分組,對用戶數(shù)進(jìn)行去重統(tǒng)計(jì)
create table if not exists tb_rs_usr_cnt
comment "今日發(fā)送消息人數(shù)、接受消息人數(shù)"
as
selectdayinfo,count(distinct sender_account) as sender_usr_cnt,count(distinct receiver_account) as receiver_usr_cnt
from db_msg.tb_msg_etl
group by dayinfo;select * from tb_rs_usr_cnt; --結(jié)果驗(yàn)證
需求5:統(tǒng)計(jì)今日發(fā)送消息最多的Top10用戶
按照天,用戶分組,計(jì)數(shù)后排序,limit 10
create table if not exists tb_rs_susr_top10
comment "發(fā)送消息條數(shù)最多的Top10用戶"
as
selectdayinfo,sender_name as username,count(*) as sender_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo,sender_name
order by sender_msg_cnt desc
limit 10;select * from tb_rs_susr_top10; --結(jié)果驗(yàn)證
需求6:統(tǒng)計(jì)今日接收消息最多的Top10用戶
按照天,用戶分組,計(jì)數(shù)后排序,limit 10
create table if not exists tb_rs_rusr_top10
comment "接受消息條數(shù)最多的Top10用戶"
as
selectdayinfo,receiver_name as username,count(*) as receiver_msg_cnt
from db_msg.tb_msg_etl
group by dayinfo,receiver_name
order by receiver_msg_cnt desc
limit 10;select * from tb_rs_rusr_top10; --結(jié)果驗(yàn)證
需求7:統(tǒng)計(jì)發(fā)送人的手機(jī)型號(hào)分布情況
按照天,用戶手機(jī)型號(hào)分組,對用戶去重計(jì)數(shù)
create table if not exists tb_rs_sender_phone
comment "發(fā)送人的手機(jī)型號(hào)分布"
as
selectdayinfo,sender_phonetype,count(distinct sender_account) as cnt
from tb_msg_etl
group by dayinfo,sender_phonetype;select * from tb_rs_sender_phone; --結(jié)果驗(yàn)證
需求8:統(tǒng)計(jì)發(fā)送人的設(shè)備操作系統(tǒng)分布情況
create table if not exists tb_rs_sender_os
comment "發(fā)送人的OS分布"
as
selectdayinfo,sender_os,count(distinct sender_account) as cnt
from tb_msg_etl
group by dayinfo,sender_os;select * from tb_rs_sender_os; --結(jié)果驗(yàn)證
3、FineBI實(shí)現(xiàn)可視化報(bào)表
進(jìn)入可視化展示階段
3.1 FineBI介紹
FineBI:https://www.finebi.com/
FineBI特點(diǎn):可多人協(xié)作、拖拽不需要代碼、適合各種分析場景、支持各種圖表、支持大數(shù)據(jù)
已下載安裝好
3.2 FineBI配置數(shù)據(jù)
將hive中數(shù)據(jù)連接到BI上。
FineBI與Hive集成的官方文檔:https://help.fanruan.com/finebi/doc-view-301.html
驅(qū)動(dòng)配置、安裝插件-----都配置好了,可直接連接hive數(shù)據(jù)
配置數(shù)據(jù)操作
3.3 構(gòu)建可視化報(bào)表
FineBI上各種拖拽操作
最后效果:
總結(jié):很簡單的一個(gè)案例,但把數(shù)據(jù)分析的整個(gè)流程走完了