做網(wǎng)站的策劃需要做什么seo免費(fèi)軟件
旅游集市數(shù)倉建設(shè)
小白如何從0到1成為大數(shù)據(jù)工程師
目錄
旅游集市數(shù)倉建設(shè)
1.上傳數(shù)據(jù)
2.可能用到的UDF函數(shù)
3.創(chuàng)建所需數(shù)據(jù)庫及表
1)ODS層
①ods_oidd
②ods_wcdr
③ods_ddr
④ods_dpi
2)DWD層
①dwd_res_regn_mergelocation_msk_d
②dwm_staypoint_msk_d
③dws_province_tourist_msk_d
④dws_city_tourist_msk_d
⑤dws_county_tourist_msk_d
3)DIM層
①dim_usertag_msk_m
4)ADS層
1)需求矩陣
2)根據(jù)區(qū)縣游客表計算如下指標(biāo)
1.上傳數(shù)據(jù)
cd /usr/local/soft/mkdir ctyun/cd ctyun/pwd
2.可能用到的UDF函數(shù)
cd /usr/local/soft/mkdir jars/cd jars/pwd
添加資源并注冊函數(shù)
add jars /usr/local/soft/jars/jtxy_hdfs-1.0-SNAPSHOT.jar;create temporary function get_points as 'ctyun.udf.getPointsUDF';create temporary function dateBetweenUDF as 'ctyun.udf.dateBetweenUDF';create temporary function calLength as 'ctyun.udf.calLength';create temporary function get_city_or_prov_id as 'ctyun.udf.getCityIdOrProvID';
3.創(chuàng)建所需數(shù)據(jù)庫及表
create database ods;use ods;
1)ODS層
①ods_oidd
OIDD是采集A接口的信令數(shù)據(jù),包括手機(jī)在發(fā)生業(yè)務(wù)時的位置信息。OIDD信令類型數(shù)據(jù)分為三大 類,呼叫記錄、短信記錄和用戶位置更新記錄。
CREATE EXTERNAL TABLE IF NOT EXISTS ods.ods_oidd(mdn string comment '手機(jī)號碼'
,start_time string comment '業(yè)務(wù)開始時間'
,county_id string comment '區(qū)縣編碼'
,longi string comment '經(jīng)度'
,lati string comment '緯度'
,bsid string comment '基站標(biāo)識'
,grid_id string comment '網(wǎng)格號'
,biz_type string comment '業(yè)務(wù)類型'
,event_type string comment '事件類型'
,data_source string comment '數(shù)據(jù)源'
)
comment 'oidd位置數(shù)據(jù)表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/data/tour/ods/ods_oidd'; // 添加分區(qū)
alter table ods.ods_oidd add partition(day_id=20180503);// 加載數(shù)據(jù)
load data local inpath '/usr/local/soft/ctyun/ods_oidd/day_id=20180503/*' into table ods_oidd partition(day_id=20180503);
//查看數(shù)據(jù)select * from ods.ods_oidd limit 10;
dfs -mkdir -p /data/tour/ods/ods_oidd;dfs -ls /data/tour/ods/ods_oidd;dfs -ls /data/tour/ods/;dfs -rmr /data/tour/ods/ods_oidd;
②ods_wcdr
WCDR采集網(wǎng)絡(luò)中ABIS接口的數(shù)據(jù),基于業(yè)務(wù)發(fā)生過程中三個扇區(qū)的測量信息,通過三角定位法 確定用戶的位置信息。
CREATE EXTERNAL TABLE IF NOT EXISTS ods.ods_wcdr (mdn string comment '手機(jī)號碼'
,start_time string comment '業(yè)務(wù)開始時間'
,county_id string comment '區(qū)縣編碼'
,longi string comment '經(jīng)度'
,lati string comment '緯度'
,bsid string comment '基站標(biāo)識'
,grid_id string comment '網(wǎng)格號'
,biz_type string comment '業(yè)務(wù)類型'
,event_type string comment '事件類型'
,data_source string comment '數(shù)據(jù)源'
)
comment 'wcdr位置數(shù)據(jù)表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/data/tour/ods/ods_wcdr'; // 添加分區(qū)
alter table ods.ods_wcdr add partition(day_id=20180503);// 加載數(shù)據(jù)
load data local inpath '/usr/local/soft/ctyun/ods_wcdr/day_id=20180503/*' into
table ods_wcdr partition(day_id=20180503);
//查看數(shù)據(jù)select * from ods.ods_wcdr limit 10;
③ods_ddr
當(dāng)前DDR中只有移動數(shù)據(jù)詳單可以提取基站標(biāo)識,其他語音,短信,增值等業(yè)務(wù)沒有位置信息, 不做為數(shù)據(jù)融合的基礎(chǔ)數(shù)據(jù)。
CREATE EXTERNAL TABLE IF NOT EXISTS ods.ods_ddr(mdn string comment '手機(jī)號碼'
,start_time string comment '業(yè)務(wù)開始時間'
,county_id string comment '區(qū)縣編碼'
,longi string comment '經(jīng)度'
,lati string comment '緯度'
,bsid string comment '基站標(biāo)識'
,grid_id string comment '網(wǎng)格號'
,biz_type string comment '業(yè)務(wù)類型'
,event_type string comment '事件類型'
,data_source string comment '數(shù)據(jù)源'
)
comment 'ddr位置數(shù)據(jù)表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/data/tour/ods/ods_ddr'; // 添加分區(qū)
alter table ods.ods_ddr add partition(day_id=20180503);// 加載數(shù)據(jù)
load data local inpath '/usr/local/soft/ctyun/ods_ddr/day_id=20180503/*' into
table ods_ddr partition(day_id=20180503);
// 查詢數(shù)據(jù)select * from ods.ods_ddr limit 10;
④ods_dpi
移動DPI數(shù)據(jù)采集用戶移動用戶數(shù)據(jù)上網(wǎng)時移動核心網(wǎng)和PDSN之間接口的數(shù)據(jù)。
CREATE EXTERNAL TABLE IF NOT EXISTS ods.ods_dpi(mdn string comment '手機(jī)號碼'
,start_time string comment '業(yè)務(wù)開始時間'
,county_id string comment '區(qū)縣編碼'
,longi string comment '經(jīng)度'
,lati string comment '緯度'
,bsid string comment '基站標(biāo)識'
,grid_id string comment '網(wǎng)格號'
,biz_type string comment '業(yè)務(wù)類型'
,event_type string comment '事件類型'
,data_source string comment '數(shù)據(jù)源'
)
comment 'dpi位置數(shù)據(jù)表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/data/tour/ods/ods_dpi'; // 添加分區(qū)
alter table ods.ods_dpi add partition(day_id=20180503);// 加載數(shù)據(jù)
load data local inpath '/usr/local/soft/ctyun/ods_dpi/day_id=20180503/*' into
table ods_dpi partition(day_id=20180503);
// 查詢數(shù)據(jù)select * from ods.ods_dpi limit 10;
2)DWD層
create database dwd;use dwd;
①dwd_res_regn_mergelocation_msk_d
在ODS層中,由于數(shù)據(jù)來源不同,原始位置數(shù)據(jù)被分成了好幾張表加載到了我們的ODS層。 為了方便大家的使用,我們在DWD層做了一張位置數(shù)據(jù)融合表,在這里,我們將oidd、wcdr、 ddr、dpi位置數(shù)據(jù)匯聚到一張表里面,統(tǒng)一字段名,提升數(shù)據(jù)質(zhì)量,這樣就有了一張可供大家方 便使用的明細(xì)表了。
CREATE EXTERNAL TABLE IF NOT EXISTS dwd.dwd_res_regn_mergelocation_msk_d (mdn string comment '手機(jī)號碼'
,start_time string comment '業(yè)務(wù)開始時間'
,county_id string comment '區(qū)縣編碼'
,longi string comment '經(jīng)度'
,lati string comment '緯度'
,bsid string comment '基站標(biāo)識'
,grid_id string comment '網(wǎng)格號'
,biz_type string comment '業(yè)務(wù)類型'
,event_type string comment '事件類型'
,data_source string comment '數(shù)據(jù)源'
)
comment '位置數(shù)據(jù)融合表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORCFilelocation '/data/tour/dwd/dwd_res_regn_mergelocation_msk_d'; // 添加分區(qū)
alter table dwd.dwd_res_regn_mergelocation_msk_d add partition(day_id=20180503);
手動下載數(shù)據(jù)
// hive 直接運(yùn)行速度太慢,可用手動load/put文件方式
// 手動load
// 注意 上面的建表語句文件存儲格式修改為了ORCFile 所以不能直接使用下面的load方法load data local inpath '/usr/local/soft/ctyun/dwd_merge/part-00000*' into
table dwd.dwd_res_regn_mergelocation_msk_d partition(day_id=20180503);// union allinsert into table dwd.dwd_res_regn_mergelocation_msk_d
partition(day_id="20180503")select mdn ,start_time ,county_id ,longi ,lati ,bsid ,grid_id ,biz_type ,event_type ,data_source
from ods.ods_oiddwhere day_id = "20180503"union allselect mdn ,start_time ,county_id ,longi ,lati ,bsid ,grid_id ,biz_type ,event_type ,data_source
from ods.ods_wcdrwhere day_id = "20180503"union allselect mdn ,start_time ,county_id ,longi ,lati ,bsid ,grid_id ,biz_type ,event_type ,data_source
from ods.ods_dpiwhere day_id = "20180503"union allselect mdn ,start_time ,county_id ,longi ,lati ,bsid ,grid_id
,biz_type
,event_type
,data_source
from ods.ods_ddrwhere day_id = "20180503";
②dwm_staypoint_msk_d
計算一個人在一個網(wǎng)格內(nèi)的停留時間,按手機(jī)號,網(wǎng)格id,區(qū)縣id分組
1、對所有時間進(jìn)行排序
2、取第一個點的開始時間和最后一個點的結(jié)束時間
create database dwm;use dwm;
CREATE EXTERNAL TABLE IF NOT EXISTS dwm.dwm_staypoint_msk_d (mdn string comment '用戶手機(jī)號碼'
,longi string comment '網(wǎng)格中心點經(jīng)度'
,lati string comment '網(wǎng)格中心點緯度'
,grid_id string comment '停留點所在電信內(nèi)部網(wǎng)格號'
,county_id string comment '停留點區(qū)縣'
,duration string comment '機(jī)主在停留點停留的時間長度(分鐘),lTime-eTime'
,grid_first_time string comment '網(wǎng)格第一個記錄位置點時間(秒級)'
,grid_last_time string comment '網(wǎng)格最后一個記錄位置點時間(秒級)'
)
comment '停留點表'PARTITIONED BY (day_id string comment '天分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILElocation '/data/tour/dwm/dwm_staypoint_msk_d';
通過grid_id 網(wǎng)格id 獲取 網(wǎng)格中心點經(jīng)緯度 longi、lati
該SQL執(zhí)行會出現(xiàn)問題: 執(zhí)行流程一直處于 0% Map 0% reduce
/**
insert into table dwm.dwm_staypoint_msk_d partition(day_id=20180503)
select t1.mdn,get_points(grid_id)[0] as longi,get_points(grid_id)[1] as lati,t1.grid_id,t1.county_id,dateBetweenUDF(t1.grid_first_time,t1.grid_last_time) as duration,t1.grid_first_time,t1.grid_last_time
from (select mdn,grid_id,county_id,min(split(start_time,',')[0]) as grid_first_time,max(split(start_time,',')[1]) as grid_last_timefrom dwd.dwd_res_regn_mergelocation_msk_dwhere day_id="20180503"group by mdn, grid_id, county_id
)t1;
*/優(yōu)化后的SQL:WITH split_table as (
SELECT
mdn
,grid_id
,county_id
,split(start_time,',')[1] as grid_first_time
,split(start_time,',')[0] as grid_last_time
FROM dwd.dwd_res_regn_mergelocation_msk_d
where day_id="20180503"
)
, max_min_table as (
SELECT
mdn
,grid_id
,county_id
,Max(grid_first_time) OVER(PARTITION BY mdn,grid_id,county_id) as grid_first_time
,MIN(grid_last_time) OVER(PARTITION BY mdn,grid_id,county_id) as grid_last_time
FROM split_table
)insert into table dwm.dwm_staypoint_msk_d partition(day_id=20180503)
SELECT
t1.mdn
,get_points(t1.grid_id)[0] as longi
,get_points(t1.grid_id)[1] as lati
,t1.grid_id
,t1.county_id
,dateBetweenUDF(t1.grid_first_time,t1.grid_last_time) as duration
,t1.grid_first_time
,t1.grid_last_time
FROM (
SELECT
mdn
,grid_id
,county_id
,grid_first_time
,grid_last_time
FROM max_min_table
group by
mdn
,grid_id
,county_id
,grid_first_time
,grid_last_time
) t1
③dws_province_tourist_msk_d
游客定義 出行距離大于300km 常住地在用戶畫像表中 在省內(nèi)停留時間大于3個小時
create database dws;use dws;
CREATE EXTERNAL TABLE IF NOT EXISTS dws.dws_province_tourist_msk_d (mdn string comment '手機(jī)號大寫MD5加密'
,source_county_id string comment '游客來源區(qū)縣'
,d_province_id string comment '旅游目的地省代碼'
,d_stay_time double comment '游客在該省停留的時間長度(小時)'
,d_max_distance double comment '游客本次出游距離'
)
comment '旅游應(yīng)用專題數(shù)據(jù)省級別-天'PARTITIONED BY (day_id string comment '日分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS PARQUETlocation '/data/tour/dws/dws_province_tourist_msk_d';
停留點表dwm_staypoint_msk_d與用戶畫像維表dim_usertag_msk_m 通過mdn關(guān)聯(lián),使用 get_city_or_prov_id(county_id,"province")方法,傳入county_id,返回province_id,然后按 mdn、province_id、resi_county_id分組,使用calLength(grid_id, resi_grid_id) 傳入網(wǎng)格id、居 住地網(wǎng)格id,算出出行距離,并計算每個用戶到每個省的累計出行時間,然后取出 累計時間最大 值超過3小時(180分鐘),出行距離大于300km的用戶
④dws_city_tourist_msk_d
出行距離大于100km 在市內(nèi)停留時間大于3個小時
CREATE EXTERNAL TABLE IF NOT EXISTS dws.dws_city_tourist_msk_d (mdn string comment '手機(jī)號大寫MD5加密'
,source_county_id string comment '游客來源區(qū)縣'
,d_city_id string comment '旅游目的地市代碼'
,d_stay_time double comment '游客在該省市停留的時間長度(小時)'
,d_max_distance double comment '游客本次出游距離'
)
comment '旅游應(yīng)用專題數(shù)據(jù)城市級別-天'PARTITIONED BY (day_id string comment '日分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS PARQUETlocation '/data/tour/dws/dws_city_tourist_msk_d';
停留點表dwm_staypoint_msk_d與用戶畫像維表dim_usertag_msk_m 通過mdn關(guān)聯(lián),使用 get_city_or_prov_id(county_id,"city")方法,傳入county_id,返回city_id,然后按mdn、city_id、 resi_county_id分組,使用calLength(grid_id, resi_grid_id) 傳入網(wǎng)格id、居住地網(wǎng)格id,算出出行 距離,并計算每個用戶到每個市的累計出行時間,然后取出 累計時間最大值超過3小時(180分 鐘),出行距離大于100km的用戶
⑤dws_county_tourist_msk_d
出行距離大于10km 在縣內(nèi)停留時間大于3個小時
CREATE EXTERNAL TABLE IF NOT EXISTS dws.dws_county_tourist_msk_d (mdn string comment '手機(jī)號大寫MD5加密' ,source_county_id string comment '游客來源區(qū)縣' ,d_county_id string comment '旅游目的地縣代碼' ,d_stay_time double comment '游客在該縣停留的時間長度(小時)' ,d_max_distance double comment '游客本次出游距離'
)
comment '旅游應(yīng)用專題數(shù)據(jù)縣級別-天'PARTITIONED BY (day_id string comment '日分區(qū)'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUETlocation '/data/tour/dws/dws_county_tourist_msk_d';
停留點表dwm_staypoint_msk_d與用戶畫像維表dim_usertag_msk_m 通過mdn關(guān)聯(lián),按mdn、 county_id、resi_county_id分組,使用calLength(grid_id, resi_grid_id) 傳入網(wǎng)格id、居住地id, 算出出行距離,并計算每個用戶到每個縣的累計出行時間,然后取出 累計時間最大值超過3小時 (180分鐘),出行距離大于10km的用戶
運(yùn)行SQL時報錯 我們要創(chuàng)建一下UDF函數(shù)
add jars /usr/local/soft/jars/jtxy_hdfs-1.0-SNAPSHOT.jar;create temporary function get_points as 'ctyun.udf.getPointsUDF';create temporary function dateBetweenUDF as 'ctyun.udf.dateBetweenUDF';create temporary function calLength as 'ctyun.udf.calLength';create temporary function get_city_or_prov_id as 'ctyun.udf.getCityIdOrProvID';
insert into table dws.dws_county_tourist_msk_d partition(day_id="20180503")select ttt1.mdn,ttt1.source_county_id,ttt1.d_county_id,ttt1.d_stay_time,ttt1.d_max_distancefrom(select mdn,resi_county_id as source_county_id,county_id as d_county_id,sum(duration) as d_stay_time,max(calLength(tt1.grid_id,tt1.resi_grid_id)) as d_max_distancefrom(select t1.mdn,t1.grid_id,t1.county_id,t1.duration,t2.resi_county_id,t2.resi_grid_idfrom (select *from dwm.dwm_staypoint_msk_d
where day_id='20180503') t1 join(select *from dim.dim_usertag_msk_mwhere month_id='201805') t2 on t1.mdn = t2.mdn) tt1 group by tt1.mdn,tt1.county_id,tt1.resi_county_id)ttt1 where d_stay_time > 180 and d_max_distance > 10000;
3)DIM層
create database dim;use dim;
①dim_usertag_msk_m
CREATE EXTERNAL TABLE IF NOT EXISTS dim.dim_usertag_msk_m (mdn string comment '手機(jī)號大寫MD5加密'
,name string comment '姓名'
,gender string comment '性別,1男2女'
,age string comment '年齡'
,id_number string comment '證件號碼'
,number_attr string comment '號碼歸屬地'
,trmnl_brand string comment '終端品牌'
,trmnl_price string comment '終端價格',packg string comment '套餐'
,conpot string comment '消費(fèi)潛力'
,resi_grid_id string comment '常住地網(wǎng)格'
,resi_county_id string comment '常住地區(qū)縣'
)
comment '用戶畫像表'PARTITIONED BY (month_id string comment '月分區(qū)'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS PARQUETlocation '/data/tour/dim/dim_usertag_msk_m'; // 添加分區(qū)
alter table dim.dim_usertag_msk_m add partition(month_id=201805);// 加載數(shù)據(jù)
load data local inpath
'/usr/local/soft/ctyun/dim_usertag_msk_m/month_id=201805/*' into table
dim.dim_usertag_msk_m partition(month_id=201805);
// 查詢數(shù)據(jù)select * from dim_usertag_msk_m limit 10;
4)ADS層
根據(jù)需求建設(shè)
1)需求矩陣
2)根據(jù)區(qū)縣游客表計算如下指標(biāo)
客流量按天 [區(qū)縣id,客流量]
select t1.d_county_id,count(*) as d_county_cnt
from (select d_county_idfrom dws.dws_county_tourist_msk_dwhere t1.day_id="20180503"
) t1 group by t1.d_county_id;
性別按天 [區(qū)縣id,性別,客流量]
select t1.d_county_id,t2.gender,count(*) as d_county_gender_cnt
from(select mdn,d_county_idfrom dws.dws_county_tourist_msk_dwhere day_id="20180503"
) t1 left join (select mdn,genderfrom dim.dim_usertag_msk_mwhere month_id=20180503
) t2 on t1.mdn = t2.mdn
group by t1.d_county_id,t2.gender;
年齡按天 [區(qū)縣id,年齡,客流量]
常住地按天 [區(qū)縣id,常住地市,客流量]
歸屬地按天 [區(qū)縣id,歸屬地市,客流量]
select t1.d_county_id,t2.number_attr,count(*) as d_county_number_attr_cnt
from(select mdn,d_county_idfrom dws.dws_county_tourist_msk_dwhere day_id="20180503"
) t1 left join (select mdn,number_attrfrom dim.dim_usertag_msk_mwhere month_id=20180503
) t2 on t1.mdn = t2.mdn
group by t1.d_county_id,t2.number_attr;
終端型號按天 [區(qū)縣id,終端型號,客流量]
消費(fèi)等級按天 [區(qū)縣id,消費(fèi)等級,客流量]
停留時長按天 [區(qū)縣id,停留時長,客流量]