網(wǎng)站開發(fā)語言是什么app開發(fā)多少錢
文章目錄
Spark Load 導(dǎo)入Hive數(shù)據(jù)
一、Spark Load導(dǎo)入Hive非分區(qū)表數(shù)據(jù)
1、在node3hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
2、啟動(dòng)Hive,在Hive客戶端創(chuàng)建Hive表并加載數(shù)據(jù)
3、在Doris中創(chuàng)建Hive外部表
4、創(chuàng)建Doris表
5、創(chuàng)建Spark Load導(dǎo)入任務(wù)
6、Spark Load任務(wù)查看
7、查看Doris結(jié)果
二、Spark Load 導(dǎo)入Hive分區(qū)表數(shù)據(jù)
1、在node3 hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
2、創(chuàng)建Hive分區(qū)表并,加載數(shù)據(jù)
3、創(chuàng)建Doris分區(qū)表
4、創(chuàng)建Spark Load導(dǎo)入任務(wù)
5、Spark Load任務(wù)查看
6、查看Doris結(jié)果
Spark Load 導(dǎo)入Hive數(shù)據(jù)
一、Spark Load導(dǎo)入Hive非分區(qū)表數(shù)據(jù)
1、在node3hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
hive_data1.txt:
1,zs,18,100
2,ls,19,101
3,ww,20,102
4,ml,21,103
5,tq,22,104
2、啟動(dòng)Hive,在Hive客戶端創(chuàng)建Hive表并加載數(shù)據(jù)
#配置Hive 服務(wù)端$HIVE_HOME/conf/hive-site.xml
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
注意:此配置項(xiàng)為關(guān)閉metastore版本驗(yàn)證,避免在doris中讀取hive外表時(shí)報(bào)錯(cuò)。#在node1節(jié)點(diǎn)啟動(dòng)hive metastore
[root@node1 ~]# hive --service metastore &#在node3節(jié)點(diǎn)進(jìn)入hive客戶端建表并加載數(shù)據(jù)
create table hive_tbl (id int,name string,age int,score int) row format delimited fields terminated by ',';load data local inpath '/root/hive_data1.txt' into table hive_tbl;#查看hive表中的數(shù)據(jù)
hive> select * from hive_tbl;
1 zs 18 100
2 ls 19 101
3 ww 20 102
4 ml 21 103
5 tq 22 104
3、在Doris中創(chuàng)建Hive外部表
使用Spark Load 將Hive非分區(qū)表中的數(shù)據(jù)導(dǎo)入到Doris中時(shí),需要先在Doris中創(chuàng)建hive 外部表,然后通過Spark Load 加載這張外部表數(shù)據(jù)到Doris某張表中。
#Doris中創(chuàng)建Hive 外表
CREATE EXTERNAL TABLE example_db.hive_doris_tbl
(
id INT,
name varchar(255),
age INT,
score INT
)
ENGINE=hive
properties
(
"dfs.nameservices"="mycluster",
"dfs.ha.namenodes.mycluster"="node1,node2",
"dfs.namenode.rpc-address.mycluster.node1"="node1:8020",
"dfs.namenode.rpc-address.mycluster.node2"="node2:8020",
"dfs.client.failover.proxy.provider.mycluster" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
"database" = "default",
"table" = "hive_tbl",
"hive.metastore.uris" = "thrift://node1:9083"
);
注意:
- 在Doris中創(chuàng)建Hive外表不會(huì)將數(shù)據(jù)存儲(chǔ)到Doris中,查詢hive外表數(shù)據(jù)時(shí)會(huì)讀取HDFS中對(duì)應(yīng)hive路徑中的數(shù)據(jù)來展示,向hive表中插入數(shù)據(jù)時(shí),doris中查詢hive外表也能看到新增數(shù)據(jù)。
- 如果Hive表中是分區(qū)表,doris創(chuàng)建hive表將分區(qū)列看成普通列即可。
以上hive外表結(jié)果如下:
mysql> select * from hive_doris_tbl;
+------+------+------+-------+
| id | name | age | score |
+------+------+------+-------+
| 1 | zs | 18 | 100 |
| 2 | ls | 19 | 101 |
| 3 | ww | 20 | 102 |
| 4 | ml | 21 | 103 |
| 5 | tq | 22 | 104 |
+------+------+------+-------+
4、創(chuàng)建Doris表
#創(chuàng)建Doris表
create table spark_load_t2(
id int,
name varchar(255),
age int,
score double
)
ENGINE = olap
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(`id`) BUCKETS 8;
5、創(chuàng)建Spark Load導(dǎo)入任務(wù)
創(chuàng)建Spark Load任務(wù)后,底層Spark Load轉(zhuǎn)換成Spark任務(wù)進(jìn)行數(shù)據(jù)導(dǎo)入處理時(shí),需要連接Hive,所以需要保證在Spark node1-node3節(jié)點(diǎn)客戶端中SPARK_HOME/conf/目錄下有hive-site.xml配置文件,以便找到Hive ,另外,連接Hive時(shí)還需要MySQL 連接依賴包,所以需要在Yarn NodeManager各個(gè)節(jié)點(diǎn)保證$HADOOP_HOME/share/hadoop/yarn/lib路徑下有mysql-connector-java-5.1.47.jar依賴包。
#把hive客戶端hive-site.xml 分發(fā)到Spark 客戶端(node1-node3)節(jié)點(diǎn)$SPARK_HOME/conf目錄下
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node1:/software/spark-2.3.1/conf/
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node2:/software/spark-2.3.1/conf/
[root@node3 ~]# cp /software/hive-3.1.3/conf/hive-site.xml /software/spark-2.3.1/conf/#將mysql-connector-java-5.1.47.jar依賴分發(fā)到NodeManager 各個(gè)節(jié)點(diǎn)$HADOOP_HOME/share/hadoop/yarn/lib路徑中
[root@node3 ~]# cp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar /software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node4:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node5:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
編寫Spark Load任務(wù),如下:
LOAD LABEL example_db.label2
(
DATA FROM TABLE hive_doris_tbl
INTO TABLE spark_load_t2
)
WITH RESOURCE 'spark1'
(
"spark.executor.memory" = "1g",
"spark.shuffle.compress" = "true"
)
PROPERTIES
(
"timeout" = "3600"
);
6、Spark Load任務(wù)查看
登錄Yarn Web UI查看對(duì)應(yīng)任務(wù)執(zhí)行情況:
執(zhí)行命令查看Spark Load 任務(wù)執(zhí)行情況:
mysql> show load order by createtime desc limit 1\G;
*************************** 1. row ***************************JobId: 37128Label: label2State: FINISHEDProgress: ETL:100%; LOAD:100%Type: SPARKEtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=0TaskInfo: cluster:spark1; timeout(s):3600; max_filter_ratio:0.0ErrorMsg: NULLCreateTime: 2023-03-10 18:13:19EtlStartTime: 2023-03-10 18:13:34EtlFinishTime: 2023-03-10 18:15:27LoadStartTime: 2023-03-10 18:15:27
LoadFinishTime: 2023-03-10 18:15:30URL: http://node1:8088/proxy/application_1678424784452_0007/JobDetails: {"Unfinished backends":{"0-0":[]},"ScannedRows":0,"TaskNumber":1,"LoadBytes":0,"All backends":{"0-0":[-1]},"FileNumber":0,"FileSi
ze":0} TransactionId: 24081ErrorTablets: {}
1 row in set (0.00 sec)
?
7、查看Doris結(jié)果
mysql> select * from spark_load_t2;
+------+------+------+-------+
| id | name | age | score |
+------+------+------+-------+
| 5 | tq | 22 | 104 |
| 4 | ml | 21 | 103 |
| 1 | zs | 18 | 100 |
| 3 | ww | 20 | 102 |
| 2 | ls | 19 | 101 |
+------+------+------+-------+
二、Spark Load 導(dǎo)入Hive分區(qū)表數(shù)據(jù)
導(dǎo)入Hive分區(qū)表數(shù)據(jù)到對(duì)應(yīng)的doris分區(qū)表就不能在doris中創(chuàng)建hive外表這種方式導(dǎo)入,因?yàn)閔ive分區(qū)列在hive外表中就是普通列,所以這里我們使用Spark Load 直接讀取Hive分區(qū)表在HDFS中的路徑,將數(shù)據(jù)加載到Doris分區(qū)表中。
1、在node3 hive客戶端,準(zhǔn)備向Hive表加載的數(shù)據(jù)
hive_data2.txt:
1,zs,18,100,2023-03-01
2,ls,19,200,2023-03-01
3,ww,20,300,2023-03-02
4,ml,21,400,2023-03-02
5,tq,22,500,2023-03-02
2、創(chuàng)建Hive分區(qū)表并,加載數(shù)據(jù)
#在node3節(jié)點(diǎn)進(jìn)入hive客戶端建表并加載數(shù)據(jù)
create table hive_tbl2 (id int, name string,age int,score int) partitioned by (dt string) row format delimited fields terminated by ','load data local inpath '/root/hive_data2.txt' into table hive_tbl2;#查看hive表中的數(shù)據(jù)
hive> select * from hive_tbl2;
OK
1 zs 18 100 2023-03-01
2 ls 19 200 2023-03-01
3 ww 20 300 2023-03-02
4 ml 21 400 2023-03-02
5 tq 22 500 2023-03-02hive> show partitions hive_tbl2;
OK
dt=2023-03-01
dt=2023-03-02
當(dāng)hive_tbl2表創(chuàng)建完成后,我們可以在HDFS中看到其存儲(chǔ)路徑格式如下:
?
3、創(chuàng)建Doris分區(qū)表
create table spark_load_t3(
dt date,
id int,
name varchar(255),
age int,
score double
)
ENGINE = olap
DUPLICATE KEY(dt,id)
PARTITION BY RANGE(`dt`)
(
PARTITION `p1` VALUES [("2023-03-01"),("2023-03-02")),
PARTITION `p2` VALUES [("2023-03-02"),("2023-03-03"))
)
DISTRIBUTED BY HASH(`id`) BUCKETS 8;
4、創(chuàng)建Spark Load導(dǎo)入任務(wù)
創(chuàng)建Spark Load任務(wù)后,底層Spark Load轉(zhuǎn)換成Spark任務(wù)進(jìn)行數(shù)據(jù)導(dǎo)入處理時(shí),需要連接Hive,所以需要保證在Spark node1-node3節(jié)點(diǎn)客戶端中SPARK_HOME/conf/目錄下有hive-site.xml配置文件,以便找到Hive ,另外,連接Hive時(shí)還需要MySQL 連接依賴包,所以需要在Yarn NodeManager各個(gè)節(jié)點(diǎn)保證HADOOP_HOME/share/hadoop/yarn/lib路徑下有mysql-connector-java-5.1.47.jar依賴包。
#把hive客戶端hive-site.xml 分發(fā)到Spark 客戶端(node1-node3)節(jié)點(diǎn)$SPARK_HOME/conf目錄下
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node1:/software/spark-2.3.1/conf/
[root@node3 ~]# scp /software/hive-3.1.3/conf/hive-site.xml node2:/software/spark-2.3.1/conf/
[root@node3 ~]# cp /software/hive-3.1.3/conf/hive-site.xml /software/spark-2.3.1/conf/#將mysql-connector-java-5.1.47.jar依賴分發(fā)到NodeManager 各個(gè)節(jié)點(diǎn)$HADOOP_HOME/share/hadoop/yarn/lib路徑中
[root@node3 ~]# cp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar /software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node4:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
[root@node3 ~]# scp /software/hive-3.1.3/lib/mysql-connector-java-5.1.47.jar node5:/software/hadoop-3.3.3/share/hadoop/yarn/lib/
編寫Spark Load任務(wù),如下:
LOAD LABEL example_db.label3
(
DATA INFILE("hdfs://node1:8020/user/hive/warehouse/hive_tbl2/dt=2023-03-02/*")
INTO TABLE spark_load_t3
COLUMNS TERMINATED BY ","
FORMAT AS "csv"
(id,name,age,score)
COLUMNS FROM PATH AS (dt)
SET
(
dt=dt,
id=id,
name=name,
age=age
)
)
WITH RESOURCE 'spark1'
(
"spark.executor.memory" = "1g",
"spark.shuffle.compress" = "true"
)
PROPERTIES
(
"timeout" = "3600"
);
注意:
- 以上HDFS路徑不支持HA模式,需要手動(dòng)指定Active NameNode節(jié)點(diǎn)
- 讀取HDFS文件路徑中的分區(qū)路徑需要寫出來,不能使用*代表,這與Broker Load不同。
- 目前版本測試存在問題:當(dāng)Data INFILE中指定多個(gè)路徑時(shí)有時(shí)會(huì)出現(xiàn)只導(dǎo)入第一個(gè)路徑數(shù)據(jù)。
5、Spark Load任務(wù)查看
執(zhí)行命令查看Spark Load 任務(wù)執(zhí)行情況:
mysql> show load order by createtime desc limit 1\G;
*************************** 1. row ***************************JobId: 39432Label: label3State: FINISHEDProgress: ETL:100%; LOAD:100%Type: SPARKEtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=3TaskInfo: cluster:spark1; timeout(s):3600; max_filter_ratio:0.0ErrorMsg: NULLCreateTime: 2023-03-10 20:11:19EtlStartTime: 2023-03-10 20:11:36EtlFinishTime: 2023-03-10 20:12:21LoadStartTime: 2023-03-10 20:12:21
LoadFinishTime: 2023-03-10 20:12:22URL: http://node1:8088/proxy/application_1678443952851_0026/JobDetails: {"Unfinished backends":{"0-0":[]},"ScannedRows":3,"TaskNumber":1,"LoadBytes":0,"All backends":{"0-0":[-1]},"FileNumber":2,"FileSi
ze":60} TransactionId: 25529ErrorTablets: {}
1 row in set (0.02 sec)
6、查看Doris結(jié)果
mysql> select * from spark_load_t3;
+------------+------+------+------+-------+
| dt | id | name | age | score |
+------------+------+------+------+-------+
| 2023-03-02 | 3 | ww | 20 | 300 |
| 2023-03-02 | 4 | ml | 21 | 400 |
| 2023-03-02 | 5 | tq | 22 | 500 |
+------------+------+------+------+-------+
- 📢博客主頁:https://lansonli.blog.csdn.net
- 📢歡迎點(diǎn)贊 👍 收藏 ?留言 📝 如有錯(cuò)誤敬請(qǐng)指正!
- 📢本文由 Lansonli 原創(chuàng),首發(fā)于 CSDN博客🙉
- 📢停下休息的時(shí)候不要忘了別人還在奔跑,希望大家抓緊時(shí)間學(xué)習(xí),全力奔赴更美好的生活?