網(wǎng)創(chuàng)八步的第七步整站優(yōu)化報價
文章目錄
- 一、Python生成數(shù)據(jù)
- 1.1 代碼說明
- 1.2 代碼參考
- 二、數(shù)據(jù)遷移
- 2.1 從本機上傳至服務(wù)器
- 2.2 檢查源數(shù)據(jù)格式
- 2.3 檢查大小并上傳至HDFS
- 三、beeline建表
- 3.1 創(chuàng)建測試表并導(dǎo)入測試數(shù)據(jù)
- 3.2 建表顯示內(nèi)容
- 四、csv文件首行列名的處理
- 4.1 創(chuàng)建新的表
- 4.2 將舊表過濾首行插入新表
一、Python生成數(shù)據(jù)
1.1 代碼說明
這段Python代碼用于生成模擬的個人信息數(shù)據(jù),并將數(shù)據(jù)保存為CSV文件。
-
導(dǎo)入必要的模塊:
csv
:用于處理CSV文件的模塊。random
:用于生成隨機數(shù)。faker
:用于生成模擬數(shù)據(jù)的庫。
-
定義生成數(shù)據(jù)所需的基本信息:
file_base_path
:生成的CSV文件的基本路徑。rows_per_file
:每個CSV文件中包含的行數(shù)。num_rows
:要生成的總行數(shù)。fake
:創(chuàng)建faker.Faker()
實例,用于生成模擬數(shù)據(jù)。
-
定義模擬數(shù)據(jù)的字典:
nationalities
:包含國籍編碼和對應(yīng)的國家。regions
:包含區(qū)域編碼和對應(yīng)的區(qū)域名稱。source_codes
:包含一組源代碼。
-
使用計數(shù)器
row_counter
來跟蹤生成的行數(shù)。 -
使用循環(huán)生成多個CSV文件,每個文件包含
rows_per_file
行數(shù)據(jù)。 -
在每個文件中,生成隨機的個人信息數(shù)據(jù),并將其寫入CSV文件。
-
數(shù)據(jù)生成的過程中,每10000行數(shù)據(jù)打印一次進度。
-
所有數(shù)據(jù)生成后,打印生成的總行數(shù)。
這段代碼使用Faker庫生成模擬的個人信息數(shù)據(jù),每個CSV文件包含一定數(shù)量的行數(shù)據(jù),數(shù)據(jù)字段包括 Rowkey, Name, Age, Email, Address, IDNumber, PhoneNumber, Nationality, Region, SourceCode
。
1.2 代碼參考
import csv
import random
import faker# 文件基本路徑
file_base_path = './output/personal_info_extended'
# 每個文件的行數(shù)
rows_per_file = 10000
# 總行數(shù)
num_rows = 10000000# 創(chuàng)建Faker實例
fake = faker.Faker()# 定義數(shù)據(jù)字典
nationalities = {1: 'US',2: 'CA',3: 'UK',4: 'AU',5: 'FR',6: 'DE',7: 'JP',
}regions = {1: 'North',2: 'South',3: 'East',4: 'West',5: 'Central',
}source_codes = ['A123', 'B456', 'C789', 'D101', 'E202']# 計數(shù)器用于跟蹤生成的行數(shù)
row_counter = 0# 循環(huán)生成數(shù)據(jù)文件
for file_number in range(1, num_rows // rows_per_file + 1):file_path = f"{file_base_path}_{file_number}.csv"# 打開CSV文件以寫入數(shù)據(jù)with open(file_path, 'w', newline='') as csvfile:csv_writer = csv.writer(csvfile)# 寫入CSV文件的標題行if row_counter == 0:csv_writer.writerow(['Rowkey', 'Name', 'Age', 'Email', 'Address', 'IDNumber', 'PhoneNumber', 'Nationality', 'Region', 'SourceCode'])# 生成并寫入指定行數(shù)的擴展的個人信息模擬數(shù)據(jù)for _ in range(rows_per_file):name = fake.name()age = random.randint(18, 99)email = fake.email()address = fake.address().replace('\n', ' ') // 替換掉地址中的換行,保持數(shù)據(jù)生成為一行id_number = fake.ssn()phone_number = fake.phone_number()nationality_code = random.randint(1, len(nationalities))nationality = nationalities[nationality_code]region_code = random.randint(1, len(regions))region = regions[region_code]source_code = random.choice(source_codes)data_row = [row_counter + 1, name, age, email, address, id_number, phone_number, nationality, region, source_code]csv_writer.writerow(data_row)row_counter += 1print(f'已生成 {row_counter} 行數(shù)據(jù)')print(f'{num_rows} 行擴展的個人信息模擬數(shù)據(jù)已生成')
在這里插入圖片描述
二、數(shù)據(jù)遷移
2.1 從本機上傳至服務(wù)器
[root@hadoop10 personInfo]# pwd
/opt/data/personInfo
[root@hadoop10 personInfo]# ls -l| wc -l
215
[root@hadoop10 personInfo]# wc -l *
...10000 personal_info_extended_98.csv10000 personal_info_extended_99.csv10000 personal_info_extended_9.csv2131609 總用量
通過命令顯示我們使用了生成的215個csv文件,現(xiàn)在已經(jīng)上傳到了/opt/data/personInfo
目錄下。
2.2 檢查源數(shù)據(jù)格式
[root@hadoop10 personInfo]# head personal_info_extended_1.csv
Rowkey,Name,Age,Email,Address,IDNumber,PhoneNumber,Nationality,Region,SourceCode
1,Hayley Jimenez,58,garrisonalicia@harris.com,"92845 Davis Circles Apt. 198 East Jerryshire, NV 35424",657-35-2900,(141)053-9917,DE,North,C789
2,Amy Johnson,23,samuelrivera@hall.com,"119 Manning Rapids Suite 557 New Randyburgh, MN 58113",477-76-9570,+1-250-531-6115,UK,North,D101
3,Sara Harper,31,gsandoval@hotmail.com,"98447 Robinson Dale Garzatown, ME 35917",254-77-4980,7958192189,AU,East,A123
4,Alicia Wang,53,kellyreed@evans.com,"531 Lucas Vista New Laura, MO 62148",606-19-1971,001-295-093-9174x819,DE,West,C789
5,Lauren Rodriguez,71,rebeccasaunders@yahoo.com,"060 Gomez Ports Suite 355 Lake Aarontown, CO 38284",186-61-7463,8458236624,DE,East,E202
6,Juan Harris,98,davidsonjohn@hines.com,"50325 Alvarez Forge Apt. 800 New Ericchester, AL 16131",529-53-1492,+1-302-675-5810,CA,East,B456
7,Stephanie Price,90,sroberts@becker.com,"9668 Erik Inlet Port Joshua, MO 62524",303-11-9577,628.011.4670,UK,East,C789
8,Nicole Parker,61,tmcneil@rose-rodriguez.com,"485 Elliott Branch Scottshire, NJ 03885",473-55-5636,001-625-925-3712x952,FR,West,A123
9,Joel Young,54,john03@hotmail.com,"9413 Houston Flats Apt. 095 West Peggy, MD 56240",547-31-2815,920.606.0727x27740,JP,Central,E202
使用head
命令查看文件的頭,發(fā)現(xiàn)了首行字段,我們可以通過首行字段編寫建表語句。
2.3 檢查大小并上傳至HDFS
[root@hadoop10 data]# du -h
282M ./personInfo
282M .
[root@hadoop10 data]# hdfs dfs -put /opt/data/personInfo /testdir/[root@hadoop10 data]# hdfs dfs -du -h /testdir/
281.4 M 281.4 M /testdir/personInfo
linux本地文件占用282M,上傳至HDFS集群/testdir/
目錄后占用281.4M.
三、beeline建表
3.1 創(chuàng)建測試表并導(dǎo)入測試數(shù)據(jù)
CREATE TABLE personal_info (Rowkey STRING,Name STRING,Age STRING,Email STRING,Address STRING,IDNumber STRING,PhoneNumber STRING,Nationality STRING,Region STRING,SourceCode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info;
如果csv文件的每一行都有同樣的列名,需要在建表語句最后添加以下代碼:TBLPROPERTIES ("skip.header.line.count"="1")
,將首行跳過。
本案例由于使用python生成文件,只有第一個csv文件有列名,其余csv沒有列名,我們稍后單獨處理這一個首行。
3.2 建表顯示內(nèi)容
0: jdbc:hive2://hadoop10:10000> CREATE TABLE personal_info (
. . . . . . . . . . . . . . . > Rowkey STRING,
. . . . . . . . . . . . . . . > Name STRING,
. . . . . . . . . . . . . . . > Age STRING,
. . . . . . . . . . . . . . . > Email STRING,
. . . . . . . . . . . . . . . > Address STRING,
. . . . . . . . . . . . . . . > IDNumber STRING,
. . . . . . . . . . . . . . . > PhoneNumber STRING,
. . . . . . . . . . . . . . . > Nationality STRING,
. . . . . . . . . . . . . . . > Region STRING,
. . . . . . . . . . . . . . . > SourceCode STRING
. . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . > ROW FORMAT DELIMITED
. . . . . . . . . . . . . . . > FIELDS TERMINATED BY ','
. . . . . . . . . . . . . . . > STORED AS TEXTFILE;
No rows affected (0.147 seconds)
0: jdbc:hive2://hadoop10:10000> LOAD DATA INPATH '/testdir/personInfo/*.csv' INTO TABLE personal_info;
No rows affected (2.053 seconds)
0: jdbc:hive2://hadoop10:10000> select * from personal_info limit 5;
+-----------------------+---------------------+--------------------+----------------------------+------------------------------------------------+-------------------------+----------------------------+----------------------------+-----------------------+---------------------------+
| personal_info.rowkey | personal_info.name | personal_info.age | personal_info.email | personal_info.address | personal_info.idnumber | personal_info.phonenumber | personal_info.nationality | personal_info.region | personal_info.sourcecode |
+-----------------------+---------------------+--------------------+----------------------------+------------------------------------------------+-------------------------+----------------------------+----------------------------+-----------------------+---------------------------+
| Rowkey | Name | Age | Email | Address | IDNumber | PhoneNumber | Nationality | Region | SourceCode |
| 1 | Hayley Jimenez | 58 | garrisonalicia@harris.com | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North |
| 2 | Amy Johnson | 23 | samuelrivera@hall.com | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | +1-250-531-6115 | UK | North |
| 3 | Sara Harper | 31 | gsandoval@hotmail.com | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East |
| 4 | Alicia Wang | 53 | kellyreed@evans.com | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West |
+-----------------------+---------------------+--------------------+----------------------------+------------------------------------------------+-------------------------+----------------------------+----------------------------+-----------------------+---------------------------+
5 rows selected (0.52 seconds)
四、csv文件首行列名的處理
4.1 創(chuàng)建新的表
解決思路是通過將整表的數(shù)據(jù)查詢出,插入到另一個新表中,而后刪除舊的表,該方法如果在生產(chǎn)環(huán)境中使用應(yīng)考慮機器性能和存儲情況。
CREATE TABLE pinfo (Rowkey STRING,Name STRING,Age STRING,Email STRING,Address STRING,IDNumber STRING,PhoneNumber STRING,Nationality STRING,Region STRING,SourceCode STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
查詢舊表中的行數(shù)。
0: jdbc:hive2://hadoop10:10000> select count(*) from personal_info;
+----------+
| _c0 |
+----------+
| 2131609 |
+----------+
1 row selected (45.762 seconds)
4.2 將舊表過濾首行插入新表
INSERT OVERWRITE TABLE pinfo
SELECTt.Rowkey,t.Name,t.Age,t.Email,t.Address,t.IDNumber,t.PhoneNumber,t.Nationality,t.Region,t.SourceCode
FROM (SELECTRowkey,Name,Age,Email,Address,IDNumber,PhoneNumber,Nationality,Region,SourceCodeFROM personal_info
) t
WHERE t.Name != 'Name';
0: jdbc:hive2://hadoop10:10000> select * from pinfo limit 5;
+---------------+-------------------+------------+----------------------------+------------------------------------------------+-----------------+--------------------+-----------------------+---------------+-------------------+
| pinfo.rowkey | pinfo.name | pinfo.age | pinfo.email | pinfo.address | pinfo.idnumber | pinfo.phonenumber | pinfo.nationality | pinfo.region | pinfo.sourcecode |
+---------------+-------------------+------------+----------------------------+------------------------------------------------+-----------------+--------------------+-----------------------+---------------+-------------------+
| 1 | Hayley Jimenez | 58 | garrisonalicia@harris.com | "92845 Davis Circles Apt. 198 East Jerryshire | NV 35424" | 657-35-2900 | (141)053-9917 | DE | North |
| 2 | Amy Johnson | 23 | samuelrivera@hall.com | "119 Manning Rapids Suite 557 New Randyburgh | MN 58113" | 477-76-9570 | +1-250-531-6115 | UK | North |
| 3 | Sara Harper | 31 | gsandoval@hotmail.com | "98447 Robinson Dale Garzatown | ME 35917" | 254-77-4980 | 7958192189 | AU | East |
| 4 | Alicia Wang | 53 | kellyreed@evans.com | "531 Lucas Vista New Laura | MO 62148" | 606-19-1971 | 001-295-093-9174x819 | DE | West |
| 5 | Lauren Rodriguez | 71 | rebeccasaunders@yahoo.com | "060 Gomez Ports Suite 355 Lake Aarontown | CO 38284" | 186-61-7463 | 8458236624 | DE | East |
+---------------+-------------------+------------+----------------------------+------------------------------------------------+-----------------+--------------------+-----------------------+---------------+-------------------+
5 rows selected (0.365 seconds)
0: jdbc:hive2://hadoop10:10000>
在yarn中查看新表插入的進度。
最后新表的查詢結(jié)果顯示比舊表少1行即為插入處理完成。
0: jdbc:hive2://hadoop10:10000> select count(*) from pinfo;
+----------+
| _c0 |
+----------+
| 2131608 |
+----------+
1 row selected (0.291 seconds)