當(dāng)前位置：首頁(yè) > news >正文

廣平網(wǎng)站建設(shè)seo顧問(wèn)服務(wù)公司站長(zhǎng)

news 2025/7/2 10:18:36

廣平網(wǎng)站建設(shè),seo顧問(wèn)服務(wù)公司站長(zhǎng),硬件開(kāi)發(fā)需求,什么網(wǎng)站做app好文章目錄準(zhǔn)備工作刪除缺失值 > 3 的數(shù)據(jù)刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)刪除非法數(shù)據(jù)hotel_data.csv通過(guò)編寫(xiě)Spark程序清洗酒店數(shù)據(jù)里的缺失數(shù)據(jù)、非法數(shù)據(jù)、重復(fù)數(shù)據(jù)準(zhǔn)備工作搭建 hadoop 偽分布或 hadoop 完全分布上傳 hotal_data.csv 文件到 hadoopidea 配置…

文章目錄

- 準(zhǔn)備工作
- 刪除缺失值 >= 3 的數(shù)據(jù)
- 刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)
- 刪除非法數(shù)據(jù)
- hotel_data.csv

通過(guò)編寫(xiě)Spark程序清洗酒店數(shù)據(jù)里的缺失數(shù)據(jù)、非法數(shù)據(jù)、重復(fù)數(shù)據(jù)

準(zhǔn)備工作

搭建 hadoop 偽分布或 hadoop 完全分布
上傳 hotal_data.csv 文件到 hadoop
idea 配置好 scala 環(huán)境

刪除缺失值 >= 3 的數(shù)據(jù)

讀取 /hotel_data.csv
刪除缺失值 >= 3 的數(shù)據(jù)，打印剔除的數(shù)量
將清洗后的數(shù)據(jù)保存為/hotelsparktask1

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Demo01 {def main(args: Array[String]): Unit = {// System.setProperty("HADOOP_USER_NAME", "root")//解決保存文件權(quán)限不夠的問(wèn)題val config: SparkConf = new SparkConf().setMaster("local[1]").setAppName("1")val sc = new SparkContext(config)val hdfsUrl ="hdfs://192.168.226.129:9000"val filePath: String = hdfsUrl+"/file3_1/hotel_data.csv"val data: RDD[Array[String]] = sc.textFile(filePath).map(_.split(",")).cache()val total: Long = data.count()val dataDrop: RDD[Array[String]] = data.filter(_.count(_.equals("NULL")) <= 3)println("刪除的數(shù)據(jù)條目有: " + (total - dataDrop.count()))dataDrop.map(_.mkString(",")).saveAsTextFile(hdfsUrl+ "/hotelsparktask1")sc.stop()}
}

刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)

讀取 /hotel_data.csv
將字段{星級(jí)、評(píng)論數(shù)、評(píng)分}中任意字段為空的數(shù)據(jù)刪除, 打印剔除的數(shù)量
保存 /hotelsparktask2

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Demo02 {def main(args: Array[String]): Unit = {System.setProperty("HADOOP_USER_NAME", "root")val config: SparkConf = new SparkConf().setMaster("local[1]").setAppName("2")val sc = new SparkContext(config)val hdfsUrl ="hdfs://192.168.226.129:9000"val filePath: String = hdfsUrl+"/file3_1/hotel_data.csv"val data: RDD[Array[String]] = sc.textFile(filePath).map(_.split(",")).cache()val total: Long = data.count()val dataDrop: RDD[Array[String]] = data.filter {arr: Array[String] =>!(arr(6).equals("NULL") || arr(10).equals("NULL") || arr(11).equals("NULL"))}println("刪除的數(shù)據(jù)條目有: " + (total - dataDrop.count()))dataDrop.map(_.mkString(",")).saveAsTextFile(hdfsUrl+ "/hotelsparktask2")sc.stop()}
}

刪除非法數(shù)據(jù)

讀取第一題的 /hotelsparktask1
剔除數(shù)據(jù)集中評(píng)分和星級(jí)字段的非法數(shù)據(jù)，合法數(shù)據(jù)是評(píng)分[0，5]的實(shí)數(shù)，星級(jí)是指星級(jí)字段內(nèi)容中包含 NULL、二星、三星、四星、五星的數(shù)據(jù)
剔除數(shù)據(jù)集中的重復(fù)數(shù)據(jù)
分別打印刪除含有非法評(píng)分、星級(jí)以及重復(fù)的數(shù)據(jù)條目數(shù)
保存 /hotelsparktask3

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Demo03 {def main(args: Array[String]): Unit = {System.setProperty("HADOOP_USER_NAME", "root")//解決權(quán)限問(wèn)題val config: SparkConf = new SparkConf().setMaster(  "local[1]").setAppName("3")val sc = new SparkContext(config)val hdfsUrl ="hdfs://192.168.226.129:9000"val filePath: String = hdfsUrl+"/hotelsparktask1"val lines: RDD[String] = sc.textFile(filePath).cache()val data: RDD[Array[String]] = lines.map(_.split(","))val total: Long = data.count()val dataDrop: RDD[Array[String]] = data.filter {arr: Array[String] =>try {(arr(10).toDouble >= 0) && (arr(10).toDouble <= 5)} catch {case _: Exception => false}}val lab = Array("NULL", "一星", "二星", "三星", "四星", "五星")val dataDrop1: RDD[Array[String]] = data.filter { arr: Array[String] =>var flag = falsefor (elem <- lab) {if (arr(6).contains(elem)) {flag = true}}flag}val dataDrop2: RDD[String] = lines.distinctprintln("刪除的非法評(píng)分?jǐn)?shù)據(jù)條目有: " + (total - dataDrop.count()))println("刪除的非法星級(jí)數(shù)據(jù)條目有: " + (total - dataDrop1.count()))println("刪除重復(fù)數(shù)據(jù)條目有: " + (total - dataDrop2.count()))val wordsRdd: RDD[Array[String]] = lines.distinct.map(_.split(",")).filter {arr: Array[String] =>try {(arr(10).toDouble >= 0) && (arr(10).toDouble <= 5)} catch {case _: Exception => false}}.filter { arr: Array[String] =>var flag = falsefor (elem <- lab) {if (arr(6).contains(elem)) {flag = true}}flag}wordsRdd.map(_.mkString(",")).saveAsTextFile(hdfsUrl + "/hotelsparktask3")sc.stop()}
}

hotel_data.csv

下載數(shù)據(jù)：https://download.csdn.net/download/weixin_44018458/87437211

查看全文

http://www.risenshineclean.com/news/32149.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

廣平網(wǎng)站建設(shè)seo顧問(wèn)服務(wù)公司站長(zhǎng)

文章目錄

準(zhǔn)備工作

刪除缺失值 >= 3 的數(shù)據(jù)

刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)

刪除非法數(shù)據(jù)

hotel_data.csv

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

文章目錄

準(zhǔn)備工作

刪除缺失值 >= 3 的數(shù)據(jù)

刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)

刪除非法數(shù)據(jù)

hotel_data.csv

相關(guān)文章：

刪除星級(jí)、評(píng)論數(shù)、評(píng)分中任意字段為空的數(shù)據(jù)