當(dāng)前位置：首頁 > news >正文

用凡科做網(wǎng)站的費(fèi)用百度競價推廣運(yùn)營

news 2025/7/2 19:00:45

用凡科做網(wǎng)站的費(fèi)用,百度競價推廣運(yùn)營,城鄉(xiāng)建設(shè)局,成都做小程序🌟歡迎來到我的博客 —— 探索技術(shù)的無限可能！ 🌟博客的簡介（文章目錄） 前言本項(xiàng)目綜合了基本數(shù)據(jù)分析的流程，包括數(shù)據(jù)采集（爬蟲）、數(shù)據(jù)清洗、數(shù)據(jù)存儲、數(shù)據(jù)前后端可視化等推薦…

🌟歡迎來到 我的博客 —— 探索技術(shù)的無限可能！

🌟博客的簡介（文章目錄）

前言

本項(xiàng)目綜合了基本數(shù)據(jù)分析的流程，包括數(shù)據(jù)采集（爬蟲）、數(shù)據(jù)清洗、數(shù)據(jù)存儲、數(shù)據(jù)前后端可視化等
推薦閱讀順序?yàn)?#xff1a;數(shù)據(jù)采集——>數(shù)據(jù)清洗——>數(shù)據(jù)庫存儲——>基于Flask的前后端交互，有問題的話可以留言，有時間我會解疑~
感謝閱讀、點(diǎn)贊和關(guān)注

開發(fā)環(huán)境

系統(tǒng)：Window 10 家庭中文版。
語言：Python（3.9）、MySQL。
Python所需的庫：pymysql、pandas、numpy、time、datetime、requests、etree、jieba、re、json、decimal、flask（沒有的話pip安裝一下就好）。
編輯器：jupyter notebook、Pycharm、SQLyog。
（如果下面代碼在jupyter中運(yùn)行不完全，建議直接使用Pycharm中運(yùn)行）

文件說明

在這里插入圖片描述
本項(xiàng)目下面有四個.ipynb的文件，下面分別闡述各個文件所對應(yīng)的功能：（有py版本可后臺留言）

數(shù)據(jù)采集：分別從前程無憂網(wǎng)站和獵聘網(wǎng)上以關(guān)鍵詞數(shù)據(jù)挖掘爬取相關(guān)數(shù)據(jù)。其中，前程無憂上爬取了270頁，有超過1萬多條數(shù)據(jù)；而獵聘網(wǎng)上只爬取了400多條數(shù)據(jù)，主要為崗位要求文本數(shù)據(jù)，最后將爬取到的數(shù)據(jù)全部儲存到csv文件中。
數(shù)據(jù)清洗：對爬取到的數(shù)據(jù)進(jìn)行清洗，包括去重去缺失值、變量重編碼、特征字段創(chuàng)造、文本分詞等。
數(shù)據(jù)庫存儲：將清洗后的數(shù)據(jù)全部儲存到MySQL中，其中對文本數(shù)據(jù)使用jieba.analyse下的extract_tags來獲取文本中的關(guān)鍵詞和權(quán)重大小，方便繪制詞云。
基于Flask的前后端交互：使用Python一個小型輕量的Flask框架來進(jìn)行Web可視化系統(tǒng)的搭建，在static中有css和js文件，js中大多為百度開源的ECharts，再通過自定義controller.js來使用ajax調(diào)用flask已設(shè)定好的路由，將數(shù)據(jù)異步刷新到templates下的main.html中。

技術(shù)棧

Python爬蟲：（requests和xpath）
數(shù)據(jù)清洗：詳細(xì)了解項(xiàng)目中數(shù)據(jù)預(yù)處理的步驟，包括去重去缺失值、變量重編碼、特征字段創(chuàng)造和文本數(shù)據(jù)預(yù)處理（pandas、numpy）
數(shù)據(jù)庫知識：select、insert等操作，（增刪查改＆pymysql）。
前后端知識：（HTML、JQuery、JavaScript、Ajax）。
Flask知識：一個輕量級的Web框架，利用Python實(shí)現(xiàn)前后端交互。（Flask）

一、數(shù)據(jù)采集（爬蟲）

1.前程無憂數(shù)據(jù)爬蟲

前程無憂反爬最難的地方應(yīng)該就是在點(diǎn)擊某個網(wǎng)頁進(jìn)入之后所得到的具體內(nèi)容，這部分會有個滑動驗(yàn)證碼，只要使用Python代碼爬數(shù)據(jù)都會被監(jiān)視到，用selenium自動化操作也會被監(jiān)視

這里使用獵聘網(wǎng)站上數(shù)據(jù)挖掘的崗位要求來代替前程無憂

import requests
import re
import json
import time
import pandas as pd
import numpy as np
from lxml import etree

通過輸入崗位名稱和頁數(shù)來爬取對應(yīng)的網(wǎng)頁內(nèi)容

job_name = input('請輸入你想要查詢的崗位：')
page = input('請輸入你想要下載的頁數(shù)：')

瀏覽器偽裝

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.47'
}
# 每個頁面提交的參數(shù)，降低被封IP的風(fēng)險
params = {'lang': 'c','postchannel': '0000','workyear': '99','cotype': '99','degreefrom': '99','jobterm': '99','companysize': '99','ord_field': '0','dibiaoid': '0'
}
href, update, job, company, salary, area, company_type, company_field, attribute = [], [], [], [], [], [], [], [], []

為了防止被封IP，下面使用基于redis的IP代理池來獲取隨機(jī)IP，然后每次向服務(wù)器請求時都隨機(jī)更改我們的IP（該ip_pool搭建相對比較繁瑣，此處省略搭建細(xì)節(jié)）
假如不想使用代理IP的話，則直接設(shè)置下方的time.sleep，并將proxies參數(shù)一并刪除

proxypool_url = 'http://127.0.0.1:5555/random'
# 定義獲取ip_pool中IP的隨機(jī)函數(shù)
def get_random_proxy():proxy = requests.get(proxypool_url).text.strip()proxies = {'http': 'http://' + proxy}return proxies

使用session的好處之一便是可以儲存每次的cookies，注意使用session時headers一般只需放上user-agent

session = requests.Session()
# 查看是否可以完成網(wǎng)頁端的請求
session.get('https://www.51job.com/', headers = headers, proxies = get_random_proxy())

爬取每個頁面下所有數(shù)據(jù)

for i in range(1, int(page) + 1):url = f'https://search.51job.com/list/000000,000000,0000,00,9,99,{job_name},2,{i}.html'response = session.get(url, headers = headers, params = params, proxies = get_random_proxy())# 使用正則表達(dá)式提取隱藏在html中的崗位數(shù)據(jù)ss = '{' + re.findall(r'window.__SEARCH_RESULT__ = {(.*)}', response.text)[0] + '}'# 加載成json格式，方便根據(jù)字段獲取數(shù)據(jù)s = json.loads(ss)data = s['engine_jds']for info in data:href.append(info['job_href'])update.append(info['issuedate'])job.append(info['job_name'])company.append(info['company_name'])salary.append(info['providesalary_text'])area.append(info['workarea_text'])company_type.append(info['companytype_text'])company_field.append(info['companyind_text'])attribute.append(' '.join(info['attribute_text']))
#     time.sleep(np.random.randint(1, 2))

遍歷每個鏈接，爬取對應(yīng)的工作職責(zé)信息

可以發(fā)現(xiàn)有些頁面點(diǎn)擊進(jìn)去需要進(jìn)行滑動驗(yàn)證，這可能是因?yàn)轭l繁爬取的緣故，需要等待一段時間再進(jìn)行數(shù)據(jù)的抓取，在不想要更換IP的情況下，可以選擇使用time模塊

for job_href in href:job_response = session.get(job_href)job_response.encoding = 'gbk'job_html = etree.HTML(job_response.text)content.append(' '.join(job_html.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div//p/text()')[1:]))time.sleep(np.random.randint(1, 3))

保存數(shù)據(jù)到DataFrame

df = pd.DataFrame({'崗位鏈接': href, '發(fā)布時間': update, '崗位名稱': job, '公司名稱': company, '公司類型': company_type, '公司領(lǐng)域': company_field, '薪水': salary, '地域': area, '其他信息': attribute})
df.head()

看一下爬到了多少條數(shù)據(jù)

len(job)

保存數(shù)據(jù)到csv文件中

df.to_csv('./51job_data_mining.csv', encoding = 'gb18030', index = None)

2.爬取獵聘網(wǎng)站數(shù)據(jù)

瀏覽器偽裝和相關(guān)參數(shù)

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36 Edg/93.0.961.47'
}
job, salary, area, edu, exp, company, href, content = [], [], [], [], [], [], [], []
session = requests.Session()
session.get('https://www.liepin.com/zhaopin/', headers = headers)

通過輸入崗位名稱和頁數(shù)來爬取對應(yīng)的網(wǎng)頁內(nèi)容

job_name = input('請輸入你想要查詢的崗位：')
page = input('請輸入你想要下載的頁數(shù)：')

遍歷每一頁上的數(shù)據(jù)

for i in range(int(page)):url = f'https://www.liepin.com/zhaopin/?key={job_name}&curPage={i}'time.sleep(np.random.randint(1, 2))response = session.get(url, headers = headers)html = etree.HTML(response.text)for j in range(1, 41):job.append(html.xpath(f'//ul[@class="sojob-list"]/li[{j}]/div/div[1]/h3/@title')[0])info = html.xpath(f'//ul[@class="sojob-list"]/li[{j}]/div/div[1]/p[1]/@title')[0]ss = info.split('_')salary.append(ss[0])area.append(ss[1])edu.append(ss[2])exp.append(ss[-1])company.append(html.xpath(f'//ul[@class="sojob-list"]/li[{j}]/div/div[2]/p[1]/a/text()')[0])href.append(html.xpath(f'//ul[@class="sojob-list"]/li[{j}]/div/div[1]/h3/a/@href')[0])

每頁共有40條崗位信息

遍歷每一個崗位的數(shù)據(jù)

for job_href in href:time.sleep(np.random.randint(1, 2))# 發(fā)現(xiàn)有些崗位詳細(xì)鏈接地址不全，需要對缺失部分進(jìn)行補(bǔ)齊if 'https' not in job_href:job_href = 'https://www.liepin.com' + job_hrefresponse = session.get(job_href, headers = headers)html = etree.HTML(response.text)content.append(html.xpath('//section[@class="job-intro-container"]/dl[1]//text()')[3])

保存數(shù)據(jù)

df = pd.DataFrame({'崗位名稱': job, '公司': company, '薪水': salary, '地域': area, '學(xué)歷': edu, '工作經(jīng)驗(yàn)': exp, '崗位要求': content})
df.to_csv('./liepin_data_mining.csv', encoding = 'gb18030', index = None)
df.head()

查看全文

http://www.risenshineclean.com/news/33096.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

用凡科做網(wǎng)站的費(fèi)用百度競價推廣運(yùn)營

前言

開發(fā)環(huán)境

文件說明

技術(shù)棧

一、數(shù)據(jù)采集（爬蟲）

1.前程無憂數(shù)據(jù)爬蟲

2.爬取獵聘網(wǎng)站數(shù)據(jù)

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

前言

開發(fā)環(huán)境

文件說明

技術(shù)棧

一、數(shù)據(jù)采集（爬蟲）

1.前程無憂數(shù)據(jù)爬蟲

2.爬取獵聘網(wǎng)站數(shù)據(jù)

相關(guān)文章：

一、數(shù)據(jù)采集（爬蟲）