當前位置：首頁 > news >正文

網(wǎng)站側(cè)欄設計合肥百度快速排名優(yōu)化

news 2025/7/1 13:11:14

網(wǎng)站側(cè)欄設計,合肥百度快速排名優(yōu)化,免費制作logo軟件,wordpress 學校模版1.概念通過編寫程序,模擬瀏覽器上網(wǎng),然后讓其去互聯(lián)網(wǎng)上抓取數(shù)據(jù)的過程通用爬蟲:抓取的是一整張頁面數(shù)據(jù)聚焦爬蟲:抓取的是頁面中的特定局部內(nèi)容增量式爬蟲:監(jiān)測網(wǎng)站中數(shù)據(jù)更新的情況,只會抓取網(wǎng)站中最新更新出來的數(shù)據(jù) robots.txt協(xié)議: 君子協(xié)議,網(wǎng)站后面添加robotx.txt…

?1.概念

通過編寫程序,模擬瀏覽器上網(wǎng),然后讓其去互聯(lián)網(wǎng)上抓取數(shù)據(jù)的過程

通用爬蟲:抓取的是一整張頁面數(shù)據(jù)
聚焦爬蟲:抓取的是頁面中的特定局部內(nèi)容
增量式爬蟲:監(jiān)測網(wǎng)站中數(shù)據(jù)更新的情況,只會抓取網(wǎng)站中最新更新出來的數(shù)據(jù)

robots.txt協(xié)議:

君子協(xié)議,網(wǎng)站后面添加robotx.txt可進行查看

https://www.baidu.com/robots.txt

1.1 http協(xié)議

服務器和客戶端進行數(shù)據(jù)交互的一種形式

常用的請求頭信息:

User-Agent: 請求載體的身份標識,如: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36
Connection: 請求完畢完畢之后,是斷開連接還是保持連接(close和keep-alive)

常用響應頭信息:

Content-Type: 服務器響應客戶端的數(shù)據(jù)類型如:(text/html; charset=utf-8)

1.2?https協(xié)議

安全的超文本傳輸協(xié)議,在客戶端和服務端進行數(shù)據(jù)傳輸和數(shù)據(jù)交互的過程中,數(shù)據(jù)是進行加密的.

數(shù)據(jù)加密的方式

對稱秘鑰加密: 客戶端制定加密方式,加加密的密文和生成的秘鑰都傳輸給服務端.服務端根據(jù)秘鑰對于密文進行解密.(但是密文和秘鑰可能會被同時攔截)
非對稱秘鑰加密:服務端生成秘鑰對,將生成的公鑰傳遞給客戶端,然后將生成的密文傳遞給服務端,服務端再使用私鑰進行解密(客戶端拿到的秘鑰,不一定是從服務端傳遞過來的)
證書秘鑰加密:服務器端制定加密方式,服務端將公鑰傳遞給證書認證機構(gòu),認證機構(gòu)將公鑰通過認證之后,進行數(shù)字簽名.將攜帶數(shù)字簽名的公鑰封裝到證書當中,將證書一并發(fā)送給客戶端

2. requests模塊

作用:模擬瀏覽器發(fā)送請求

requests模塊的編碼流程:

指定url
發(fā)起請求(get/post)
獲取相應數(shù)據(jù)
持久化存儲

2.1 簡易網(wǎng)頁采集器

2.1.1 UA 偽裝

將python腳本發(fā)送的請求，偽裝成為瀏覽器發(fā)送的請求

2.1.2 Get請求攜帶參數(shù)

將url的參數(shù)封裝成為字典，傳遞給params

# _*_ coding utf-8 _*_
# george
# time: 2024/12/24下午7:56
# name: test.py
# comment:
import requests# 參考url
# https://www.baidu.com/s?wd=%E6%88%90%E5%8A%9F
url = "https://www.baidu.com/s?"# UA偽裝：模擬瀏覽器
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}word = input("enter your word:").strip()# get請求攜帶的參數(shù)
param = {"wd":word
}# 獲取響應對象
response = requests.get(url, headers=header, params=param)# 獲取響應內(nèi)容,可以通過response.text獲取字符串形式的響應數(shù)據(jù)
page_text = response.content# 持久化儲存
file_name = word + ".html"
with open(file_name, "wb") as f:f.write(page_text)print(f"{file_name} has been save successfully")

2.2 破解百度翻譯

2.2.1 POST請求攜帶參數(shù)

將傳遞的參數(shù)封裝成為字典，并且傳遞給data

2.2.2 Ajax請求

單詞輸入后，局部頁面就會刷新

Ajax（Asynchronous JavaScript and XML）是一種在Web應用程序中進行異步通信的技術(shù)，它使用JavaScript和XML（現(xiàn)在通常使用JSON）來實現(xiàn)在不刷新整個頁面的情況下與服務器進行數(shù)據(jù)交換

2.2.3 JSON模塊的使用

反序列化 loads：將json字符串轉(zhuǎn)化為python對象字典
序列化 dump: 將python對象字典轉(zhuǎn)化為json字符串，并寫入文件

# _*_ coding utf-8 _*_
# george
# time: 2024/12/24下午7:56
# name: test.py
# comment:
import requests
import json# 參考url
url = "https://fanyi.baidu.com/sug"# UA偽裝：模擬瀏覽器
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36","Referer":"https://fanyi.baidu.com/mtpe-individual/multimodal"
}word = input("enter your word:").strip()# get請求攜帶的參數(shù)
param = {"kw":word
}# 獲取響應對象
response = requests.post(url, headers=header,data=param)# 獲取響應內(nèi)容
word_text = json.loads(response.text)# 持久化儲存
file_name = word + ".json"
with open(file_name, "w",encoding="utf-8") as f:json.dump(word_text["data"][0], f,ensure_ascii=False)print(f"{file_name} has been save successfully")

2.3 電影?

2.4 公司url

動態(tài)加載數(shù)據(jù)分析
獲取每家公司的url,但是發(fā)現(xiàn)每家公司的詳情頁面也是動態(tài)加載出來的

3.數(shù)據(jù)解析

正則
bs4
xpath

數(shù)據(jù)解析原理:

解析局部的文本內(nèi)容都會在標簽之間或是標簽的屬性中進行存儲

進行指定標簽的定位
標簽或是標簽對應屬性存儲值的獲取

3.1 正則解析

3.2 bs4解析

只可以被應用于python中

3.2.1 數(shù)據(jù)解析原理

實例化BeautifulSoup對象,并且將頁面源碼數(shù)據(jù)加載到該對象里面
通過調(diào)用BeautifulSoup對象中的相關的屬性或是方法,進行標簽定位和數(shù)據(jù)提取

3.2.2 環(huán)境安裝

pip3 install beautifulsoup4  -i  https://mirrors.aliyun.com/pypi/simple
pip3 install lxml  -i  https://mirrors.aliyun.com/pypi/simple

3.2.3 bs4的具體使用

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/26 22:07
# @Author : George
from bs4 import BeautifulSoup
import refp = open("./result.html", "r", encoding="utf-8")
soup = BeautifulSoup(fp, "lxml")
# soup.tagName:返回的是html中第一次出現(xiàn)的tag標簽
print(soup.div)
# -------------------------------------------
# 等同于soup.tagName
print(soup.find("a"))
# 屬性定位
print(soup.find("a", href=re.compile(".*ip138.com/$")).text)
# # 加載源碼中所有的tag標簽組成的列表
print(soup.find_all("a"))
# -------------------------------------------
# 可以使用選擇器,id/類/標簽/選擇器，返回的是一個列表
print(soup.select('.center'))
# 層級選擇器,
# “ ”空格就是表示多個層級
# > 表示一個層級
for i in soup.select('.center > ul > li > a'):print(i.text)
print(soup.select('.center > ul a')[0])
# -------------------------------------------
# 獲取標簽之間的文本數(shù)據(jù) soup.a.text/string/get_text()
# --text/get_text可以獲取標簽里面啊所有的文本內(nèi)容
# --string只可以獲取該標簽下的直系文本內(nèi)容
print(soup.select('.center > ul a')[0].string)
# -------------------------------------------
# 獲取標簽里面的屬性值
print(soup.select('.center > ul a')[0]["href"])

3.2.4 bs4爬取三國演義

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/26 22:43
# @Author : George
# ==================================================
# <a href="/guwen/bookv_6dacadad4420.aspx">第一回</a>
# 第一回網(wǎng)址
# https://www.gushiwen.cn/guwen/bookv_6dacadad4420.aspx
# ==================================================from bs4 import BeautifulSoup
import requestsurl = "https://www.gushiwen.cn/guwen/book_46653FD803893E4F7F702BCF1F7CCE17.aspx"# UA偽裝：模擬瀏覽器
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}
# 獲取響應對象
response = requests.get(url, headers=header)
# 實例化bs對象
soup = BeautifulSoup(response.text, "lxml")
a_liat = soup.select(".bookcont >ul > span >a")
fp = open("三國.txt","w",encoding="utf-8")
# 解析章節(jié)標題和詳情頁面的url
for tag in a_liat:title = tag.textdetail_url = "https://www.gushiwen.cn/"+tag["href"]# 對詳情頁面發(fā)起請求detail_page = requests.get(detail_url, headers=header)# 解析出詳情頁面中的內(nèi)容detail_soup = BeautifulSoup(detail_page.text,"lxml")# 使用此種方法出現(xiàn)一個問題,就是文章都是在p標簽里面,所以文章不會換行# content = detail_soup.find("div",class_="contson").textfp.write(title + ":")for line in detail_soup.select('.contson > p'):fp.write(line.text+"\n")fp.write("\n\n")print(f"{title}爬取成功")
fp.close()

效果：

3.3 xpath解析

最常用且是最便捷高效的一種解析方式

3.3.1 xpath解析原理

實例化一個etree的對象,且將被解析的頁面遠嗎數(shù)據(jù)加載到該對象中
調(diào)用etree對象中的xpath方法結(jié)合xpath表達式實現(xiàn)標簽的定位和內(nèi)容的捕獲

3.3.2 環(huán)境的安裝

pip3 install lxml  -i  https://mirrors.aliyun.com/pypi/simple

3.3.3 具體使用

xpath只能夠根據(jù)層級關系定位標簽頁面.

<!DOCTYPE html>
<html lang="zh-CN">
<head>
<!--    <meta charset="UTF-8">-->
<!--    <meta name="viewport" content="width=device-width, initial-scale=1.0">--><title>小型HTML頁面示例</title>
</head>
<body><div class="container"><div class="div1"><h2>第一個Div</h2><p>這是第一個div的內(nèi)容。它使用了類標簽.div1進行樣式定位。</p><p>這是第一個div的內(nèi)容。它使用了類標簽.div2進行樣式定位。</p><p><title>這是第一個div的內(nèi)容。它使用了類標簽.div3進行樣式定位。</title></p></div><div class="div2"><h2>第二個Div</h2><p>這是第二個div的內(nèi)容。它使用了類標簽.div2進行樣式定位。</p></div><div class="div3"><h2 class="div3_h2">第三個Div</h2><p>這是第三個div的內(nèi)容。它使用了類標簽.div3進行樣式定位。</p></div></div>
</body>
</html>

from lxml import etreetree_ = etree.parse("./baidu.html")
# r = tree_.xpath("/html/head/title")  # => [<Element title at 0x10a79ee60>],返回的是定位出來的對象
# r = tree_.xpath("/html//title")  # => [<Element title at 0x10a510dc0>],定位出來的效果是一樣的
# r = tree_.xpath("//title") # => [<Element title at 0x106b75dc0>],找到源碼里面所有的title標簽# 屬性定位
# r = tree_.xpath('//div[@class="div1"]') # => [<Element div at 0x101807e60>]# 索引定位,索引位置從1開始
# r = tree_.xpath('//div[@class="div1"]/p[3]')# 取文本 /text()獲取的標簽里面直系的文本內(nèi)容, //text() 獲取標簽下面所有的文本內(nèi)容
# text = tree_.xpath('//div[@class="div1"]/p[3]/text()')
# print(text)  # => ['這是第一個div的內(nèi)容。它使用了類標簽.div3進行樣式定位。']
# text = tree_.xpath('//div[@class="div1"]//text()')
# print(text)# 獲取屬性
# attr = tree_.xpath('//div[@class="div3"]/h2/@class')  # ['div3_h2']
# print(attr)

3.3.4 爬取ppt模板

已經(jīng)成功，結(jié)果如下，但是是大學時多虧了它的免費模板，就不貼代碼給它帶來麻煩了

3.3.5 爬取美女圖片

https://pic.netbian.com/4kmeinv/

爬取美女圖片失敗，開始進入網(wǎng)站總是有人機驗證，進去什么都爬取不了，后面再試一下

4.模擬登錄

4.1 驗證碼識別?

要收費，自己寫個圖片文字識別

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/28 13:44
# @Author : George
from PIL import Image
import pytesseract# 如果你沒有將tesseract.exe添加到系統(tǒng)的PATH中，
# 你需要在這里指定tesseract可執(zhí)行文件的完整路徑
pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract.exe'# 打開一個圖像文件
image_path = 'abc_1.png'  # 替換為你的圖像路徑
image = Image.open(image_path)# 使用pytesseract進行OCR
text = pytesseract.image_to_string(image, lang='chi_sim')  # lang參數(shù)指定語言，例如'chi_sim'表示簡體中文# 打印識別出的文字
print(text)

4.2 模擬登錄邏輯

5.cookie

5.1含義及作用

網(wǎng)頁的Cookie是一種在Web開發(fā)中廣泛使用的技術(shù)，用于在用戶的計算機上存儲小塊的數(shù)據(jù)。這些數(shù)據(jù)通常由Web服務器發(fā)送給用戶的瀏覽器（所以主要是服務器創(chuàng)建的），并在用戶后續(xù)的訪問中被瀏覽器返回給服務器。Cookie的主要功能和用途包括：

會話管理：Cookie可以用于保持用戶的會話狀態(tài)，例如在用戶登錄到網(wǎng)站后，服務器可以發(fā)送一個包含會話ID的Cookie到用戶的瀏覽器。在用戶后續(xù)的請求中，瀏覽器會自動包含這個會話ID，使得服務器能夠識別并持續(xù)管理用戶的會話。
個性化設置：Cookie可以用來存儲用戶的偏好設置，例如網(wǎng)站的語言、主題顏色、字體大小等。這樣，當用戶再次訪問網(wǎng)站時，這些設置可以自動恢復，提高用戶體驗。
跟蹤用戶行為：通過Cookie，網(wǎng)站可以跟蹤和分析用戶的行為，例如用戶訪問了哪些頁面、停留了多長時間、點擊了哪些鏈接等。這些信息對于網(wǎng)站優(yōu)化、廣告定位等非常有用。
安全性：在某些情況下，Cookie還可以用于增強網(wǎng)站的安全性，例如通過存儲加密的令牌來驗證用戶的身份。

Cookie具有以下幾個特點：

存儲在客戶端：Cookie數(shù)據(jù)存儲在用戶的瀏覽器上，而不是服務器上。這意味著即使用戶關閉了瀏覽器或計算機，只要沒有刪除Cookie，數(shù)據(jù)仍然存在。
自動發(fā)送：在用戶訪問與Cookie相關的網(wǎng)站時，瀏覽器會自動將相關的Cookie數(shù)據(jù)發(fā)送到服務器。
有限的大小和數(shù)量：每個Cookie的大小和數(shù)量都有限制，這取決于瀏覽器和Web服務器的配置。
過期時間：Cookie可以設置過期時間，在過期時間之前，Cookie一直有效。如果沒有設置過期時間，Cookie就是一個會話Cookie，在用戶關閉瀏覽器時自動刪除。

5.2 session的含義及特點

一、Session的基本概念

Session是服務器用于跟蹤用戶會話的一種機制。它允許服務器在多個請求之間識別同一個用戶，并維護該用戶的狀態(tài)信息。這些狀態(tài)信息可以包括用戶的登錄狀態(tài)、購物車內(nèi)容、偏好設置等。

二、Session的工作原理

創(chuàng)建Session：當用戶首次訪問網(wǎng)站時，服務器會創(chuàng)建一個新的Session對象，并為其分配一個唯一的Session ID。
發(fā)送Session ID：服務器通常會將這個Session ID以Cookie的形式發(fā)送給客戶端瀏覽器?？蛻舳藶g覽器會在后續(xù)的請求中自動將這個Session ID附加在請求頭中。
識別用戶：服務器通過檢查請求頭中的Session ID來識別用戶，并獲取相應的Session數(shù)據(jù)。這樣，服務器就可以在多個請求之間保持用戶的狀態(tài)信息。

三、Session的作用

保持用戶狀態(tài)：Session允許服務器在多個請求之間跟蹤用戶的狀態(tài)信息，如登錄狀態(tài)、購物車內(nèi)容等。這使得用戶可以在不同頁面之間無縫切換，而無需重新認證或輸入信息。
個性化服務：根據(jù)用戶的喜好和歷史行為，服務器可以為用戶提供個性化的內(nèi)容和服務。這有助于提高用戶體驗和滿意度。
安全性：通過驗證Session ID來確認用戶的請求，服務器可以防止未授權(quán)訪問和非法操作。這有助于保護用戶的隱私和數(shù)據(jù)安全。

四、Session與Cookie的關系

Session和Cookie是密切相關的兩種技術(shù)。Cookie是服務器發(fā)送到客戶端瀏覽器并保存在本地的一小塊數(shù)據(jù)，而Session則是服務器用于跟蹤用戶會話的一種機制。Cookie通常用于存儲Session ID，以便服務器在后續(xù)的請求中識別用戶。因此，可以說Cookie是Session的一種實現(xiàn)方式。

五、Session的管理

在網(wǎng)絡請求中，管理Session是非常重要的。開發(fā)人員需要確保Session的安全性、有效性和可維護性。這包括：

設置合理的Session過期時間：為了避免用戶長時間未操作而導致的會話過期問題，開發(fā)人員需要設置合理的Session過期時間。
保護Session ID：Session ID是用戶身份的重要標識，開發(fā)人員需要采取措施來保護它免受攻擊和泄露。例如，可以使用HTTPS協(xié)議來加密傳輸Session ID，或者使用更復雜的Session ID生成算法來提高安全性。
清理無效的Session：為了節(jié)省服務器資源和提高性能，開發(fā)人員需要定期清理無效的Session對象。這可以通過設置Session的失效時間、使用數(shù)據(jù)庫存儲Session數(shù)據(jù)并定期清理過期數(shù)據(jù)等方式來實現(xiàn)。

5.3?http和https協(xié)議的特點

無狀態(tài)。即是說，即使你的第一次請求已經(jīng)實現(xiàn)了自動登錄。但是你第二次發(fā)送請求時，服務器端并不知道你的請求是基于第一次登錄狀態(tài)的。

cookie可以讓服務器端記錄客戶端的相關狀態(tài)

5.3 cookie值的處理

手動抓包cookie值，將其封裝到headers里面，但是這種方法面對cookie是動態(tài)變化的時候就很難處理了
session會話對象
- 可以進行請求的發(fā)送
- 如果請求過程中產(chǎn)生了cookie,則該cookie會被自動存儲/攜帶在該session會話對象中

6.代理

需要繞過IP封鎖、限制或進行大規(guī)模數(shù)據(jù)抓取時。代理服務器充當客戶端和目標服務器之間的中介，可以隱藏你的真實IP地址，提供額外的安全性，有時還能加速請求

測試網(wǎng)址：

https://httpbin.org/get

現(xiàn)在基本上沒有免費正常的代理可以被使用，我這個也是失敗的?？吹揭粋€博主寫建立代理ip池的python3之爬蟲代理IP的使用+建立代理IP池_python爬蟲代理池-CSDN博客，代碼寫的不錯，爬取出來的ip也沒什么能用的了。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/28 19:14
# @Author : George
import requestsurl = "https://www.baidu.com/s?"# 將爬蟲程序偽裝成為瀏覽器來發(fā)送請求
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}# 設置代理
proxies = {"http": "154.203.132.49:8080","https":"49.73.4.158:8089"
}
param = {"wd":"ip"
}response = requests.get(url,headers=headers,proxies=proxies,verify=False,params=param)
with open("proxy.html","wt",encoding="utf-8") as f:f.write(response.text)

公司的聯(lián)網(wǎng)也是需要代理的

# _*_ coding utf-8 _*_
# george
# time: 2024/12/24下午7:56
# name: test.py
# comment:
import requestsuser = ""
passwd = "url = "https://www.baidu.com/"# 模擬瀏覽器
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}
# 指定代理,不指定代理,無法上網(wǎng)
proxies = {"http": f"http://{user}:{passwd}@ip:port","https": f"https://{user}:{passwd}@ip:port"
}# 獲取響應對象
response = requests.get(url, headers=header, proxies=proxies)# 獲取響應內(nèi)容,可以通過response.text獲取字符串形式的響應數(shù)據(jù)
page_text = response.content# 持久化儲存
with open("baidu.html", "wb") as f:f.write(page_text)

7.異步爬蟲(進程池）

本來是針對視頻進行爬取的，但是ajax請求時的請求地址，看不懂mrd這個怎么來的，暫時跳過，我灰太狼一定會回來的！！！！！！！！！！！！

7.1 視頻爬取

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/29 13:16
# @Author : George
import requests
import chardet
import os
from lxml import etree
"""
"https://www.pearvideo.com/category_1"
['video_1797596', 'video_1797399', 'video_1797404', 'video_1797260']https://www.pearvideo.com/video_1797596ajax請求
https://www.pearvideo.com/videoStatus.jsp?contId=1797596&mrd=0.4308675768914054
https://www.pearvideo.com/videoStatus.jsp?contId=1797399&mrd=0.6241695396585363
"""class LiVideo(object):def __init__(self):#  定義輸出ppt的文件夾self.extract_to_dir = f"./video"os.makedirs("./video", exist_ok=True)# 添加Referer防止反爬蟲self.header = {"Referer": "https://www.pearvideo.com/","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"}# 編碼self.encoding = Noneself.home_url = "https://www.pearvideo.com/category_1"def get_HTML(self, url_param):response = requests.get(url_param, headers=self.header)if response.status_code == 200:# 使用 chardet 自動檢測編碼self.encoding = chardet.detect(response.content)['encoding']response.encoding = self.encoding# 創(chuàng)建etree對象tree = etree.HTML(response.text)return treedef deal(self):home_tree = self.get_HTML(self.home_url)home_url_list = home_tree.xpath("//*[@id='listvideoListUl']/li/div/a[1]/@href")name_list = home_tree.xpath("//*[@id='listvideoListUl']/li/div/a[1]/div[2]/text()")for name,url in zip(name_list,home_url_list):url = "https://www.pearvideo.com/"+urldetail_tree = self.get_HTML(url)if __name__ == '__main__':vi = LiVideo()vi.deal()

?7.2 詩文異步爬取

只要將任務提交給線程池,線程池就會自動安排一個線程來執(zhí)行這個任務.同過線程池提交的任務是異步提交.異步提交的結(jié)果就是不等待任務的執(zhí)行結(jié)果,繼續(xù)往下執(zhí)行?

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/29 14:26
# @Author : George
"""
https://www.gushiwen.cn/guwen/Default.aspx?p=1&type=%e5%b0%8f%e8%af%b4%e5%ae%b6%e7%b1%bb第二層
https://www.gushiwen.cn/guwen/book_4e6b88d8a0bc.aspx
https://www.gushiwen.cn/guwen/book_a09880163008.aspx第三層
https://www.gushiwen.cn/guwen/bookv_b630af160f65.aspx
"""
import os
import requests
import chardet
from lxml import etree
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import time
from typing import Dict, List, Tupleclass NovelDownloader:def __init__(self):self.output_dir = "./novels"os.makedirs(self.output_dir, exist_ok=True)self.headers = {"Referer": "https://www.gushiwen.cn/guwen/Default.aspx?p=1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"}self.base_url = "https://www.gushiwen.cn"self.home_url = f"{self.base_url}/guwen/Default.aspx?p=1&type=%e5%b0%8f%e8%af%b4%e5%ae%b6%e7%b1%bb"def get_html_tree(self, url: str) -> etree._Element:"""獲取頁面HTML并返回etree對象"""response = requests.get(url, headers=self.headers)if response.status_code != 200:raise Exception(f"Failed to get {url}, status code: {response.status_code}")encoding = chardet.detect(response.content)['encoding']response.encoding = encodingreturn etree.HTML(response.text)def get_chapter_details(self) -> Dict[str, Dict[str, str]]:"""獲取所有章節(jié)詳情"""home_tree = self.get_html_tree(self.home_url)# 獲取書籍鏈接和標題urls = home_tree.xpath("//*[@class='sonspic']/div[1]/p[1]/a[1]/@href")[1:3]titles = home_tree.xpath("//*[@class='sonspic']/div[1]/p[1]/a[1]/b/text()")[1:3]book_details = {}for title, url in zip(titles, urls):detail_tree = self.get_html_tree(f"{self.base_url}{url}")# 獲取章節(jié)鏈接和標題chapter_urls = [f"{self.base_url}{url}" for url in detail_tree.xpath("//*[@class='bookcont']/ul/span/a/@href")]chapter_titles = detail_tree.xpath("//*[@class='bookcont']/ul/span/a/text()")book_details[title] = dict(zip(chapter_titles, chapter_urls))return book_detailsdef download_novel(self, title: str, chapters: Dict[str, str]):"""下載單本小說"""output_path = os.path.join(self.output_dir, f"{title}.txt")print(f"開始下載 {title}".center(100, "="))with open(output_path, "w", encoding="utf-8") as f:for chapter_title, chapter_url in chapters.items():response = requests.get(chapter_url, headers=self.headers)soup = BeautifulSoup(response.text, "lxml")f.write(f"{chapter_title}:\n")for paragraph in soup.select('.contson > p'):f.write(f"{paragraph.text}\n")f.write("\n\n")print(f"{chapter_title} 下載完成".center(20, "-"))print(f"{title} 下載完成".center(100, "="))def main():start_time = time.time()downloader = NovelDownloader()books = downloader.get_chapter_details()# 使用線程池并行下載with ThreadPoolExecutor(max_workers=10) as pool:futures = [pool.submit(downloader.download_novel, title, chapters) for title, chapters in books.items()]print(f"總耗時: {time.time() - start_time:.2f}秒")if __name__ == '__main__':main()

8.異步爬蟲（協(xié)程）

??協(xié)程內(nèi)容不贅述：CSDN

基于單線程和協(xié)程，實現(xiàn)異步爬蟲。?

8.1.?基于flask搭建服務器

簡單的學習了一下，感覺不是很復雜，后面等著詳細學習

pip3 install Flask -i  https://mirrors.aliyun.com/pypi/simple

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/31 21:47
# @Author : George
"""
# @app.route('/')： 這是一個裝飾器，用于告訴 Flask 哪個 URL 應該觸發(fā)下面的函數(shù)。
在這個例子中，它指定了根 URL（即網(wǎng)站的主頁）。# return 'Hello, World!'： 這行代碼是 hello_world 函數(shù)的返回值。
當用戶訪問根 URL 時，這個字符串將被發(fā)送回用戶的瀏覽器。"""
from flask import Flask, request, jsonify
import timedef get_client_ip(request):# 如果使用了反向代理，優(yōu)先從 X-Forwarded-For 頭部獲取 IP 地址x_forwarded_for = request.headers.get('X-Forwarded-For', '').split(',')if x_forwarded_for:client_ip = x_forwarded_for[0]  # 通常第一個 IP 地址是客戶端的真實 IP 地址else:client_ip = request.remote_addrreturn client_ipapp = Flask(__name__)@app.route('/')
def index():return 'Hello, World!'@app.route('/bobo')
def index_bobo():# 獲取client的user_agent和refereruser_agent = request.headers.get('User-Agent')user_referer = request.headers.get('Referer')print(user_agent, user_referer)# client ip地址沒有獲取到client_ip = get_client_ip(request)print({'client_ip': client_ip})time.sleep(3)return 'bobo'@app.route('/jar')
def index_jar():time.sleep(3)return 'jar'@app.route('/test')
def index_test():time.sleep(3)return 'test'if __name__ == '__main__':app.run(threaded=True)

8.2 基于aiohttp實現(xiàn)異步爬蟲

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2024/12/31 22:08
# @Author : George
import asyncio
import requests
import time
import aiohttp
from threading import currentThread"""
async def request(url): # 耗時： 9.054526805877686,requests是同步阻塞，必須替換為異步庫提供的函數(shù)aiohttp
"""urls = ["http://127.0.0.1:5000/jar","http://127.0.0.1:5000/bobo","http://127.0.0.1:5000/test"
]start = time.time()# 耗時： 9.054526805877686,requests是同步操作
# async def request(url):
#     print("開始下載", url)
#     response = requests.get(url=url)
#     print(response.text)
#     print("下載結(jié)束", url)
#     print("------------")async def request_2(url):print("開始下載", url, currentThread())headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36","Referer": "https://movie.douban.com/explore"}# 設置代理proxy = "http://8.220.204.215:8008"async with aiohttp.ClientSession() as session:async with session.get(url, headers=headers) as response:text = await response.text()print(text)print("下載結(jié)束", url)async def main():tasks = []for url in urls:tasks.append(asyncio.create_task(request_2(url)))await asyncio.wait(tasks)asyncio.run(main())end = time.time()
print("耗時：", end - start)

9.selenuim使用

便捷的獲取網(wǎng)頁中間動態(tài)加載的數(shù)據(jù)
邊界實現(xiàn)模擬登錄

selenuim:基于瀏覽器自動化的一個模塊

入門指南 | Selenium

基于selenium實現(xiàn)瀏覽器自動化，自動輸入并播放動漫核心代碼如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2025/1/1 16:04
# @Author : George
"""
在 Selenium 4 中，不再直接在 webself.driver.Chrome 中設置驅(qū)動程序路徑，而是通過引入 Service 對象來設置。
這樣可以避免棄用警告，并確保驅(qū)動程序的正確加載
"""
import os.pathfrom selenium import webdriver
from selenium.webdriver.edge.service import Service as EdgeService
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
from lxml import etree
import requests
import chardet
import json
from log_test import loggerclass VideoAutoPlay(object):def __init__(self):self.driver = {}self.count_num = 1self.movies_dict =Noneself.home_url = "https://www.agedm.org/"self.headers = {"Referer": "https://www.agedm.org/","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"}def get_html_tree(self, url: str) -> etree._Element:"""獲取頁面HTML并返回etree對象"""response = requests.get(url, headers=self.headers)if response.status_code != 200:raise Exception(f"Failed to get {url}, status code: {response.status_code}")encoding = chardet.detect(response.content)['encoding']response.encoding = encodingreturn etree.HTML(response.text)def movies_input(self, movie_name, n):try:if not self.driver:self.driver = self.setup_driver()self.driver.get(self.home_url)# 查找搜索框元素search_box = self.driver.find_element(By.ID, "query")# 輸入搜索內(nèi)容search_box.send_keys(movie_name)# 提交搜索表單search_box.send_keys(Keys.RETURN)# 等待搜索結(jié)果加載# WebDriverWait(self.driver, 10).until(#     EC.presence_of_element_located((By.CLASS_NAME, "content_left"))# )# 二級請求self.driver.get(f"https://www.agedm.org/search?query={movie_name}")# 查找在線播放btn# wait = WebDriverWait(self.driver, 10)  # 10秒超時# # titles = home_tree.xpath("//*[@class='sonspic']/div[1]/p[1]/a[1]/b/text()")button = self.driver.find_element(By.XPATH, "//*[@class='video_btns']/a[1]")# print(button)button.click()WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.CLASS_NAME, "tab-content")))# # 等待搜索結(jié)果加載detail_url = self.driver.current_urltree = self.get_html_tree(detail_url)url_dict = {}titles = tree.xpath("//*[@class='video_detail_episode']/li/a/text()")[n - 1:]urls = tree.xpath("//*[@class='video_detail_episode']/li/a/@href")[n - 1:]for title, url in zip(titles, urls):url_dict[title] = url# print(url_dict)return url_dictexcept Exception as e:logger.error(f"搜索過程發(fā)生錯誤: {str(e)}")if self.driver:self.driver.quit()self.driver = Nonereturn {}def setup_driver(self):# 優(yōu)化視頻播放性能的設置options = webdriver.EdgeOptions()options.add_argument('--disable-gpu')  # 禁用GPU加速options.add_argument('--no-sandbox')  # 禁用沙箱模式options.add_argument('--disable-dev-shm-usage')  # 禁用/dev/shm使用options.add_argument('--disable-software-rasterizer')  # 禁用軟件光柵化options.add_argument('--disable-extensions')  # 禁用擴展options.add_argument('--disable-infobars')  # 禁用信息欄options.add_argument('--autoplay-policy=no-user-gesture-required')  # 允許自動播放options.add_experimental_option('excludeSwitches', ['enable-automation'])  # 禁用自動化提示options.add_experimental_option("useAutomationExtension", False)  # 禁用自動化擴展# 設置正確的驅(qū)動路徑service = EdgeService(executable_path='./msedgedriver.exe')return webdriver.Edge(options=options, service=service)def video_player(self, movie_name, n):try:url_dict = self.movies_input(movie_name, n)if not url_dict:logger.error("未找到可播放的視頻鏈接")returnif not self.driver:self.driver = self.setup_driver()for title, url in url_dict.items():try:# 打開網(wǎng)站self.driver.get(url)# 切換到視頻iframeself.driver.switch_to.frame("iframeForVideo")logger.info(f"開始時間: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}")# 等待視頻元素加載wait = WebDriverWait(self.driver, 20)video_element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "art-video")))logger.info(f"找到視頻元素: {video_element}")logger.info(f"{movie_name}:{title}集開始播放".center(10, "="))# 等待視頻加載完成后執(zhí)行全屏page_stage = Truewhile page_stage:paused_state = self.driver.execute_script("return arguments[0].paused;", video_element)print("paused_state", paused_state)if not paused_state:break# 雙擊使得視頻全屏顯示ActionChains(self.driver).double_click(video_element).perform()time.sleep(3)# 視頻單擊播放paused_state = self.driver.execute_script("return arguments[0].paused;", video_element)if paused_state:logger.info(f"{movie_name}:{title}觸發(fā)雙擊全屏")ActionChains(self.driver).click(video_element).perform()# # 點擊視頻開始播放# action = ActionChains(self.driver)# action.move_to_element_with_offset(video_element, 100, 100).click().perform()print(f"點擊播放時間: {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}")# 等待視頻播放完成time.sleep(60*21)n = n + 1self.update_movie_count(movie_name,n)logger.info(f"第{title}集播放完畢".center(10, "="))except Exception as e:logger.error(f"播放第{title}集時發(fā)生錯誤: {str(e)}")continueexcept Exception as e:logger.error(f"視頻播放總體發(fā)生錯誤: {str(e)}")finally:if self.driver:self.driver.quit()self.driver = Nonedef __del__(self):"""確保在對象銷毀時關閉driver"""if self.driver:try:self.driver.quit()except:passdef count_read(self, movies, n=1):# 將此文件作為播放內(nèi)容的緩存if not os.path.exists("./count.json"):with open("./count.json", "w", encoding="utf-8") as f:self.movies_dict = {movies:n}json.dump({movies: n}, f, ensure_ascii=False, indent=4)return nelse:with open("./count.json", "r", encoding="utf-8") as f:self.movies_dict = json.load(f)if not self.movies_dict.get(str(movies)):  # 讀取不到moviesself.movies_dict.update({str(movies):n})with open("./count.json", "w", encoding="utf-8") as f:json.dump(self.movies_dict, f, ensure_ascii=False, indent=4)return nelse:  # 讀取到了movieself.count_num = self.movies_dict[str(movies)]return self.count_num# update movies countdef update_movie_count(self, movies, n):with open("./count.json", "w", encoding="utf-8") as f:self.movies_dict[str(movies)] = njson.dump(self.movies_dict, f, ensure_ascii=False, indent=4)if __name__ == "__main__":video = VideoAutoPlay()movie = "火影忍者 疾風傳"# movie = "神之塔 第二季"count_num = video.count_read(movie, 1)# video.update_movie_count(movie, 4)video.video_player(movie, count_num)