吉林沈陽網(wǎng)站建設手機網(wǎng)站模板建站
1、requests基本使用
1.1、requests介紹
requests是python中一個常用于發(fā)送HTTP請求的第三方庫,它極大地簡化了web服務交互的過程。它是唯一的一個非轉基因的python HTTP庫,人類可以安全享用。
1.2、requests庫的安裝
pip install -i https://pypi.tuan.tsinghua.edu.cn/simple requests?
1.3、requests基礎語法?
import requests
url = 'http://www.baidu.com'
response = requests.get(url)
?1.4、response的屬性以及類型
(1)一個類型:
print(type(response)) # <class 'requests.models.Response'>
(2)六個屬性:
# 是指相應的編碼格式
response.encoding = 'utf-8'
# 以字符串形式返回網(wǎng)頁源碼
print(response.text)
# 獲取請求頭
print(response.url)
# 返回二進制數(shù)據(jù)
print(response.content)
# 返回狀態(tài)碼信息
print(response.status_code)
# 獲取響應頭信息
print(response.headers)
?2、requests的get請求
爬取鄭州頁面信息,和urllib基本差不多,只要明白urllib,相信requests的get請求也不會有什么難度。
import requests
url = 'https://www.baidu.com/s?'
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
data = {"wd":"鄭州"
}
# url 請求資源路徑 params 參數(shù) # kwargs 字典
response = requests.get(url=url,params=data,headers=headers)
content = response.text
print(content)
與urllib的get請求區(qū)別:
1、參數(shù)需要使用params傳遞
2、參數(shù)無需urlencode3、不需要請求對象的定制?
4、請求資源路徑中的?可以省略
?3、requests的post請求
我們還是以之前urllib中關于post請求-百度翻譯為例:
import requests
url = "https://fanyi.baidu.com/sug"
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36","cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; PSINO=1; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=3&tt=b6o&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=ruw"'
}
data = {"kw":"eye"
}
response = requests.post(url=url, headers=headers, data=data)
content = response.text
import json
content = json.loads(content)
print(content)
與urllib的post請求的區(qū)別:
1、post請求不需要編解碼
2、post請求的參數(shù)是data
3、不需要請求對象的定制?
4、代理
import requests
url = "http://www.baidu.com/s?"
headers = {# "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",# "cookie":'BIDUPSID=91AC5A2A82E26F50448A070917943E70; PSTM=1732629509; BAIDUID=91AC5A2A82E26F50448A070917943E70:FG=1; BD_UPN=12314753; BDUSS_BFESS=E1IcjZ0NVRodGlNNjJaNFdXNUZQVjVsZE04eW5iaVdOSXkzQ3BDRkcxVndMbkpuRUFBQUFBJCQAAAAAAQAAAAEAAABYaMgfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHChSmdwoUpne; BAIDUID_BFESS=91AC5A2A82E26F50448A070917943E70:FG=1; ZFY=0L:BrFXMz3oPPSIl2WrbINbmdK4f2nDwQtL:Bfl6za7PM:C; B64_BOT=1; BDRCVFR[l9-IMhu-BDf]=mk3SLVN4HKm; delPer=0; BD_CK_SAM=1; H_PS_PSSID=61027_61099_61217_61280_61298_61246_60853; shifen[8451320_53724]=1733557849; shifen[304792146112_6039]=1733557876; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_WISE_SIDS=61027_61099_61217_61280_61298_61246_60853; BA_HECTOR=a58l2h24a121a1808ka48g213kh3u01jlb88s1u; shifen[8332037_91638]=1733665082; BCLID=10763796247062205483; BCLID_BFESS=10763796247062205483; BDSFRCVID=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=rvFOJexroG3B_xQJosAdbCbKXuweG7bTDYrEOwXPsp3LGJLVdLE8EG0Pts1-dEu-S2OOogKKBeOTHn0F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; H_BDCLCKID_SF_BFESS=tbkD_C-MfIvhDRTvhCcjh-FSMgTBKI62aKDsoJ71BhcqJ-ovQpJmjU4ByRnkBJoa0Krihn6cWKJJ8UbeWfvp3t_D-tuH3lLHQJnph66dah5nhMJmBp_VhfL3qtCOaJby523i5J5vQpn_hhQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0DTbLjH8jqTntaD5yWj6JanTjjTrFbKTjhPrML4tJWMT-MTryKM3xJh7-Ox7Xy4nDLPDUWMciB5OMBanRhlRNQRjVHqI4Lq_K360ZWec72MQxtNRJMMKEal5MKqF9MRJobUPULxo9LUvXtgcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLfC-aMCt6eno_Mt4HqfbQa4JWHDQbsJOOaCvDSqQOy4oTj6D05-TRbMRZXa5ZaRonKqviEP8RW4r_3MvB-fnyKMIJye3CBItbtbr5ol6KQft20-DAeMtjBbLLfNTtVn7jWhvIeq72y-I2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8EjHCDJ5kDtJuHVbobHJoHjJbGq4bohjPX54j9BtQO-DOxoho7MUjkDPOqb-5T-xPR5qJ-05baQgnkQq5vbMnmqPtRXMJkXhKOX-_O0x-jLTneo66e34KVVIoOXPnJyUPYbtnnBPCj3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CcJ-J8XMD89jTbP; ab_sr=1.0.1_ZmQ5MTQ5YzBmNGJkNTY1NzMwMDMyZDljNDI4ZDNmNDk2YjBiOTJiOTkyNTYwZDEwYWM1MTAyNDliM2IwZjQxNmFmYmQxZGJmZDI0MDI5YmViZDIwYzIwMDVkZmMxNjljNGEzNzQ5MTYyOWY5MzVmMTgxZTQxOGY4YzFhMTk3YWRiNGQ0NGI3Y2M1NjhjOGEyMTE1MDU1N2M1MDI2OWVjMg==; RT="z=1&dm=baidu.com&si=683d19d9-ec4a-4ee1-ba25-d45da6aaef7f&ss=m4fnfeoj&sl=4&tt=cn1&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=wmj&ul=o4bd&hd=o4c0"; PSINO=7; sugstore=1; H_PS_645EC=e2c20yk9RoanWFIVyDJbr18JC5dzOzNojiUaPy0JXsXtSzcOKsks5N3IUyetiaDn7Vsq5ZY; baikeVisitId=1d823dea-39eb-4e63-978d-65fd09a0d697; COOKIE_SESSION=81376_0_6_6_7_3_1_0_6_3_205_1_111167_0_0_0_1733584849_0_1733666222%7C9%2379969_3_1733137574%7C2'
}
data = {"wd":"ip"
}
# 代理池
proxy={"http":"23.247.137.142:80"
}
response =requests.get(url=url,params=data,headers=headers,proxies=proxy)
content = response.text
file = open("ip.html","w",encoding="utf-8")
file.write(content)
file.close()
5、cookie登錄
我們以古詩文個人主頁頁面為例子,含有驗證碼。
首先我們進入登陸界面后,搜遍輸入密碼,然后打開開發(fā)者模式,看到login接口,看負載(payload)里面有許多信息。
?__VIEWSTATE:MnTNH2SbI9isHX8zdfu1NvmByZXoSVf8Vxj5QIeJ5C8EmgWhaBFQRNjQYMe47E+qOO+ss1LSDNdjYeNRy/bdvD7wktgbMm73Cku21k7NhLMYo79CC54kuz//cZ9kSLKKFvkpppzOssnyET3GX789uH1DMUM= __VIEWSTATEGENERATOR: C93BE1AE
這兩個信息不固定,是變量,而code也是變量。因此解決這三個變量就是這個例子的難點
難點:(1)__VIEWSTATE??__VIEWSTATEGENERATOR
我們回到登陸頁面,檢查源代碼,發(fā)現(xiàn)里面是有這兩個變量的。而hidden我們稱之為隱藏域。
獲取登錄頁面源碼:
import requests
url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
content = response.text
解析__VIEWSTATE??__VIEWSTATEGENERATOR兩個變量的value,可以通過beautifulsoup語法,也可用通過xpath:
from lxml import etree
tree = etree.HTML(content)
__VIEWSTATE = tree.xpath('//input[@name="__VIEWSTATE"]/@value')
__VIEWSTATEGENERATOR = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')
print(__VIEWSTATE)
print(__VIEWSTATEGENERATOR)
?難點:(2)code驗證碼(獲取驗證碼圖片)
code = tree.xpath('//img[@id="imgCode"]/@src')[0]
code_url = "https://so.gushiwen.cn"+code
獲取了驗證碼圖片后下載到本地觀察驗證碼,然后在控制臺輸入即可!(當然也可以用pytesseract來識別數(shù)字)
import urllib.request
urllib.request.urlretrieve(url=code_url,filename="code.jpg")
code_name = input("請輸入驗證碼:")
但這種方法顯然是有問題的,只有我們輸入驗證碼后才會生成新的驗證碼,也就是說這個時候我們輸入的驗證碼是舊的驗證碼。因此我們可以用requests庫中的session方法,通過session的返回值,是請求變成一個對象。
session = requests.session()
response_code = session.get(code_url)
content_code = response_code.content # 此時要使用二進制數(shù)據(jù),因為使用的圖片的下載
f = open("code.jpg","wb") # wb的模式就是將二進制數(shù)據(jù)寫入到文件
f.write(content_code)
f.close()
code_name = input("請輸入驗證碼:")
抓取登錄按鈕的接口
url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
data_post = {"__VIEWSTATE": viewstate,"__VIEWSTATEGENERATOR": viewstategenerator,"from": "http://www.gushiwen.cn/user/collect.aspx","email": 17719114890,"pwd": "dwq0219423","code": code_name,"denglu": "登錄"
}
response_post = session.post(url=url_post, headers=headers, data=data_post)
content_post = response_post.text
f = open("古詩文.html","w",encoding="utf-8")
f.write(content_post)
完整代碼如下:
import requests
url = "https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx"
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
content = response.text
from lxml import etree
tree = etree.HTML(content)
viewstate = tree.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
viewstategenerator = tree.xpath('//input[@name="__VIEWSTATEGENERATOR"]/@value')[0]
code = tree.xpath('//img[@id="imgCode"]/@src')[0]
code_url = "https://so.gushiwen.cn"+code
session = requests.session()
response_code = session.get(code_url)
content_code = response_code.content # 此時要使用二進制數(shù)據(jù),因為使用的圖片的下載
f = open("code.jpg","wb") # wb的模式就是將二進制數(shù)據(jù)寫入到文件
f.write(content_code)
f.close()
code_name = input("請輸入驗證碼:")
url_post = "https://www.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fwww.gushiwen.cn%2fuser%2fcollect.aspx"
data_post = {"__VIEWSTATE": viewstate,"__VIEWSTATEGENERATOR": viewstategenerator,"from": "http://www.gushiwen.cn/user/collect.aspx","email": 17719114890,"pwd": "dwq0219423","code": code_name,"denglu": "登錄"
}
response_post = session.post(url=url_post, headers=headers, data=data_post)
content_post = response_post.text
f = open("古詩文.html","w",encoding="utf-8")
f.write(content_post)
?