江門(mén)網(wǎng)站建設(shè)網(wǎng)絡(luò)平臺(tái)推廣方案
正則表達(dá)式練習(xí)
- 工具
- 目的
- 代碼
- 運(yùn)行結(jié)果
工具
pycharm
目的
'''
https://www.77xsw.cc/fenlei/1_1/:第一頁(yè)的網(wǎng)址
https://www.77xsw.cc/fenlei/1_2/:第二頁(yè)的網(wǎng)址
...
https://www.77xsw.cc/fenlei/1_10/:第十頁(yè)的網(wǎng)址
'''
代碼
import requests
import re
import jsonnovel_list = []for i in range(1,11):# 請(qǐng)求網(wǎng)址url = 'https://www.77xsw.cc/fenlei/1_' + str(i) + '/'headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}# 發(fā)送請(qǐng)求response = requests.get(url, headers=headers)# print(response.text)# 數(shù)據(jù)處理 確定正則表達(dá)式規(guī)則時(shí),看抓取到的沒(méi)有美化的響應(yīng)結(jié)果data = response.text# 用中文不能匹配中文的標(biāo)點(diǎn)符號(hào)# rule = '<span class="sp_2"><a href="(.*?)".*?title="[\u4e00-\u9fa5]*">([\u4e00-\u9fa5]*?)</a></span><span class="sp_3">' # not# rule = '<span class="sp_2"><a href="(.*?)".*?title=".*?">(.*?)</a></span><span class="sp_3">' # okrule = '<span class="sp_2"><a href="(.*?)".*?>(.*?)</a></span><span class="sp_3">'result = re.findall(rule,data) # 返回值為多個(gè)匹配結(jié)果組成的列表for novel in result:novel_list.extend(result)# print(novel_list)
novel_tuple = tuple(novel_list) # 去重
novel_dict = dict([i[1], i[0]] for i in novel_tuple) # 轉(zhuǎn)為字典
print(novel_dict,len(novel_dict)) # 一頁(yè)40個(gè),10頁(yè)應(yīng)該有400個(gè),結(jié)果為397個(gè),含有重復(fù)的
# 保存數(shù)據(jù) json格式
with open('novel.json', 'w', encoding='utf8') as f:json.dump(novel_dict,f,indent=2,ensure_ascii=False)
運(yùn)行結(jié)果
見(jiàn)資源