模仿別人的網(wǎng)站東莞關(guān)鍵詞優(yōu)化平臺(tái)
本項(xiàng)目純學(xué)習(xí)使用。
1 scrapy 代碼
爬取邏輯非常簡(jiǎn)單,根據(jù)url來處理翻頁,然后獲取到詳情頁面的鏈接,再去爬取詳情頁面的內(nèi)容即可,最終數(shù)據(jù)落地到excel中。
經(jīng)測(cè)試,總計(jì)獲取 11299條中醫(yī)藥材數(shù)據(jù)。
import pandas as pd
import scrapyclass ZhongyaoSpider(scrapy.Spider):name = "zhongyao"start_urls = [f"https://www.zysj.com.cn/zhongyaocai/index__{i}.html" for i in range(1, 27)]def __init__(self, *args, **kwargs):self.data = []def parse(self, response):for li in response.css('div#list-content ul li'):a_tag = li.css('a')title = a_tag.css('::attr(title)').get()href = a_tag.css('::attr(href)').get()if title and href:# 構(gòu)建完整的詳情頁 URLdetail_url = response.urljoin(href)yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'title': title})# 解析邏輯def parse_detail(self, response):title = response.meta['title']pinyin = response.css('div.item.pinyin_name_phonetic div.item-content::text').get(default='').strip()alias = response.css('div.item.alias div.item-content p::text').get(default='').strip()english_name = response.css('div.item.english_name div.item-content::text').get(default='').strip()source = response.css('div.item.alias div.item-content p::text').get(default='').strip()# 性味flavor = response.css('div.item.flavor div.item-content p::text').get(default='').strip()functional_indications = response.css('div.item.flavor div.item-content p::text').get(default='').strip()usage = response.css('div.item.usage div.item-content p::text').get(default='').strip()excerpt = response.css('div.item.excerpt div.item-content::text').get(default='').strip()#habitat = response.css('div.item.habitat div.item-content p::text').get(default='').strip()# 出處provenance = response.css('div.item.provenance div.item-content p::text').get(default='').strip()# 性狀shape_properties = response.css('div.item.shape_properties div.item-content p::text').get(default='').strip()# 歸經(jīng)attribution = response.css('div.item.attribution div.item-content p::text').get(default='').strip()# 原形態(tài)prototype = response.css('div.item.prototype div.item-content p::text').get(default='').strip()# 名家論述discuss = response.css('div.item.discuss div.item-content p::text').get(default='').strip()# 化學(xué)成分chemical_composition = response.css('div.item.chemical_composition div.item-content p::text').get(default='').strip()item = {'title': title,'pinyin': pinyin,'alias': alias,'source': source,'english_name': english_name,'habitat': habitat,'flavor': flavor,'functional_indications': functional_indications,'usage': usage,'excerpt': excerpt,'provenance': provenance,'shape_properties': shape_properties,'attribution': attribution,'prototype': prototype,'discuss': discuss,'chemical_composition': chemical_composition,}self.data.append(item)yield itemdef closed(self, reason):# 當(dāng)爬蟲關(guān)閉時(shí),保存數(shù)據(jù)到 Excel 文件df = pd.DataFrame(self.data)df.to_excel('zhongyao_data.xlsx', index=False)