無錫企業(yè)網(wǎng)站設(shè)計網(wǎng)絡(luò)整合營銷的特點有
scrapy概述
Scrapy,Python開發(fā)的一個快速、高層次的屏幕抓取和web抓取框架,用于抓取web站點并從頁面中提取結(jié)構(gòu)化的數(shù)據(jù)。Scrapy用途廣泛,可以用于數(shù)據(jù)挖掘、監(jiān)測和自動化測試
scrapy安裝
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
最開始安裝了低版本 報錯builtins.AttributeError: module 'OpenSSL.SSL' has no attribute 'SSLv3_METHOD' 升級到最新版本2.10.0 沒有問題
scrapy使用
scrapy創(chuàng)建項目及結(jié)構(gòu)
創(chuàng)建項目
scrapy startproject 項目名稱
scrapy自定義爬蟲類
創(chuàng)建爬蟲文件
scrapy genspider 爬蟲文件名稱 網(wǎng)頁地址
scrapy genspider MyTestSpider www.baidu.com
一般情況下不需要添加http協(xié)議, 因為start urls的值是根據(jù)allowed domains修改的 ,所以添加了http的話,那么start urls就需要我們手動去修改
import scrapyclass MytestSpider(scrapy.Spider):# 爬蟲的名字 用于運行爬蟲的時候 使用的值name = 'MyTestSpider'# 允許訪問的域名allowed_domains = ['www.baidu.com']# 起始的ur]地址 指的是第一次要訪問的域名start_urls = ['http://www.baidu.com/']def parse(self, response):pass
?scrapy response的屬性和方法
response.text? ? ? ?獲取的是響應(yīng)的字符串
response.body? ? ?獲取的是二進制數(shù)據(jù)
response.xpath? ? 可以直接是xpath方法來解析response中的內(nèi)容
response.extract? 提取seletor對象的data屬性值
response.extract_first?提取seletor列表的第一個值
scrapy啟動爬蟲程序
scrapy crawl? 爬蟲名稱
scrapy crawl MyTestSpider
scrapy原理
1、引擎向spiders要url
2、引擎學(xué)將要爬取的url給調(diào)度器
3、調(diào)度器會將url生成請求對象放到指定的隊列中,從隊列中發(fā)起一個請求
4、引擎將請求交給下載器進行處理
5、下載器發(fā)送請求獲取互聯(lián)網(wǎng)數(shù)據(jù)
6、將數(shù)據(jù)返回給下載器
7、下載器將數(shù)據(jù)返回給引擎
8、引擎將數(shù)據(jù)給spiders
9、spiders解析數(shù)據(jù),交給引擎,如果發(fā)起第二次請求,會再次交給調(diào)度器
10、引擎將數(shù)據(jù)交給管道
scrapy爬蟲案例
創(chuàng)建項目
scrapy startproject movie
創(chuàng)建spider
scrapy genspider mv https://www.dytt8.net/html/gndy/china/index.html
import scrapyclass MvSpider(scrapy.Spider):name = "mv"allowed_domains = ["www.dytt8.net"]start_urls = ["https://www.dytt8.net/html/gndy/china/index.html"]def parse(self, response):pass
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MovieItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()
編寫管道?
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MoviePipeline:# 執(zhí)行之前執(zhí)行def open_spider(self, spider):self.fp = open('movie.json','w',encoding='utf-8')def process_item(self, item, spider):self.fp.write(str(item))return item# 執(zhí)行之后執(zhí)行def close_spider(self,spider):self.fp.close()
settings.py開啟管道
BOT_NAME = "movie"SPIDER_MODULES = ["movie.spiders"]
NEWSPIDER_MODULE = "movie.spiders"ROBOTSTXT_OBEY = TrueITEM_PIPELINES = {"movie.pipelines.MoviePipeline": 300,
}REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
編寫爬蟲程序
import scrapy
from movie.items import MovieItemclass MvSpider(scrapy.Spider):name = "mv"allowed_domains = ["www.dytt8.net"]start_urls = ["https://www.dytt8.net/html/gndy/china/index.html"]def parse(self, response):a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')for a in a_list:name = a.xpath('./text()').extract_first()href = a.xpath('./@href').extract_first()#第二頁的地址是url = 'https://www.dytt8.net' + href# 對第二頁的鏈接發(fā)起訪問yield scrapy.Request(url=url, callback=self.parse_second,meta={'name':name})def parse_second(self,response):src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()# 接受到請求的那個meta參數(shù)的值name = response.meta['name']movie = MovieItem(src=src, name=name)# 返回給管道yield movie
運行并查看結(jié)果
進入spider目錄下,執(zhí)行?scrapy crawl mv