廣州市司法職業(yè)學(xué)校網(wǎng)站seo是干什么的
Scrapy入門
- 1、Scrapy概述
- 2、PyCharm搭建Scrapy環(huán)境
- 3、Scrapy使用四部曲
- 4、Scrapy入門案例
- 4.1、明確目標
- 4.2、制作爬蟲
- 4.3、存儲數(shù)據(jù)
- 4.4、運行爬蟲
1、Scrapy概述
Scrapy是一個由Python語言開發(fā)的適用爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)的Web應(yīng)用程序框架。主要用于數(shù)據(jù)挖掘、信息處理、數(shù)據(jù)存儲和自動化測試等。通過Scrapy框架實現(xiàn)一個爬蟲,只需要少量的代碼,就能夠快速的網(wǎng)絡(luò)抓取
Scrapy框架5大組件(架構(gòu)):
- Scrapy引擎(Scrapy Engine):Scrapy引擎是整個框架的核心,負責Spider、ItemPipeline、Downloader、Scheduler間的通訊、數(shù)據(jù)傳遞等
- 調(diào)度器(Scheduler):網(wǎng)頁URL的優(yōu)先隊列,主要負責處理引擎發(fā)送的請求,并按一定方式排列調(diào)度,當引擎需要時,交還給引擎
- 下載器(Downloader):負責下載引擎發(fā)送的所有Requests請求資源,并將其獲取到的Responses交還給引擎,由引擎交給Spider來處理
- 爬蟲(Spider):用戶定制的爬蟲,用于從特定網(wǎng)頁中提取信息(實體Item),負責處理所有Responses,從中提取數(shù)據(jù),并將需要跟進的URL提交給引擎,再次進入調(diào)度器
- 實體管道(Item Pipeline):用于處理Spider中獲取的實體,并進行后期處理(詳細分析、過濾、持久化存儲等)
其他組件:
- 下載中間件(Downloader Middlewares):一個可以自定義擴展下載功能的組件
- Spider中間件(Spider Middlewares):一個可以自定擴展和操作引擎和Spider間通信的組件
官方文檔:https://docs.scrapy.org
入門文檔:https://doc.scrapy.org/en/latest/intro/tutorial.html
2、PyCharm搭建Scrapy環(huán)境
1)新建一個爬蟲項目ScrapyDemo
2)在Terminal終端安裝所需模塊
Scrapy基于Twisted,Twisted是一個異步網(wǎng)絡(luò)框架,主要用于提高爬蟲的下載速度
pip install scrapy
pip install twisted
如果報錯:
ERROR: Failed building wheel for twisted
error: Microsoft Visual C++ 14.0 or greater is required
則需要下載對應(yīng)的whl文件安裝:
Python擴展包whl文件下載:https://www.lfd.uci.edu/~gohlke/pythonlibs/#
ctrl+f
查找需要的whl文件,點擊下載對應(yīng)版本
安裝:
pip install whl文件絕對路徑
例如:
pip install F:\PyWhl\Twisted-20.3.0-cp38-cp38m-win_amd64.whl
3)在Terminal終端創(chuàng)建爬蟲項目ScrapyDemo
scrapy startproject ScrapyDemo
生成項目目錄結(jié)構(gòu)
4)在spiders文件夾下創(chuàng)建核心爬蟲文件SpiderDemo.py
最終項目結(jié)構(gòu)及說明:
ScrapyDemo/ 爬蟲項目├── ScrapyDemo/ 爬蟲項目目錄 │ ├── spiders/ 爬蟲文件│ │ ├── __init__.py │ │ └── SpiderDemo.py 自定義核心功能文件│ ├── __init__.py │ ├── items.py 爬蟲目標數(shù)據(jù)│ ├── middlewares.py 中間件、代理 │ ├── pipelines.py 管道,用于處理爬取的數(shù)據(jù) │ └── settings.py 爬蟲配置文件└── scrapy.cfg 項目配置文件
3、Scrapy使用四部曲
1)明確目標
明確爬蟲的目標網(wǎng)站
明確需要爬取實體(屬性):items.py
定義:屬性名 = scrapy.Field()
2)制作爬蟲
自定義爬蟲核心功能文件:spiders/SpiderDemo.py
3)存儲數(shù)據(jù)
設(shè)計管道存儲爬取內(nèi)容:settings.py、pipelines.py
4)運行爬蟲
方式1:在Terminal終端執(zhí)行(cmd執(zhí)行需要切到項目根目錄下)
scrapy crawl dangdang(爬蟲名)
cmd切換操作:
切盤:F:
切換目錄:cd A/B/...
方式2:在PyCharm執(zhí)行文件
在爬蟲項目目錄下創(chuàng)建運行文件run.py
,右鍵運行
4、Scrapy入門案例
4.1、明確目標
1)爬取當當網(wǎng)手機信息:https://category.dangdang.com/cid4004279.html
2)明確需要爬取實體屬性:items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapy# 1)明確目標
# 1.2)明確需要爬取實體屬性
class ScrapyDemoItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 名稱name = scrapy.Field()# 價格price = scrapy.Field()
4.2、制作爬蟲
SpiderDemo.py
# 入門案例# 1)明確目標
# 1.1)爬取當當網(wǎng)手機信息:https://category.dangdang.com/cid4004279.html# 2)制作爬蟲
import scrapy
from scrapy.http import Response
from ..items import ScrapyDemoItemclass SpiderDemo(scrapy.Spider):# 爬蟲名稱,運行爬蟲時使用的值name = "dangdang"# 爬蟲域,允許訪問的域名allowed_domains = ['category.dangdang.com']# 爬蟲地址:起始URL:第一次訪問是域名start_urls = ['https://category.dangdang.com/cid4004279.html']# 翻頁分析# 第1頁:https://category.dangdang.com/cid4004279.html# 第2頁:https://category.dangdang.com/pg2-cid4004279.html# 第3頁:https://category.dangdang.com/pg3-cid4004279.html# ......page = 1# 請求響應(yīng)處理def parse(self, response: Response):li_list = response.xpath('//ul[@id="component_47"]/li')for li in li_list:# 商品名稱name = li.xpath('.//img/@alt').extract_first()print(name)# 商品價格price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()print(price)# 獲取一個實體對象就交給管道pipelinesdemo = ScrapyDemoItem(name=name, price=price)# 封裝item數(shù)據(jù)后,調(diào)用yield將控制權(quán)給管道,管道拿到item后返回該程序yield demo# 每一頁爬取邏輯相同,只需要將執(zhí)行下一頁的請求再次調(diào)用parse()方法即可if self.page <= 10:self.page += 1url = rf"https://category.dangdang.com/pg{str(self.page)}-cid4004279.html"# scrapy.Request為scrapy的請求# yield中斷yield scrapy.Request(url=url, callback=self.parse)
補充:Response對象的屬性和方法
'''
1)獲取響應(yīng)的字符串
response.text
2)獲取響應(yīng)的二進制數(shù)據(jù)
response.body
3)解析響應(yīng)內(nèi)容
response.xpath()
'''
4.3、存儲數(shù)據(jù)
settings.py
# Scrapy settings for ScrapyDemo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html# 3)存儲數(shù)據(jù)
# 3.1)爬蟲配置、打開通道和添加通道# 爬蟲項目名
BOT_NAME = "ScrapyDemo"SPIDER_MODULES = ["ScrapyDemo.spiders"]
NEWSPIDER_MODULE = "ScrapyDemo.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "ScrapyDemo (+http://www.yourdomain.com)"
# User-Agent配置
USER_AGENT = 'Mozilla/5.0'# Obey robots.txt rules
# 是否遵循機器人協(xié)議(默認True),為了避免一些爬取限制需要改為False
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并發(fā)數(shù)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下載延遲(單位:s),用于控制爬取的頻率
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# 是否保存Cookies(默認False)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
# 請求頭
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "ScrapyDemo.middlewares.ScrapydemoSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "ScrapyDemo.middlewares.ScrapydemoDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "ScrapyDemo.pipelines.ScrapydemoPipeline": 300,
#}# 項目管道
ITEM_PIPELINES = {# 管道可以有多個,后面的數(shù)字是優(yōu)先級(范圍:1-1000),值越小優(yōu)先級越高# 爬取網(wǎng)頁'scrapy_dangdang.pipelines.ScrapyDemoPipeline': 300,# 保存數(shù)據(jù)'scrapy_dangdang.pipelines.ScrapyDemoSinkPiepline': 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"# 設(shè)置日志輸出等級(默認DEBUG)與日志存放的路徑
LOG_LEVEL = 'INFO'
# LOG_FILE = "spider.log"
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 3)存儲數(shù)據(jù)
# 3.2)使用管道存儲數(shù)據(jù)
# 若使用管道,則必須在settings.py中開啟管道import os
import csv# 爬取網(wǎng)頁
class ScrapyDemoPipeline:# 數(shù)據(jù)item交給管道輸出def process_item(self, item, spider):print(item)return item# 保存數(shù)據(jù)
class ScrapyDemoSinkPiepline:# item為yield后面的ScrapyDemoItem對象,字典類型def process_item(self, item, spider):with open(r'C:\Users\cc\Desktop\scrapy_test.csv', 'a', newline='', encoding='utf-8') as csvfile:# 定義表頭fields = ['name', 'price']writer = csv.DictWriter(csvfile, fieldnames=fields)writer.writeheader()# 寫入數(shù)據(jù)writer.writerow(item)
4.4、運行爬蟲
run.py
# 4)運行爬蟲from scrapy import cmdlinecmdline.execute('scrapy crawl dangdang'.split())
其他文件不動,本案例運行會報錯:
ERROR: Twisted-20.3.0-cp38-cp38m-win_amd64.whl is not a supported wheel on this platform
builtins.ModuleNotFoundError: No module named 'scrapy_dangdang'
原因大概是Twisted版本兼容問題,暫未解決,后續(xù)補充