當前位置：首頁 > news >正文

廣州市司法職業(yè)學(xué)校網(wǎng)站seo是干什么的

news 2025/7/10 20:17:10

廣州市司法職業(yè)學(xué)校,網(wǎng)站seo是干什么的,有關(guān)網(wǎng)站升級建設(shè)的申請書,代理ip做網(wǎng)站流量Scrapy入門 1、Scrapy概述2、PyCharm搭建Scrapy環(huán)境3、Scrapy使用四部曲4、Scrapy入門案例4.1、明確目標4.2、制作爬蟲4.3、存儲數(shù)據(jù)4.4、運行爬蟲 1、Scrapy概述 Scrapy是一個由Python語言開發(fā)的適用爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)的Web應(yīng)用程序框架。主要用于數(shù)據(jù)挖掘、信息處…

Scrapy入門

- 1、Scrapy概述
- 2、PyCharm搭建Scrapy環(huán)境
- 3、Scrapy使用四部曲
- 4、Scrapy入門案例
- - 4.1、明確目標
  - 4.2、制作爬蟲
  - 4.3、存儲數(shù)據(jù)
  - 4.4、運行爬蟲

1、Scrapy概述

Scrapy是一個由Python語言開發(fā)的適用爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)的Web應(yīng)用程序框架。主要用于數(shù)據(jù)挖掘、信息處理、數(shù)據(jù)存儲和自動化測試等。通過Scrapy框架實現(xiàn)一個爬蟲，只需要少量的代碼，就能夠快速的網(wǎng)絡(luò)抓取

Scrapy框架5大組件（架構(gòu)）：

在這里插入圖片描述

Scrapy引擎(Scrapy Engine)：Scrapy引擎是整個框架的核心，負責Spider、ItemPipeline、Downloader、Scheduler間的通訊、數(shù)據(jù)傳遞等
調(diào)度器(Scheduler)：網(wǎng)頁URL的優(yōu)先隊列，主要負責處理引擎發(fā)送的請求，并按一定方式排列調(diào)度，當引擎需要時，交還給引擎
下載器(Downloader)：負責下載引擎發(fā)送的所有Requests請求資源，并將其獲取到的Responses交還給引擎，由引擎交給Spider來處理
爬蟲(Spider)：用戶定制的爬蟲，用于從特定網(wǎng)頁中提取信息(實體Item)，負責處理所有Responses，從中提取數(shù)據(jù)，并將需要跟進的URL提交給引擎，再次進入調(diào)度器
實體管道(Item Pipeline)：用于處理Spider中獲取的實體，并進行后期處理（詳細分析、過濾、持久化存儲等）

其他組件：

下載中間件(Downloader Middlewares)：一個可以自定義擴展下載功能的組件
Spider中間件(Spider Middlewares)：一個可以自定擴展和操作引擎和Spider間通信的組件

官方文檔：https://docs.scrapy.org

入門文檔：https://doc.scrapy.org/en/latest/intro/tutorial.html

2、PyCharm搭建Scrapy環(huán)境

1）新建一個爬蟲項目ScrapyDemo

2）在Terminal終端安裝所需模塊

Scrapy基于Twisted，Twisted是一個異步網(wǎng)絡(luò)框架，主要用于提高爬蟲的下載速度

pip install scrapy
pip install twisted

如果報錯：

ERROR: Failed building wheel for twisted
error: Microsoft Visual C++ 14.0 or greater is required

則需要下載對應(yīng)的whl文件安裝：

Python擴展包whl文件下載：https://www.lfd.uci.edu/~gohlke/pythonlibs/#

ctrl+f查找需要的whl文件，點擊下載對應(yīng)版本

安裝：

pip install whl文件絕對路徑

例如：

pip install F:\PyWhl\Twisted-20.3.0-cp38-cp38m-win_amd64.whl

3）在Terminal終端創(chuàng)建爬蟲項目ScrapyDemo

scrapy startproject ScrapyDemo

生成項目目錄結(jié)構(gòu)

4）在spiders文件夾下創(chuàng)建核心爬蟲文件SpiderDemo.py

最終項目結(jié)構(gòu)及說明：

ScrapyDemo/                              爬蟲項目├── ScrapyDemo/                      爬蟲項目目錄    │      ├── spiders/                  爬蟲文件│      │      ├── __init__.py   │      │      └── SpiderDemo.py      自定義核心功能文件│      ├── __init__.py   │      ├── items.py                  爬蟲目標數(shù)據(jù)│      ├── middlewares.py            中間件、代理  │      ├── pipelines.py              管道，用于處理爬取的數(shù)據(jù)    │      └── settings.py               爬蟲配置文件└── scrapy.cfg                       項目配置文件

3、Scrapy使用四部曲

1）明確目標

明確爬蟲的目標網(wǎng)站

明確需要爬取實體（屬性）：items.py

定義：屬性名 = scrapy.Field()

2）制作爬蟲

自定義爬蟲核心功能文件：spiders/SpiderDemo.py

3）存儲數(shù)據(jù)

設(shè)計管道存儲爬取內(nèi)容：settings.py、pipelines.py

4）運行爬蟲

方式1：在Terminal終端執(zhí)行（cmd執(zhí)行需要切到項目根目錄下）

scrapy crawl dangdang(爬蟲名)

cmd切換操作：

切盤：F:
切換目錄：cd A/B/...

方式2：在PyCharm執(zhí)行文件

在爬蟲項目目錄下創(chuàng)建運行文件run.py，右鍵運行

4、Scrapy入門案例

4.1、明確目標

1）爬取當當網(wǎng)手機信息：https://category.dangdang.com/cid4004279.html

2）明確需要爬取實體屬性：items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapy# 1）明確目標
# 1.2）明確需要爬取實體屬性
class ScrapyDemoItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 名稱name = scrapy.Field()# 價格price = scrapy.Field()

4.2、制作爬蟲

SpiderDemo.py

# 入門案例# 1）明確目標
# 1.1）爬取當當網(wǎng)手機信息：https://category.dangdang.com/cid4004279.html# 2）制作爬蟲
import scrapy
from scrapy.http import Response
from ..items import ScrapyDemoItemclass SpiderDemo(scrapy.Spider):# 爬蟲名稱，運行爬蟲時使用的值name = "dangdang"# 爬蟲域，允許訪問的域名allowed_domains = ['category.dangdang.com']# 爬蟲地址：起始URL：第一次訪問是域名start_urls = ['https://category.dangdang.com/cid4004279.html']# 翻頁分析# 第1頁：https://category.dangdang.com/cid4004279.html# 第2頁：https://category.dangdang.com/pg2-cid4004279.html# 第3頁：https://category.dangdang.com/pg3-cid4004279.html# ......page = 1# 請求響應(yīng)處理def parse(self, response: Response):li_list = response.xpath('//ul[@id="component_47"]/li')for li in li_list:# 商品名稱name = li.xpath('.//img/@alt').extract_first()print(name)# 商品價格price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()print(price)# 獲取一個實體對象就交給管道pipelinesdemo = ScrapyDemoItem(name=name, price=price)# 封裝item數(shù)據(jù)后，調(diào)用yield將控制權(quán)給管道，管道拿到item后返回該程序yield demo# 每一頁爬取邏輯相同，只需要將執(zhí)行下一頁的請求再次調(diào)用parse()方法即可if self.page <= 10:self.page += 1url = rf"https://category.dangdang.com/pg{str(self.page)}-cid4004279.html"# scrapy.Request為scrapy的請求# yield中斷yield scrapy.Request(url=url, callback=self.parse)

補充：Response對象的屬性和方法

'''
1）獲取響應(yīng)的字符串
response.text
2）獲取響應(yīng)的二進制數(shù)據(jù)
response.body
3）解析響應(yīng)內(nèi)容
response.xpath()
'''

4.3、存儲數(shù)據(jù)

settings.py

# Scrapy settings for ScrapyDemo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html# 3）存儲數(shù)據(jù)
# 3.1）爬蟲配置、打開通道和添加通道# 爬蟲項目名
BOT_NAME = "ScrapyDemo"SPIDER_MODULES = ["ScrapyDemo.spiders"]
NEWSPIDER_MODULE = "ScrapyDemo.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "ScrapyDemo (+http://www.yourdomain.com)"
# User-Agent配置
USER_AGENT = 'Mozilla/5.0'# Obey robots.txt rules
# 是否遵循機器人協(xié)議（默認True），為了避免一些爬取限制需要改為False
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并發(fā)數(shù)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下載延遲（單位：s），用于控制爬取的頻率
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# 是否保存Cookies（默認False）
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}
# 請求頭
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "ScrapyDemo.middlewares.ScrapydemoSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "ScrapyDemo.middlewares.ScrapydemoDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "ScrapyDemo.pipelines.ScrapydemoPipeline": 300,
#}# 項目管道
ITEM_PIPELINES = {# 管道可以有多個，后面的數(shù)字是優(yōu)先級（范圍：1-1000），值越小優(yōu)先級越高# 爬取網(wǎng)頁'scrapy_dangdang.pipelines.ScrapyDemoPipeline': 300,# 保存數(shù)據(jù)'scrapy_dangdang.pipelines.ScrapyDemoSinkPiepline': 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"# 設(shè)置日志輸出等級（默認DEBUG）與日志存放的路徑
LOG_LEVEL = 'INFO'
# LOG_FILE = "spider.log"

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 3）存儲數(shù)據(jù)
# 3.2）使用管道存儲數(shù)據(jù)
# 若使用管道，則必須在settings.py中開啟管道import os
import csv# 爬取網(wǎng)頁
class ScrapyDemoPipeline:# 數(shù)據(jù)item交給管道輸出def process_item(self, item, spider):print(item)return item# 保存數(shù)據(jù)
class ScrapyDemoSinkPiepline:# item為yield后面的ScrapyDemoItem對象，字典類型def process_item(self, item, spider):with open(r'C:\Users\cc\Desktop\scrapy_test.csv', 'a', newline='', encoding='utf-8') as csvfile:# 定義表頭fields = ['name', 'price']writer = csv.DictWriter(csvfile, fieldnames=fields)writer.writeheader()# 寫入數(shù)據(jù)writer.writerow(item)

4.4、運行爬蟲

run.py

# 4）運行爬蟲from scrapy import cmdlinecmdline.execute('scrapy crawl dangdang'.split())

其他文件不動，本案例運行會報錯：

ERROR: Twisted-20.3.0-cp38-cp38m-win_amd64.whl is not a supported wheel on this platform
builtins.ModuleNotFoundError: No module named 'scrapy_dangdang'

原因大概是Twisted版本兼容問題，暫未解決，后續(xù)補充

查看全文

http://www.risenshineclean.com/news/54175.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

廣州市司法職業(yè)學(xué)校網(wǎng)站seo是干什么的

Scrapy入門

1、Scrapy概述

2、PyCharm搭建Scrapy環(huán)境

3、Scrapy使用四部曲

4、Scrapy入門案例

4.1、明確目標

4.2、制作爬蟲

4.3、存儲數(shù)據(jù)

4.4、運行爬蟲

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

Scrapy入門

1、Scrapy概述

2、PyCharm搭建Scrapy環(huán)境

3、Scrapy使用四部曲

4、Scrapy入門案例

4.1、明確目標

4.2、制作爬蟲

4.3、存儲數(shù)據(jù)

4.4、運行爬蟲

相關(guān)文章：

1、Scrapy概述

4.1、明確目標

4.3、存儲數(shù)據(jù)