中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

當前位置: 首頁 > news >正文

廣州市司法職業(yè)學(xué)校網(wǎng)站seo是干什么的

廣州市司法職業(yè)學(xué)校,網(wǎng)站seo是干什么的,有關(guān)網(wǎng)站升級建設(shè)的申請書,代理ip做網(wǎng)站流量Scrapy入門 1、Scrapy概述2、PyCharm搭建Scrapy環(huán)境3、Scrapy使用四部曲4、Scrapy入門案例4.1、明確目標4.2、制作爬蟲4.3、存儲數(shù)據(jù)4.4、運行爬蟲 1、Scrapy概述 Scrapy是一個由Python語言開發(fā)的適用爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)的Web應(yīng)用程序框架。主要用于數(shù)據(jù)挖掘、信息處…

Scrapy入門

    • 1、Scrapy概述
    • 2、PyCharm搭建Scrapy環(huán)境
    • 3、Scrapy使用四部曲
    • 4、Scrapy入門案例
      • 4.1、明確目標
      • 4.2、制作爬蟲
      • 4.3、存儲數(shù)據(jù)
      • 4.4、運行爬蟲

1、Scrapy概述

Scrapy是一個由Python語言開發(fā)的適用爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)的Web應(yīng)用程序框架。主要用于數(shù)據(jù)挖掘、信息處理、數(shù)據(jù)存儲和自動化測試等。通過Scrapy框架實現(xiàn)一個爬蟲,只需要少量的代碼,就能夠快速的網(wǎng)絡(luò)抓取

Scrapy框架5大組件(架構(gòu)):

在這里插入圖片描述

  • Scrapy引擎(Scrapy Engine):Scrapy引擎是整個框架的核心,負責Spider、ItemPipeline、Downloader、Scheduler間的通訊、數(shù)據(jù)傳遞等
  • 調(diào)度器(Scheduler):網(wǎng)頁URL的優(yōu)先隊列,主要負責處理引擎發(fā)送的請求,并按一定方式排列調(diào)度,當引擎需要時,交還給引擎
  • 下載器(Downloader):負責下載引擎發(fā)送的所有Requests請求資源,并將其獲取到的Responses交還給引擎,由引擎交給Spider來處理
  • 爬蟲(Spider):用戶定制的爬蟲,用于從特定網(wǎng)頁中提取信息(實體Item),負責處理所有Responses,從中提取數(shù)據(jù),并將需要跟進的URL提交給引擎,再次進入調(diào)度器
  • 實體管道(Item Pipeline):用于處理Spider中獲取的實體,并進行后期處理(詳細分析、過濾、持久化存儲等)

其他組件:

  • 下載中間件(Downloader Middlewares):一個可以自定義擴展下載功能的組件
  • Spider中間件(Spider Middlewares):一個可以自定擴展和操作引擎和Spider間通信的組件

官方文檔:https://docs.scrapy.org

入門文檔:https://doc.scrapy.org/en/latest/intro/tutorial.html

2、PyCharm搭建Scrapy環(huán)境

1)新建一個爬蟲項目ScrapyDemo

2)在Terminal終端安裝所需模塊

Scrapy基于Twisted,Twisted是一個異步網(wǎng)絡(luò)框架,主要用于提高爬蟲的下載速度

pip install scrapy
pip install twisted

如果報錯:

ERROR: Failed building wheel for twisted
error: Microsoft Visual C++ 14.0 or greater is required

則需要下載對應(yīng)的whl文件安裝:

Python擴展包whl文件下載:https://www.lfd.uci.edu/~gohlke/pythonlibs/#

ctrl+f查找需要的whl文件,點擊下載對應(yīng)版本

安裝:

pip install whl文件絕對路徑

例如:

pip install F:\PyWhl\Twisted-20.3.0-cp38-cp38m-win_amd64.whl

3)在Terminal終端創(chuàng)建爬蟲項目ScrapyDemo

scrapy startproject ScrapyDemo

生成項目目錄結(jié)構(gòu)

4)在spiders文件夾下創(chuàng)建核心爬蟲文件SpiderDemo.py

最終項目結(jié)構(gòu)及說明:

ScrapyDemo/                              爬蟲項目├── ScrapyDemo/                      爬蟲項目目錄    │      ├── spiders/                  爬蟲文件│      │      ├── __init__.py   │      │      └── SpiderDemo.py      自定義核心功能文件│      ├── __init__.py   │      ├── items.py                  爬蟲目標數(shù)據(jù)│      ├── middlewares.py            中間件、代理  │      ├── pipelines.py              管道,用于處理爬取的數(shù)據(jù)    │      └── settings.py               爬蟲配置文件└── scrapy.cfg                       項目配置文件

3、Scrapy使用四部曲

1)明確目標

明確爬蟲的目標網(wǎng)站

明確需要爬取實體(屬性):items.py

定義:屬性名 = scrapy.Field()

2)制作爬蟲

自定義爬蟲核心功能文件:spiders/SpiderDemo.py

3)存儲數(shù)據(jù)

設(shè)計管道存儲爬取內(nèi)容:settings.py、pipelines.py

4)運行爬蟲

方式1:在Terminal終端執(zhí)行(cmd執(zhí)行需要切到項目根目錄下)

scrapy crawl dangdang(爬蟲名)

cmd切換操作:

切盤:F:
切換目錄:cd A/B/...

方式2:在PyCharm執(zhí)行文件

在爬蟲項目目錄下創(chuàng)建運行文件run.py,右鍵運行

4、Scrapy入門案例

4.1、明確目標

1)爬取當當網(wǎng)手機信息:https://category.dangdang.com/cid4004279.html

2)明確需要爬取實體屬性:items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapy# 1)明確目標
# 1.2)明確需要爬取實體屬性
class ScrapyDemoItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 名稱name = scrapy.Field()# 價格price = scrapy.Field()

4.2、制作爬蟲

SpiderDemo.py

# 入門案例# 1)明確目標
# 1.1)爬取當當網(wǎng)手機信息:https://category.dangdang.com/cid4004279.html# 2)制作爬蟲
import scrapy
from scrapy.http import Response
from ..items import ScrapyDemoItemclass SpiderDemo(scrapy.Spider):# 爬蟲名稱,運行爬蟲時使用的值name = "dangdang"# 爬蟲域,允許訪問的域名allowed_domains = ['category.dangdang.com']# 爬蟲地址:起始URL:第一次訪問是域名start_urls = ['https://category.dangdang.com/cid4004279.html']# 翻頁分析# 第1頁:https://category.dangdang.com/cid4004279.html# 第2頁:https://category.dangdang.com/pg2-cid4004279.html# 第3頁:https://category.dangdang.com/pg3-cid4004279.html# ......page = 1# 請求響應(yīng)處理def parse(self, response: Response):li_list = response.xpath('//ul[@id="component_47"]/li')for li in li_list:# 商品名稱name = li.xpath('.//img/@alt').extract_first()print(name)# 商品價格price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()print(price)# 獲取一個實體對象就交給管道pipelinesdemo = ScrapyDemoItem(name=name, price=price)# 封裝item數(shù)據(jù)后,調(diào)用yield將控制權(quán)給管道,管道拿到item后返回該程序yield demo# 每一頁爬取邏輯相同,只需要將執(zhí)行下一頁的請求再次調(diào)用parse()方法即可if self.page <= 10:self.page += 1url = rf"https://category.dangdang.com/pg{str(self.page)}-cid4004279.html"# scrapy.Request為scrapy的請求# yield中斷yield scrapy.Request(url=url, callback=self.parse)

補充:Response對象的屬性和方法

'''
1)獲取響應(yīng)的字符串
response.text
2)獲取響應(yīng)的二進制數(shù)據(jù)
response.body
3)解析響應(yīng)內(nèi)容
response.xpath()
'''

4.3、存儲數(shù)據(jù)

settings.py

# Scrapy settings for ScrapyDemo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html# 3)存儲數(shù)據(jù)
# 3.1)爬蟲配置、打開通道和添加通道# 爬蟲項目名
BOT_NAME = "ScrapyDemo"SPIDER_MODULES = ["ScrapyDemo.spiders"]
NEWSPIDER_MODULE = "ScrapyDemo.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "ScrapyDemo (+http://www.yourdomain.com)"
# User-Agent配置
USER_AGENT = 'Mozilla/5.0'# Obey robots.txt rules
# 是否遵循機器人協(xié)議(默認True),為了避免一些爬取限制需要改為False
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并發(fā)數(shù)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下載延遲(單位:s),用于控制爬取的頻率
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# 是否保存Cookies(默認False)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}
# 請求頭
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "ScrapyDemo.middlewares.ScrapydemoSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "ScrapyDemo.middlewares.ScrapydemoDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "ScrapyDemo.pipelines.ScrapydemoPipeline": 300,
#}# 項目管道
ITEM_PIPELINES = {# 管道可以有多個,后面的數(shù)字是優(yōu)先級(范圍:1-1000),值越小優(yōu)先級越高# 爬取網(wǎng)頁'scrapy_dangdang.pipelines.ScrapyDemoPipeline': 300,# 保存數(shù)據(jù)'scrapy_dangdang.pipelines.ScrapyDemoSinkPiepline': 301,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"# 設(shè)置日志輸出等級(默認DEBUG)與日志存放的路徑
LOG_LEVEL = 'INFO'
# LOG_FILE = "spider.log"

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 3)存儲數(shù)據(jù)
# 3.2)使用管道存儲數(shù)據(jù)
# 若使用管道,則必須在settings.py中開啟管道import os
import csv# 爬取網(wǎng)頁
class ScrapyDemoPipeline:# 數(shù)據(jù)item交給管道輸出def process_item(self, item, spider):print(item)return item# 保存數(shù)據(jù)
class ScrapyDemoSinkPiepline:# item為yield后面的ScrapyDemoItem對象,字典類型def process_item(self, item, spider):with open(r'C:\Users\cc\Desktop\scrapy_test.csv', 'a', newline='', encoding='utf-8') as csvfile:# 定義表頭fields = ['name', 'price']writer = csv.DictWriter(csvfile, fieldnames=fields)writer.writeheader()# 寫入數(shù)據(jù)writer.writerow(item)

4.4、運行爬蟲

run.py

# 4)運行爬蟲from scrapy import cmdlinecmdline.execute('scrapy crawl dangdang'.split())

其他文件不動,本案例運行會報錯:

ERROR: Twisted-20.3.0-cp38-cp38m-win_amd64.whl is not a supported wheel on this platform
builtins.ModuleNotFoundError: No module named 'scrapy_dangdang'

原因大概是Twisted版本兼容問題,暫未解決,后續(xù)補充

http://www.risenshineclean.com/news/54175.html

相關(guān)文章:

  • wordpress 在哪里注冊網(wǎng)站優(yōu)化服務(wù)
  • 贛州網(wǎng)站設(shè)計重慶網(wǎng)站seo診斷
  • 黃岡商城網(wǎng)站制作哪家好海南快速seo排名優(yōu)化
  • 曰本真人做爰免費網(wǎng)站近期熱點新聞事件
  • 昆山企業(yè)網(wǎng)站建設(shè)公司淘客推廣
  • 獨立ip網(wǎng)站建設(shè)農(nóng)產(chǎn)品網(wǎng)絡(luò)營銷
  • 贛州網(wǎng)站制作培訓(xùn)百度一下免費下載
  • 做一手房產(chǎn)中介用什么網(wǎng)站好成都網(wǎng)站建設(shè)seo
  • 免費企業(yè)網(wǎng)站 優(yōu)幫云武漢建站優(yōu)化廠家
  • 高新西區(qū)網(wǎng)站建設(shè)廣安seo外包
  • 獵頭做單的網(wǎng)站電商推廣平臺有哪些
  • 靜態(tài)網(wǎng)站制作wordpress模版代發(fā)軟文
  • 網(wǎng)站欄目結(jié)構(gòu)設(shè)計新媒體營銷案例ppt
  • 做中國供應(yīng)商免費網(wǎng)站有作用嗎怎么自己創(chuàng)建網(wǎng)址
  • 網(wǎng)站開發(fā)咨詢百度推廣優(yōu)化師
  • 云南省網(wǎng)站建設(shè)收費調(diào)查報告論文專業(yè)seo排名優(yōu)化費用
  • b2b網(wǎng)站開發(fā)源文件網(wǎng)站建設(shè)與管理屬于什么專業(yè)
  • 改版百度不收錄網(wǎng)站百度資源共享
  • 合肥網(wǎng)站的優(yōu)化seo外鏈怎么發(fā)
  • 做app還是做網(wǎng)站競價惡意點擊器
  • 上海裝修公司前十名深圳網(wǎng)絡(luò)優(yōu)化seo
  • yaqinblog.wordpress什么是網(wǎng)站推廣優(yōu)化
  • 網(wǎng)站備案信息真實性核驗單個人產(chǎn)品線上推廣方案
  • 域名需要備案嗎?鄭州網(wǎng)站建設(shè)推廣優(yōu)化
  • 鄭州設(shè)計公司天津做優(yōu)化好的公司
  • 微教育云平臺網(wǎng)站建設(shè)長沙網(wǎng)站關(guān)鍵詞推廣
  • wordpress置頂文章不生效百度關(guān)鍵詞自然排名優(yōu)化公司
  • 重慶微信網(wǎng)站制作湖南企業(yè)網(wǎng)站建設(shè)
  • 叫人開發(fā)網(wǎng)站注意事項競價推廣托管公司價格
  • 如何創(chuàng)辦視頻網(wǎng)站媒體平臺