當(dāng)前位置：首頁 > news >正文

b2c 電商網(wǎng)站開發(fā)框架設(shè)計(jì)百度指數(shù)是怎么計(jì)算的

news 2025/7/15 1:40:58

b2c 電商網(wǎng)站開發(fā)框架設(shè)計(jì),百度指數(shù)是怎么計(jì)算的,建筑行業(yè)大數(shù)據(jù)綜合查詢平臺官網(wǎng),王建設(shè)個人網(wǎng)站Scrapy是一個為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。其可以應(yīng)用在數(shù)據(jù)挖掘，信息處理或存儲歷史數(shù)據(jù)等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網(wǎng)絡(luò)抓取 )所設(shè)計(jì)的， 也可以應(yīng)用在獲取API所返回的數(shù)據(jù)(例如 Amazon As…

Scrapy是一個為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。其可以應(yīng)用在數(shù)據(jù)挖掘，信息處理或存儲歷史數(shù)據(jù)等一系列的程序中。其最初是為了頁面抓取 (更確切來說, 網(wǎng)絡(luò)抓取 )所設(shè)計(jì)的，也可以應(yīng)用在獲取API所返回的數(shù)據(jù)(例如 Amazon Associates Web Services ) 或者通用的網(wǎng)絡(luò)爬蟲。Scrapy用途廣泛，可以用于數(shù)據(jù)挖掘、監(jiān)測和自動化測試。

Scrapy 使用了 Twisted異步網(wǎng)絡(luò)庫來處理網(wǎng)絡(luò)通訊。整體架構(gòu)大致如下

Scrapy主要包括了以下組件：

- 引擎(Scrapy)
  用來處理整個系統(tǒng)的數(shù)據(jù)流, 觸發(fā)事務(wù)(框架核心)
- 調(diào)度器(Scheduler)
  用來接受引擎發(fā)過來的請求, 壓入隊(duì)列中, 并在引擎再次請求的時候返回. 可以想像成一個URL（抓取網(wǎng)頁的網(wǎng)址或者說是鏈接）的優(yōu)先隊(duì)列, 由它來決定下一個要抓取的網(wǎng)址是什么, 同時去除重復(fù)的網(wǎng)址
- 下載器(Downloader)
  用于下載網(wǎng)頁內(nèi)容, 并將網(wǎng)頁內(nèi)容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)
- 爬蟲(Spiders)
  爬蟲是主要干活的, 用于從特定的網(wǎng)頁中提取自己需要的信息, 即所謂的實(shí)體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續(xù)抓取下一個頁面
- 項(xiàng)目管道(Pipeline)
  負(fù)責(zé)處理爬蟲從網(wǎng)頁中抽取的實(shí)體，主要的功能是持久化實(shí)體、驗(yàn)證實(shí)體的有效性、清除不需要的信息。當(dāng)頁面被爬蟲解析后，將被發(fā)送到項(xiàng)目管道，并經(jīng)過幾個特定的次序處理數(shù)據(jù)。
- 下載器中間件(Downloader Middlewares)
  位于Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應(yīng)。
- 爬蟲中間件(Spider Middlewares)
  介于Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應(yīng)輸入和請求輸出。
- 調(diào)度中間件(Scheduler Middewares)
  介于Scrapy引擎和調(diào)度之間的中間件，從Scrapy引擎發(fā)送到調(diào)度的請求和響應(yīng)。

Scrapy運(yùn)行流程大概如下：

引擎從調(diào)度器中取出一個鏈接(URL)用于接下來的抓取
引擎把URL封裝成一個請求(Request)傳給下載器
下載器把資源下載下來，并封裝成應(yīng)答包(Response)
爬蟲解析Response
解析出實(shí)體（Item）,則交給實(shí)體管道進(jìn)行進(jìn)一步的處理
解析出的是鏈接（URL）,則把URL交給調(diào)度器等待抓取

一、安裝

    1、安裝wheelpip install wheel2、安裝lxmlhttps://pypi.python.org/pypi/lxml/4.1.03、安裝pyopensslhttps://pypi.python.org/pypi/pyOpenSSL/17.5.04、安裝Twistedhttps://www.lfd.uci.edu/~gohlke/pythonlibs/5、安裝pywin32https://sourceforge.net/projects/pywin32/files/6、安裝scrapypip install scrapy

注：windows平臺需要依賴pywin32，請根據(jù)自己系統(tǒng)32/64位選擇下載安裝，https://sourceforge.net/projects/pywin32/

二、爬蟲舉例

入門篇：美劇天堂前100最新（http://www.meijutt.com/new100.html）

1、創(chuàng)建工程

scrapy startproject movie

2、創(chuàng)建爬蟲程序

cd movie
scrapy genspider meiju meijutt.com

3、自動創(chuàng)建目錄及文件

4、文件說明：

scrapy.cfg ?項(xiàng)目的配置信息，主要為Scrapy命令行工具提供一個基礎(chǔ)的配置信息。（真正爬蟲相關(guān)的配置信息在settings.py文件中）
items.py ? ?設(shè)置數(shù)據(jù)存儲模板，用于結(jié)構(gòu)化數(shù)據(jù)，如：Django的Model
pipelines ? ?數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py 配置文件，如：遞歸的層數(shù)、并發(fā)數(shù)，延遲下載等
spiders ? ? ?爬蟲目錄，如：創(chuàng)建文件，編寫爬蟲規(guī)則

注意：一般創(chuàng)建爬蟲文件時，以網(wǎng)站域名命名

5、設(shè)置數(shù)據(jù)存儲模板

　　items.py

import scrapyclass?MovieItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()

6、編寫爬蟲

　　meiju.py

# -*- coding: utf-8 -*-import scrapy
from?movie.items import MovieItemclass?MeijuSpider(scrapy.Spider):name =?"meiju"allowed_domains = ["meijutt.com"]start_urls = ['http://www.meijutt.com/new100.html']def parse(self, response):movies = response.xpath('//ul[@class="top-list? fn-clear"]/li')for?each_movie?in?movies:item = MovieItem()item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]yield item

7、設(shè)置配置文件

　settings.py增加如下內(nèi)容

ITEM_PIPELINES = {'movie.pipelines.MoviePipeline':100}

8、編寫數(shù)據(jù)處理腳本

　　pipelines.py

class?MoviePipeline(object):def process_item(self, item, spider):with open("my_meiju.txt",'a')?as?fp:fp.write(item['name'].encode("utf8") +?'\n')

9、執(zhí)行爬蟲

cd movie
scrapy crawl meiju --nolog

10、結(jié)果

進(jìn)階篇：爬取?；ňW(wǎng)（http://www.xiaohuar.com/list-1-1.html）

1、創(chuàng)建一個工程

scrapy startproject xhspider

2、創(chuàng)建爬蟲程序

cd pic
scrapy genspider xh xiaohuar.com

3、自動創(chuàng)建目錄及文件

4、文件說明：

scrapy.cfg ?項(xiàng)目的配置信息，主要為Scrapy命令行工具提供一個基礎(chǔ)的配置信息。（真正爬蟲相關(guān)的配置信息在settings.py文件中）
items.py ? ?設(shè)置數(shù)據(jù)存儲模板，用于結(jié)構(gòu)化數(shù)據(jù)，如：Django的Model
pipelines ? ?數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py 配置文件，如：遞歸的層數(shù)、并發(fā)數(shù)，延遲下載等
spiders ? ? ?爬蟲目錄，如：創(chuàng)建文件，編寫爬蟲規(guī)則

注意：一般創(chuàng)建爬蟲文件時，以網(wǎng)站域名命名

5、設(shè)置數(shù)據(jù)存儲模板

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass XhspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()addr = scrapy.Field()name = scrapy.Field()pass

?6、編寫爬蟲

# -*- coding: utf-8 -*-
import scrapy
import os# 導(dǎo)入item中結(jié)構(gòu)化數(shù)據(jù)模板
from xhspider.items import XhspiderItemclass XhSpider(scrapy.Spider):# 爬蟲名稱，唯一name = "xh"# 允許訪問的域allowed_domains = ["xiaohuar.com"]# 初始URLstart_urls = ['http://www.xiaohuar.com/list-1-1.html']def parse(self, response):# 獲取所有圖片的a標(biāo)簽allPics = response.xpath('//div[@class="img"]/a')for pic in allPics:# 分別處理每個圖片，取出名稱及地址item = XhspiderItem()name = pic.xpath('./img/@alt').extract()[0]addr = pic.xpath('./img/@src').extract()[0]if addr.startswith('/d/file/'):addr = 'http://www.xiaohuar.com' + addritem['name'] = nameitem['addr'] = addr# 返回爬取到的數(shù)據(jù)yield item

7、設(shè)置配置文件

# 設(shè)置處理返回?cái)?shù)據(jù)的類及執(zhí)行優(yōu)先級
ITEM_PIPELINES = {'xhspider.pipelines.XhspiderPipeline': 300,
}

8、編寫數(shù)據(jù)處理腳本

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport urllib
import urllib.request
import os
import socket
import sslclass XhspiderPipeline(object):def process_item(self, item, spider):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}print(item['addr'])ctx = ssl._create_unverified_context()req = urllib.request.Request(url=item['addr'], headers=headers)res = urllib.request.urlopen(req,context=ctx)file_name = os.path.join('./allpic', item['name'] + '.jpg')with open(file_name, 'wb') as f:f.write(res.read())return item

?9、執(zhí)行爬蟲

cd pic
scrapy crawl xh --nolog

結(jié)果：

終極篇：我想要所有?；▓D

注明：基于進(jìn)階篇再修改為終極篇

#　　xh.py

# -*- coding: utf-8 -*-
import scrapy
from xhspider.items import XhspiderItemclass AllxhSpider(scrapy.Spider):# 爬蟲名稱，唯一name = "allxh"# 允許訪問的域allowed_domains = ["xiaohuar.com"]# 初始URLstart_urls = ['http://www.xiaohuar.com/hua/']# 設(shè)置一個空集合url_set = set()def parse(self, response):# 如果圖片地址以http://www.xiaohuar.com/list-開頭，我才取其名字及地址信息if response.url.startswith("http://www.xiaohuar.com/list-"):allPics = response.xpath('//div[@class="img"]/a')for pic in allPics:# 分別處理每個圖片，取出名稱及地址item = XhspiderItem()name = pic.xpath('./img/@alt').extract()[0]addr = pic.xpath('./img/@src').extract()[0]if addr.startswith('/d/file/'):addr = 'http://www.xiaohuar.com' + addritem['name'] = nameitem['addr'] = addr# 返回爬取到的信息yield item# 獲取所有的地址鏈接urls = response.xpath("//a/@href").extract()for url in urls:# 如果地址以http://www.xiaohuar.com/list-開頭且不在集合中，則獲取其信息if url.startswith("http://www.xiaohuar.com/list-"):if url in AllxhSpider.url_set:passelse:AllxhSpider.url_set.add(url)# 回調(diào)函數(shù)默認(rèn)為parse,也可以通過from scrapy.http import Request來指定回調(diào)函數(shù)# from scrapy.http import Request# Request(url,callback=self.parse)yield self.make_requests_from_url(url)else:pass

查看全文

http://www.risenshineclean.com/news/65244.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

b2c 電商網(wǎng)站開發(fā)框架設(shè)計(jì)百度指數(shù)是怎么計(jì)算的

相關(guān)文章：