當(dāng)前位置：首頁(yè) > news >正文

nginx建設(shè)網(wǎng)站教程寧波seo推薦推廣渠道

news 2025/7/9 21:02:43

nginx建設(shè)網(wǎng)站教程,寧波seo推薦推廣渠道,c 做網(wǎng)站開(kāi)發(fā),護(hù)理專業(yè)建設(shè)規(guī)劃一個(gè)比較基礎(chǔ)且常見(jiàn)的爬蟲(chóng)，寫(xiě)下來(lái)用于記錄和鞏固相關(guān)知識(shí)。一、前置條件本項(xiàng)目采用scrapy框架進(jìn)行爬取，需要提前安裝 pip install scrapy# 國(guó)內(nèi)鏡像 pip install scrapy -i https://pypi.douban.com/simple 由于需要保存數(shù)據(jù)到數(shù)據(jù)庫(kù)，因…

一個(gè)比較基礎(chǔ)且常見(jiàn)的爬蟲(chóng)，寫(xiě)下來(lái)用于記錄和鞏固相關(guān)知識(shí)。

一、前置條件

本項(xiàng)目采用scrapy框架進(jìn)行爬取，需要提前安裝

pip install scrapy# 國(guó)內(nèi)鏡像
pip install scrapy -i https://pypi.douban.com/simple

由于需要保存數(shù)據(jù)到數(shù)據(jù)庫(kù)，因此需要下載pymysql進(jìn)行數(shù)據(jù)庫(kù)相關(guān)的操作

pip install pymysql# 國(guó)內(nèi)鏡像
pip install pymysql -i https://pypi.douban.com/simple

同時(shí)在數(shù)據(jù)庫(kù)中創(chuàng)立對(duì)應(yīng)的表

create database spider01 charset utf8;use spider01;# 這里簡(jiǎn)單創(chuàng)建name和src
create table book(id int primary key auto_increment,name varchar(188),src varchar(188) 
);

二、項(xiàng)目創(chuàng)建

在終端進(jìn)入準(zhǔn)備存放項(xiàng)目的文件夾中

1、創(chuàng)建項(xiàng)目

scrapy startproject scrapy_book

創(chuàng)建成功后，結(jié)構(gòu)如下：

2、跳轉(zhuǎn)到spiders路徑

cd scrapy_book\scrapy_book\spiders

3、生成爬蟲(chóng)文件

由于涉及鏈接的提取，這里生成CrawlSpider文件

scrapy genspider -t crawl read Www.dushu.com

注意：先將第11行中follow的值改為False，否則會(huì)跟隨從當(dāng)前頁(yè)面提取的鏈接繼續(xù)爬取，避免過(guò)度下載

4、項(xiàng)目結(jié)構(gòu)說(shuō)明

接下來(lái)我們一共要修改4個(gè)文件完成爬取功能：

read.py: 自定義的爬蟲(chóng)文件，完成爬取的功能
items.py：定義數(shù)據(jù)結(jié)構(gòu)的地方，是一個(gè)繼承自scrapy.Item的類
pipelines.py：管道文件，里面只有一個(gè)類，用于處理下載數(shù)據(jù)的后續(xù)處理
setings.py：配置文件比如：是否遵循robots協(xié)議，User-Agent協(xié)議

三、網(wǎng)頁(yè)分析

1、圖書(shū)分析

讀書(shū)網(wǎng)主頁(yè)：

在讀書(shū)網(wǎng)中，隨便選取一個(gè)分類，這里以外國(guó)小說(shuō)為例進(jìn)行分析

這里我們簡(jiǎn)單爬取它的圖片和書(shū)名，當(dāng)然也可擴(kuò)展

使用xpath語(yǔ)法對(duì)第一頁(yè)的圖片進(jìn)行分析

由上圖可以知道

書(shū)名：//div[@class="bookslist"]//img/@alt

書(shū)圖片地址：//div[@class="bookslist"]//img/@data-original 不是src屬性是因?yàn)轫?yè)面圖片使用懶加載

2、頁(yè)碼分析

第一頁(yè)：外國(guó)小說(shuō) - 讀書(shū)網(wǎng)|dushu.com 或 https://www.dushu.com/book/1176_1.html

第二頁(yè)：外國(guó)小說(shuō) - 讀書(shū)網(wǎng)|dushu.com

第三頁(yè)：外國(guó)小說(shuō) - 讀書(shū)網(wǎng)|dushu.com

發(fā)現(xiàn)規(guī)律，滿足表達(dá)式：r"/book/1176_\d+\.html"

四、項(xiàng)目完成

1、修改items.py文件

自己定義下載數(shù)據(jù)的結(jié)構(gòu)

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyBookItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 書(shū)名name = scrapy.Field()# 圖片地址src = scrapy.Field()

2、修改settings.py文件

將第65行的ITEM_PIPELINES的注釋去掉，并在下面新增自己數(shù)據(jù)庫(kù)的相關(guān)配置

3、修改pipnelines.py文件

進(jìn)行下載數(shù)據(jù)的相關(guān)處理

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 加載settings文件
from scrapy.utils.project import get_project_settings
import pymysqlclass ScrapyBookPipeline:# 最開(kāi)始執(zhí)行def open_spider(self,spider):settings = get_project_settings()# 獲取配置信息self.host = settings['DB_HOST']self.port = settings['DB_PORT']self.user = settings['DB_USER']self.password = settings['DB_PASSWROD']self.name = settings['DB_NAME']self.charset = settings['DB_CHARSET']self.connect()def connect(self):self.conn = pymysql.connect(host=self.host,port=self.port,user=self.user,password=self.password,db=self.name,charset=self.charset)self.cursor = self.conn.cursor()# 執(zhí)行中def process_item(self, item, spider):# 根據(jù)自己的表結(jié)構(gòu)進(jìn)行修改，我的是book表sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'], item['src'])# 執(zhí)行sql語(yǔ)句self.cursor.execute(sql)# 提交self.conn.commit()# 結(jié)尾執(zhí)行def close_spider(self, spider):self.cursor.close()self.conn.close()

4、修改read.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule# 導(dǎo)入時(shí)可能有下劃線報(bào)錯(cuò)，是編譯器問(wèn)題，可以正常使用
from scrapy_book.items import ScrapyBookItemclass ReadSpider(CrawlSpider):name = "read"allowed_domains = ["www.dushu.com"]# 改為第一頁(yè)的網(wǎng)址,這樣都能滿足allow的規(guī)則，不遺漏start_urls = ["https://www.dushu.com/book/1176_1.html"]# allow屬性提取指定鏈接，下面是正則表達(dá)式    callback回調(diào)函數(shù)   follow是否跟進(jìn)就是按照提取連接規(guī)則進(jìn)行提取這里選擇Falserules = (Rule(LinkExtractor(allow=r"/book/1176_\d+\.html"), callback="parse_item", follow=False),)def parse_item(self, response):item = {}# item["domain_id"] = response.xpath('//input[@id="sid"]/@value').get()# item["name"] = response.xpath('//div[@id="name"]').get()# item["description"] = response.xpath('//div[@id="description"]').get()# 獲取當(dāng)前頁(yè)面的所有圖片img_list = response.xpath('//div[@class="bookslist"]//img')for img in img_list:name = img.xpath('./@alt').extract_first()src = img.xpath('./@data-original').extract_first()book = ScrapyBookItem(name=name, src=src)# 進(jìn)入pipelines管道進(jìn)行下載yield book

5、下載

終端進(jìn)入spiders文件夾，運(yùn)行命令：scrapy crawl read

其中read是spiders文件夾下read.py中name的值

6、結(jié)果

一共下載了40(每一頁(yè)的數(shù)據(jù)) * 13(頁(yè)) = 520條數(shù)據(jù)

將read.py中的follow改為T(mén)rue即可下載該類書(shū)籍的全部數(shù)據(jù)，總共有100頁(yè)，如果用流量的話謹(jǐn)慎下載，預(yù)防話費(fèi)不足。

5、結(jié)語(yǔ)

這個(gè)爬蟲(chóng)項(xiàng)目應(yīng)該可以適用于挺多場(chǎng)景的，不是特別多，跟著寫(xiě)一下也沒(méi)啥壞處。如果有代碼的需求的話，日后會(huì)把項(xiàng)目的代碼地址給出。因?yàn)樽约簩W(xué)爬蟲(chóng)沒(méi)多久，記錄一下梳理下思路，也可以為以后有需要的時(shí)候做參考。

查看全文

http://www.risenshineclean.com/news/51652.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

nginx建設(shè)網(wǎng)站教程寧波seo推薦推廣渠道

一、前置條件

二、項(xiàng)目創(chuàng)建

1、創(chuàng)建項(xiàng)目

2、跳轉(zhuǎn)到spiders路徑

3、生成爬蟲(chóng)文件

4、項(xiàng)目結(jié)構(gòu)說(shuō)明

三、網(wǎng)頁(yè)分析

1、圖書(shū)分析

2、頁(yè)碼分析

四、項(xiàng)目完成

1、修改items.py文件

2、修改settings.py文件

3、修改pipnelines.py文件

4、修改read.py

5、下載

6、結(jié)果

5、結(jié)語(yǔ)

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

一、前置條件

二、項(xiàng)目創(chuàng)建

1、創(chuàng)建項(xiàng)目

2、跳轉(zhuǎn)到spiders路徑

3、生成爬蟲(chóng)文件

4、項(xiàng)目結(jié)構(gòu)說(shuō)明

三、網(wǎng)頁(yè)分析

1、圖書(shū)分析

2、頁(yè)碼分析

四、項(xiàng)目完成

1、修改items.py文件

2、修改settings.py文件

3、修改pipnelines.py文件

4、修改read.py

5、下載

6、結(jié)果

5、結(jié)語(yǔ)

相關(guān)文章：

一、前置條件

1、創(chuàng)建項(xiàng)目

3、生成爬蟲(chóng)文件

三、網(wǎng)頁(yè)分析

1、圖書(shū)分析

四、項(xiàng)目完成

1、修改items.py文件

3、修改pipnelines.py文件

5、下載