公司怎么制作網(wǎng)站免費(fèi)發(fā)帖推廣網(wǎng)站
一.最終效果
二.項(xiàng)目代碼
2.1 新建項(xiàng)目
本文使用scrapy分布式、多線程爬蟲框架編寫的高性能爬蟲,因此新建、運(yùn)行scrapy項(xiàng)目3步驟:
1.新建項(xiàng)目:
scrapy startproject weibo_hot
2.新建 spider:
scrapy genspider hot_search "weibo.com"
3.運(yùn)行 spider:
scrapy crawl hot_search
注意:hot_search 是spider中的name
4.編寫item:
class WeiboHotItem(scrapy.Item):index = scrapy.Field()topic_flag = scrapy.Field()icon_desc_color = scrapy.Field()small_icon_desc = scrapy.Field()small_icon_desc_color = scrapy.Field()is_hot = scrapy.Field()is_gov = scrapy.Field()note = scrapy.Field()mid = scrapy.Field()url = scrapy.Field()flag = scrapy.Field()name = scrapy.Field()word = scrapy.Field()pos = scrapy.Field()icon_desc = scrapy.Field()
5.編寫爬蟲解析代碼:
import os
from itemadapter import ItemAdapter
from .settings import DATA_URI
from .Utils import Tooltool = Tool()class WeiboHotPipeline:def open_spider(self, spider):self.hot_line = "index,mid,word,label_name,raw_hot,category,onboard_time\n"data_dir = os.path.join(DATA_URI)file_path = data_dir + '/hot.csv'#判斷文件夾存放的位置是否存在,不存在則新建文件夾if os.path.isfile(file_path):self.data_file = open(file_path, 'a', encoding='utf-8')else:if not os.path.exists(data_dir):os.makedirs(data_dir)self.data_file = open(file_path, 'a', encoding='utf-8')self.data_file.write(self.hot_line)def close_spider(self, spider): # 在關(guān)閉一個(gè)spider的時(shí)候自動(dòng)運(yùn)行self.data_file.close()def process_item(self, item, spider):try:hot_line = '{},{},{},{},{},{},{}\n'.format(item.get('index', ''),item.get('mid', ''),item.get('word', ''),item.get('label_name', ''),item.get('raw_hot', ''),tool.translate_chars(item.get('category', '')),tool.get_format_time(item.get('onboard_time', '')),)self.data_file.write(hot_line)except BaseException as e:print("hot錯(cuò)誤在這里>>>>>>>>>>>>>", e, "<<<<<<<<<<<<<錯(cuò)誤在這里")return item
三.注意事項(xiàng)
settings.py配置項(xiàng)修改
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 關(guān)閉,否則根據(jù)weibo的爬蟲策略爬蟲無(wú)法獲取數(shù)據(jù)
如果
四.運(yùn)行過(guò)程
五.項(xiàng)目說(shuō)明文檔
六.獲取完整源碼
愛(ài)學(xué)習(xí)的小伙伴,本次案例的完整源碼,已上傳微信公眾號(hào)“一個(gè)努力奔跑的snail”,后臺(tái)回復(fù) 熱搜榜 即可獲取。