鏈家二手房數(shù)據(jù)抓取與Excel存儲
目錄
- 開發(fā)環(huán)境準備
- 爬蟲流程分析
- 核心代碼實現(xiàn)
- 關鍵命令詳解
- 進階優(yōu)化方案
- 注意事項與擴展
一、開發(fā)環(huán)境準備
1.1 必要組件安裝
pip install requests beautifulsoup4 openpyxl pandas
- requests:網(wǎng)絡請求庫(版本≥2.25.1)
- beautifulsoup4:HTML解析庫(版本≥4.11.2)
- openpyxl:Excel文件操作庫(版本≥3.1.2)
- pandas:數(shù)據(jù)分析庫(版本≥2.0.3)
1.2 開發(fā)環(huán)境驗證
import requests
from bs4 import BeautifulSoup
import pandas as pdprint("所有庫加載成功!")
二、爬蟲流程分析
2.1 技術路線圖
2.2 目標頁面結構
https://cq.lianjia.com/ershoufang/
├── div.leftContent
│ └── ul.sellListContent
│ └── li[data-houseid] # 單個房源
│ ├── div.title > a # 標題
│ ├── div.flood > div # 地址
│ ├── div.priceInfo > div.totalPrice # 總價
│ └── div.followInfo # 關注量
三、核心代碼實現(xiàn)
3.1 完整代碼(帶詳細注釋)
"""
鏈家二手房數(shù)據(jù)采集器
版本:1.2
"""import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36','Accept-Language': 'zh-CN,zh;q=0.9'
}def get_house_data(max_page=5):"""獲取鏈家二手房數(shù)據(jù)參數(shù):max_page: 最大爬取頁數(shù)(默認5頁)返回:pandas.DataFrame格式的清洗后數(shù)據(jù)"""all_data = []for page in range(1, max_page+1):url = f"https://cq.lianjia.com/ershoufang/pg{page}/"try:response = requests.get(url, headers=headers, timeout=10)response.raise_for_status() sleep(1.5) soup = BeautifulSoup(response.text, 'lxml')house_list = soup.select('ul.sellListContent > li[data-houseid]')for house in house_list:try:title = house.select_one('div.title a').text.strip()address = house.select_one('div.flood > div').text.strip()total_price = house.select_one('div.totalPrice').text.strip()unit_price = house.select_one('div.unitPrice').text.strip()follow = house.select_one('div.followInfo').text.split('/')[0].strip()cleaned_data = {'標題': title,'地址': address.replace(' ', ''),'總價(萬)': float(total_price.replace('萬', '')),'單價(元/㎡)': int(unit_price.replace('元/㎡', '').replace(',', '')),'關注量': int(follow.replace('人關注', ''))}all_data.append(cleaned_data)except Exception as e:print(f"數(shù)據(jù)解析異常:{str(e)}")continueexcept requests.exceptions.RequestException as e:print(f"網(wǎng)絡請求失敗:{str(e)}")continuereturn pd.DataFrame(all_data)def save_to_excel(df, filename='house_data.xlsx'):"""將數(shù)據(jù)保存為Excel文件參數(shù):df: pandas.DataFrame數(shù)據(jù)框filename: 輸出文件名"""writer = pd.ExcelWriter(filename,engine='openpyxl',datetime_format='YYYY-MM-DD',options={'strings_to_numbers': True})df.to_excel(writer,index=False,sheet_name='鏈家數(shù)據(jù)',float_format="%.2f",freeze_panes=(1,0))writer.book.save(filename)print(f"數(shù)據(jù)已保存至 {filename}")if __name__ == '__main__':house_df = get_house_data(max_page=3)if not house_df.empty:save_to_excel(house_df)print(f"成功采集 {len(house_df)} 條數(shù)據(jù)")else:print("未獲取到有效數(shù)據(jù)")
四、關鍵命令詳解
4.1 核心方法說明
4.1.1 pandas.to_excel參數(shù)解析
df.to_excel(excel_writer, sheet_name='Sheet1',na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf'
)
4.2 防反爬策略
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)','Accept-Encoding': 'gzip, deflate, br','Referer': 'https://cq.lianjia.com/'
}
proxies = {'http': 'http://10.10.1.10:3128','https': 'http://10.10.1.10:1080',
}
import random
sleep(random.uniform(1, 3))
五、進階優(yōu)化方案
5.1 數(shù)據(jù)存儲優(yōu)化
with pd.ExcelWriter('output.xlsx') as writer:df1.to_excel(writer, sheet_name='重慶')df2.to_excel(writer, sheet_name='北京')
def append_to_excel(df, filename):from openpyxl import load_workbookbook = load_workbook(filename)writer = pd.ExcelWriter(filename, engine='openpyxl')writer.book = bookdf.to_excel(writer, startrow=writer.sheets['Sheet1'].max_row, index=False)writer.save()
5.2 異常監(jiān)控體系
import logging
logging.basicConfig(filename='spider.log',level=logging.ERROR,format='%(asctime)s - %(levelname)s - %(message)s'
)try:
except Exception as e:logging.error(f"嚴重錯誤:{str(e)}", exc_info=True)
六、注意事項
- 法律合規(guī)
嚴格遵守《網(wǎng)絡安全法》和網(wǎng)站Robots協(xié)議,控制采集頻率 - 數(shù)據(jù)清洗
建議增加字段校驗:
def validate_price(price):return 10 < price < 2000
- 性能調優(yōu)
- 啟用多線程采集(需控制并發(fā)數(shù))
- 使用lxml解析器替代html.parser
- 禁用BeautifulSoup的格式化功能
- 存儲擴展
存儲方式 | 優(yōu)點 | 缺點 |
---|
Excel | 查看方便 | 大數(shù)據(jù)性能差 |
CSV | 通用格式 | 無多Sheet支持 |
SQLite | 輕量級數(shù)據(jù)庫 | 需要SQL知識 |
MySQL | 適合大規(guī)模存儲 | 需要部署數(shù)據(jù)庫 |
# 快速使用指南1. 安裝依賴庫:
```bash
pip install -r requirements.txt
- 運行爬蟲:
python lianjia_spider.py
- 輸出文件:
house_data.xlsx
:清洗后的完整數(shù)據(jù)spider.log
:錯誤日志記錄
通過本方案可實現(xiàn)日均10萬級數(shù)據(jù)的穩(wěn)定采集,建議根據(jù)實際需求調整采集頻率和存儲方案。請務必遵守相關法律法規(guī),合理使用爬蟲技術。