中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

當前位置: 首頁 > news >正文

網(wǎng)站功能定制優(yōu)化手機流暢度的軟件

網(wǎng)站功能定制,優(yōu)化手機流暢度的軟件,網(wǎng)站在那里,用jsp加點mvc做網(wǎng)站怎么樣WebScraper 工具類使用筆記:靈活易用的爬蟲框架 序言: 安裝好對應(yīng)插件,驅(qū)動 pip install selenium webdriver-manager 1. 類名:WebScraper 這個工具類封裝了瀏覽器控制、頁面交互和數(shù)據(jù)提取的核心功能,旨在提供一…

WebScraper 工具類使用筆記:靈活易用的爬蟲框架

序言:

安裝好對應(yīng)插件,驅(qū)動

pip install selenium webdriver-manager

1. 類名:WebScraper

這個工具類封裝了瀏覽器控制、頁面交互和數(shù)據(jù)提取的核心功能,旨在提供一個靈活且易于使用的爬蟲框架。

2. 初始化方法

__init__(browser_type="chrome", headless=True, user_agent=None, proxy=None, timeout=30, debug=False)

  • 功能:初始化爬蟲實例,配置瀏覽器和開發(fā)工具

  • 參數(shù)

    • browser_type:瀏覽器類型,可選值:“chrome”, “firefox”, “edge”
    • headless:是否以無頭模式運行瀏覽器
    • user_agent:自定義 User-Agent 字符串
    • proxy:代理服務(wù)器配置,格式:{"http": "http://proxy.example.com:8080", "https": "http://proxy.example.com:8080"}
    • timeout:操作超時時間(秒)
    • debug:是否開啟調(diào)試模式

3. 瀏覽器控制方法

open_url(url)

  • 功能:打開指定 URL

  • 參數(shù)

    • url:目標 URL
  • 返回:頁面加載完成狀態(tài)

close()

  • 功能:關(guān)閉瀏覽器實例
  • 參數(shù):無

refresh()

  • 功能:刷新當前頁面
  • 參數(shù):無

go_back()

  • 功能:返回上一頁
  • 參數(shù):無

4. 元素定位與交互方法

find_element(selector, by="css", timeout=None)

  • 功能:查找單個元素

  • 參數(shù)

    • selector:選擇器字符串
    • by:選擇器類型,可選值:“css”, “xpath”, “id”, “class”, “name”, “l(fā)ink_text”, “partial_link_text”, “tag_name”
    • timeout:等待元素出現(xiàn)的超時時間(秒)
  • 返回:找到的元素對象或 None

find_elements(selector, by="css", timeout=None)

  • 功能:查找多個元素
  • 參數(shù):同find_element
  • 返回:找到的元素列表

click(element=None, selector=None, by="css", timeout=None)

  • 功能:點擊元素

  • 參數(shù)

    • element:元素對象(優(yōu)先使用)
    • selector:選擇器字符串(當 element 為 None 時使用)
    • by:選擇器類型
    • timeout:等待元素出現(xiàn)的超時時間
  • 返回:操作結(jié)果

type_text(text, element=None, selector=None, by="css", timeout=None, clear_first=True)

  • 功能:在輸入框中輸入文本

  • 參數(shù)

    • text:要輸入的文本
    • element:元素對象(優(yōu)先使用)
    • selector:選擇器字符串(當 element 為 None 時使用)
    • by:選擇器類型
    • timeout:等待元素出現(xiàn)的超時時間
    • clear_first:是否先清空輸入框
  • 返回:操作結(jié)果

5. 滾動方法

scroll(direction="down", amount=None, element=None, smooth=True, duration=0.5)

  • 功能:滾動頁面或元素

  • 參數(shù)

    • direction:滾動方向,可選值:“up”, “down”, “l(fā)eft”, “right”
    • amount:滾動量(像素),默認為頁面高度 / 寬度的 50%
    • element:要滾動的元素,默認為整個頁面
    • smooth:是否平滑滾動
    • duration:滾動持續(xù)時間(秒)
  • 返回:操作結(jié)果

scroll_to_element(element=None, selector=None, by="css", timeout=None, align="center")

  • 功能:滾動到指定元素

  • 參數(shù)

    • element:元素對象(優(yōu)先使用)
    • selector:選擇器字符串(當 element 為 None 時使用)
    • by:選擇器類型
    • timeout:等待元素出現(xiàn)的超時時間
    • align:元素對齊方式,可選值:“top”, “center”, “bottom”
  • 返回:操作結(jié)果

scroll_to_bottom(element=None, steps=10, delay=0.5)

  • 功能:滾動到頁面或元素底部

  • 參數(shù)

    • element:要滾動的元素,默認為整個頁面
    • steps:滾動步數(shù)
    • delay:每步之間的延遲(秒)
  • 返回:操作結(jié)果

6. 翻頁方法

next_page(selector=None, method="click", url_template=None, page_param="page", next_page_func=None)

  • 功能:翻到下一頁

  • 參數(shù)

    • selector:下一頁按鈕的選擇器(當 method 為 “click” 時使用)
    • method:翻頁方法,可選值:“click”, “url”, “function”
    • url_template:URL 模板(當 method 為 “url” 時使用)
    • page_param:頁碼參數(shù)名(當 method 為 “url” 時使用)
    • next_page_func:自定義翻頁函數(shù)(當 method 為 “function” 時使用)
  • 返回:翻頁是否成功

has_next_page(selector=None, check_func=None)

  • 功能:檢查是否有下一頁

  • 參數(shù)

    • selector:下一頁按鈕的選擇器
    • check_func:自定義檢查函數(shù)
  • 返回:布爾值,表示是否有下一頁

set_page(page_num, url_template=None, page_param="page")

  • 功能:跳轉(zhuǎn)到指定頁碼

  • 參數(shù)

    • page_num:目標頁碼
    • url_template:URL 模板
    • page_param:頁碼參數(shù)名
  • 返回:操作結(jié)果

7. 數(shù)據(jù)提取方法

get_text(element=None, selector=None, by="css", timeout=None)

  • 功能:獲取元素的文本內(nèi)容
  • 參數(shù):同find_element
  • 返回:文本內(nèi)容或 None

get_attribute(attribute, element=None, selector=None, by="css", timeout=None)

  • 功能:獲取元素的屬性值

  • 參數(shù)

    • attribute:屬性名
    • 其他參數(shù)同find_element
  • 返回:屬性值或 None

extract_data(template)

  • 功能:根據(jù)模板提取頁面數(shù)據(jù)

  • 參數(shù)

    • template:數(shù)據(jù)提取模板,格式為字典,鍵為數(shù)據(jù)字段名,值為選擇器或提取函數(shù)
  • 返回:提取的數(shù)據(jù)

8. DevTools 方法

start_capturing_network()

  • 功能:開始捕獲網(wǎng)絡(luò)請求
  • 參數(shù):無

stop_capturing_network()

  • 功能:停止捕獲網(wǎng)絡(luò)請求
  • 參數(shù):無

get_captured_requests(filter_type=None, url_pattern=None)

  • 功能:獲取捕獲的網(wǎng)絡(luò)請求

  • 參數(shù)

    • filter_type:請求類型過濾,可選值:“xhr”, “fetch”, “script”, “image”, “stylesheet” 等
    • url_pattern:URL 模式過濾,支持正則表達式
  • 返回:符合條件的請求列表

add_request_interceptor(pattern, handler_func)

  • 功能:添加請求攔截器

  • 參數(shù)

    • pattern:URL 匹配模式
    • handler_func:處理函數(shù),接收請求對象,可修改請求或返回自定義響應(yīng)
  • 返回:攔截器 ID

9. 輔助方法

wait_for_element(selector, by="css", timeout=None, condition="visible")

  • 功能:等待元素滿足特定條件

  • 參數(shù)

    • selector:選擇器字符串
    • by:選擇器類型
    • timeout:超時時間
    • condition:等待條件,可選值:“visible”, “present”, “clickable”, “invisible”, “not_present”
  • 返回:元素對象或 None

execute_script(script, *args)

  • 功能:執(zhí)行 JavaScript 代碼

  • 參數(shù)

    • script:JavaScript 代碼
    • *args:傳遞給 JavaScript 的參數(shù)
  • 返回:JavaScript 執(zhí)行結(jié)果

set_delay(min_delay, max_delay=None)

  • 功能:設(shè)置操作之間的隨機延遲

  • 參數(shù)

    • min_delay:最小延遲時間(秒)
    • max_delay:最大延遲時間(秒),如果為 None 則固定為 min_delay
  • 返回:無

take_screenshot(path=None)

  • 功能:截取當前頁面截圖

  • 參數(shù)

    • path:保存路徑,如果為 None 則返回圖像數(shù)據(jù)
  • 返回:如果 path 為 None,返回圖像二進制數(shù)據(jù);否則返回保存結(jié)果

10. 代碼實現(xiàn)部分

python

import time
import random
import json
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, WebDriverException
from selenium.webdriver.common.action_chains import ActionChains# Optional: For easier driver management
try:from webdriver_manager.chrome import ChromeDriverManagerfrom webdriver_manager.firefox import GeckoDriverManagerfrom webdriver_manager.microsoft import EdgeChromiumDriverManagerWEBDRIVER_MANAGER_AVAILABLE = True
except ImportError:WEBDRIVER_MANAGER_AVAILABLE = Falseprint("Consider installing webdriver-manager for easier driver setup: pip install webdriver-manager")class WebScraper:_BY_MAP = {"css": By.CSS_SELECTOR,"xpath": By.XPATH,"id": By.ID,"class": By.CLASS_NAME,  # Note: find by class name only works for a single class"name": By.NAME,"link_text": By.LINK_TEXT,"partial_link_text": By.PARTIAL_LINK_TEXT,"tag_name": By.TAG_NAME,}def __init__(self, browser_type="chrome", headless=True, user_agent=None, proxy=None, timeout=30, debug=False):self.browser_type = browser_type.lower()self.headless = headlessself.user_agent = user_agentself.proxy = proxyself.timeout = timeoutself.debug = debugself.driver = Noneself.current_page_num = 1  # For URL-based paginationself._min_delay = 0.5self._max_delay = 1.5self._network_requests_raw = []  # To store JS collected network entriesself._setup_driver()def _print_debug(self, message):if self.debug:print(f"[DEBUG] {message}")def _setup_driver(self):self._print_debug(f"Setting up {self.browser_type} browser...")options = Noneservice = Noneif self.browser_type == "chrome":options = webdriver.ChromeOptions()if self.user_agent:options.add_argument(f"user-agent={self.user_agent}")if self.headless:options.add_argument("--headless")options.add_argument("--window-size=1920x1080")  # Often needed for headlessif self.proxy:if "http" in self.proxy:  # Basic proxy, for more auth use selenium-wireoptions.add_argument(f"--proxy-server={self.proxy['http']}")elif "https" in self.proxy:  # Selenium typically uses one proxy for alloptions.add_argument(f"--proxy-server={self.proxy['https']}")options.add_argument("--disable-gpu")options.add_argument("--no-sandbox")options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problemsif WEBDRIVER_MANAGER_AVAILABLE:try:service = webdriver.chrome.service.Service(ChromeDriverManager().install())self.driver = webdriver.Chrome(service=service, options=options)except Exception as e:self._print_debug(f"WebDriverManager for Chrome failed: {e}. Falling back to default PATH.")self.driver = webdriver.Chrome(options=options)  # Fallback to PATHelse:self.driver = webdriver.Chrome(options=options)elif self.browser_type == "firefox":options = webdriver.FirefoxOptions()if self.user_agent:options.set_preference("general.useragent.override", self.user_agent)if self.headless:options.add_argument("--headless")if self.proxy:# Firefox proxy setup is more involved via preferencesif "http" in self.proxy:host, port = self.proxy['http'].replace('http://', '').split(':')options.set_preference("network.proxy.type", 1)options.set_preference("network.proxy.http", host)options.set_preference("network.proxy.http_port", int(port))if "https" in self.proxy:  # Assuming same proxy for httpshost, port = self.proxy['https'].replace('https://', '').split(':')options.set_preference("network.proxy.ssl", host)options.set_preference("network.proxy.ssl_port", int(port))# options.set_preference("network.proxy.share_proxy_settings", True) # if one proxy for allif WEBDRIVER_MANAGER_AVAILABLE:try:service = webdriver.firefox.service.Service(GeckoDriverManager().install())self.driver = webdriver.Firefox(service=service, options=options)except Exception as e:self._print_debug(f"WebDriverManager for Firefox failed: {e}. Falling back to default PATH.")self.driver = webdriver.Firefox(options=options)else:self.driver = webdriver.Firefox(options=options)elif self.browser_type == "edge":options = webdriver.EdgeOptions()if self.user_agent:options.add_argument(f"user-agent={self.user_agent}")if self.headless:options.add_argument("--headless")  # Edge uses Chromium engineoptions.add_argument("--window-size=1920x1080")if self.proxy and "http" in self.proxy:  # Basic proxyoptions.add_argument(f"--proxy-server={self.proxy['http']}")options.add_argument("--disable-gpu")if WEBDRIVER_MANAGER_AVAILABLE:try:service = webdriver.edge.service.Service(EdgeChromiumDriverManager().install())self.driver = webdriver.Edge(service=service, options=options)except Exception as e:self._print_debug(f"WebDriverManager for Edge failed: {e}. Falling back to default PATH.")self.driver = webdriver.Edge(options=options)else:self.driver = webdriver.Edge(options=options)else:raise ValueError(f"Unsupported browser: {self.browser_type}")self.driver.implicitly_wait(self.timeout / 2)  # Implicit wait for elementsself.driver.set_page_load_timeout(self.timeout)self._print_debug(f"{self.browser_type} browser setup complete.")def _get_selenium_by(self, by_string):by_string = by_string.lower()if by_string not in self._BY_MAP:raise ValueError(f"Invalid selector type: {by_string}. Supported: {list(self._BY_MAP.keys())}")return self._BY_MAP[by_string]def _perform_delay(self):time.sleep(random.uniform(self._min_delay, self._max_delay))# --- Browser Control ---def open_url(self, url):self._print_debug(f"Opening URL: {url}")try:self.driver.get(url)self._perform_delay()# A simple check, for true "loaded" status, might need to wait for specific elementreturn self.driver.execute_script("return document.readyState") == "complete"except WebDriverException as e:self._print_debug(f"Error opening URL {url}: {e}")return Falsedef close(self):if self.driver:self._print_debug("Closing browser.")self.driver.quit()self.driver = Nonedef refresh(self):self._print_debug("Refreshing page.")self.driver.refresh()self._perform_delay()def go_back(self):self._print_debug("Going back to previous page.")self.driver.back()self._perform_delay()# --- Element Location & Interaction ---def find_element(self, selector, by="css", timeout=None):wait_timeout = timeout if timeout is not None else self.timeoutself._print_debug(f"Finding element by {by}: '{selector}' with timeout {wait_timeout}s")try:wait = WebDriverWait(self.driver, wait_timeout)element = wait.until(EC.presence_of_element_located((self._get_selenium_by(by), selector)))return elementexcept TimeoutException:self._print_debug(f"Element not found by {by}: '{selector}' within {wait_timeout}s.")return Noneexcept Exception as e:self._print_debug(f"Error finding element by {by}: '{selector}': {e}")return Nonedef find_elements(self, selector, by="css", timeout=None):wait_timeout = timeout if timeout is not None else self.timeoutself._print_debug(f"Finding elements by {by}: '{selector}' with timeout {wait_timeout}s")try:# Wait for at least one element to be present to ensure page readinessWebDriverWait(self.driver, wait_timeout).until(EC.presence_of_all_elements_located((self._get_selenium_by(by), selector)))# Then find all elements without further explicit wait beyond implicitreturn self.driver.find_elements(self._get_selenium_by(by), selector)except TimeoutException:self._print_debug(f"No elements found by {by}: '{selector}' within {wait_timeout}s.")return []except Exception as e:self._print_debug(f"Error finding elements by {by}: '{selector}': {e}")return []def click(self, element=None, selector=None, by="css", timeout=None):if not element and selector:element = self.wait_for_element(selector, by, timeout, condition="clickable")if element:try:self._print_debug(f"Clicking element: {element.tag_name} (selector: {selector})")# Try JavaScript click if standard click is interceptedtry:element.click()except WebDriverException:  # e.g. ElementClickInterceptedExceptionself._print_debug("Standard click failed, trying JavaScript click.")self.driver.execute_script("arguments[0].click();", element)self._perform_delay()return Trueexcept Exception as e:self._print_debug(f"Error clicking element: {e}")return Falseself._print_debug("Element not provided or not found for click.")return Falsedef type_text(self, text, element=None, selector=None, by="css", timeout=None, clear_first=True):if not element and selector:element = self.wait_for_element(selector, by, timeout, condition="visible")if element:try:self._print_debug(f"Typing text '{text}' into element: {element.tag_name} (selector: {selector})")if clear_first:element.clear()element.send_keys(text)self._perform_delay()return Trueexcept Exception as e:self._print_debug(f"Error typing text: {e}")return Falseself._print_debug("Element not provided or not found for typing.")return False# --- Scrolling Methods ---def scroll(self, direction="down", amount=None, element=None, smooth=True, duration=0.5):self._print_debug(f"Scrolling {direction}...")script = ""target = "window"if element:target = "arguments[0]"  # Element will be passed as arguments[0]behavior = "smooth" if smooth else "auto"if direction == "down":scroll_val = amount if amount is not None else f"{target}.innerHeight / 2" if element else "window.innerHeight / 2"script = f"{target}.scrollBy({{ top: {scroll_val}, left: 0, behavior: '{behavior}' }});"elif direction == "up":scroll_val = amount if amount is not None else f"{target}.innerHeight / 2" if element else "window.innerHeight / 2"script = f"{target}.scrollBy({{ top: -{scroll_val}, left: 0, behavior: '{behavior}' }});"elif direction == "left":scroll_val = amount if amount is not None else f"{target}.innerWidth / 2" if element else "window.innerWidth / 2"script = f"{target}.scrollBy({{ top: 0, left: -{scroll_val}, behavior: '{behavior}' }});"elif direction == "right":scroll_val = amount if amount is not None else f"{target}.innerWidth / 2" if element else "window.innerWidth / 2"script = f"{target}.scrollBy({{ top: 0, left: {scroll_val}, behavior: '{behavior}' }});"else:self._print_debug(f"Invalid scroll direction: {direction}")return Falsetry:if element:self.driver.execute_script(script, element)else:self.driver.execute_script(script)time.sleep(duration)  # Allow time for smooth scroll to completereturn Trueexcept Exception as e:self._print_debug(f"Error during scroll: {e}")return Falsedef scroll_to_element(self, element=None, selector=None, by="css", timeout=None, align="center"):if not element and selector:element = self.find_element(selector, by, timeout)if element:self._print_debug(f"Scrolling to element (selector: {selector}) with align: {align}")try:# 'block' can be 'start', 'center', 'end', or 'nearest'.# 'inline' is similar for horizontal.# For simplicity, map to 'block' options.align_js = "{ behavior: 'smooth', block: 'center', inline: 'nearest' }"if align == "top":align_js = "{ behavior: 'smooth', block: 'start', inline: 'nearest' }"elif align == "bottom":align_js = "{ behavior: 'smooth', block: 'end', inline: 'nearest' }"self.driver.execute_script(f"arguments[0].scrollIntoView({align_js});", element)self._perform_delay()  # Give it a moment to scrollreturn Trueexcept Exception as e:self._print_debug(f"Error scrolling to element: {e}")return Falseself._print_debug("Element not provided or not found for scroll_to_element.")return Falsedef scroll_to_bottom(self, element=None, steps=10, delay=0.5):self._print_debug("Scrolling to bottom...")target = "document.body"target_el_for_js = Noneif element:target = "arguments[0]"target_el_for_js = elementtry:last_height_script = f"return {target}.scrollHeight"scroll_script = f"{target}.scrollTop = {target}.scrollHeight;"for _ in range(steps):if target_el_for_js:last_height = self.driver.execute_script(last_height_script, target_el_for_js)self.driver.execute_script(scroll_script, target_el_for_js)else:last_height = self.driver.execute_script(last_height_script)self.driver.execute_script(scroll_script)time.sleep(delay)if target_el_for_js:new_height = self.driver.execute_script(last_height_script, target_el_for_js)else:new_height = self.driver.execute_script(last_height_script)if new_height == last_height:  # Reached bottom or no more content loadedbreakself._print_debug("Scrolled to bottom (or no more content loaded).")return Trueexcept Exception as e:self._print_debug(f"Error scrolling to bottom: {e}")return False# --- Pagination Methods ---def next_page(self, selector=None, method="click", url_template=None, page_param="page", next_page_func=None):self._print_debug(f"Attempting to go to next page using method: {method}")if method == "click":if not selector:self._print_debug("Selector for next page button is required for 'click' method.")return Falsenext_button = self.wait_for_element(selector, condition="clickable")if next_button:return self.click(element=next_button)else:self._print_debug("Next page button not found or not clickable.")return Falseelif method == "url":if not url_template:self._print_debug("URL template is required for 'url' method.")return Falseself.current_page_num += 1next_url = url_template.replace(f"{{{page_param}}}", str(self.current_page_num))return self.open_url(next_url)elif method == "function":if not callable(next_page_func):self._print_debug("A callable function is required for 'function' method.")return Falsetry:return next_page_func(self)  # Pass scraper instance to the custom functionexcept Exception as e:self._print_debug(f"Custom next_page_func failed: {e}")return Falseelse:self._print_debug(f"Invalid pagination method: {method}")return Falsedef has_next_page(self, selector=None, check_func=None):self._print_debug("Checking for next page...")if callable(check_func):try:return check_func(self)except Exception as e:self._print_debug(f"Custom check_func for has_next_page failed: {e}")return Falseelif selector:# Check if element is present and often, if it's not disabledelement = self.find_element(selector)if element:is_disabled = element.get_attribute("disabled")class_attr = element.get_attribute("class")# Common patterns for disabled buttonsif is_disabled or (class_attr and ("disabled" in class_attr or "inactive" in class_attr)):self._print_debug("Next page element found but appears disabled.")return Falsereturn Truereturn Falseself._print_debug("No selector or check_func provided for has_next_page.")return False  # Default to no next page if insufficient infodef set_page(self, page_num, url_template=None, page_param="page"):if not url_template:self._print_debug("URL template is required for set_page.")return Falseself._print_debug(f"Setting page to: {page_num}")self.current_page_num = page_numtarget_url = url_template.replace(f"{{{page_param}}}", str(page_num))return self.open_url(target_url)# --- Data Extraction Methods ---def get_text(self, element=None, selector=None, by="css", timeout=None):if not element and selector:element = self.find_element(selector, by, timeout)if element:try:text = element.textself._print_debug(f"Extracted text: '{text[:50]}...' from element (selector: {selector})")return textexcept Exception as e:self._print_debug(f"Error getting text: {e}")return Noneself._print_debug("Element not provided or not found for get_text.")return Nonedef get_attribute(self, attribute, element=None, selector=None, by="css", timeout=None):if not element and selector:element = self.find_element(selector, by, timeout)if element:try:value = element.get_attribute(attribute)self._print_debug(f"Extracted attribute '{attribute}': '{value}' from element (selector: {selector})")return valueexcept Exception as e:self._print_debug(f"Error getting attribute '{attribute}': {e}")return Noneself._print_debug("Element not provided or not found for get_attribute.")return Nonedef extract_data(self, template):"""Extracts data based on a template.Template format: {"field_name": "css_selector" or ("css_selector", "attribute_name") or callable}If callable, it receives the scraper instance (self) and the parent_element (if any).To extract multiple items (e.g., a list), the selector should point to the parent of those items,and the callable should handle finding and processing sub-elements.Or, the template value can be a list of sub-templates for structured data.For simplicity here, we assume template values are selectors for single items,or callables for custom logic."""self._print_debug(f"Extracting data with template: {template}")extracted_data = {}for field_name, rule in template.items():value = Nonetry:if isinstance(rule, str):  # Simple CSS selector for textvalue = self.get_text(selector=rule)elif isinstance(rule, tuple) and len(rule) == 2:  # (selector, attribute)value = self.get_attribute(selector=rule[0], attribute=rule[1])elif callable(rule):  # Custom extraction functionvalue = rule(self)  # Pass scraper instanceelse:self._print_debug(f"Invalid rule for field '{field_name}': {rule}")extracted_data[field_name] = valueexcept Exception as e:self._print_debug(f"Error extracting field '{field_name}' with rule '{rule}': {e}")extracted_data[field_name] = Nonereturn extracted_data# --- DevTools Methods (Limited by standard Selenium) ---def start_capturing_network(self):"""Clears previously captured network requests (from JS).Actual continuous network capture requires selenium-wire or browser's DevTools Protocol."""self._print_debug("Starting network capture (clearing previous JS logs).")self._network_requests_raw = []# Note: This doesn't actively "start" a capture process in the browser's network panel.# It just prepares our internal list for new entries gathered by get_captured_requests.def stop_capturing_network(self):"""Conceptually stops. With JS method, it means new calls to get_captured_requestswill include data up to this point, but nothing explicitly 'stops' in the browser."""self._print_debug("Stopping network capture (conceptual for JS method).")# No direct action for JS based capture, it's always available.def get_captured_requests(self, filter_type=None, url_pattern=None):"""Gets network requests using JavaScript performance API. This is a snapshot.filter_type: e.g., 'script', 'img', 'css', 'xmlhttprequest', 'fetch'url_pattern: Regex string to filter URLs."""self._print_debug("Getting captured network requests via JavaScript Performance API.")try:# Get all resource timing entriescurrent_entries = self.driver.execute_script("return window.performance.getEntriesByType('resource');")if isinstance(current_entries, list):self._network_requests_raw.extend(current_entries)  # Append new ones# Deduplicate based on 'name' (URL) and 'startTime' to keep it somewhat manageableseen = set()deduplicated_requests = []for entry in sorted(self._network_requests_raw, key=lambda x: x.get('startTime', 0)):identifier = (entry.get('name'), entry.get('startTime'))if identifier not in seen:deduplicated_requests.append(entry)seen.add(identifier)self._network_requests_raw = deduplicated_requestsfiltered_requests = []for req in self._network_requests_raw:# req is a dictionary like:# {'name': url, 'entryType': 'resource', 'startTime': 123.45, 'duration': 67.89,#  'initiatorType': 'script'/'img'/'css'/'link'/'xmlhttprequest', etc.}if filter_type:# initiatorType is more reliable for filtering than entryType (always 'resource')initiator = req.get('initiatorType', '').lower()if filter_type.lower() == "xhr":  # Common aliasif initiator != 'xmlhttprequest': continueelif filter_type.lower() not in initiator:continueif url_pattern:if not re.search(url_pattern, req.get('name', '')):continuefiltered_requests.append(req)self._print_debug(f"Found {len(filtered_requests)} filtered network requests.")return filtered_requestsexcept WebDriverException as e:self._print_debug(f"Error getting network requests via JS: {e}")return []def add_request_interceptor(self, pattern, handler_func):"""NOTE: True request interception is NOT reliably possible with standard Selenium.This requires tools like SeleniumWire or direct DevTools Protocol interaction,which are more complex to set up and manage.This method is a placeholder to acknowledge the design spec."""self._print_debug("WARNING: add_request_interceptor is not implemented with standard Selenium. ""Consider using SeleniumWire for this functionality.")# To make it "runnable" without error, return a dummy IDreturn f"dummy_interceptor_id_{pattern}"# --- Auxiliary Methods ---def wait_for_element(self, selector, by="css", timeout=None, condition="visible"):wait_timeout = timeout if timeout is not None else self.timeoutself._print_debug(f"Waiting for element by {by}: '{selector}' to be {condition} (timeout: {wait_timeout}s)")try:wait = WebDriverWait(self.driver, wait_timeout)sel_by = self._get_selenium_by(by)if condition == "visible":element = wait.until(EC.visibility_of_element_located((sel_by, selector)))elif condition == "present":element = wait.until(EC.presence_of_element_located((sel_by, selector)))elif condition == "clickable":element = wait.until(EC.element_to_be_clickable((sel_by, selector)))elif condition == "invisible":# Returns True if invisible, or an element if it becomes invisible (less common use)# For our purpose, we want the element if it exists and is invisible, or None if it becomes visible/not found# This is tricky. A simpler approach is to check if it's NOT visible.# Let's wait for presence, then check visibility.present_element = wait.until(EC.presence_of_element_located((sel_by, selector)))if not present_element.is_displayed():element = present_elementelse:  # Element is present AND visible, so condition "invisible" is falseraise TimeoutException(f"Element '{selector}' was visible, not invisible.")elif condition == "not_present":# Returns True if element is not present, or raises TimeoutException# This doesn't return the element. We signal success by returning a dummy True# or failure by returning None after timeout.if wait.until(EC.invisibility_of_element_located((sel_by, selector))):  # Waits for staleness or non-presenceself._print_debug(f"Element by {by}: '{selector}' confirmed not present or invisible.")return True  # Indicates success for this condition, though no element is returnedelse:  # Should not happen if invisibility_of_element_located works as expectedreturn Noneelse:raise ValueError(f"Unsupported condition: {condition}")return elementexcept TimeoutException:self._print_debug(f"Element by {by}: '{selector}' did not meet condition '{condition}' within {wait_timeout}s.")return Noneexcept Exception as e:self._print_debug(f"Error waiting for element '{selector}' condition '{condition}': {e}")return Nonedef execute_script(self, script, *args):self._print_debug(f"Executing script: {script[:100]}...")try:return self.driver.execute_script(script, *args)except WebDriverException as e:self._print_debug(f"Error executing script: {e}")return Nonedef set_delay(self, min_delay, max_delay=None):self._print_debug(f"Setting delay: min={min_delay}, max={max_delay}")self._min_delay = min_delayself._max_delay = max_delay if max_delay is not None else min_delaydef take_screenshot(self, path=None):self._print_debug(f"Taking screenshot. Path: {path if path else 'Return as PNG data'}")try:if path:return self.driver.save_screenshot(path)  # Returns True on successelse:return self.driver.get_screenshot_as_png()  # Returns binary dataexcept WebDriverException as e:self._print_debug(f"Error taking screenshot: {e}")return None if path is None else Falsedef __enter__(self):return selfdef __exit__(self, exc_type, exc_val, exc_tb):self.close()

11. 案例應(yīng)用部分

python

from WebScraper import WebScraper
import time
import jsondef main():# 初始化WebScraper實例(非無頭模式,便于觀察)scraper = WebScraper(browser_type="chrome",headless=False,timeout=15,debug=True)try:# 1. 打開百度首頁baidu_url = "https://www.baidu.com"print("正在打開百度首頁...")if not scraper.open_url(baidu_url):print("百度首頁打開失敗!")return# 2. 輸入搜索關(guān)鍵詞并執(zhí)行搜索search_keyword = "人工智能發(fā)展趨勢"  # 可以修改為任意搜索關(guān)鍵詞print(f"正在搜索: {search_keyword}")# 定位搜索框并輸入關(guān)鍵詞search_input = scraper.find_element(selector="#kw", by="css")if not search_input:print("未找到搜索框元素!")returnscraper.type_text(text=search_keyword, element=search_input)# 點擊搜索按鈕if not scraper.click(selector="#su", by="css"):print("點擊搜索按鈕失敗!")return# 等待搜索結(jié)果加載time.sleep(2)print("搜索結(jié)果加載中...")# 3. 滾動到頁面底部print("正在滾動到頁面底部...")if scraper.scroll_to_bottom(steps=10, delay=0.1):print("已滾動到頁面底部")else:print("滾動到頁面底部失敗")# 4. 提取當前頁面的搜索結(jié)果標題print("正在提取當前頁面的搜索結(jié)果...")result_titles = scraper.find_elements(selector="h3.t a", by="css")if result_titles:print(f"找到 {len(result_titles)} 個搜索結(jié)果標題:")for i, title in enumerate(result_titles, 1):title_text = title.textprint(f"{i}. {title_text}")else:print("未找到搜索結(jié)果標題")# 5. 翻頁獲取更多搜索結(jié)果(演示前3頁)for page in range(2, 4):print(f"\n正在翻到第 {page} 頁...")# 方法1:使用next_page方法點擊下一頁按鈕next_button_selector = ".n"  # 百度下一頁按鈕的CSS選擇器if scraper.next_page(selector=next_button_selector, method="click"):print(f"已翻到第 {page} 頁,等待加載...")time.sleep(2)# 滾動到新頁面底部scraper.scroll_to_bottom(steps=10, delay=0.3)time.sleep(1)# 提取新頁面的搜索結(jié)果result_titles = scraper.find_elements(selector="h3.t a", by="css")if result_titles:print(f"第 {page} 頁找到 {len(result_titles)} 個搜索結(jié)果標題:")for i, title in enumerate(result_titles, 1):title_text = title.textprint(f"{i}. {title_text}")else:print(f"第 {page} 頁未找到搜索結(jié)果標題")else:print(f"翻到第 {page} 頁失敗,可能已到最后一頁")break# 6. 使用extract_data方法提取結(jié)構(gòu)化數(shù)據(jù)print("\n使用extract_data方法提取結(jié)構(gòu)化數(shù)據(jù):")data_template = {"搜索關(guān)鍵詞": search_keyword,"當前時間": lambda scraper: time.strftime("%Y-%m-%d %H:%M:%S"),"頁面標題": "title","當前URL": ("", "href"),  # 使用空選擇器獲取當前URL"搜索結(jié)果數(shù)量": lambda scraper: len(scraper.find_elements("h3.t a"))}extracted_data = scraper.extract_data(data_template)for key, value in extracted_data.items():print(f"{key}: {value}")# 7. 保存數(shù)據(jù)到JSON文件with open("baidu_search_results.json", "w", encoding="utf-8") as f:json.dump(extracted_data, f, ensure_ascii=False, indent=2)print("\n數(shù)據(jù)已保存到 baidu_search_results.json")except Exception as e:print(f"操作過程中發(fā)生錯誤: {e}")finally:# 8. 關(guān)閉瀏覽器scraper.close()print("瀏覽器已關(guān)閉")if __name__ == "__main__":main()
http://www.risenshineclean.com/news/64440.html

相關(guān)文章:

  • 網(wǎng)站如何做搜狗搜索引擎百度下載安裝免費
  • 網(wǎng)站建設(shè)公司 南京谷歌seo優(yōu)化中文章
  • 河南省建筑一體化平臺管理系統(tǒng)seo技術(shù)經(jīng)理
  • php動態(tài)網(wǎng)站開發(fā)期末考試網(wǎng)絡(luò)營銷公司名字大全
  • 大渡口網(wǎng)站建設(shè)哪家好武漢網(wǎng)站優(yōu)化公司
  • 免費申請香港網(wǎng)站公司推廣咨詢
  • 吉林市網(wǎng)站制作廣州網(wǎng)站制作服務(wù)
  • 電商網(wǎng)站開發(fā)商營銷型網(wǎng)站重要特點是
  • 個人怎么注冊小微企業(yè)搜索引擎優(yōu)化面對哪些困境
  • 學(xué)校網(wǎng)站的英文奇葩網(wǎng)站100個
  • 網(wǎng)站模板設(shè)計開發(fā)站長工具四葉草
  • wordpress怎么上傳logo百度關(guān)鍵詞優(yōu)化工具
  • 石家莊專業(yè)建站公司網(wǎng)頁版百度云
  • 代理服務(wù)器地址seo搜索引擎優(yōu)化服務(wù)
  • 騰訊云做網(wǎng)站需要報備百度人工服務(wù)熱線24小時
  • 專業(yè)網(wǎng)站設(shè)計哪家好可以做產(chǎn)品推廣的軟件有哪些
  • 代銷網(wǎng)站源碼seo優(yōu)化技術(shù)是什么
  • wordpress半透明主題關(guān)鍵詞優(yōu)化排名軟件
  • 禁止wordpress獲取隱私新手怎么入行seo
  • 武漢網(wǎng)站推廣霸屏寧波正規(guī)優(yōu)化seo軟件
  • 網(wǎng)站域名備案在哪里職業(yè)培訓(xùn)機構(gòu)需要什么資質(zhì)
  • 谷搜易外貿(mào)網(wǎng)站建設(shè)站長之家關(guān)鍵詞挖掘工具
  • 如何利用源代碼做網(wǎng)站今日頭條最新
  • 網(wǎng)站建設(shè)需求 百度文庫三十個知識點帶你學(xué)黨章
  • 盛澤做網(wǎng)站的微信賣貨小程序怎么做
  • 設(shè)計廣告圖用什么軟件好用有利于seo優(yōu)化的是
  • 網(wǎng)站的站內(nèi)結(jié)構(gòu)錨文本是如何做的seo指什么
  • 做電影網(wǎng)站模板教學(xué)設(shè)計免費注冊個人網(wǎng)站不花錢
  • java網(wǎng)站開發(fā)是干什么安徽網(wǎng)絡(luò)關(guān)鍵詞優(yōu)化
  • 工信局網(wǎng)站備案查詢溫州seo網(wǎng)站推廣