當(dāng)前位置：首頁(yè) > news >正文

1688做網(wǎng)站費(fèi)用常州網(wǎng)站制作維護(hù)

news 2025/7/14 16:09:06

1688做網(wǎng)站費(fèi)用,常州網(wǎng)站制作維護(hù),杭州做網(wǎng)站價(jià)格,盤(pán)錦企業(yè)網(wǎng)站建設(shè)一、安裝 Beautiful Soup 是一個(gè)HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。 lxml 只會(huì)局部遍歷，而B(niǎo)eautiful Soup 是基于HTML DOM的，會(huì)載入整個(gè)文檔，解析整個(gè)DOM樹(shù)，因此時(shí)間和內(nèi)存開(kāi)銷都會(huì)大很多…

一、安裝

Beautiful Soup 是一個(gè)HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。
lxml 只會(huì)局部遍歷，而B(niǎo)eautiful Soup 是基于HTML DOM的，會(huì)載入整個(gè)文檔，解析整個(gè)DOM樹(shù)，因此時(shí)間和內(nèi)存開(kāi)銷都會(huì)大很多，所以性能要低于lxml。
BeautifulSoup 用來(lái)解析 HTML 比較簡(jiǎn)單，API非常人性化，支持CSS選擇器、Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器，也支持 lxml 的 XML解析器。

pip install beautifulsoup4

二、使用案例

from bs4 import BeautifulSoup
import requests
import asyncio
import functools
import rehouse_info = []'''異步請(qǐng)求獲取鏈家每頁(yè)數(shù)據(jù)'''
async def get_page(page_index):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'}request = functools.partial(requests.get, f'https://sh.lianjia.com/ershoufang/pudong/pg{page_index}/',headers=headers)loop = asyncio.get_running_loop()response = await loop.run_in_executor(None, request)return response'''使用xpath獲取房屋信息'''
def get_house_info(soup):house_info_list = soup.select('.info')  # 房屋titlereg = re.compile(r'\n|\s')for html in house_info_list:house_info.append({'title': re.sub(reg,'',html.select('.title a')[0].getText()),'house_pattern': re.sub(reg,'',html.select('.houseInfo')[0].getText()),'price': re.sub(reg,'',html.select('.unitPrice')[0].getText()),'location': re.sub(reg,'',html.select('.positionInfo')[0].getText()),'total': re.sub(reg,'',html.select('.totalPrice')[0].getText())})'''異步獲取第一頁(yè)數(shù)據(jù)，拿到第一頁(yè)房屋信息，并返回分頁(yè)總數(shù)和當(dāng)前頁(yè)'''
async def get_first_page():response = await get_page(1)soup = BeautifulSoup(response.text, 'lxml')get_house_info(soup)print(house_info)if __name__ == '__main__':asyncio.run(get_first_page())

三、創(chuàng)建soup對(duì)象

soup = BeautifulSoup(markup=“”, features=None, builder=None,parse_only=None, from_encoding=None, exclude_encodings=None,element_classes=None)

markup：要解析的HTML或XML文檔字符串。可以是一個(gè)字符串變量，也可以是一個(gè)文件對(duì)象（需要指定"html.parser"或"lxml"等解析器）。
features：指定解析器的名稱或類型。默認(rèn)為"html.parser"，可以使用其他解析器如"lxml"、"html5lib"等。
builder：指定文檔樹(shù)的構(gòu)建器。默認(rèn)為None，表示使用默認(rèn)構(gòu)建器。可以使用"lxml"或"html5lib"等指定其他構(gòu)建器。
parse_only：指定要解析的特定部分?？梢詡鬟f一個(gè)解析器或一個(gè)標(biāo)簽名或一個(gè)元素的列表。
from_encoding：指定解析器使用的字符編碼。默認(rèn)為None，表示自動(dòng)檢測(cè)編碼。
exclude_encodings：指定要排除的編碼列表，用于字符編碼自動(dòng)檢測(cè)。
element_classes：指定要用于解析文檔的元素類。默認(rèn)為None，表示使用默認(rèn)元素類。

四、soup對(duì)象

soup.prettify(encoding=None, formatter=“minimal”)：返回格式化后的HTML或XML文檔的字符串表示。它將文檔內(nèi)容縮進(jìn)并使用適當(dāng)?shù)臉?biāo)簽閉合格式，以提高可讀性
soup.title：返回文檔的標(biāo)簽的內(nèi)容，如果存在的話
soup.head：返回文檔的標(biāo)簽的內(nèi)容，作為一個(gè)BeautifulSoup對(duì)象
soup.body：返回文檔的標(biāo)簽的內(nèi)容，作為一個(gè)BeautifulSoup對(duì)象
soup.html：返回文檔的標(biāo)簽的內(nèi)容，作為一個(gè)BeautifulSoup對(duì)象
soup.find(name, attrs, recursive, string))：在文檔中查找具有指定名稱和屬性的第一個(gè)元素，并返回該元素的BeautifulSoup對(duì)象?？梢允褂胣ame參數(shù)指定標(biāo)簽名稱，使用attrs參數(shù)指定屬性字典，使用recursive參數(shù)指定是否遞歸搜索子元素，使用string參數(shù)指定元素的文本內(nèi)容，還可以使用其他關(guān)鍵字參數(shù)指定其他屬性條件
soup.find_all(name, attrs, recursive, string, limit))：在文檔中查找具有指定名稱和屬性的所有元素，并返回這些元素的列表。參數(shù)和用法與find()方法相似，但它會(huì)返回所有匹配的元素
soup.select(selector))：使用CSS選擇器語(yǔ)法在文檔中查找元素，并返回匹配的元素列表。選擇器可以是標(biāo)簽名、類名、id、屬性等。返回的是一個(gè)BeautifulSoup對(duì)象的列表
soup.get_text()：獲取文檔中所有元素的文本內(nèi)容，并將它們連接成一個(gè)字符串返回
soup.get(attrName)：獲取屬性值
soup.find_parents(name, attrs, recursive, string))：在文檔中查找具有指定名稱和屬性的所有父元素，并返回這些父元素的列表
soup.find_next_sibling(name, attrs, string))：在文檔中查找具有指定名稱和屬性的下一個(gè)同級(jí)元素，并返回該元素的BeautifulSoup對(duì)象
soup.find_previous_sibling(name, attrs, string))：在文檔中查找具有指定名稱和屬性的上一個(gè)同級(jí)元素，并返回該元素的BeautifulSoup對(duì)象
soup.find_next(name, attrs, string))：在文檔中查找具有指定名稱和屬性的下一個(gè)元素，并返回該元素的BeautifulSoup對(duì)象
soup.find_previous(name, attrs, string))：在文檔中查找具有指定名稱和屬性的上一個(gè)元素，并返回該元素的BeautifulSoup對(duì)象
soup.decompose(): 從文檔中移除當(dāng)前元素，并清理其占用的內(nèi)存。
soup.encode(formatter=None): 將解析后的文檔編碼為字節(jié)字符串。
soup.decode(encoding=“utf-8”, errors=“strict”): 將字節(jié)字符串解碼為Unicode字符串。
soup.new_tag(name, namespace=None, attrs={}, **kwargs)
soup.new_string(s, parent=None): 創(chuàng)建一個(gè)新的字符串對(duì)象。
soup.replace_with(replacement): 將當(dāng)前元素替換為指定的元素或字符串。
soup.wrap(wrapper): 將當(dāng)前元素包裝在指定的包裝器標(biāo)簽中。


from bs4 import BeautifulSouphtml_str = '<html><head><title>我是標(biāo)題</title></head><body><div><div class="div1">我是div1</div><div class="div2">我是div2</div><div class="div3">我是div3</div></div></body</html>'soup = BeautifulSoup(html_str, 'lxml')
print('title：',soup.title)
print('head：', soup.head)
print('body：', soup.body)
print('html：', soup.html)
print('find：', soup.find('div',attrs={'class':'div1'}))
print('find_all：', soup.find_all('div'))
print('select：', soup.select('.div1'))
print('get_text：', soup.select('.div1')[0].get_text())
print('get：', soup.select('.div1')[0].get('class'))
div1 = soup.select('.div1')[0]
print('find_parents：', div1.find_parents('div'))
print('find_next_sibling：', div1.find_next_sibling())
print('find_previous_sibling：', div1.find_previous_sibling())
print('find_next：', div1.find_next())
print('find_previous：', div1.find_previous())

查看全文

http://www.risenshineclean.com/news/57777.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

1688做網(wǎng)站費(fèi)用常州網(wǎng)站制作維護(hù)

一、安裝

二、使用案例

三、創(chuàng)建soup對(duì)象

四、soup對(duì)象

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

一、安裝

二、使用案例

三、創(chuàng)建soup對(duì)象

四、soup對(duì)象

相關(guān)文章：

一、安裝

二、使用案例

三、創(chuàng)建soup對(duì)象