當(dāng)前位置：首頁(yè) > news >正文

網(wǎng)站開(kāi)發(fā)需要多線程嗎視頻優(yōu)化是什么意思

news 2025/7/7 14:14:57

網(wǎng)站開(kāi)發(fā)需要多線程嗎,視頻優(yōu)化是什么意思,dreamweaver 網(wǎng)站導(dǎo)航,網(wǎng)站qq登錄原理前言「作者主頁(yè)」：雪碧有白泡泡「?jìng)€(gè)人網(wǎng)站」：雪碧的個(gè)人網(wǎng)站 ChatGPT體驗(yàn)地址文章目錄前言前言正則表達(dá)式進(jìn)行轉(zhuǎn)換送書(shū)活動(dòng) 前言在信息爆炸的時(shí)代，互聯(lián)網(wǎng)上的海量文字信息如同無(wú)盡的沙灘。然而，其中真正有價(jià)值的信息往往埋…

前言

在這里插入圖片描述
「作者主頁(yè)」：雪碧有白泡泡
「?jìng)€(gè)人網(wǎng)站」：雪碧的個(gè)人網(wǎng)站
請(qǐng)?zhí)砑訄D片描述

ChatGPT體驗(yàn)地址

請(qǐng)?zhí)砑訄D片描述

文章目錄

前言
前言
正則表達(dá)式
進(jìn)行轉(zhuǎn)換
送書(shū)活動(dòng)

前言

在信息爆炸的時(shí)代，互聯(lián)網(wǎng)上的海量文字信息如同無(wú)盡的沙灘。然而，其中真正有價(jià)值的信息往往埋在各種網(wǎng)頁(yè)中，需要經(jīng)過(guò)篩選和整理才能被有效利用。幸運(yùn)的是，Python這個(gè)強(qiáng)大的編程語(yǔ)言可以幫助我們完成這項(xiàng)任務(wù)。

本文將介紹如何使用Python將網(wǎng)頁(yè)文字轉(zhuǎn)換為Markdown格式，這將使得我們能夠更加方便地閱讀和處理網(wǎng)頁(yè)內(nèi)容。無(wú)論是將文章保存為本地文件還是轉(zhuǎn)化為其他格式，Markdown都能夠提供清晰簡(jiǎn)潔的排版和格式，讓我們更加專注于內(nèi)容本身。

正則表達(dá)式

我們將頁(yè)面進(jìn)行Maekdown的轉(zhuǎn)換為了保證準(zhǔn)確度，我們可以使用正則表達(dá)式去修改，如下

import re__all__ = ['Tomd', 'convert']MARKDOWN = {'h1': ('\n# ', '\n'),'h2': ('\n## ', '\n'),'h3': ('\n### ', '\n'),'h4': ('\n#### ', '\n'),'h5': ('\n##### ', '\n'),'h6': ('\n###### ', '\n'),'code': ('`', '`'),'ul': ('', ''),'ol': ('', ''),'li': ('- ', ''),'blockquote': ('\n> ', '\n'),'em': ('**', '**'),'strong': ('**', '**'),'block_code': ('\n```\n', '\n```\n'),'span': ('', ''),'p': ('\n', '\n'),'p_with_out_class': ('\n', '\n'),'inline_p': ('', ''),'inline_p_with_out_class': ('', ''),'b': ('**', '**'),'i': ('*', '*'),'del': ('~~', '~~'),'hr': ('\n---', '\n\n'),'thead': ('\n', '|------\n'),'tbody': ('\n', '\n'),'td': ('|', ''),'th': ('|', ''),'tr': ('', '\n')
}BlOCK_ELEMENTS = {'h1': '<h1.*?>(.*?)</h1>','h2': '<h2.*?>(.*?)</h2>','h3': '<h3.*?>(.*?)</h3>','h4': '<h4.*?>(.*?)</h4>','h5': '<h5.*?>(.*?)</h5>','h6': '<h6.*?>(.*?)</h6>','hr': '<hr/>','blockquote': '<blockquote.*?>(.*?)</blockquote>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','block_code': '<pre.*?><code.*?>(.*?)</code></pre>','p': '<p\s.*?>(.*?)</p>','p_with_out_class': '<p>(.*?)</p>','thead': '<thead.*?>(.*?)</thead>','tr': '<tr>(.*?)</tr>'
}INLINE_ELEMENTS = {'td': '<td>(.*?)</td>','tr': '<tr>(.*?)</tr>','th': '<th>(.*?)</th>','b': '<b>(.*?)</b>','i': '<i>(.*?)</i>','del': '<del>(.*?)</del>','inline_p': '<p\s.*?>(.*?)</p>','inline_p_with_out_class': '<p>(.*?)</p>','code': '<code.*?>(.*?)</code>','span': '<span.*?>(.*?)</span>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','li': '<li.*?>(.*?)</li>','img': '<img.*?src="(.*?)".*?>(.*?)</img>','a': '<a.*?href="(.*?)".*?>(.*?)</a>','em': '<em.*?>(.*?)</em>','strong': '<strong.*?>(.*?)</strong>'
}DELETE_ELEMENTS = ['<span.*?>', '</span>', '<div.*?>', '</div>']class Element:def __init__(self, start_pos, end_pos, content, tag, is_block=False):self.start_pos = start_posself.end_pos = end_posself.content = contentself._elements = []self.is_block = is_blockself.tag = tagself._result = Noneif self.is_block:self.parse_inline()def __str__(self):wrapper = MARKDOWN.get(self.tag)self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1])return self._resultdef parse_inline(self):for tag, pattern in INLINE_ELEMENTS.items():if tag == 'a':self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content)elif tag == 'img':self.content = re.sub(pattern, '![\g<2>](\g<1>)', self.content)elif self.tag == 'ul' and tag == 'li':self.content = re.sub(pattern, '- \g<1>', self.content)elif self.tag == 'ol' and tag == 'li':self.content = re.sub(pattern, '1. \g<1>', self.content)elif self.tag == 'thead' and tag == 'tr':self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'th':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'td':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))else:wrapper = MARKDOWN.get(tag)self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content)class Tomd:def __init__(self, html='', options=None):self.html = htmlself.options = optionsself._markdown = ''def convert(self, html, options=None):elements = []for tag, pattern in BlOCK_ELEMENTS.items():for m in re.finditer(pattern, html, re.I | re.S | re.M):element = Element(start_pos=m.start(),end_pos=m.end(),content=''.join(m.groups()),tag=tag,is_block=True)can_append = Truefor e in elements:if e.start_pos < m.start() and e.end_pos > m.end():can_append = Falseelif e.start_pos > m.start() and e.end_pos < m.end():elements.remove(e)if can_append:elements.append(element)elements.sort(key=lambda element: element.start_pos)self._markdown = ''.join([str(e) for e in elements])for index, element in enumerate(DELETE_ELEMENTS):self._markdown = re.sub(element, '', self._markdown)return self._markdown@propertydef markdown(self):self.convert(self.html, self.options)return self._markdown_inst = Tomd()
convert = _inst.convert

這段代碼是一個(gè)用于將HTML轉(zhuǎn)換為Markdown的工具類。它使用了正則表達(dá)式來(lái)解析HTML標(biāo)簽，并根據(jù)預(yù)定義的轉(zhuǎn)換規(guī)則將其轉(zhuǎn)換為對(duì)應(yīng)的Markdown格式。

代碼中定義了一個(gè)Element類，用于表示HTML中的各個(gè)元素。Element類包含了標(biāo)簽的起始位置、結(jié)束位置、內(nèi)容、標(biāo)簽類型等信息。它還提供了一個(gè)parse_inline方法，用于解析內(nèi)聯(lián)元素，并將其轉(zhuǎn)換為Markdown格式。

Tomd類是主要的轉(zhuǎn)換類，它接受HTML字符串并提供了convert方法來(lái)執(zhí)行轉(zhuǎn)換操作。convert方法遍歷預(yù)定義的HTML標(biāo)簽?zāi)Ｊ?#xff0c;并使用正則表達(dá)式匹配HTML字符串中對(duì)應(yīng)的部分。然后創(chuàng)建相應(yīng)的Element對(duì)象并進(jìn)行轉(zhuǎn)換操作。最后，將轉(zhuǎn)換后的Markdown字符串返回。

在模塊頂部，MARKDOWN字典定義了各個(gè)HTML標(biāo)簽對(duì)應(yīng)的Markdown格式。BlOCK_ELEMENTS和INLINE_ELEMENTS字典定義了正則表達(dá)式模式，用于匹配HTML字符串中的塊級(jí)元素和內(nèi)聯(lián)元素。DELETE_ELEMENTS列表定義了需要?jiǎng)h除的HTML元素。

那么既然有了轉(zhuǎn)markdown的工具，我們就可以對(duì)網(wǎng)頁(yè)進(jìn)行轉(zhuǎn)換

進(jìn)行轉(zhuǎn)換

首先，result_file函數(shù)用于創(chuàng)建一個(gè)保存結(jié)果文件的路徑。它接受文件夾的用戶名、文件名和文件夾名作為參數(shù)，并在指定的文件夾路徑下創(chuàng)建一個(gè)新的文件，并返回該文件的路徑。

get_headers函數(shù)用于從一個(gè)文本文件中讀取Cookie，并將它們保存為字典形式。它接受包含Cookie的文本文件路徑作為參數(shù)。

delete_ele函數(shù)用于刪除BeautifulSoup對(duì)象中指定的標(biāo)簽。它接受一個(gè)BeautifulSoup對(duì)象和待刪除的標(biāo)簽列表作為參數(shù)，并通過(guò)使用該對(duì)象的select方法來(lái)選擇要?jiǎng)h除的標(biāo)簽，然后使用decompose方法進(jìn)行刪除。

delete_ele_attr函數(shù)用于刪除BeautifulSoup對(duì)象中指定標(biāo)簽的指定屬性。它接受一個(gè)BeautifulSoup對(duì)象和待刪除的屬性列表作為參數(shù)，并使用find_all方法來(lái)選取所有標(biāo)簽，然后使用Python的del語(yǔ)句刪除指定的屬性。

delete_blank_ele函數(shù)用于刪除BeautifulSoup對(duì)象中的空白標(biāo)簽。它接受一個(gè)BeautifulSoup對(duì)象和一個(gè)例外列表，對(duì)于不在例外列表中且內(nèi)容為空的標(biāo)簽，使用decompose方法進(jìn)行刪除。

TaskQueue類是一個(gè)簡(jiǎn)單的任務(wù)隊(duì)列，用于存儲(chǔ)已訪問(wèn)的和未訪問(wèn)的URL。它提供了一系列方法來(lái)操作這些列表。

def result_file(folder_username, file_name, folder_name):folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", folder_name, folder_username)if not os.path.exists(folder):try:os.makedirs(folder)except Exception:passpath = os.path.join(folder, file_name)file = open(path,"w")file.close()else:path = os.path.join(folder, file_name)return pathdef get_headers(cookie_path:str):cookies = {}with open(cookie_path, "r", encoding="utf-8") as f:cookie_list = f.readlines()for line in cookie_list:cookie = line.split(":")cookies[cookie[0]] = str(cookie[1]).strip()return cookiesdef delete_ele(soup:BeautifulSoup, tags:list):for ele in tags:for useless_tag in soup.select(ele):useless_tag.decompose()def delete_ele_attr(soup:BeautifulSoup, attrs:list):for attr in attrs:for useless_attr in soup.find_all():del useless_attr[attr]def delete_blank_ele(soup:BeautifulSoup, eles_except:list):for useless_attr in soup.find_all():try:if useless_attr.name not in eles_except and useless_attr.text == "":useless_attr.decompose()except Exception:passclass TaskQueue(object):def __init__(self):self.VisitedList = []self.UnVisitedList = []def getVisitedList(self):return self.VisitedListdef getUnVisitedList(self):return self.UnVisitedListdef InsertVisitedList(self, url):if url not in self.VisitedList:self.VisitedList.append(url)def InsertUnVisitedList(self, url):if url not in self.UnVisitedList:self.UnVisitedList.append(url)def RemoveVisitedList(self, url):self.VisitedList.remove(url)def PopUnVisitedList(self,index=0):url = []if index and self.UnVisitedList:url = self.UnVisitedList[index]del self.UnVisitedList[:index]elif self.UnVisitedList:url = self.UnVisitedList.pop()return urldef getUnVisitedListLength(self):return len(self.UnVisitedList)class CSDN(object):def __init__(self, username, folder_name, cookie_path):# self.headers = {# 	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"# }self.headers = get_headers(cookie_path)self.s = requests.Session()self.username = usernameself.TaskQueue = TaskQueue()self.folder_name = folder_nameself.url_num = 1def start(self):num = 0articles = [None]while len(articles) > 0:num += 1url = u'https://blog.csdn.net/' + self.username + '/article/list/' + str(num)response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, "html.parser")articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"})for article in articles:article_title = article.a.text.strip().replace('        ','：')article_href = article.a['href']with ensure_memory(sys.getsizeof(self.TaskQueue.UnVisitedList)):self.TaskQueue.InsertUnVisitedList([article_title, article_href])def get_md(self, url):response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, 'lxml')content = soup.select_one("#content_views")# 刪除注釋for useless_tag in content(text=lambda text: isinstance(text, Comment)):useless_tag.extract()# 刪除無(wú)用標(biāo)簽tags = ["svg", "ul", ".hljs-button.signin"]delete_ele(content, tags)# 刪除標(biāo)簽屬性attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"]delete_ele_attr(content,attrs)# 刪除空白標(biāo)簽eles_except = ["img", "br", "hr"]delete_blank_ele(content, eles_except)# 轉(zhuǎn)換為markdownmd = Tomd(str(content)).markdownreturn mddef write_readme(self):print("+"*100)print("[++] 開(kāi)始爬取 {} 的博文 ......".format(self.username))print("+"*100)reademe_path = result_file(self.username,file_name="README.md",folder_name=self.folder_name)with open(reademe_path,'w', encoding='utf-8') as reademe_file:readme_head = "# " + self.username + " 的博文\n"reademe_file.write(readme_head)for [article_title,article_href] in self.TaskQueue.UnVisitedList[::-1]:text = str(self.url_num) + '. [' + article_title + ']('+ article_href +')\n'reademe_file.write(text)self.url_num += 1self.url_num = 1def get_all_articles(self):try:while True:[article_title,article_href] = self.TaskQueue.PopUnVisitedList()try:file_name = re.sub(r'[\/:：*?"<>|]','-', article_title) + ".md"artical_path = result_file(folder_username=self.username, file_name=file_name, folder_name=self.folder_name)md_head = "# " + article_title + "\n"md = md_head + self.get_md(article_href)print("[++++] 正在處理URL：{}".format(article_href))with open(artical_path, "w", encoding="utf-8") as artical_file:artical_file.write(md)except Exception:print("[----] 處理URL異常：{}".format(article_href))self.url_num += 1except Exception:passdef muti_spider(self, thread_num):while self.TaskQueue.getUnVisitedListLength() > 0:thread_list = []for i in range(thread_num):th = threading.Thread(target=self.get_all_articles)thread_list.append(th)for th in thread_list:th.start()lock = threading.Lock()
total_mem= 1024 * 1024 * 500 #500MB spare memory
@contextlib.contextmanager
def ensure_memory(size):global total_memwhile 1:with lock:if total_mem > size:total_mem-= sizebreaktime.sleep(5)yield with lock:total_mem += sizedef spider_user(username: str, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):if not os.path.exists(folder_name):os.makedirs(folder_name)csdn = CSDN(username, folder_name, cookie_path)csdn.start()th1 = threading.Thread(target=csdn.write_readme)th1.start()th2 = threading.Thread(target=csdn.muti_spider, args=(thread_num,))th2.start()def spider(usernames: list, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):for username in usernames:try:user_thread = threading.Thread(target=spider_user,args=(username, cookie_path, thread_num, folder_name))user_thread.start()print("[++] 開(kāi)啟爬取 {} 博文進(jìn)程成功 ......".format(username))except Exception:print("[--] 開(kāi)啟爬取 {} 博文進(jìn)程出現(xiàn)異常 ......".format(username))