網(wǎng)站建設(shè)公司專業(yè)的建站優(yōu)化公司百度seo怎么優(yōu)化
?前面我講述過如何通過BeautifulSoup獲取維基百科的消息盒,同樣可以通過Spider獲取網(wǎng)站內(nèi)容,最近學(xué)習(xí)了Selenium+Phantomjs后,準(zhǔn)備利用它們獲取百度百科的旅游景點(diǎn)消息盒(InfoBox),這也是畢業(yè)設(shè)計(jì)實(shí)體對(duì)齊和屬性的對(duì)齊的語料庫前期準(zhǔn)備工作。希望文章對(duì)你有所幫助~
源代碼
#?coding=utf-8????
"""??
Created?on?2015-09-04?@author:?Eastmount???
"""????import?time????????????
import?re????????????
import?os????
import?sys??
import?codecs??
from?selenium?import?webdriver????????
from?selenium.webdriver.common.keys?import?Keys????????
import?selenium.webdriver.support.ui?as?ui????????
from?selenium.webdriver.common.action_chains?import?ActionChains????#Open?PhantomJS????
driver?=?webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")????
#driver?=?webdriver.Firefox()????
wait?=?ui.WebDriverWait(driver,10)??
global?info?#全局變量??#Get?the?infobox?of?5A?tourist?spots????
def?getInfobox(name):????try:????#create?paths?and?txt?files??global?info??basePathDirectory?=?"Tourist_spots_5A"????if?not?os.path.exists(basePathDirectory):????os.makedirs(basePathDirectory)????baiduFile?=?os.path.join(basePathDirectory,"BaiduSpider.txt")????if?not?os.path.exists(baiduFile):????info?=?codecs.open(baiduFile,'w','utf-8')????else:????info?=?codecs.open(baiduFile,'a','utf-8')????#locate?input??notice:?1.visit?url?by?unicode?2.write?files????print?name.rstrip('\n')?#delete?char?'\n'????driver.get("http://baike.baidu.com/")????elem_inp?=?driver.find_element_by_xpath("//form[@id='searchForm']/input")????elem_inp.send_keys(name)????elem_inp.send_keys(Keys.RETURN)????info.write(name.rstrip('\n')+'\r\n')??#codecs不支持'\n'換行??time.sleep(2)??print?driver.current_url??print?driver.title??#load?infobox?basic-info?cmn-clearfix??elem_name?=?driver.find_elements_by_xpath("//div[@class='basic-info?cmn-clearfix']/dl/dt")????elem_value?=?driver.find_elements_by_xpath("//div[@class='basic-info?cmn-clearfix']/dl/dd")??for?e?in?elem_name:??print?e.text??for?e?in?elem_value:??print?e.text??#create?dictionary?key-value??#字典是一種散列表結(jié)構(gòu),數(shù)據(jù)輸入后按特征被散列,不記錄原來的數(shù)據(jù),順序建議元組??elem_dic?=?dict(zip(elem_name,elem_value))???for?key?in?elem_dic:????print?key.text,elem_dic[key].text????info.writelines(key.text+"?"+elem_dic[key].text+'\r\n')????time.sleep(5)????except?Exception,e:?#'utf8'?codec?can't?decode?byte????print?"Error:?",e????finally:????print?'\n'????info.write('\r\n')????#Main?function????
def?main():??global?info??#By?function?get?information?????source?=?open("Tourist_spots_5A_BD.txt",'r')????for?name?in?source:????name?=?unicode(name,"utf-8")????if?u'故宮'?in?name:?#else?add?a?'?'????name?=?u'北京故宮'????getInfobox(name)????print?'End?Read?Files!'????source.close()????info.close()????driver.close()????main()??
??????
運(yùn)行結(jié)果
? ? ? ? 主要通過從F盤中txt文件中讀取國(guó)家5A級(jí)景區(qū)的名字,再調(diào)用Phantomjs.exe瀏覽器依次訪問獲取InfoBox值。同時(shí)如果存在編碼問題“'ascii' codec can't encode characters”則可通過下面代碼設(shè)置編譯器utf-8編碼,代碼如下:
?
#設(shè)置編碼utf-8??
import?sys???
reload(sys)????
sys.setdefaultencoding('utf-8')??
#顯示當(dāng)前默認(rèn)編碼方式??
print?sys.getdefaultencoding()??
對(duì)應(yīng)源碼
? ? ? ? 其中對(duì)應(yīng)的百度百科InfoBox源代碼如下圖,代碼中基礎(chǔ)知識(shí)可以參考我前面的博文或我的Python爬蟲專利,Selenium不僅僅擅長(zhǎng)做自動(dòng)測(cè)試,同樣適合做簡(jiǎn)單的爬蟲。
編碼問題
? ? ? ? 此時(shí)你仍然可能遇到“'ascii' codec can't encode characters”編碼問題。
? ? ? ?它是因?yàn)槟銊?chuàng)建txt文件時(shí)默認(rèn)是ascii格式,此時(shí)你的文字確實(shí)'utf-8'格式,所以需要轉(zhuǎn)換通過如下方法。
import?codecs??#用codecs提供的open方法來指定打開的文件的語言編碼,它會(huì)在讀取的時(shí)候自動(dòng)轉(zhuǎn)換為內(nèi)部unicode??
if?not?os.path.exists(baiduFile):????info?=?codecs.open(baiduFile,'w','utf-8')????
else:????info?=?codecs.open(baiduFile,'a','utf-8')??#該方法不是io故換行是'\r\n'??
info.writelines(key.text+":"+elem_dic[key].text+'\r\n')????
總結(jié)
? ? ? ?你可以代碼中學(xué)習(xí)基本的自動(dòng)化爬蟲方法、同時(shí)可以學(xué)會(huì)如何通過for循環(huán)顯示key-value鍵值對(duì),對(duì)應(yīng)的就是顯示的屬性和屬性值,通過如下代碼實(shí)現(xiàn):? ? ??
?elem_dic = dict(zip(elem_name,elem_value))
? ? ? ?但最后的輸出結(jié)果不是infobox中的順序,why??
? ? ? ?最后希望文章對(duì)你有所幫助,還有一篇基礎(chǔ)介紹文章,