當(dāng)前位置：首頁 > news >正文

前端做項(xiàng)目的網(wǎng)站惠州網(wǎng)站排名提升

news 2025/7/7 15:40:09

前端做項(xiàng)目的網(wǎng)站,惠州網(wǎng)站排名提升,扶貧網(wǎng)站建設(shè)的意義,合肥營銷型網(wǎng)站建設(shè)前言： 基于Boost庫的搜索引擎為何基于Boost庫？ 從技術(shù)上說：這個(gè)項(xiàng)目用了很多Boost庫的接口從搜索引擎存儲(chǔ)內(nèi)說：存儲(chǔ)的內(nèi)容是Boost庫的內(nèi)容預(yù)期效果預(yù)期效果:用戶在瀏覽器輸入關(guān)鍵詞，瀏覽器顯示相關(guān)結(jié)果 STEP1&#x…

前言：

基于Boost庫的搜索引擎

為何基于Boost庫？

從技術(shù)上說：這個(gè)項(xiàng)目用了很多Boost庫的接口
從搜索引擎存儲(chǔ)內(nèi)說：存儲(chǔ)的內(nèi)容是Boost庫的內(nèi)容預(yù)期效果

預(yù)期效果:用戶在瀏覽器輸入關(guān)鍵詞，瀏覽器顯示相關(guān)結(jié)果

STEP1：導(dǎo)入Boos庫數(shù)據(jù)到服務(wù)器

由于我們是將Boost庫中的數(shù)據(jù)作為服務(wù)器的數(shù)據(jù)源，所以我們要把Boost庫相關(guān)數(shù)據(jù)拉取到服務(wù)器上。

1.導(dǎo)入數(shù)據(jù)源到服務(wù)器

我們選擇的是Boost 庫中html文件作為數(shù)據(jù)源

數(shù)據(jù)源url：Index of main/release/1.78.0/source

boost官網(wǎng)下載文件，導(dǎo)入文件數(shù)據(jù)到Linux中，使用rz指令

2.解壓文件 ??

使用指令 tar xzf 壓縮包名稱，得到解壓后的文件夾

但這個(gè)文件夾內(nèi)，有非常多的內(nèi)容

我們選擇doc路徑下html文件夾中的內(nèi)容作為數(shù)據(jù)源(里面存放的都是html文件)

建立文件夾data/input，用于存放doc/html下的文件內(nèi)容

mkdir data/input

cp -r boost_1_78_0/doc/html/* data/input/

數(shù)據(jù)源準(zhǔn)備工作完畢

STEP2：處理數(shù)據(jù)模塊

在處理數(shù)據(jù)之前，需要明確，我們的數(shù)據(jù)源現(xiàn)在是存儲(chǔ)在文件上的，我們想要使用它，必須把它加載到內(nèi)存中，所以第一步，我們需要存放他們的文件路徑

1.存放文件路徑

//src_path="data/input" --存放html文件的路徑
//files_list --用于保存文件路徑的容器
bool enumfile(const std::string &src_path, std::vector<std::string> *files_list)
{// 引入boost開發(fā)庫 因?yàn)閏++對(duì)文件系統(tǒng)的支持不是很好// 展開boost的命名空間namespace fs = boost::filesystem;//path是一個(gè)用于處理文件操作的類fs::path root_path(src_path);//  判斷路徑是否存在if (!fs::exists(root_path)){std::cerr << "file not exists" <<  std::endl;return false;}// 存在 遞歸遍歷 recursive_directory_iterator end == nullptrfs::recursive_directory_iterator end;// 篩選文件for (fs::recursive_directory_iterator iter(root_path); iter != end; iter++){// is_regular_file是否為普通文件 eg:png falseif (!fs::is_regular_file(*iter))continue;//跳過本次循環(huán)，篩選下一個(gè)文件// 是否為html文件 path.extension()if (iter->path().extension() != ".html")continue;//跳過本次循環(huán)，篩選下一個(gè)文件// html文件，將文件路徑導(dǎo)入到容器中files_list->push_back(iter->path().string());}return true;
}

tips：

boost::filesystem::path

filesystem是一個(gè)模塊，提供了許多與文件處理相關(guān)的組件

path是一個(gè)類，包含了許多與文件處理相關(guān)的接口，

例如，獲取文件擴(kuò)展名-->path::extension()

2.處理文件內(nèi)容

我們已經(jīng)獲取到了想要的文件路徑了，接下來就可以使用文件操作的相關(guān)接口，打開文件內(nèi)容，并對(duì)文件內(nèi)容做相關(guān)的處理——提取標(biāo)題、內(nèi)容、url

//files_list --存放文件路徑的容器
//results --用于存放提取出來的文件內(nèi)容的容器
//ns_util::fileutil::readfile --讀取文件內(nèi)容的接口
//docinfo_t 定義如下：
typedef struct docinfo
{std::string title;std::string content;std::string url;
} docinfo_t;static bool parsehtml(const  std::vector< std::string> &files_list,std::vector<docinfo_t> *results)
{// 解析文件// file--本地文件路徑for (const  std::string &file : files_list){std::string result;// 1.讀取文件信息ns_util::fileutil::readfile(file, &result);docinfo_t doc;// 2.解析文件的titleif (!parsertitle(result, &doc.title)){continue;}// 3.解析文件的contentif (!parsercontent(result, &doc.content)){continue;}// 4.解析文件的urlif (!parserurl(file, &doc.url)){continue;}//解析好的內(nèi)容存入容器，使用移動(dòng)構(gòu)造提高效率 results->push_back(std::move(doc));}return true;
}

提取標(biāo)題

在html文件中，標(biāo)題是以<title>出現(xiàn)</title>結(jié)尾的

舉個(gè)例子：

以下html代碼中，<title></title>間的白字部分就是標(biāo)題

可以根據(jù)上述特性編寫代碼：

//file --文件內(nèi)容
//title --提取的標(biāo)題存放進(jìn)的容器
static bool parsertitle(const  std::string &file,  std::string *title)
{size_t begin = file.find("<title>");//尋找title出現(xiàn)的位置if (begin ==  std::string::npos){return false;}size_t end = file.find("</title>");//尋找</title>出現(xiàn)的位置if (end ==  std::string::npos){return false;}if (begin > end){return false;}begin +=  std::string("<title>").size();*title = file.substr(begin, end - begin);//截取標(biāo)題內(nèi)容return true;
}

提取content

content是以'>'開始標(biāo)志的，是以'<'為結(jié)尾標(biāo)志的，<>xxxx<> xxxx就是content

舉個(gè)例子：下面HTML代碼中標(biāo)出來的白字部分就是content內(nèi)容

但請(qǐng)注意：不是說只要出現(xiàn)'>'，后面就是content，例如

<a name="xpressive.legal"></a><p>

'>'后出現(xiàn)的是'<'，這是HTML語言的標(biāo)簽

只要'>'出現(xiàn)的不是'<',那就是content

根據(jù)上述規(guī)則，可以編寫代碼

????????

//file --存放文件內(nèi)容的容器
//content --存放提取內(nèi)容(content)的容器
static bool parsercontent(const  std::string &file,  std::string *content)
{// 去標(biāo)簽 enum status{Lable,Content};enum status s = Lable;for (const char c : file)//按字符讀取文件內(nèi)容{switch (s){case Lable://是標(biāo)簽if (c == '>')//內(nèi)容開始的標(biāo)志s = Content;//切換狀態(tài)break;case Content:if (c == '<')//不是內(nèi)容s = Lable;else//是內(nèi)容{if (c == '\n')//將\n置為空字符，原因后文會(huì)提到c == ' ';content->push_back(c);}break;default:break;}}return true;
}

提取url

這里更準(zhǔn)確的說法，應(yīng)該是拼接url，按照我們預(yù)期的效果，頁面應(yīng)該要顯示搜索內(nèi)容所在的url

我們的數(shù)據(jù)源皆來自https://www.boost.org/doc/libs/1_78_0/doc/html/

我們的容器中存放的文件路徑是data/input/具體的文件名

所以我們要如此拼接：https://www.boost.org/doc/libs/1_78_0/doc/html/+具體的文件名

//file_path --文件路徑
//url --用于存放拼接好的url的容器
//src_path --data/input
static bool parserurl(const  std::string &file_path,  std::string *url)
{std::string url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";std::string url_tail = file_path.substr(src_path.size());*url = url_head + url_tail;return true;
}

3.保存處理好的文件內(nèi)容

我們已經(jīng)將每一個(gè)文件所對(duì)應(yīng)的內(nèi)容存放在vector<docinfo_t>中了，接下來需要對(duì)一個(gè)個(gè)的docinfo_t進(jìn)行格式化處理，并將其寫入磁盤，以待使用

為什么要進(jìn)行格式化處理？方便內(nèi)容提取，在后文中會(huì)有具體體現(xiàn)

如何格式化？以特定字符作為內(nèi)容內(nèi)title content url的分隔符，以特定字符作為內(nèi)容與內(nèi)容之間的分隔符

將vector<docinfo_t>中的內(nèi)容作格式化處理

title\3content\3url\n-->一個(gè)完整的內(nèi)容

寫到data/raw_html/raw.txt

//results --存放結(jié)構(gòu)體數(shù)據(jù)的容器
//output --寫入磁盤的文件路徑
bool savehtml(const  std::vector<docinfo_t> &results, const  std::string &output)
{
#define sep '\3'    // title\3content\3url\n// 按照二進(jìn)制方式寫入std::ofstream out(output,  std::ios::out |  std::ios::binary);if (!out.is_open()){std::cerr << "open " << output << "failed!" <<  std::endl;return false;}for (const docinfo_t &item : results)//讀取每一個(gè)結(jié)構(gòu)體信息{std::string out_string;out_string = item.title;out_string += sep;    //title\3out_string += item.content;    //title\3contentout_string += sep;    //title\3content\3out_string += item.url;    //title\3content\3urlout_string += '\n';    //title\3content\3url\nout.write(out_string.c_str(), out_string.size());}out.close();return true;
}

STEP3：構(gòu)建索引模塊

何為索引？即搜索引擎的查找規(guī)則

舉個(gè)例子：當(dāng)我們?cè)跒g覽器輸入“hello world”時(shí)，瀏覽器會(huì)顯示大量頁面，從hello world 到頁面，使這一過程發(fā)生的就是索引

索引規(guī)則有如下2種：

正排索引：根據(jù)文檔id找到文檔內(nèi)容，所以它的底層是vector<docinfo_t>，下標(biāo)就是文檔id，里面存的就是文檔內(nèi)容
倒排索引：根據(jù)關(guān)鍵詞找到文檔id 并通過文檔id找到文檔內(nèi)容，他是根據(jù)關(guān)鍵詞在文章中出現(xiàn)的權(quán)重為基礎(chǔ)，構(gòu)建索引的

我們要對(duì)誰構(gòu)建索引？存在磁盤上的格式化的數(shù)據(jù)源

構(gòu)建正排索引

//line --存放文件內(nèi)容的容器
//out --存放切分結(jié)果的容器
//forwardindex --正排索引  類型vector<docinfo>struct docinfo{std::string title;std::string content;std::string url;uint64_t docid;};
//...文件讀取操作
docinfo *bulidforwardindex(const std::string &line){// 對(duì)line進(jìn)行 title content url 的分詞 std::vector<std::string> out;const std::string sep = "\3";    //以'\3'為分割標(biāo)志ns_util::stringutil::split(line, &out, sep);    //調(diào)用切分字符的接口if(out.size()!=3){return nullptr;}docinfo doc;doc.title = out[0];doc.content = out[1];doc.url = out[2];doc.docid = forwardindex.size();forwardindex.push_back(std::move(doc));//std::cout<<(forwardindex[forwardindex.size()-1].url)<<std::endl;//表明正派建立成功return &forwardindex.back();    //返回構(gòu)建好的一組數(shù)據(jù)，供建立倒排索引使用}

構(gòu)建倒排索引

//wordmap --unordered_map<string,wordcnt>類型 用于存儲(chǔ)被劃分詞在標(biāo)題與內(nèi)容中出現(xiàn)的次數(shù) 
//invertedindex --unordered_map<string，invertedlist>類型 用于表示關(guān)鍵詞與網(wǎng)頁間的對(duì)應(yīng)關(guān)系 
//ilist --invertedlist類型，typedef vector<invertedelem> invertedlist 
//item --invertedelem類型struct invertedelem{uint64_t docid;std::string word;int weight;invertedelem() : weight(0){}};
bool buildinvertedindex(const docinfo &doc){struct wordcnt    //用于計(jì)算被劃分的詞在標(biāo)題/內(nèi)容出現(xiàn)的次數(shù){int titlecnt;    //用于計(jì)算被劃分的標(biāo)題詞在標(biāo)題中出現(xiàn)的次數(shù)    int contentcnt;    //用于計(jì)算被劃分的內(nèi)容詞在內(nèi)容中出現(xiàn)的次數(shù)wordcnt() : titlecnt(0), contentcnt(0){}};std::string title = doc.title;    //取出完整的標(biāo)題std::string content = doc.content;    //取出完整的內(nèi)容// jieba分詞--titlestd::vector<std::string> titlecut;ns_util::jiebautil::cutstring(title, &titlecut);// 拿到了jieba為我們分好的詞 --titlestd::unordered_map<std::string, wordcnt> wordmap; for (auto &s : titlecut)    //遍歷被劃分的標(biāo)題詞{boost::to_lower(s);    //不區(qū)分大小寫wordmap[s].titlecnt++;    //記錄標(biāo)題詞在標(biāo)題出現(xiàn)次數(shù)}// jieba分詞--contentstd::vector<std::string> contentcut;ns_util::jiebautil::cutstring(content, &contentcut);for (auto &s : contentcut)    //遍歷被劃分的內(nèi)容詞{boost::to_lower(s);wordmap[s].contentcnt++;    //記錄內(nèi)容詞在內(nèi)容出現(xiàn)次數(shù)}// word -> id word weight
#define X 10
#define Y 1//構(gòu)建倒排索引    被劃分的詞才是主角for (auto &wmap : wordmap){invertedelem item;item.docid = doc.docid;item.word = wmap.first;//構(gòu)建各個(gè)詞在此"網(wǎng)頁"中的權(quán)重 --標(biāo)題:10/次 內(nèi)容:1/次item.weight = X * wmap.second.titlecnt + Y * wmap.second.contentcnt; //構(gòu)建被劃分的詞與"網(wǎng)頁"的關(guān)系invertedlist &ilist = invertedindex[wmap.first];// std::cout<<"invert success"<<std::endl;//表明創(chuàng)建倒排成功ilist.push_back(std::move(item));}return true;}

現(xiàn)在，forwardindex 與invertedlist都已經(jīng)按照各自的索引規(guī)則存儲(chǔ)好了數(shù)據(jù)，上層想要調(diào)用使用這里面的數(shù)據(jù)，還需要我們提供兩個(gè)接口

//正排索引的調(diào)用接口  
//docid --想要查找的內(nèi)容對(duì)應(yīng)的id  
docinfo *getforwardindex(uint64_t docid){if (docid >= forwardindex.size()){std::cerr << "no expected doc" << std::endl;return nullptr;}return &forwardindex[docid];}//倒排索引的調(diào)用接口
//word: 以word關(guān)鍵字為key值對(duì)相應(yīng)內(nèi)容做檢索
invertedlist *getinvertedlist(const std::string &word){auto iter = invertedindex.find(word);if (invertedindex.end() == iter){return nullptr;}return &(iter->second);}

STEP4：編寫服務(wù)端模塊

服務(wù)端的作用是：接受客戶端傳過來的關(guān)鍵詞，并對(duì)關(guān)鍵詞進(jìn)行分詞，利用構(gòu)建好的索引，并返回相關(guān)頁面

工作流程：

InitSearcher --初始化工作

創(chuàng)建單例，構(gòu)建索引（文件-->內(nèi)存）

    //Index --Index對(duì)象void initsearcher(){Index = ns_index::index::getinstance();    //獲取單例if(Index==nullptr){std::cerr<<"getinstance fail"<<std::endl;exit(1);}std::cout<<"get instance success"<<std::endl;    Index->buildindex(input);    //創(chuàng)建索引}

Search --搜索工作

對(duì)用戶的關(guān)鍵詞進(jìn)行分詞，然后查倒排，將結(jié)果存放到inverted_list_all當(dāng)中去

對(duì)查找結(jié)果按照權(quán)重進(jìn)行排序

根據(jù)文檔id，查詢相關(guān)結(jié)果

為了達(dá)到像真實(shí)網(wǎng)頁一般顯示，我們要對(duì)內(nèi)容做摘要

利用 Json對(duì)結(jié)果進(jìn)行序列化，返回給客戶端

//query --用戶傳進(jìn)來的關(guān)鍵詞
//jsonstring --要返回的結(jié)果   
void search(const  std::string &query,  std::string *jsonstring){// 對(duì)query進(jìn)行分詞std::vector< std::string> cutwords;ns_util::jiebautil::cutstring(query, &cutwords);ns_index::invertedlist inverted_list_all;for (auto word : cutwords){boost::to_lower(word);ns_index::invertedlist *inverted_list;inverted_list = Index->getinvertedlist(word);if (inverted_list == nullptr){std::cerr<<"get inverted_list err"<<std::endl;continue;}inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());}// 按照相關(guān)性對(duì)內(nèi)容進(jìn)行排序std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const ns_index::invertedelem &e1, const ns_index::invertedelem &e2){ return e1.weight > e2.weight; });//利用Json對(duì)結(jié)果進(jìn)行反序列化，并將結(jié)果返回給上層     Json::Value root;for (auto &item : inverted_list_all){ns_index::docinfo *doc = Index->getforwardindex(item.docid);if (nullptr == doc){std::cerr<<"get doc fail"<<std::endl;continue;}Json::Value elem;elem["title"] = doc->title;elem["desc"] = getdesc(doc->content, item.word);    //對(duì)內(nèi)容做摘要elem["url"] = doc->url;//here~~!!//for debugelem["weight"]=item.weight;elem["docid"]=(int)doc->docid;root.append(elem);}Json::StyledWriter writer;*jsonstring = writer.write(root);}std:: string getdesc(const std:: string &content, const  std::string &word){int prev_step = 50;int next_step = 100;//int pos = content.find(word);//大小寫問題 split->Splitauto iter=std::search(content.begin(),content.end(),word.begin(),word.end(),[](int x,int y){return std::tolower(x)==std::tolower(y);});if(iter==content.end()){return "none1";}int pos=std::distance(content.begin(),iter);int start = 0;int end = content.size() - 1;if (pos - prev_step > start){start = pos - prev_step;}if (pos + next_step < end){end = pos + next_step;}if (start > end)return "none2";std:: string desc = content.substr(start, end - start);return desc;}

這里的服務(wù)端模塊，其實(shí)就是大量的在調(diào)用索引模塊的接口，服務(wù)端是上層，索引模塊是下層

為方便理解，下圖簡(jiǎn)單勾勒出了二者關(guān)系

server-index關(guān)系圖：

STEP5：編寫http服務(wù)模塊

http服務(wù)模塊，位于應(yīng)用層，是整個(gè)服務(wù)器的最上層，具體工作是：啟動(dòng)服務(wù)器，完成socket編程(創(chuàng)建套接字、綁定套接字、監(jiān)聽套接字、等待連接)，接受客戶端請(qǐng)求，返回服務(wù)器結(jié)果。

#include "searcher.hpp"
#include "cpphttplib/httplib.h"
const std::string root_path = "./wwwroot";int main()
{ns_searcher::searcher sear;sear.initsearcher();httplib::Server svr;svr.set_base_dir(root_path.c_str());svr.Get("/s", [&sear](const httplib::Request &req, httplib::Response &rsp){if (!req.has_param("word")){rsp.set_content("必須要有搜索關(guān)鍵字!", "text/plain; charset=utf-8");return;}std::string word = req.get_param_value("word");              std::string json_string;sear.search(word, &json_string);rsp.set_content(json_string, "application/json");});svr.listen("0.0.0.0", 18088);return 0;
}

STEP6：部署日志到服務(wù)器中

部署日志信息是為了監(jiān)控服務(wù)器狀態(tài)，方便對(duì)服務(wù)端的管理。

查看全文

http://www.risenshineclean.com/news/45853.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网