當(dāng)前位置：首頁(yè) > news >正文

秦皇島優(yōu)化營(yíng)商環(huán)境北京網(wǎng)站優(yōu)化服務(wù)商

news 2025/7/6 23:55:38

秦皇島優(yōu)化營(yíng)商環(huán)境,北京網(wǎng)站優(yōu)化服務(wù)商,網(wǎng)頁(yè)制作團(tuán)隊(duì),網(wǎng)站代碼制作項(xiàng)目背景先說(shuō)一下什么是搜索引擎,很簡(jiǎn)單,就是我們平常使用的百度,我們把自己想要所有的內(nèi)容輸入進(jìn)去,百度給我們返回相關(guān)的內(nèi)容.百度一般給我們返回哪些內(nèi)容呢?這里很簡(jiǎn)單,我們先來(lái)看一下. 搜索引擎基本原理這里我們簡(jiǎn)單的說(shuō)一下我們的搜索引擎的基本原理. 我們給服務(wù)器發(fā)…

項(xiàng)目背景

先說(shuō)一下什么是搜索引擎,很簡(jiǎn)單,就是我們平常使用的百度,我們把自己想要所有的內(nèi)容輸入進(jìn)去,百度給我們返回相關(guān)的內(nèi)容.百度一般給我們返回哪些內(nèi)容呢?這里很簡(jiǎn)單,我們先來(lái)看一下.

搜索引擎基本原理

這里我們簡(jiǎn)單的說(shuō)一下我們的搜索引擎的基本原理.

我們給服務(wù)器發(fā)起請(qǐng)求,例如搜索關(guān)鍵字"boost",服務(wù)器拿到請(qǐng)求之后,此時(shí)檢索自己的資源,然后把結(jié)果構(gòu)成響應(yīng)發(fā)送給我們.

Boost庫(kù)

boost庫(kù)是一個(gè)經(jīng)過(guò)千錘百煉、可移植、提供源代碼的 C++ 庫(kù)，作為標(biāo)準(zhǔn)庫(kù)的后備.他的供能很強(qiáng)大,但是這里面有一個(gè)小小的缺陷,它不支持搜索,例如我們想要搜索一個(gè)函數(shù),看一下cplus庫(kù),他是支持的.

但是我們的boost庫(kù)不支持,不知道我們后面支不支持.

項(xiàng)目目的

下面我們就要說(shuō)一下我們的項(xiàng)目的目的了,很簡(jiǎn)單,我們給boost添加一個(gè)搜索的功能,這里要說(shuō)一下,我們服務(wù)器上面說(shuō)了,我們需要搜索資源,可以通過(guò)兩個(gè)方式

搜索其他的網(wǎng)頁(yè)資源:這里需要使用爬蟲(chóng),有一定的技術(shù)要求
把boost下載下來(lái),我們?cè)诒镜厮阉髻Y源

這里我們使用第二個(gè)方式,下載一下boost庫(kù).

Boost搜索引擎宏觀流程

清晰數(shù)據(jù)

我們把boost庫(kù)下載下來(lái),此時(shí)我們想要把所有的后綴是html的文件進(jìn)行處理,也就是清晰數(shù)據(jù).我們先來(lái)看一個(gè)簡(jiǎn)單的html文件.我們把其中的title,content,url進(jìn)行保存.

構(gòu)建索引

我們把清晰出來(lái)的標(biāo)簽構(gòu)建好索引,為了后期便于查找.這里細(xì)節(jié)很多,我們后面說(shuō)/

處理請(qǐng)求

我們把請(qǐng)求處理好,然后根據(jù)索引拿到結(jié)果,由于我們的結(jié)果很多,這里我們把眾多的結(jié)果根據(jù)權(quán)重排好序之后,發(fā)送給客戶端.

前端頁(yè)面

根據(jù)返回的結(jié)果,我們使用前端技術(shù)進(jìn)行處理,讓后我們就可以完成這個(gè)項(xiàng)目了.

技術(shù)棧與環(huán)境

技術(shù)棧

后端: C/C++, C++11,STL, boost標(biāo)準(zhǔn)庫(kù), Jsoncpp, cppjieba, cpp-httplib
前端: html5，css，js、jQuery, Ajax

環(huán)境

Centos7虛擬機(jī),vim,gcc(g++),Makefile,Vscode

認(rèn)識(shí)索引

下面我們要說(shuō)下什么是索引,這里很簡(jiǎn)單,我們給編上號(hào),我們可以根據(jù)編號(hào)找到唯一確定的文件,這就是索引的基本的原理.不過(guò)這里的索引分為正排索引和倒排索引.

正派索引: 根據(jù)編號(hào)找到文件,這里的結(jié)果是唯一的
倒排索引: 根據(jù)關(guān)鍵字,找到文件id.

這里們說(shuō)大家可能覺(jué)得有點(diǎn)不太清楚,這里我們舉一個(gè)例子,這里有兩個(gè)文件.

正排索引

我們對(duì)每一個(gè)文件進(jìn)行編號(hào).

文檔ID	文檔名稱	文檔內(nèi)容
1	文檔A	你好,我是大學(xué)生
2	文檔B	你好,我是社會(huì)人

這里的正派索引很簡(jiǎn)單,我們根據(jù)文檔編號(hào),直接就可以找到文檔的內(nèi)容.

倒排索引

我們把每一個(gè)文檔都進(jìn)行分詞,拿出來(lái)不重復(fù)的詞,對(duì)于每一個(gè)不重復(fù)的次,下面都掛著我們的文檔的編號(hào).

關(guān)鍵字	文檔ID
你好	1, 2
我	1, 2
是	1, 2
大學(xué)生	1
社會(huì)人	2

倒排索引,就是根據(jù)關(guān)鍵字,拿到我們的文檔ID.

如何分詞

上面我們說(shuō)了把文檔進(jìn)行分詞,為何分詞?為了提高查找的效率.那么請(qǐng)問(wèn)我們?cè)撊绾畏衷~呢?這里我們可以自己手動(dòng)分,但是已經(jīng)有大佬給我們變好了一個(gè)庫(kù),我們直接使用就可以了.但是如果我們手動(dòng)分?這里該如何分,很簡(jiǎn)單.

你好,我是大學(xué)生: 你好/我/是/大學(xué)生
你好,我是社會(huì)人: 你好/我/是/社會(huì)人

注意的,上面的分詞我隨意分的,不一定就是這樣的.不過(guò)這里我們要談一下我們一個(gè)提高效率的方法,我們發(fā)現(xiàn),一個(gè)文旦里面的了" , “從” , “嗎” , “the” , “a” 有的時(shí)候意義不是太大,那么我們這里是不是在分詞的時(shí)候直接忽略,可以提高我們的效率,像這一種詞,我們稱為停止詞.

模擬查找

下面我們模擬一下查找的流程的。

用戶輸入：你好 -> 倒排索引中查找 -> 提取出文檔ID(1,2) -> 根據(jù)正排索引 -> 找到文檔的內(nèi)容 ->title+conent（desc）+url 文檔結(jié)果進(jìn)行摘要->構(gòu)建響應(yīng)結(jié)果

數(shù)據(jù)清洗

我們先下載一下boost庫(kù),直接使用最新版本的,我這里是1.83.0.我們下載到桌面,然后在centos下使用指令rz傳入虛擬機(jī)中,然后解壓一下就可以了.

[qkj@localhost install]$ rz -E [qkj@localhost install]$ ll
total 141256
-rw-r--r--. 1 qkj qkj 144645738 Sep  9 00:15 boost_1_83_0.tar.gz
[qkj@localhost install]$ tar xzf boost_1_83_0.tar.gz 
[qkj@localhost install]$ ll
total 141260
drwxr-xr-x. 8 qkj qkj      4096 Aug  8 14:40 boost_1_83_0
-rw-r--r--. 1 qkj qkj 144645738 Sep  9 00:15 boost_1_83_0.tar.gz
[qkj@localhost install]$

下面看一下這個(gè)庫(kù)的內(nèi)容.

[qkj@localhost install]$ cd boost_1_83_0/
[qkj@localhost boost_1_83_0]$ ll
total 112
drwxr-xr-x. 139 qkj qkj  8192 Aug  8 14:40 boost
-rw-r--r--.   1 qkj qkj   851 Aug  8 14:02 boost-build.jam
-rw-r--r--.   1 qkj qkj 20245 Aug  8 14:02 boostcpp.jam
-rw-r--r--.   1 qkj qkj   989 Aug  8 14:02 boost.css
-rw-r--r--.   1 qkj qkj  6308 Aug  8 14:02 boost.png
-rw-r--r--.   1 qkj qkj  2486 Aug  8 14:02 bootstrap.bat
-rwxr-xr-x.   1 qkj qkj 10811 Aug  8 14:02 bootstrap.sh
drwxr-xr-x.   7 qkj qkj   196 Aug  8 14:14 doc
-rw-r--r--.   1 qkj qkj   769 Aug  8 14:02 index.htm
-rw-r--r--.   1 qkj qkj  5418 Aug  8 14:40 index.html
-rw-r--r--.   1 qkj qkj   291 Aug  8 14:02 INSTALL
-rw-r--r--.   1 qkj qkj 11947 Aug  8 14:02 Jamroot
drwxr-xr-x. 148 qkj qkj  4096 Aug  8 14:40 libs
-rw-r--r--.   1 qkj qkj  1338 Aug  8 14:02 LICENSE_1_0.txt
drwxr-xr-x.   4 qkj qkj   159 Aug  8 14:02 more
-rw-r--r--.   1 qkj qkj   542 Aug  8 14:02 README.md
-rw-r--r--.   1 qkj qkj  2608 Aug  8 14:02 rst.css
drwxr-xr-x.   2 qkj qkj   171 Aug  8 14:02 status
drwxr-xr-x.  14 qkj qkj   256 Aug  8 14:02 tools
[qkj@localhost boost_1_83_0]$

這里面就是我們boost庫(kù)的全部?jī)?nèi)容,為了我們的項(xiàng)目簡(jiǎn)單一些,這里我們使用boost里面的doc里面的html目錄下的的html文件.如果我們想要搭建所有的html文件,這里在后面去做.

boost_1_83_0/doc/html

[qkj@localhost doc]$ cd html/
[qkj@localhost html]$ ll
total 2900
-rw-r--r--.  1 qkj qkj   3476 Aug  8 14:24 about.html
drwxr-xr-x.  2 qkj qkj     82 Aug  8 14:25 accumulators
-rw-r--r--.  1 qkj qkj   5858 Aug  8 14:25 accumulators.html
drwxr-xr-x.  2 qkj qkj    168 Aug  8 14:26 align
-rw-r--r--.  1 qkj qkj   4440 Aug  8 14:26 align.html
drwxr-xr-x.  2 qkj qkj     78 Aug  8 14:26 any
-rw-r--r--.  1 qkj qkj   9011 Aug  8 14:26 any.html
drwxr-xr-x.  3 qkj qkj     78 Aug  8 14:26 array
-rw-r--r--.  1 qkj qkj   8377 Aug  8 14:26 array.html
-rw-r--r--.  1 qkj qkj  36597 Aug  8 14:30 array_types.html
-rw-r--r--.  1 qkj qkj 286811 Aug  8 14:29 asio_HTML.manifest
-rw-r--r--.  1 qkj qkj   6685 Aug  8 14:35 Assignable.html
-rw-r--r--.  1 qkj qkj    700 Aug  8 14:02 atomic.html
-rw-r--r--.  1 qkj qkj  20627 Aug  8 14:30 auxiliary.html
drwxr-xr-x.  2 qkj qkj     31 Aug  8 14:02 bbv2
...

下面我們要做的就是就是把boost_1_83_0/doc/html里面的所有內(nèi)容保存到一個(gè)文件中.

[qkj@localhost boost_searcher]$ mkdir data/input -p
[qkj@localhost boost_searcher]$ cp -rf ../../install/boost_1_83_0/doc/html/* data/input/

我們看一下.

[qkj@localhost boost_searcher]$ cd data/input/
[qkj@localhost input]$ ll
total 2900
-rw-r--r--.  1 qkj qkj   3476 Sep  9 00:31 about.html
drwxr-xr-x.  2 qkj qkj     82 Sep  9 00:31 accumulators
-rw-r--r--.  1 qkj qkj   5858 Sep  9 00:31 accumulators.html
drwxr-xr-x.  2 qkj qkj    168 Sep  9 00:31 align
-rw-r--r--.  1 qkj qkj   4440 Sep  9 00:31 align.html
drwxr-xr-x.  2 qkj qkj     78 Sep  9 00:31 any
-rw-r--r--.  1 qkj qkj   9011 Sep  9 00:31 any.html
drwxr-xr-x.  3 qkj qkj     78 Sep  9 00:31 array
-rw-r--r--.  1 qkj qkj   8377 Sep  9 00:31 array.html

下面就可以去去標(biāo)簽了,這里創(chuàng)建一個(gè)文件.

[qkj@localhost boost_searcher]$ touch parser.cc

認(rèn)識(shí)標(biāo)簽

在談去標(biāo)簽之前,我們需要先認(rèn)識(shí)一下標(biāo)簽.,我們隨便打開(kāi)的一個(gè)html文件.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">    
<html>    
<head>    
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">    
<title>Chapter 45. Boost.YAP</title>    
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">    
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">    
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook Documentation Subset">    
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries (BoostBook Subset)">    
<link rel="prev" href="xpressive/appendices.html" title="Appendices">    
<link rel="next" href="boost_yap/manual.html" title="Manual">    
<meta name="viewport" content="width=device-width, initial-scale=1">    
</head>    
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">    
<table cellpadding="2" width="100%"><tr>    
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../boost.png"></td>             
<td align="center"><a href="../../index.html">Home</a></td>    
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>    
<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>

像這種由<>包含的就是標(biāo)簽,一般而言,標(biāo)簽是成對(duì)出現(xiàn)的.這些標(biāo)簽對(duì)我們來(lái)說(shuō)現(xiàn)在是沒(méi)有價(jià)值的.我們需要把它給清晰了.對(duì)與清晰的數(shù)據(jù)我們也保存在一個(gè)文件中.

[qkj@localhost boost_searcher]$ mkdir data/raw_html -p
[qkj@localhost boost_searcher]$ cd data/
[qkj@localhost data]$ ll
total 16
drwxrwxr-x. 58 qkj qkj 12288 Sep  9 00:31 input     // 這里保存源html
drwxrwxr-x.  2 qkj qkj     6 Sep  9 00:44 raw_html  // 這里保存清晰后的html
[qkj@localhost data]$

下面說(shuō)一下我們?cè)撊绾伪４孢@些清晰后的文檔內(nèi)容,看一我們?cè)磆tml文件有多少個(gè).

[qkj@localhost input]$ ls -Rl | grep -E "*.html" | wc -l
8581
[qkj@localhost input]$

這里我們可以對(duì)每一個(gè)源html都創(chuàng)建一個(gè)文件,但是這里有些多了,不如我們把所有的文檔清洗好之后結(jié)果放在一個(gè)文件中,文件與文件之間使用’\3’隔開(kāi),就像下面的格式

XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3

這里解釋一下我們?yōu)楹问褂谩痋3’.這是因?yàn)樵贏SCII表中 , 控制字符是不可顯示字符 , 即無(wú)法打印。在我們獲取的文檔內(nèi)容(即data/input中的html網(wǎng)頁(yè)文件)中,里面基本上都是可打印字符,基本上不會(huì)有不可顯示的控制字符。如此以來(lái)也就不會(huì)污染我們的文檔內(nèi)容啦。

不過(guò)我們不適用上面的格式,這里我們想辦法把一個(gè)文檔的’\n’全部去掉,然后我們使用這樣的格式.

類似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
方便我們getline(ifsream, line)，直接獲取文檔的全部?jī)?nèi)容：title\3content\3url

我們創(chuàng)建一個(gè)文件來(lái)保存我們?nèi)?biāo)簽之后的內(nèi)容.

drwxrwxr-x. 58 qkj qkj 12288 Sep  9 01:03 input
drwxrwxr-x.  2 qkj qkj     6 Sep  9 01:03 raw_html
[qkj@localhost data]$ 
[qkj@localhost data]$ cd raw_html/
[qkj@localhost raw_html]$ touch raw.txt
[qkj@localhost raw_html]$ ll
total 0
-rw-rw-r--. 1 qkj qkj 0 Sep  9 02:32 raw.txt

清晰標(biāo)簽框架

下面我們開(kāi)始編寫(xiě)parser.cc簡(jiǎn)單框架內(nèi),我們看一下.

#include <iostream>
#include <string>
#include <vector>
#include <cassert>
// 這是一個(gè)目錄,下面放的是所有的html網(wǎng)頁(yè)
const std::string src_path = "data/input";// 下面是一個(gè)文本文件,該文件保存所有的 網(wǎng)頁(yè)清洗后的數(shù)據(jù)
const std::string output = "data/raw_html/raw.txt";// 解析網(wǎng)頁(yè)格式
typedef struct DocInfo
{std::string title;   // 文檔標(biāo)題std::string content; // 文旦內(nèi)容std::string url;     // 該文檔在官網(wǎng)的的url
} DocInfo_t;static bool EnumFile(const std::string &src_path, std::vector<std::string> *file_list);
static bool ParseHtml(const std::vector<std::string> &file_list, std::vector<DocInfo_t> *results);
static bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);int main(void)
{// 保存所有的 html 的文件名std::vector<std::string> file_list;// 第一步: EnumFile 枚舉所有的文件名(帶路徑),僅限 網(wǎng)頁(yè),方便后期對(duì)一個(gè)一個(gè)文件進(jìn)行讀取if (false == EnumFile(src_path, &file_list)){std::cerr << "枚舉文件名失敗" << std::endl;return 1;}// 第二部:讀取每一個(gè)文件的內(nèi)容,進(jìn)行解析,解析的格式 為DocInfo_tstd::vector<DocInfo_t> results;if (false == ParseHtml(file_list, &results)){std::cerr << "解析文件失敗" << std::endl;return 2;}// 第三步: 把解析文件的內(nèi)容寫(xiě)入到output中,按照\(chéng)3\n 作為每一個(gè)文檔的分割符if (false == SaveHtml(results, output)){std::cerr << "保存文件失敗" << std::endl;return 3;}return 0;
}

我們的的基本思路是下面這樣的.

拿到我們所有的源html文件名,然后把這些文件名保存在一個(gè)數(shù)組中
依次遍歷數(shù)組,把文件進(jìn)行去標(biāo)簽,然后把去掉的內(nèi)容整理成一個(gè)DocInfo_t結(jié)構(gòu)體,里面保存title,content,url, 結(jié)果放在一個(gè)數(shù)組中
遍歷結(jié)構(gòu)體數(shù)組,然后把內(nèi)容寫(xiě)入到我們的目的文件中,按照一定的格式.

`Boost`庫(kù)的安裝

在實(shí)現(xiàn)上面的接口前,我們這里需要下載一個(gè)boost庫(kù),這是因?yàn)槲覀冃枰褂盟麄兊暮瘮?shù).

[qkj@localhost BoostSearchEngine]$ sudo yum install -y boost-devel
[sudo] password for qkj:

我們這里簡(jiǎn)單認(rèn)識(shí)一下boost,下面是使用手冊(cè).

我們要使用是的關(guān)于文件的函數(shù),這里我們看一下.

`EnumFile`函數(shù)實(shí)現(xiàn)

下面開(kāi)始EnumFil函數(shù)的實(shí)現(xiàn),它的功能是把我們給定src_path目錄下的所有后綴是html的文件名字給保存下了,存在在一個(gè)file_list數(shù)組中.

static bool EnumFile(const std::string &src_path, std::vector<std::string> *file_list)

具體的實(shí)現(xiàn)是.

static bool EnumFile(const std::string &src_path, std::vector<std::string> *file_list)
{assert(file_list);namespace fs = boost::filesystem; // 這是一個(gè)習(xí)慣, C++支持fs::path root_path(src_path);     // 定義一個(gè)path對(duì)象if (fs::exists(root_path) == false) // 判斷路徑是不是存在{std::cerr << src_path << " 路徑是不存在的" << std::endl;return false;}// 定義一個(gè)空的迭代器, 用來(lái)判斷 迭代器遞歸結(jié)束fs::recursive_directory_iterator end;for (fs::recursive_directory_iterator iter(root_path); iter != end; iter++){// 保證是普通的文件if (fs::is_regular_file(*iter) == false){// 這里是目錄一類的continue;}// 普通文件需要 html 文件后綴結(jié)束if (iter->path().extension() != ".html"){continue;}std::cout << "debug: " << iter->path().string() << std::endl;// 此時(shí)一定 是以 html 后綴結(jié)尾的普通文件file_list->push_back(iter->path().string());}return true;
}

下面我們測(cè)試一下,寫(xiě)一些Makefile.

cc=g++
parser:parser.cc $(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
.PHONY:clean
clean:rm parser

下面運(yùn)行一下,我們發(fā)現(xiàn)成功了.

[qkj@localhost BoostSearchEngine]$ make
g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
[qkj@localhost BoostSearchEngine]$ ll
total 104
drwxrwxr-x. 4 qkj qkj    35 Sep  9 01:03 data
-rw-rw-r--. 1 qkj qkj   117 Sep  9 01:41 Makefile
-rwxrwxr-x. 1 qkj qkj 89152 Sep  9 01:43 parser
-rw-rw-r--. 1 qkj qkj  8398 Sep  9 01:43 parser.cc
[qkj@localhost BoostSearchEngine]$ ./parser 
debug: data/input/about.html
debug: data/input/accumulators/user_s_guide.html
debug: data/input/accumulators/acknowledgements.html
debug: data/input/accumulators/reference.html
debug: data/input/accumulators.html
...

`ParseHtml`實(shí)現(xiàn)

這里我們開(kāi)始解析我們的每一個(gè)html目錄.

static bool ParseHtml(const std::vector<std::string> &file_list, std::vector<DocInfo_t> *results)

下面是我們的框架.

static bool ParseTitle(const std::string &file, std::string *title);
static bool ParseContent(const std::string &file, std::string *content);
static bool ParseUrl(const std::string &file_path, std::string *url);static bool ParseHtml(const std::vector<std::string> &file_list, std::vector<DocInfo_t> *results)
{assert(results);for (auto &file_path : file_list){// 1. 讀取文件std::string result;if (false == ns_util::FileUtil::ReadFile(file_path, &result)){continue;}DocInfo_t doc;// 2. 提取titleif (false == ParseTitle(result, &doc.title)){continue;}// 3. 提取content  本質(zhì)時(shí) 去標(biāo)簽if (false == ParseContent(result, &doc.content)){continue;}// 4. 提取urlif (false == ParseUrl(file_path, &doc.url)){continue;}// 到這里一定時(shí)完成了解析任務(wù)results->push_back(std::move(doc)); // 右值引用}return true;
}

我們說(shuō)一下我們的流程

對(duì)于每一個(gè)文件,我們把它讀取到一個(gè)字符串中
根據(jù)字符串拿到title
根據(jù)字符串拿到content
根據(jù)字符串拿到url

下面我們分別實(shí)現(xiàn)這些函數(shù)的功能.

讀取文件內(nèi)容

對(duì)于這個(gè)函數(shù),我們把它放在一個(gè)工具集中,后面可能會(huì)使用到.

#pragma once
#include <iostream>
#include <assert.h>
#include <fstream>
#include <string>
// 這是一個(gè)工具集
namespace ns_util
{/// @brief  這是為了解析文件class FileUtil{public:/// @brief 讀取文件內(nèi)容到 out中/// @param file_path/// @param out/// @returnstatic bool ReadFile(const std::string &file_path, std::string *out){assert(out);std::ifstream in(file_path, std::ios::in);if (in.is_open() == false){std::cerr << file_path << " 打開(kāi)失敗" << std::endl;return false;}std::string line;// 注意 getline 不會(huì) 讀取 \nwhile (std::getline(in, line)){*out += line;}in.close();return true;}};
}

提取titile

我們這里繼續(xù)看一下我們的一個(gè)html文件,title是在一個(gè)標(biāo)簽里面的.

下面根據(jù)字符串來(lái)進(jìn)行提取title.

static bool ParseTitle(const std::string &file, std::string *title)
{assert(title);std::size_t begin = file.find("<title>");if (begin == std::string::npos){return false;}std::size_t end = file.find("</title>"); // 反方向查if (end == std::string::npos){return false;}begin += std::string("<title>").size();if (begin > end){return false;}*title = file.substr(begin, end - begin);return true;
}

提取content

這里我們獲取content,不是把所有的內(nèi)容都拿出來(lái),而是要去標(biāo)簽,這里需要借助一個(gè)狀態(tài)機(jī).

我們知道標(biāo)簽是有<>這樣的表示的.那么我們這里使用一個(gè)狀態(tài)機(jī).我們默認(rèn)第一個(gè)字符是<

static bool ParseContent(const std::string &file, std::string *content)
{assert(content);// 這就是我們?nèi)?biāo)簽最重要的地方// 我們這里使用一個(gè)簡(jiǎn)單的狀態(tài)機(jī)enum status{LABLE,CONTENT};enum status s = LABLE; // 默認(rèn)第一個(gè)是 '<'for (char ch : file) // 注意這里我沒(méi)有使用引用,后面解釋{switch (s){case LABLE:if (ch == '>'){// 此時(shí)意味這當(dāng)前的標(biāo)簽被處理完畢s = CONTENT;}break;case CONTENT:if (ch == '<'){// 這里有可能是<><>這樣的情況s = LABLE;}else{// 這里有一個(gè)細(xì)節(jié) 我們不想要'\n' 字符// 我們希望用'\n' 作為分隔符// 注意,這個(gè)應(yīng)該不會(huì)出現(xiàn)\n,// 畢竟我們讀取文件的時(shí)候使用的getline,可是不我們不能把希望寄托到被人身上if (ch == '\n'){ch = ' ';}content->push_back(ch);}break;default:break;}}return true;
}

提取url

這里面有一個(gè)需要談的.我們這里是要憑借url,那么我么看一下官網(wǎng)的url和我們的本地的url是有什么關(guān)系的.

官網(wǎng)url: https://www.boost.org/doc/libs/1_83_0/doc/html/accumulators.html
本地url: data/input/accumulators.html                   // 這是因?yàn)闉槲覀儼裠oc/html/里面的內(nèi)容拷貝到data/input中的// 這里我們要拼接url
url_head = "https://www.boost.org/doc/libs/1_83_0/doc/html";
url_tail = [data/input](刪除) /accumulators.html=> url_tail = /accumulators.htmlurl = url_head + url_tail ; 相當(dāng)于形成了一個(gè)官網(wǎng)鏈接

下面就是我們的代碼

static bool ParseUrl(const std::string &file_path, std::string *url)
{assert(url);//  url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html"//  url_tail = "/accumulators.html"std::string url_head = "https://www.boost.org/doc/libs/1_83_0/doc/html";std::string url_tail = file_path.substr(src_path.size());*url = url_head + url_tail;return true;
}

下面我們測(cè)試驗(yàn)證一下,使用一個(gè)函數(shù).

void ShowDoc(const DocInfo_t &doc)
{std::cout << "title: " << doc.title << std::endl;std::cout << "content: " << doc.content << std::endl;std::cout << "url: " << doc.url << std::endl;
}
static bool ParseHtml(const std::vector<std::string> &file_list, std::vector<DocInfo_t> *results)
{assert(results);for (auto &file_path : file_list){// 1. 讀取文件std::string result;if (false == ns_util::FileUtil::ReadFile(file_path, &result)){continue;}DocInfo_t doc;// 2. 提取titleif (false == ParseTitle(result, &doc.title)){continue;}// 3. 提取content  本質(zhì)時(shí) 去標(biāo)簽if (false == ParseContent(result, &doc.content)){continue;}// 4. 提取urlif (false == ParseUrl(file_path, &doc.url)){continue;}// for debugShowDoc(doc);// break;// 到這里一定時(shí)完成了解析任務(wù)results->push_back(std::move(doc)); // 右值引用}return true;
}

這個(gè)是我們的測(cè)定結(jié)果.

title: Struct template result&lt;This(InputIterator, InputIterator)&gt;
content: Struct template result&lt;This(InputIterator, InputIterator)&gt;HomeLibrariesPeopleFAQMoreStruct template result&lt;This(InputIterator, InputIterator)&gt;boost::proto::functional::distance::result&lt;This(InputIterator, InputIterator)&gt;Synopsis// In header: &lt;boost/proto/functional/std/iterator.hpp&gt;template&lt;typename This, typename InputIterator&gt; struct result&lt;This(InputIterator, InputIterator)&gt; {  // types  typedef typename std::iterator_traits&lt;      typename boost::remove_const&lt;        typename boost::remove_reference&lt;InputIterator&gt;::type      &gt;::type    &gt;::difference_type type;};Copyright ? 2008 Eric Niebler        Distributed under the Boost Software License, Version 1.0. (See accompanying        file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)      
url: https://www.boost.org/doc/libs/1_83_0/doc/html/boost/proto/functional/distance/resu_1_3_32_5_26_2_1_1_2_4.html

我們拿到這個(gè)url去官網(wǎng)上看看是不是,我們發(fā)現(xiàn)是的.

`SaveHtml`實(shí)現(xiàn)

static bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);

我們已經(jīng)得到每一個(gè)文件的結(jié)構(gòu)體了,下面我們開(kāi)始保存文件到要求的文件中.

static bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
#define SEP "\3"// 我們按照下面的方式,要知道我們把文檔的內(nèi)容去掉了\n// title\3content\3url\n title\3content\3url\n title\3content\3url\n return true;// explicit basic_ofstream (const char* filename,//                       ios_base::openmode mode = ios_base::out);std::ofstream out(output, std::ios::out | std::ios::binary);if (out.is_open() == false){std::cerr << "打開(kāi)文件失敗 " << output << std::endl;return false;}for (auto &e : results){std::string str = e.title;str += SEP;str += e.content;str += SEP;str += e.url;str += "\n";out.write(str.c_str(), str.size());}out.close();return true;
}

這里驗(yàn)證是不是保存了.

這里我們驗(yàn)證下是不是保存完全了.

[qkj@localhost BoostSearchEngine]$ ls ./data/input/ -Rl | grep -E "*.html" | wc -l
8581
[qkj@localhost BoostSearchEngine]$ cat ./data/raw_html/raw.txt | wc -l
8581
[qkj@localhost BoostSearchEngine]$

建立索引

下面我們就要建立索引的,建立索引實(shí)際上就是構(gòu)建存儲(chǔ)+搜索的數(shù)據(jù)結(jié)構(gòu),來(lái)加快我們對(duì)于關(guān)鍵字->文檔ID->文檔內(nèi)容的搜索過(guò)程。根據(jù)上面談的,我們建立正派索引和倒排索引.

jieba安裝與使用

對(duì)于分詞,這里我們使用cppjieba分詞工具,我們執(zhí)行下面的命令就可以了.

[qkj@localhost install]$ git clone https://github.com/yanyiwu/cppjieba.git

這里我們看一下cppjieba的具體內(nèi)容.

[qkj@localhost install]$ tree cppjieba/
cppjieba/
├── ChangeLog.md
├── CMakeLists.txt
├── deps
│   ├── CMakeLists.txt
│   ├── gtest
│   │   ├── CMakeLists.txt
│   │   ├── include
│   │   │   └── gtest
│   │   │       ├── gtest-death-test.h
│   │   │       ├── gtest.h
│   │   │       ├── gtest-message.h
│   │   │       ├── gtest-param-test.h
│   │   │       ├── gtest-param-test.h.pump
│   │   │       ├── gtest_pred_impl.h
│   │   │       ├── gtest-printers.h
│   │   │       ├── gtest_prod.h
│   │   │       ├── gtest-spi.h
│   │   │       ├── gtest-test-part.h
│   │   │       ├── gtest-typed-test.h
│   │   │       └── internal
│   │   │           ├── gtest-death-test-internal.h
│   │   │           ├── gtest-filepath.h
│   │   │           ├── gtest-internal.h
│   │   │           ├── gtest-linked_ptr.h
│   │   │           ├── gtest-param-util-generated.h
│   │   │           ├── gtest-param-util-generated.h.pump
│   │   │           ├── gtest-param-util.h
│   │   │           ├── gtest-port.h
│   │   │           ├── gtest-string.h
│   │   │           ├── gtest-tuple.h
│   │   │           ├── gtest-tuple.h.pump
│   │   │           ├── gtest-type-util.h
│   │   │           └── gtest-type-util.h.pump
│   │   └── src
│   │       ├── gtest-all.cc
│   │       ├── gtest.cc
│   │       ├── gtest-death-test.cc
│   │       ├── gtest-filepath.cc
│   │       ├── gtest-internal-inl.h
│   │       ├── gtest_main.cc
│   │       ├── gtest-port.cc
│   │       ├── gtest-printers.cc
│   │       ├── gtest-test-part.cc
│   │       └── gtest-typed-test.cc
│   └── limonp
├── dict
│   ├── hmm_model.utf8
│   ├── idf.utf8
│   ├── jieba.dict.utf8
│   ├── pos_dict
│   │   ├── char_state_tab.utf8
│   │   ├── prob_emit.utf8
│   │   ├── prob_start.utf8
│   │   └── prob_trans.utf8
│   ├── README.md
│   ├── stop_words.utf8
│   └── user.dict.utf8
├── include
│   └── cppjieba
│       ├── DictTrie.hpp
│       ├── FullSegment.hpp
│       ├── HMMModel.hpp
│       ├── HMMSegment.hpp
│       ├── Jieba.hpp
│       ├── KeywordExtractor.hpp
│       ├── MixSegment.hpp
│       ├── MPSegment.hpp
│       ├── PosTagger.hpp
│       ├── PreFilter.hpp
│       ├── QuerySegment.hpp
│       ├── SegmentBase.hpp
│       ├── SegmentTagged.hpp
│       ├── TextRankExtractor.hpp
│       ├── Trie.hpp
│       └── Unicode.hpp
├── LICENSE
├── README_EN.md
├── README.md
└── test├── CMakeLists.txt├── demo.cpp├── load_test.cpp├── testdata│   ├── curl.res│   ├── extra_dict│   │   └── jieba.dict.small.utf8│   ├── gbk_dict│   │   ├── hmm_model.gbk│   │   └── jieba.dict.gbk│   ├── jieba.dict.0.1.utf8│   ├── jieba.dict.0.utf8│   ├── jieba.dict.1.utf8│   ├── jieba.dict.2.utf8│   ├── load_test.urls│   ├── review.100│   ├── review.100.res│   ├── server.conf│   ├── testlines.gbk│   ├── testlines.utf8│   ├── userdict.2.utf8│   ├── userdict.english│   ├── userdict.utf8│   └── weicheng.utf8└── unittest├── CMakeLists.txt├── gtest_main.cpp├── jieba_test.cpp├── keyword_extractor_test.cpp├── pos_tagger_test.cpp├── pre_filter_test.cpp├── segments_test.cpp├── textrank_test.cpp├── trie_test.cpp└── unicode_test.cpp16 directories, 98 files
[qkj@localhost install]$

這里我們要關(guān)注的是兩個(gè)文件.

cppjieba/include : 我們的頭文件
cppjiba/dict : 我們的字典

下面我們開(kāi)始jiebba分詞的使用,里面存在一個(gè)demo.cpp文件供我們測(cè)試在,這里我們把它拷貝到一個(gè)位置.

[qkj@localhost test]$ pwd
/home/qkj/install/cppjieba/test
[qkj@localhost test]$ ll
total 16
-rw-rw-r--. 1 qkj qkj  148 Sep  9 03:38 CMakeLists.txt
-rw-rw-r--. 1 qkj qkj 2797 Sep  9 03:38 demo.cpp
-rw-rw-r--. 1 qkj qkj 1532 Sep  9 03:38 load_test.cpp
drwxrwxr-x. 4 qkj qkj 4096 Sep  9 03:38 testdata
drwxrwxr-x. 2 qkj qkj  255 Sep  9 03:38 unittest
[qkj@localhost test]$ cp demo.cpp ../..
[qkj@localhost test]$ cd ../../
[qkj@localhost install]$ ll
total 8
drwxr-xr-x. 8 qkj qkj 4096 Aug  8 14:40 boost_1_83_0
drwxrwxr-x. 8 qkj qkj  215 Sep  9 03:38 cppjieba
-rw-rw-r--. 1 qkj qkj 2797 Sep  9 03:49 demo.cpp
[qkj@localhost install]$

首先,我們不能直接編譯,它會(huì)報(bào)錯(cuò).

[qkj@localhost install]$ g++ demo.cpp 
demo.cpp:1:10: fatal error: cppjieba/Jieba.hpp: No such file or directory#include "cppjieba/Jieba.hpp"^~~~~~~~~~~~~~~~~~~~
compilation terminated.
[qkj@localhost install]$

這是因?yàn)槲覀冞@里的庫(kù)和頭文件的路徑是不對(duì)的,這里添加軟鏈接.

[qkj@localhost install]$ ln -s  cppjieba/include/ inc
[qkj@localhost install]$ ln -s  cppjieba/dict/ dict
[qkj@localhost install]$ ll
total 8
drwxr-xr-x. 8 qkj qkj 4096 Aug  8 14:40 boost_1_83_0
drwxrwxr-x. 8 qkj qkj  215 Sep  9 03:38 cppjieba
-rw-rw-r--. 1 qkj qkj 2797 Sep  9 03:49 demo.cpp
lrwxrwxrwx. 1 qkj qkj   14 Sep  9 03:50 dict -> cppjieba/dict/
lrwxrwxrwx. 1 qkj qkj   17 Sep  9 03:50 inc -> cppjieba/include/
[qkj@localhost install]$ cp -rf cppjieba/deps/limonp/ cppjieba/include/cppjieba/
[qkj@localhost install]$

下面我們要修改demo.cpp文件.

下面我們繼續(xù)編譯,我們發(fā)現(xiàn)還是出現(xiàn)錯(cuò)誤.

[qkj@localhost install]$ g++ demo.cpp 
In file included from inc/cppjieba/Jieba.hpp:4,from demo.cpp:1:
inc/cppjieba/QuerySegment.hpp:7:10: fatal error: limonp/Logging.hpp: No such file or directory#include "limonp/Logging.hpp"^~~~~~~~~~~~~~~~~~~~
compilation terminated.

這是因?yàn)閏ppjieba/deps/limonp實(shí)際上是空文件夾

[qkj@localhost install]$ cd  cppjieba/include/cppjieba/limonp/
[qkj@localhost limonp]$ ll
total 0
[qkj@localhost limonp]$

這里需要我們手動(dòng)去下載這個(gè)目錄.

[qkj@localhost install]$ git clone https://github.com/yanyiwu/limonp.git

然后把我們下載好的目錄拷貝到cppjieba/deps/limonp,然后重新拷貝到cppjieba/include/cppjieba/.

[qkj@localhost install]$ cp -rf limonp/include/limonp/ cppjieba/deps/
[qkj@localhost install]$ cp -rf cppjieba/deps/limonp/ cppjieba/include/cppjieba/
[qkj@localhost install]$

這樣就可以了,我們這里編譯一下.

[qkj@localhost install]$ g++ demo.cpp -std=c++11
[qkj@localhost install]$ ll
total 480
-rwxrwxr-x. 1 qkj qkj 482896 Sep  9 05:50 a.out
drwxr-xr-x. 8 qkj qkj   4096 Aug  8 14:40 boost_1_83_0
drwxrwxr-x. 8 qkj qkj    215 Sep  9 03:38 cppjieba
-rw-rw-r--. 1 qkj qkj   2852 Sep  9 05:28 demo.cpp
lrwxrwxrwx. 1 qkj qkj     14 Sep  9 03:50 dict -> cppjieba/dict/
lrwxrwxrwx. 1 qkj qkj     17 Sep  9 03:50 inc -> cppjieba/include/
drwxrwxr-x. 6 qkj qkj    171 Sep  9 05:46 limonp
[qkj@localhost install]$ ./a.out 
他來(lái)到了網(wǎng)易杭研大廈
[demo] Cut With HMM
他/來(lái)到/了/網(wǎng)易/杭研/大廈
[demo] Cut Without HMM 
他/來(lái)到/了/網(wǎng)易/杭/研/大廈
我來(lái)到北京清華大學(xué)
[demo] CutAll
我/來(lái)到/北京/清華/清華大學(xué)/華大/大學(xué)
小明碩士畢業(yè)于中國(guó)科學(xué)院計(jì)算所，后在日本京都大學(xué)深造
[demo] CutForSearch
小明/碩士/畢業(yè)/于/中國(guó)/科學(xué)/學(xué)院/科學(xué)院/中國(guó)科學(xué)院/計(jì)算/計(jì)算所/，/后/在/日本/京都/大學(xué)/日本京都大學(xué)/深造

索引框架

下面我們創(chuàng)建一個(gè)文件.

[qkj@localhost BoostSearchEngine]$ touch index.hpp
[qkj@localhost BoostSearchEngine]$ ll
total 124
drwxrwxr-x. 4 qkj qkj     35 Sep  9 01:03 data
-rw-rw-r--. 1 qkj qkj      0 Sep  9 02:48 index.hpp
-rw-rw-r--. 1 qkj qkj    117 Sep  9 01:41 Makefile
-rwxrwxr-x. 1 qkj qkj 110008 Sep  9 02:48 parser
-rw-rw-r--. 1 qkj qkj   6361 Sep  9 02:47 parser.cc
-rw-rw-r--. 1 qkj qkj    783 Sep  9 02:48 util.hpp
[qkj@localhost BoostSearchEngine]$

這里我們需要明確是我們要建立正排和倒排索引.并且我們還要提供一個(gè)兩個(gè)查找的接口.

namespace ns_index
{struct DocInfo{std::string title;   // 文檔標(biāo)題std::string content; // 文檔內(nèi)容std::string url;     // 官網(wǎng)urluint64_t doc_id; // 文旦的id 暫時(shí)不做理解};/// @brief 作為倒排索引的輔助struct InvertedElem{uint64_t doc_id;  // 文旦idstd::string word; // 關(guān)鍵字int weight;       // 權(quán)重 -->后面解釋};// 倒排拉鏈  -- 根據(jù)用一個(gè)關(guān)鍵字 來(lái)拿到一組的InvertedElemtypedef std::vector<struct InvertedElem> InvertedList;class Index{public:/// @brief 根據(jù)doc_id來(lái)獲取正派索引 ,也就是文旦內(nèi)容/// @param doc_id  文旦id/// @return 返回文檔結(jié)構(gòu)體的地址struct DocInfo *GetForwardIndex(const uint64_t doc_id){return nullptr;}/// @brief 根據(jù)關(guān)鍵字 獲取倒排拉鏈/// @param word 關(guān)鍵/// @returnInvertedList *GetInvertedList(const std::string &word){return nullptr;}/// @brief 根據(jù)目錄 文件 構(gòu)建 正派和倒排索引,這里是最重的一步/// @param src_path 去標(biāo)簽后目錄文件目錄/// @returnbool BuildIndex(const std::string &src_path){// 建立正排// 建立倒排return true;}/// @brief 根據(jù)字符串建立正派索引  也就是根據(jù)文旦id找到 文檔內(nèi)容/// @param line 一個(gè)字符串,該字符串保留一個(gè)html文檔的所有內(nèi)容/// @returnDocInfo *BuildForwardIndex(const std::string &line){return nullptr;}
private:// 這兩個(gè)結(jié)構(gòu)不暴露給外部/// @brief 根據(jù)一個(gè)文檔內(nèi)容的結(jié)構(gòu)體建立倒排索引,需要經(jīng)行分詞 /// @param doc  這個(gè)是一個(gè)結(jié)構(gòu)體/// @returnbool BuildInvertedIndex(const DocInfo &doc){return true;}private:// 正排索引 -- 根據(jù)vector下標(biāo)可以更加高效作為id找到內(nèi)容std::vector<struct DocInfo> forward_index;// 倒排索引 一個(gè)關(guān)鍵字 可能在很多的文檔中出現(xiàn),一定是一個(gè)關(guān)鍵字和一組InvertedElem對(duì)應(yīng)std::unordered_map<std::string, InvertedList> inverted_index;};
}

下面我們依次實(shí)現(xiàn)這里面的函數(shù).

BuildIndex 構(gòu)建索引

bool BuildIndex(const std::string &src_path);

這個(gè)是根據(jù)我們已經(jīng)清洗好的數(shù)據(jù),通過(guò)它來(lái)構(gòu)建索引.

bool BuildIndex(const std::string &src_path)
{std::ifstream in(src_path, std::ios::in | std::ios::binary);if (in.is_open() == false){std::cerr << "文件目錄 " << src_path << "無(wú)效" << std::endl;return false;}int count = 0; // 他的作用是讓我們看到構(gòu)建索引的過(guò)程std::string line; while (std::getline(in, line)){// 此時(shí)我們已經(jīng)提取到每一個(gè)html內(nèi)容了// 建立正派索引DocInfo *doc = BuildForwardIndex(line); if (doc == nullptr){std::cerr << "建立一個(gè)正派索引失敗" << line << std::endl;continue;}// 建立 倒排索引BuildInvertedIndex(*doc);count++;if (count % 50 == 0){// 后期加上一個(gè)進(jìn)度條std::cout << "當(dāng)前已經(jīng)處理了 索引文檔 " << count << std::endl;}}return true;
}

建立正排索引

這個(gè)是在是太好實(shí)現(xiàn)了,我們數(shù)組下標(biāo)天然是我們的文檔ID,只需要把清晰后每一個(gè)文檔的內(nèi)容處理成結(jié)構(gòu)體,然后添加到數(shù)組中就可以了.

/// @brief 根據(jù)字符串建立正派索引  也就是根據(jù)文旦id找到 文檔內(nèi)容
/// @param line 一個(gè)字符串,該字符串保留一個(gè)html文檔的所有內(nèi)容
/// @return
DocInfo *BuildForwardIndex(const std::string &line)
{// title\3content\3url\nstd::vector<std::string> results;const std::string sep = "\3";ns_util::StringUtil::Split(line, &results, sep); // 這里是工具集里面切分字符串if (results.size() != 3)return nullptr;DocInfo doc;doc.title = results[0];doc.content = results[1];doc.url = results[2];// 文檔id,就是數(shù)組下標(biāo)doc.doc_id = forward_index.size(); // 注意這里是 正派拉鏈forward_index.push_back(std::move(doc));return &(forward_index[forward_index.size() - 1]);
}

把工具集里面的代碼寫(xiě)一下.

/// @brief 字符串切分
class StringUtil
{
public:static void Split(const std::string &target, std::vector<std::string> *out, const std::string sep){assert(out);// 我們這里使用現(xiàn)成的切分函數(shù)boost::split(*out, target, boost::is_any_of(sep),boost::token_compress_on);}
};

建立倒排索引

下面我們開(kāi)始根據(jù)最新的結(jié)構(gòu)體建立倒排索引.這里我們需要分詞.

struct word_cnt
{int title_cnt;int content_cnt;word_cnt() : title_cnt(0), content_cnt(0) {}
};bool BuildInvertedIndex(const DocInfo &doc)
{// 用來(lái)暫存 詞頻std::unordered_map<std::string, word_cnt> word_map;// 1.對(duì)標(biāo)題 分詞std::vector<std::string> title_words;ns_util::JiebaUtil::CutString(doc.title, &title_words);// 不區(qū)分大小寫(xiě)// 那么用戶也不因該區(qū)分大小寫(xiě)for (std::string s : title_words){boost::to_lower(s); word_map[s].title_cnt++; // 解釋一下}// 對(duì)文檔內(nèi)容分詞std::vector<std::string> content_words;ns_util::JiebaUtil::CutString(doc.content, &content_words);for (auto s : content_words){boost::to_lower(s);word_map[s].content_cnt++;}// 到這里每一個(gè)詞都有它的在標(biāo)題和內(nèi)容中出現(xiàn)的次數(shù)// 3 構(gòu)建倒排拉鏈for (auto &word_pair : word_map){/*struct InvertedElem{uint64_t doc_id;  // 文旦idstd::string word; // 關(guān)鍵字int weight;       // 權(quán)重 -->后面解釋};*/InvertedElem item; item.doc_id = doc.doc_id; // 這里解釋了上面我們?yōu)楹翁砑恿薸ditem.word = word_pair.first;item.weight = _build_relevance(word_pair.second); // 這里是計(jì)算權(quán)重的// 加入倒排拉鏈中// typedef std::vector<struct InvertedElem> InvertedList;// std::unordered_map<std::string, InvertedList> inverted_index;InvertedList &inverted_list = inverted_index[word_pair.first];inverted_list.push_back(std::move(item));}return true;
}

引入jieba

由于倒排索引需要分詞,這里我們引入jiebe,這里我們把切分字符串寫(xiě)成一個(gè)工具.這是使用軟鏈接.

[qkj@localhost BoostSearchEngine]$ ln -s /home/qkj/install/cppjieba/include/cppjieba cppjieba
[qkj@localhost BoostSearchEngine]$ ln -s /home/qkj/install/cppjieba/dict/ dict
[qkj@localhost BoostSearchEngine]$ ll
total 24
lrwxrwxrwx. 1 qkj qkj   43 Sep  9 06:00 cppjieba -> /home/qkj/install/cppjieba/include/cppjieba
drwxrwxr-x. 4 qkj qkj   35 Sep  9 01:03 data
lrwxrwxrwx. 1 qkj qkj   32 Sep  9 06:01 dict -> /home/qkj/install/cppjieba/dict/
-rw-rw-r--. 1 qkj qkj 6379 Sep  9 03:15 index.hpp
-rw-rw-r--. 1 qkj qkj  117 Sep  9 01:41 Makefile
-rw-rw-r--. 1 qkj qkj 6361 Sep  9 02:47 parser.cc
-rw-rw-r--. 1 qkj qkj 1199 Sep  9 03:15 util.hpp
[qkj@localhost BoostSearchEngine]$

這里就可以編寫(xiě)我們的切詞工具了.

const char *const DICT_PATH = "./dict/jieba.dict.utf8";
const char *const HMM_PATH = "./dict/hmm_model.utf8";
const char *const USER_DICT_PATH = "./dict/user.dict.utf8";
const char *const IDF_PATH = "./dict/idf.utf8";
const char *const STOP_WORD_PATH = "./dict/stop_words.utf8";/// @brief 這是一個(gè)jieba分詞
class JiebaUtil
{
public:static void CutString(const std::string &src, std::vector<std::string> *out){assert(out);jieba.CutForSearch(src, *out);}
private:static cppjieba::Jieba jieba;
};
cppjieba::Jieba JiebaUtil::jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH);

權(quán)重計(jì)算

先來(lái)解釋一下什么是權(quán)重,可以這么理解.對(duì)于搜索頻率高的單詞,我們認(rèn)為它的權(quán)重高.同時(shí)對(duì)一個(gè)文檔,如果關(guān)鍵字出現(xiàn)的次數(shù)越多,起權(quán)重越大.這里我么權(quán)重結(jié)算簡(jiǎn)單些.

    int _build_relevance(const struct word_cnt &word){
#define X 10
#define Y 1return X * word.title_cnt + Y * word.content_cnt;}

那么權(quán)重有什么作用呢?這里可以等我們搜索的時(shí)候,一個(gè)關(guān)鍵字可以對(duì)應(yīng)多個(gè)文檔,那么此時(shí)我們可以把權(quán)重高的放在前面.

現(xiàn)在我們的結(jié)構(gòu)是這樣的.

`GetForwardIndex`

這個(gè)是根據(jù)文檔的id找到文檔的內(nèi)容.

struct DocInfo *GetForwardIndex(const uint64_t doc_id)
{if (doc_id < 0 || doc_id >= forward_index.size()){std::cerr << "索引id " << doc_id << " 越界了" << std::endl;return nullptr;}return &(forward_index[doc_id]);
}

`GetInvertedList`

這個(gè)是根據(jù)關(guān)鍵字拿到倒排拉鏈.

InvertedList *GetInvertedList(const std::string &word)
{auto it = inverted_index.find(word);if (it == inverted_index.end()){std::cerr << "關(guān)鍵字 " << word << " 不存在" << std::endl;return nullptr;}return &(it->second);
}

這里還剩下一個(gè)小工作,后面我們把index設(shè)置為單例模式.

設(shè)置成單例

下面我們把index設(shè)置成單例模式,一來(lái),我們其實(shí)在boost搜索引擎項(xiàng)目當(dāng)中,事實(shí)上不需要建立多個(gè)Index索引對(duì)象,只需要建立一個(gè)索引對(duì)象就可以完成查找工作了二來(lái),我們建立一個(gè)索引對(duì)象的成本事實(shí)上是極高的,因?yàn)槲覀冃枰獙⑺械木W(wǎng)頁(yè)信息分詞,統(tǒng)計(jì),填充,插入,效率上會(huì)受極大損失。

namespace ns_index
{struct DocInfo{std::string title;   // 文檔標(biāo)題std::string content; // 文檔內(nèi)容std::string url;     // 官網(wǎng)urluint64_t doc_id; // 文旦的id 暫時(shí)不做理解};/// @brief 作為倒排索引的輔助struct InvertedElem{uint64_t doc_id;  // 文旦idstd::string word; // 關(guān)鍵字int weight;       // 權(quán)重};// 倒排拉鏈  -- 根據(jù)用一個(gè)關(guān)鍵字 來(lái)拿到一組的InvertedElemtypedef std::vector<struct InvertedElem> InvertedList;class Index{private:Index() {}Index(const Index &) = delete;Index &operator=(const Index &) = delete;static Index *instance;static std::mutex mtx;public:~Index(){}static Index *GetInstance(){// 線程不安全,加鎖if (nullptr == instance){mtx.lock();if (instance == nullptr){instance = new Index;}mtx.unlock();}return instance;}/// @brief 根據(jù)doc_id來(lái)獲取正派索引 ,也就是文旦內(nèi)容/// @param doc_id  文旦id/// @return 返回文檔結(jié)構(gòu)體的地址struct DocInfo *GetForwardIndex(const uint64_t doc_id){if (doc_id < 0 || doc_id >= forward_index.size()){std::cerr << "索引id " << doc_id << " 越界了" << std::endl;return nullptr;}return &(forward_index[doc_id]);}/// @brief 根據(jù)關(guān)鍵字 獲取倒排拉鏈/// @param word 關(guān)鍵/// @returnInvertedList *GetInvertedList(const std::string &word){auto it = inverted_index.find(word);if (it == inverted_index.end()){std::cerr << "關(guān)鍵字 " << word << " 不存在" << std::endl;return nullptr;}return &(it->second);}/// @brief 根據(jù)目錄 文件 構(gòu)建 正派和倒排索引,這里是最重的一步/// @param src_path 去標(biāo)簽后目錄文件目錄/// @returnbool BuildIndex(const std::string &src_path){std::ifstream in(src_path, std::ios::in | std::ios::binary);if (in.is_open() == false){std::cerr << "文件目錄 " << src_path << "無(wú)效" << std::endl;return false;}int count = 0;std::string line;while (std::getline(in, line)){// 此時(shí)我們已經(jīng)提取到每一個(gè)html內(nèi)容了// 建立正派索引DocInfo *doc = BuildForwardIndex(line);if (doc == nullptr){std::cerr << "建立一個(gè)正派索引失敗" << line << std::endl;continue;}// 建立 倒排索引BuildInvertedIndex(*doc);count++;if (count % 50 == 0){// 后期加上一個(gè)進(jìn)度條// LOG(NORMAL, "當(dāng)前已經(jīng)處理了 " + std::to_string(count) + " 個(gè)文檔");std::cout << "當(dāng)前已經(jīng)處理了 索引文檔 " << count << std::endl;}}return true;}private:/// @brief 根據(jù)字符串建立正派索引  也就是根據(jù)文旦id找到 文檔內(nèi)容/// @param line 一個(gè)字符串,該字符串保留一個(gè)html文檔的所有內(nèi)容/// @returnDocInfo *BuildForwardIndex(const std::string &line){// title\3content\3url\nstd::vector<std::string> results;const std::string sep = "\3";ns_util::StringUtil::Split(line, &results, sep);if (results.size() != 3)return nullptr;DocInfo doc;doc.title = results[0];doc.content = results[1];doc.url = results[2];doc.doc_id = forward_index.size(); // 注意這里是 正派拉鏈forward_index.push_back(std::move(doc));return &(forward_index[forward_index.size() - 1]);}// 為了詞頻統(tǒng)計(jì)struct word_cnt{int title_cnt;int content_cnt;word_cnt() : title_cnt(0), content_cnt(0) {}};/// @brief 根據(jù)一個(gè)文檔內(nèi)容的結(jié)構(gòu)體建立倒排索引,需要經(jīng)行分詞  --/// @param doc  這個(gè)是一個(gè)結(jié)構(gòu)體/// @returnbool BuildInvertedIndex(const DocInfo &doc){// 用來(lái)暫存 詞頻std::unordered_map<std::string, word_cnt> word_map;// 1.對(duì)標(biāo)題 分詞std::vector<std::string> title_words;ns_util::JiebaUtil::CutString(doc.title, &title_words);// 不區(qū)分大小寫(xiě)// 那么用戶也不因該區(qū)分大小寫(xiě)for (std::string s : title_words){boost::to_lower(s);word_map[s].title_cnt++; // 解釋一下}std::vector<std::string> content_words;ns_util::JiebaUtil::CutString(doc.content, &content_words);for (auto s : content_words){boost::to_lower(s);word_map[s].content_cnt++;}// 3 構(gòu)建倒排拉鏈for (auto &word_pair : word_map){InvertedElem item;item.doc_id = doc.doc_id; // 這里解釋了上面我們?yōu)楹翁砑恿薸ditem.word = word_pair.first;item.weight = _build_relevance(word_pair.second);// 加入倒排拉鏈中InvertedList &inverted_list = inverted_index[word_pair.first];inverted_list.push_back(std::move(item));}return true;}private:/// @brief 構(gòu)建權(quán)重/// @param word/// @returnint _build_relevance(const struct word_cnt &word){
#define X 10
#define Y 1return X * word.title_cnt + Y * word.content_cnt;}private:// 正排索引 -- 根據(jù)vector下標(biāo)可以更加高效作為id找到內(nèi)容std::vector<struct DocInfo>forward_index;// 倒排索引 一個(gè)關(guān)鍵字 可能在很多的文檔中出現(xiàn),一定是一個(gè)關(guān)鍵字和一組InvertedElem對(duì)應(yīng)std::unordered_map<std::string, InvertedList> inverted_index;};Index *Index::instance = nullptr;std::mutex Index::mtx;
}

搜索引擎模塊

下面我們開(kāi)始編寫(xiě)搜索模塊,這里我們先來(lái)寫(xiě)出基本代碼結(jié)構(gòu).我們也創(chuàng)建一個(gè)文件.

[qkj@localhost BoostSearchEngine]$ touch searcher.hpp

下面是我們的框架.

namespace ns_searcher
{struct InvertedElemPrint{uint64_t doc_id; // 文旦idint weight;                     // 權(quán)重std::vector<std::string> words; // 關(guān)鍵字>InvertedElemPrint() : doc_id(0), weight(0) {}};class Searcher{public:Searcher() {}~Searcher() {}//input 這個(gè)是我們?nèi)?biāo)簽后面的文件void InitSearcher(const std::string &input){// 1. 獲取index// 2. 根絕index建立索引}// query: 這個(gè)是我們要搜索的詞或者是語(yǔ)句// json_string: 這個(gè)是我們結(jié)果,是一個(gè)json串void Search(const std::string &query, std::string *json_string){//1. 分詞 我們的搜索的語(yǔ)句,注意轉(zhuǎn)成小寫(xiě)//2. 根據(jù)關(guān)鍵字,拿到倒排拉鏈,//3. 合并排序: 根據(jù)我們的結(jié)果按照權(quán)重進(jìn)行降序排序//4. 構(gòu)建json串}private:ns_index::Index *index; // 提供系統(tǒng)經(jīng)行查找索引};
}

InitSearcher

這個(gè)是我們初始化的工作,一共兩個(gè)內(nèi)容.

拿到index對(duì)象
根據(jù)index建立索引

void InitSearcher(const std::string &input)
{// 獲取創(chuàng)建index對(duì)象index = ns_index::Index::GetInstance();// std::cout << "獲取單例成功" << std::endl;//  根據(jù)index對(duì)象建立索引index->BuildIndex(input);// std::cout << "建立正派倒排索引成功" << std::endl;
}

Search

這個(gè)是我們查找實(shí)現(xiàn)的具體流程.我們輸入我們想要查找的內(nèi)容,下面是我們函數(shù)的流程

切分輸入的內(nèi)容,小寫(xiě)的保存在數(shù)組中
根據(jù)額數(shù)組的每一個(gè)元素,拿到倒排拉鏈,然后把所有的倒排拉量的內(nèi)容保存在一個(gè)拉鏈中
我們以降序的方式排序整個(gè)拉鏈
根據(jù)拉鏈的id找到文檔內(nèi)容,構(gòu)建json串

void Search(const std::string &query, std::string *json_string)
{// 1 分詞  先來(lái)分詞后面在進(jìn)行查找std::vector<std::string> words;ns_util::JiebaUtil::CutString(query, &words);// 2 根據(jù)分詞結(jié)果依次觸發(fā)  搜索ns_index::InvertedList inverted_list_all; // 保存所有的倒排拉鏈里面的內(nèi)容for (std::string s : words){boost::to_lower(s); // 建立索引的時(shí)候是忽略大小寫(xiě)的,我們搜索的時(shí)候也需要// 先查倒排ns_index::InvertedList *inverted_list = index->GetInvertedList(s);if (nullptr == inverted_list){continue;}// 此時(shí)找到了 保存所有的 拉鏈里面的值// 不完美 一個(gè)詞可能和多個(gè)文檔相關(guān) 一個(gè)文檔可以和多個(gè)關(guān)鍵詞相關(guān).inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());}std::sort(inverted_list_all.begin(), inverted_list_all.end(),[](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){return e1.weight > e2.weight;});// 4 構(gòu)建json串 使用序列化和反序列化
}
*json_string = writer.write(root);

上面我們的實(shí)現(xiàn)有一個(gè)完美的地方,我們知道一個(gè)詞可以映射到多個(gè)文檔的id,那么多個(gè)關(guān)鍵字映射的文檔id,就有可能進(jìn)行沖突.例如下面的例子.

關(guān)鍵字	文檔ID
你好	1, 2
我	1, 2
是	1, 2
大學(xué)生	1
社會(huì)人	2

我們把"你好,我"進(jìn)行分詞,然后得到拉鏈,放在總拉鏈里面,這就是[文檔1, 文檔2,文檔1, 文檔2],這我們后期彌補(bǔ).

jsoncpp安裝與使用

下面我們需要說(shuō)一下jsoncpp的安裝與使用.畢竟我們這里要構(gòu)建json串.json是序列化和反序列化的.

[qkj@localhost BoostSearchEngine]$ sudo yum install -y jsoncpp-devel

下面我們使用一下json.

[qkj@localhost install]$ touch test.cc

#include <iostream>
#include <string>
#include <jsoncpp/json/json.h>int main()
{Json::Value root;Json::Value item1;item1["key1"] = "value11";item1["key2"] = "value22";Json::Value item2;item2["key1"] = "value1";item2["key2"] = "value2";root.append(item1);root.append(item2);Json::StyledWriter writer;std::string s = writer.write(root);std::cout << s << std::endl;return 0;
}

下面就是我們的結(jié)果.

[qkj@localhost install]$ g++ test.cc  -ljsoncpp
[qkj@localhost install]$ ./a.out 
[{"key1" : "value11","key2" : "value22"},{"key1" : "value1","key2" : "value2"}
][qkj@localhost install]$

下面我們繼續(xù)編寫(xiě)這個(gè)代碼.

void Search(const std::string &query, std::string *json_string)
{// 1 分詞  先來(lái)分詞后面在進(jìn)行查找std::vector<std::string> words;ns_util::JiebaUtil::CutString(query, &words);// 2 根據(jù)分詞結(jié)果依次觸發(fā)  搜索ns_index::InvertedList inverted_list_all; // 保存所有的倒排拉鏈里面的內(nèi)容for (std::string s : words){boost::to_lower(s); // 建立索引的時(shí)候是忽略大小寫(xiě)的,我們搜索的時(shí)候也需要// 先查倒排ns_index::InvertedList *inverted_list = index->GetInvertedList(s);if (nullptr == inverted_list){continue;}// 此時(shí)找到了 保存所有的 拉鏈里面的值// 不完美 一個(gè)詞可能和多個(gè)文檔相關(guān) 一個(gè)文檔可以和多個(gè)關(guān)鍵詞相關(guān).inverted_list_all.insert(inverted_list_all.end(), inverted_list->begin(), inverted_list->end());}std::sort(inverted_list_all.begin(), inverted_list_all.end(),[](const ns_index::InvertedElem &e1, const ns_index::InvertedElem &e2){return e1.weight > e2.weight;});// 4 構(gòu)建json串 使用序列化和反序列化Json::Value root;for (auto &item : inverted_list_all){// 此時(shí)拿到正派ns_index::DocInfo *doc = index->GetForwardIndex(item.doc_id);if (nullptr == doc){continue;}// 獲取了 文檔內(nèi)容Json::Value elem;elem["title"] = doc->title;elem["desc"] = doc->content;elem["url"] = doc->url;root.append(elem); // 這里是有序的}Json::StyledWriter writer; // 這里我們暫時(shí)用這個(gè)格式*json_string = writer.write(root);
}

搜索測(cè)試

下面我們這里統(tǒng)一做一個(gè)搜索測(cè)試.

#include "searcher.hpp"
const std::string input = "data/raw_html/raw.txt";
int main()
{ns_searcher::Searcher *search = new ns_searcher::Searcher();search->InitSearcher(input);std::string query;std::string json_string;while (true){std::cout << "請(qǐng)輸入關(guān)鍵字# ";//std::cin >> query;std::getline(std::cin, query);//std::cout << query;search->Search(query, &json_string);std::cout << json_string << std::endl;}return 0;
}

下面是Mekefile.

cc=g++
PARSER=parser
SSVR=search_server .PHONY:all
all:$(PARSER) $(SSVR)$(SSVR):server.cc$(cc) -o $@ $^ -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp$(PARSER):parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem.PHONY:clean
clean:rm -f $(PARSER) $(SSVR)

下面我們測(cè)試一下.這是一個(gè)html文檔的內(nèi)容,我們的內(nèi)容實(shí)在是太多了.此時(shí)這我們應(yīng)該把內(nèi)容給裁出來(lái)一部分.這樣比較好.

{"desc" : "Struct template bound_launcherHomeLibrariesPeopleFAQMoreStruct template bound_launcherboost::process::v2::bound_launcher — Utility class to bind initializers to a launcher. Synopsis// In header: &lt;boost/process/v2/bind_launcher.hpp&gt;template&lt;typename Launcher, typename ... Init&gt; struct bound_launcher {  // construct/copy/destruct  template&lt;typename Launcher_, typename ... Init_&gt;     bound_launcher(Launcher_ &amp;&amp;, Init_ &amp;&amp;...);  // public member functions  template&lt;typename ExecutionContext, typename Args, typename ... Inits&gt;     auto operator()(ExecutionContext &amp;,                     const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp;,                     Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;typename ExecutionContext, typename Args, typename ... Inits&gt;     auto operator()(ExecutionContext &amp;, error_code &amp;,                     const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp;,                     Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;typename Executor, typename Args, typename ... Inits&gt;     auto operator()(Executor,                     const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp;,                     Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;typename Executor, typename Args, typename ... Inits&gt;     auto operator()(Executor, error_code &amp;,                     const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp;,                     Args &amp;&amp;, Inits &amp;&amp;...);  // private member functions  template&lt;std::size_t ... Idx, typename ExecutionContext, typename Args,            typename ... Inits&gt;     auto invoke(unspecified, ExecutionContext &amp;,                 const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp;,                 Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;std::size_t ... Idx, typename ExecutionContext, typename Args,            typename ... Inits&gt;     auto invoke(unspecified, ExecutionContext &amp;, error_code &amp;,                 const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp;,                 Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;std::size_t ... Idx, typename Executor, typename Args,            typename ... Inits&gt;     auto invoke(unspecified, Executor,                 const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp;,                 Args &amp;&amp;, Inits &amp;&amp;...);  template&lt;std::size_t ... Idx, typename Executor, typename Args,            typename ... Inits&gt;     auto invoke(unspecified, Executor, error_code &amp;,                 const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp;,                 Args &amp;&amp;, Inits &amp;&amp;...);};DescriptionThis can be used when multiple processes shared some settings, e.g. Template Parameterstypename LauncherThe inner launcher to be used typename ... Initbound_launcher         public       construct/copy/destructtemplate&lt;typename Launcher_, typename ... Init_&gt;   bound_launcher(Launcher_ &amp;&amp; l, Init_ &amp;&amp;... init);bound_launcher public member functionstemplate&lt;typename ExecutionContext, typename Args, typename ... Inits&gt;   auto operator()(ExecutionContext &amp; context,                   const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp; executable,                   Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;typename ExecutionContext, typename Args, typename ... Inits&gt;   auto operator()(ExecutionContext &amp; context, error_code &amp; ec,                   const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp; executable,                   Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;typename Executor, typename Args, typename ... Inits&gt;   auto operator()(Executor exec,                   const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp; executable,                   Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;typename Executor, typename Args, typename ... Inits&gt;   auto operator()(Executor exec, error_code &amp; ec,                   const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp; executable,                   Args &amp;&amp; args, Inits &amp;&amp;... inits);bound_launcher private member functionstemplate&lt;std::size_t ... Idx, typename ExecutionContext, typename Args,          typename ... Inits&gt;   auto invoke(unspecified, ExecutionContext &amp; context,               const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp; executable,               Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;std::size_t ... Idx, typename ExecutionContext, typename Args,          typename ... Inits&gt;   auto invoke(unspecified, ExecutionContext &amp; context, error_code &amp; ec,               const typename std::enable_if&lt; std::is_convertible&lt; ExecutionContext &amp;, boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp; executable,               Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;std::size_t ... Idx, typename Executor, typename Args,          typename ... Inits&gt;   auto invoke(unspecified, Executor exec,               const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp; executable,               Args &amp;&amp; args, Inits &amp;&amp;... inits);template&lt;std::size_t ... Idx, typename Executor, typename Args,          typename ... Inits&gt;   auto invoke(unspecified, Executor exec, error_code &amp; ec,               const typename std::enable_if&lt; boost::asio::execution::is_executor&lt; Executor &gt;::value||boost::asio::is_executor&lt; Executor &gt;::value, filesystem::path &gt;::type &amp; executable,               Args &amp;&amp; args, Inits &amp;&amp;... inits);Copyright ? 2006-2012 Julio M. Merino Vidal, Ilya Sokolov,      Felipe Tanus, Jeff Flinn, Boris SchaelingCopyright ? 2016 Klemens D. Morgenstern        Distributed under the Boost Software License, Version 1.0. (See accompanying        file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)      ","title" : "Struct template bound_launcher","url" : "https://www.boost.org/doc/libs/1_83_0/doc/html/boost/process/v2/bound_launcher.html"},

獲取摘要

void Search(const std::string &query, std::string *json_string)
{// ...// 4 構(gòu)建json串 使用序列化和反序列化Json::Value root;for (auto &item : inverted_list_all){// ....// 獲取了 文檔內(nèi)容Json::Value elem;elem["title"] = doc->title;elem["desc"] = make_summary(doc->content, item.word); // 我們需要根據(jù)關(guān)鍵字來(lái)提取摘要elem["url"] = doc->url;root.append(elem); // 這里是有序的}Json::StyledWriter writer; // 這里我們暫時(shí)用這個(gè)格式*json_string = writer.write(root);
}

首先我們可以隨便切分,但是一般我們想要與搜索關(guān)鍵字相關(guān)的內(nèi)容.

std::string make_summary(const std::string &content, const std::string &word)
{// 這里有點(diǎn)問(wèn)題  content是正排索引的里面的內(nèi)容,是區(qū)分大小寫(xiě)的 是文檔內(nèi)容,不區(qū)分大小寫(xiě)  word 確是 小的的//  這里獲取摘要有點(diǎn)問(wèn)題,關(guān)鍵字不一定會(huì)出現(xiàn)在內(nèi)容中, 注意是非常小的概率// std::size_t pos = content.find(words);// if (pos == std::string::npos)//   return "Node";auto item = std::search(content.begin(), content.end(), word.begin(), word.end(),[](int x, int y){return std::tolower(x) == std::tolower(y);});if (item == content.end())return "Node";// 找到了 計(jì)算 跌打器到begin的距離std::size_t pos = std::distance(content.begin(), item);const std::size_t prev_step = 50;const std::size_t next_step = 100;// 先前找 50個(gè) 向后找 50個(gè)std::size_t begin = 0;// 注意szie_t是一個(gè)無(wú)符號(hào)數(shù),這里我們-1 絕對(duì)有問(wèn)題if (pos > prev_step){begin = pos - prev_step;}std::size_t end = pos + next_step;if (end > content.size()){end = content.size();}//這里是是避只有關(guān)鍵if (end > begin){std::string desc = content.substr(begin, end - begin);desc += "....";return desc;}elsereturn "Node";
}

這里測(cè)試一下.

請(qǐng)輸入關(guān)鍵字# filesystem
[{"desc" : "boost::asio::execution_context &amp; &gt;::value, filesystem::path &gt;::type &amp;,                     Args &amp;&amp;, Inits &amp;&amp;...);  templ....","title" : "Struct template bound_launcher","url" : "https://www.boost.org/doc/libs/1_83_0/doc/html/boost/process/v2/bound_launcher.html"},.....
]

綜合調(diào)試

下面我們這里要測(cè)試上面我們寫(xiě)的內(nèi)容,是不是按照權(quán)重從大到小進(jìn)行排序的,這里在json串哪里測(cè)試一下.

這個(gè)我們思路是.我們拿到所有的倒排拉鏈里面的內(nèi)容,根據(jù)id找正文.但是我們倒排拉鏈哪里也是存在權(quán)重的.

請(qǐng)輸入關(guān)鍵字# split
[{"desc" : "Class template split_iteratorHomeLibrariesPeopleFAQMoreClass template split_iteratorboost::algorithm::split_iterato....","title" : "Class template split_iterator","url" : "https://www.boost.org/doc/libs/1_83_0/doc/html/boost/algorithm/split_iterator.html","weight" : 37},{"desc" : "ual, BucketTraits, SizeType, BoolFlags &gt;::type split_bucket_hash_equal_t;  typedef split_bucket_hash_equal_t::key_equal                            ....","title" : "Struct template hashdata_internal","url" : "https://www.boost.org/doc/libs/1_83_0/doc/html/boost/intrusive/hashdata_internal.html","weight" : 20},.....
]

關(guān)于調(diào)試我們這里需要總結(jié)幾個(gè)內(nèi)容.

計(jì)算權(quán)重時(shí),我們先去拿了標(biāo)題,但是在內(nèi)容中我們是對(duì)整個(gè)內(nèi)容去標(biāo)題.所以我們標(biāo)題計(jì)算權(quán)重時(shí)要計(jì)算兩次,那么一個(gè)標(biāo)題是11
我們分詞的具體規(guī)則不知道,不夠這里我們就不關(guān)心了
上面我們還剩下最后一個(gè)內(nèi)容,就是重復(fù)文檔的問(wèn)題.

調(diào)試后,我們修改一下文件名.

[qkj@localhost BoostSearchEngine]$ mv server.cc debug.cc

同時(shí)也修改一下makefile.

cc=g++
PARSER=parser
DUG=debug.PHONY:all
all:$(PARSER) $(DUG)$(DUG):debug.cc$(cc) -o $@ $^ -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp$(PARSER):parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem.PHONY:clean
clean:rm -f $(PARSER) $(DUG)

搜索服務(wù)端

下面我們開(kāi)始編寫(xiě)網(wǎng)絡(luò)版本的服務(wù)端,我們先創(chuàng)建好文件.

[qkj@localhost BoostSearchEngine]$ touch http_server.cc

#include "searcher.hpp"
int mian()
{return 0;
}

這里也修改下makefile.

cc=g++
PARSER=parser
DUG=debug
HTTP_SERVER=http_server 
.PHONY:all
all:$(PARSER) $(DUG) $(HTTP_SERVER)$(DUG):debug.cc$(cc) -o $@ $^ -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp$(PARSER):parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem$(HTTP_SERVER):http_server.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem -ljsoncpp.PHONY:clean
clean:rm -f $(PARSER) $(DUG) $(HTTP_SERVER)

這里測(cè)試一下.

[qkj@localhost BoostSearchEngine]$ make
g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
g++ -o debug debug.cc -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp
g++ -o http_server http_server.cc -std=c++11 -lpthread -lboost_system -lboost_filesystem -ljsoncpp
[qkj@localhost BoostSearchEngine]$ ll
total 1548
lrwxrwxrwx. 1 qkj qkj     43 Sep  9 06:00 cppjieba -> /home/qkj/install/cppjieba/include/cppjieba
drwxrwxr-x. 4 qkj qkj     35 Sep  9 01:03 data
-rwxrwxr-x. 1 qkj qkj 658128 Sep  9 20:02 debug
-rw-rw-r--. 1 qkj qkj    483 Sep  9 09:16 debug.cc
lrwxrwxrwx. 1 qkj qkj     32 Sep  9 06:01 dict -> /home/qkj/install/cppjieba/dict/
-rwxrwxr-x. 1 qkj qkj 401400 Sep  9 20:02 http_server
-rw-rw-r--. 1 qkj qkj     51 Sep  9 20:02 http_server.cc
-rw-rw-r--. 1 qkj qkj   6102 Sep  9 08:33 index.hpp
-rw-rw-r--. 1 qkj qkj    446 Sep  9 19:58 Makefile
-rwxrwxr-x. 1 qkj qkj 481760 Sep  9 20:02 parser
-rw-rw-r--. 1 qkj qkj   6361 Sep  9 02:47 parser.cc
-rw-rw-r--. 1 qkj qkj   4626 Sep  9 19:42 searcher.hpp
-rw-rw-r--. 1 qkj qkj   1779 Sep  9 08:27 util.hpp

升級(jí)gcc

這里通信我們可以自己寫(xiě),后面我們會(huì)升級(jí).不過(guò)這里我們使用cpp-httplib庫(kù).這個(gè)庫(kù)很簡(jiǎn)單.這里cpp-httplib有點(diǎn)問(wèn)題,我們需要教新版本的編譯器,否則就是編譯不通過(guò),或者是運(yùn)行出現(xiàn)錯(cuò)誤.

[qkj@localhost BoostSearchEngine]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --
infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enablebootstrap
--enable-shared --enable-threads=posix --enable-checking=release --with-systemzlib
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --
enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,objc++,
java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --
with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --
with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install -
-enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-
redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

下面直接升級(jí).

[qkj@localhost BoostSearchEngine]$ sudo yum install centos-release-scl
[qkj@localhost BoostSearchEngine]$ sudo yum install devtoolset-8-gcc*
scl enable devtoolset-8 bash
[qkj@localhost BoostSearchEngine]$ source /opt/rh/devtoolset-8/enable
[qkj@localhost BoostSearchEngine]$ mv /usr/bin/gcc /usr/bin/gcc-4.8.5
[qkj@localhost BoostSearchEngine]$ ln -s /opt/rh/devtoolset-8/root/bin/gcc /usr/bin/gcc
[qkj@localhost BoostSearchEngine]$ mv /usr/bin/g++ /usr/bin/g++-4.8.5
[qkj@localhost BoostSearchEngine]$ ln -s /opt/rh/devtoolset-8/root/bin/g++ /usr/bin/g++
[qkj@localhost BoostSearchEngine]$ mv /usr/bin/c++ /usr/bin/c++-4.8.5
[qkj@localhost BoostSearchEngine]$ ln -s /opt/rh/devtoolset-8/root/bin/c++ /usr/bin/c++
[qkj@localhost BoostSearchEngine]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-x86_64-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) 
[qkj@localhost BoostSearchEngine]$

引入cpp-httplib庫(kù)

這里我們選擇下載0.7.15版本,這是因?yàn)檩^新版本的可能運(yùn)行時(shí)會(huì)報(bào)錯(cuò).
這里我們選擇下載到桌面,然后拖拽到虛擬機(jī)上,這些方法都試一遍.

[qkj@localhost install]$ rz -E [qkj@localhost install]$ ll
total 596
-rwxrwxr-x. 1 qkj qkj  15424 Sep  9 09:09 a.out
drwxr-xr-x. 8 qkj qkj   4096 Aug  8 14:40 boost_1_83_0
-rw-r--r--. 1 qkj qkj 584053 Sep  9 20:23 cpp-httplib-v0.7.15.zip
drwxrwxr-x. 8 qkj qkj    215 Sep  9 03:38 cppjieba
-rw-rw-r--. 1 qkj qkj    421 Sep  9 09:09 test.cc
[qkj@localhost install]$

然后我們創(chuàng)建軟連接到我們的項(xiàng)目中.

[qkj@localhost BoostSearchEngine]$ ln -s /home/qkj/install/cpp-httplib-v0.7.15/ cpp-httplib
[qkj@localhost BoostSearchEngine]$ ll
total 1548
lrwxrwxrwx. 1 qkj qkj     38 Sep  9 20:30 cpp-httplib -> /home/qkj/install/cpp-httplib-v0.7.15/
lrwxrwxrwx. 1 qkj qkj     43 Sep  9 06:00 cppjieba -> /home/qkj/install/cppjieba/include/cppjieba
drwxrwxr-x. 4 qkj qkj     35 Sep  9 01:03 data
-rwxrwxr-x. 1 qkj qkj 658128 Sep  9 20:02 debug
-rw-rw-r--. 1 qkj qkj    483 Sep  9 09:16 debug.cc
lrwxrwxrwx. 1 qkj qkj     32 Sep  9 06:01 dict -> /home/qkj/install/cppjieba/dict/
-rwxrwxr-x. 1 qkj qkj 401400 Sep  9 20:02 http_server
-rw-rw-r--. 1 qkj qkj     51 Sep  9 20:02 http_server.cc
-rw-rw-r--. 1 qkj qkj   6102 Sep  9 08:33 index.hpp
-rw-rw-r--. 1 qkj qkj    446 Sep  9 19:58 Makefile
-rwxrwxr-x. 1 qkj qkj 481760 Sep  9 20:02 parser
-rw-rw-r--. 1 qkj qkj   6361 Sep  9 02:47 parser.cc
-rw-rw-r--. 1 qkj qkj   4626 Sep  9 19:42 searcher.hpp
-rw-rw-r--. 1 qkj qkj   1779 Sep  9 08:27 util.hpp
[qkj@localhost BoostSearchEngine]$

測(cè)試cpp-httplib

下面我們測(cè)試一下httplib庫(kù).

這里我們先來(lái)測(cè)試一下.

[qkj@localhost BoostSearchEngine]$ make
g++ -o http_server http_server.cc -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp
/opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8/libstdc++_nonshared.a(thread48.o): In function `std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)())':
(.text._ZNSt6thread15_M_start_threadESt10unique_ptrINS_6_StateESt14default_deleteIS1_EEPFvvE+0x11): undefined reference to `pthread_create'
/opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8/libstdc++_nonshared.a(thread48.o): In function `std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)())':
(.text._ZNSt6thread15_M_start_threadESt10shared_ptrINS_10_Impl_baseEEPFvvE+0x60): undefined reference to `pthread_create'
/tmp/ccGWpu61.o: In function `std::thread::thread<httplib::ThreadPool::worker, , void>(httplib::ThreadPool::worker&&)':
http_server.cc:(.text._ZNSt6threadC2IN7httplib10ThreadPool6workerEJEvEEOT_DpOT0_[_ZNSt6threadC5IN7httplib10ThreadPool6workerEJEvEEOT_DpOT0_]+0x21): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
make: *** [http_server] Error 1
[qkj@localhost BoostSearchEngine]$

這是由于我們httplib需要引入pthread庫(kù).

cc=g++
PARSER=parser
DUG=debug
HTTP_SERVER=http_server 
.PHONY:all
all:$(PARSER) $(DUG) $(HTTP_SERVER)$(DUG):debug.cc$(cc) -o $@ $^ -std=c++11  -lboost_system -lboost_filesystem -ljsoncpp$(PARSER):parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem$(HTTP_SERVER):http_server.cc$(cc) -o $@ $^ -std=c++11 -lpthread -lboost_system -lboost_filesystem -ljsoncpp.PHONY:clean
clean:rm -f $(PARSER) $(DUG) $(HTTP_SERVER)

這里我們繼續(xù)測(cè)試,先創(chuàng)建一個(gè)簡(jiǎn)單的功能.這個(gè)庫(kù)是很好用的.

這是我們代碼.

#include "cpp-httplib/httplib.h"
int main()
{httplib::Server svr;svr.Get("hi", [](const httplib::Request& req, httplib::Response& rsp){rsp.set_content("hello word!", "text/plain; charset=utf-8");});svr.listen("0.0.0.0", 8081);return 0;
}

[qkj@localhost install]$ netstat -ntlp
(Not all processes could be identified, non-owned process infowill not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:44227         0.0.0.0:*               LISTEN      1903/node           
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      4191/./http_server  
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::111                  :::*                    LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      -                   
tcp6       0      0 ::1:631                 :::*                    LISTEN      -                   
tcp6       0      0 ::1:25                  :::*                    LISTEN      -                   
[qkj@localhost install]$

開(kāi)放端口號(hào)

這是因?yàn)槲覀兊奶摂M機(jī)沒(méi)有開(kāi)辟端口被外部網(wǎng)絡(luò)進(jìn)行訪問(wèn).這里需要開(kāi)放端口.我們看一下下面有那些端口被打開(kāi)了.下面是打開(kāi)的規(guī)則.

Centos開(kāi)放端口號(hào)

設(shè)置根目錄

一般而言,我們都有一個(gè)根目錄.這樣就可以了.

[qkj@localhost BoostSearchEngine]$ mkdir wwwroot

這里在服務(wù)器上面設(shè)置跟目錄.

#include "cpp-httplib/httplib.h"
const std::string root_path = "./wwwroot";int main()
{httplib::Server svr;// 設(shè)置跟目錄svr.set_base_dir(root_path.c_str());svr.Get("hi", [](const httplib::Request& req, httplib::Response& rsp){rsp.set_content("hello word!", "text/plain; charset=utf-8");});svr.listen("0.0.0.0", 8080);return 0;
}

我們繼續(xù)測(cè)試.

注意z合適因?yàn)槲覀兊母夸浵旅媸裁炊紱](méi)有.一般而言,我們是名字為index.html文件.這里設(shè)置一下

[qkj@localhost wwwroot]$ touch index.html
[qkj@localhost wwwroot]$ ll
total 8
-rw-rw-r--. 1 qkj qkj    0 Sep  9 21:10 index.html

編寫(xiě)搜索服務(wù)端

下面我們就可以編寫(xiě)我們的服務(wù)端了.這里面是非常簡(jiǎn)單的.

#include "cpp-httplib/httplib.h"
#include "searcher.hpp"const std::string root_path = "./wwwroot";
const std::string input = "data/raw_html/raw.txt";
int main()
{// 初始化sercherns_searcher::Searcher search;search.InitSearcher(input);httplib::Server svr;svr.set_base_dir(root_path.c_str()); // 設(shè)置跟目錄svr.Get("/s", [&search](const httplib::Request &req, httplib::Response &rsp){if (req.has_param("word") == false){rsp.set_content("必須要搜索關(guān)鍵字", "text/plain; charset=utf-8");return;}std::string word = req.get_param_value("word");std::cout << "用戶搜索的: " << word << std::endl;std::string json_string;search.Search(word, &json_string);rsp.set_content(json_string, "application/json"); });std::cout << "服務(wù)器啟動(dòng)成功" << std::endl;svr.listen("0.0.0.0", 8081);return 0;
}

前端代碼

前端部分我們可以選學(xué),這里我們也不談.如果想學(xué),可以去下面的網(wǎng)站.

HTML: 編寫(xiě)網(wǎng)頁(yè)結(jié)構(gòu), 網(wǎng)頁(yè)的骨骼
CSS : 網(wǎng)頁(yè)樣式,網(wǎng)頁(yè)的皮肉
Js : 前后端交互,網(wǎng)頁(yè)的靈魂

前端學(xué)習(xí)網(wǎng)站推薦：http://www.w3school.com.cn

網(wǎng)頁(yè)結(jié)構(gòu)

我們?cè)O(shè)置的網(wǎng)頁(yè)結(jié)構(gòu)是這樣的.

按照上面的內(nèi)容,我們的html可以這樣寫(xiě).

<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>boost 搜索引擎</title>
</head><body><div class="container"><div class="search"><input type="text" value="輸入搜索關(guān)鍵字..."><button>搜索一下</button></div><div class="result"><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div></div></div>
</body></html>

網(wǎng)頁(yè)樣式

上面我們發(fā)現(xiàn)有點(diǎn)丑,所以這里我們要給他美顏一下.

<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>boost 搜索引擎</title><style>/* 去掉網(wǎng)頁(yè)中的所有的默認(rèn)內(nèi)外邊距，html的盒子模型 */* {/* 設(shè)置外邊距 */margin: 0;/* 設(shè)置內(nèi)邊距 */padding: 0;}/* 將我們的body內(nèi)的內(nèi)容100%和html的呈現(xiàn)吻合 */html,body {height: 100%;}/* 類選擇器.container */.container {/* 設(shè)置div的寬度 */width: 800px;/* 通過(guò)設(shè)置外邊距達(dá)到居中對(duì)齊的目的 */margin: 0px auto;/* 設(shè)置外邊距的上邊距，保持元素和網(wǎng)頁(yè)的上部距離 */margin-top: 15px;}/* 復(fù)合選擇器，選中container 下的 search */.container .search {/* 寬度與父標(biāo)簽保持一致 */width: 100%;/* 高度設(shè)置為52px */height: 52px;}/* 先選中input標(biāo)簽， 直接設(shè)置標(biāo)簽的屬性，先要選中， input：標(biāo)簽選擇器*//* input在進(jìn)行高度設(shè)置的時(shí)候，沒(méi)有考慮邊框的問(wèn)題 */.container .search input {/* 設(shè)置left浮動(dòng) */float: left;width: 600px;height: 50px;/* 設(shè)置邊框?qū)傩?#xff1a;邊框的寬度，樣式，顏色 */border: 1px solid black;/* 去掉input輸入框的有邊框 */border-right: none;/* 設(shè)置內(nèi)邊距，默認(rèn)文字不要和左側(cè)邊框緊挨著 */padding-left: 10px;/* 設(shè)置input內(nèi)部的字體的顏色和樣式 */color: #CCC;font-size: 15px;}/* 先選中button標(biāo)簽， 直接設(shè)置標(biāo)簽的屬性，先要選中， button：標(biāo)簽選擇器*/.container .search button {/* 設(shè)置left浮動(dòng) */float: left;width: 150px;height: 52px;/* 設(shè)置button的背景顏色，#4e6ef2 */background-color: #4e6ef2;/* 設(shè)置button中的字體顏色 */color: #FFF;/* 設(shè)置字體的大小 */font-size: 19px;font-family: Georgia, 'Times New Roman', Times, serif;}.container .result {width: 100%;}.container .result .item {margin-top: 15px;}.container .result .item a {/* 設(shè)置為塊級(jí)元素，單獨(dú)站一行 */display: block;/* a標(biāo)簽的下劃線去掉 */text-decoration: none;/* 設(shè)置a標(biāo)簽中的文字的字體大小 */font-size: 20px;/* 設(shè)置字體的顏色 */color: #4e6ef2;}.container .result .item a:hover {/*設(shè)置鼠標(biāo)放在a之上的動(dòng)態(tài)效果*/text-decoration: underline;}.container .result .item p {margin-top: 5px;font-size: 16px;font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida SansUnicode', Geneva, Verdana, sans-serif;}.container .result .item i {/* 設(shè)置為塊級(jí)元素，單獨(dú)站一行 */display: block;/* 取消斜體風(fēng)格 */font-style: normal;color: green;}</style>
</head><body><div class="container"><div class="search"><input type="text" value="輸入搜索關(guān)鍵字..."><button>搜索一下</button></div><div class="result"><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要這是摘要</p><i>https://search.gitee.com/?skin=rec&type=repository&q=cpp-httplib</i></div></div></div>
</body></html>

前后端交互

下面我們繼續(xù)使用前后端交互.也是直接貼代碼.

<!-- 形成骨架 -->
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><script src="http://code.jquery.com/jquery-2.1.1.min.js"></script><title>boost 搜索引擎</title><!-- 把內(nèi)外邊距清零 --><style>* {/* 設(shè)置外邊距 */margin: 0;/* 設(shè)置內(nèi)邊距 */padding: 0;}html,body {height: 100%;}/* 居中顯式  以點(diǎn)開(kāi)頭的我們稱之類選擇器 */.container {/* 這是最大框架 */width: 800px;margin: 0px auto;margin-top: 15px;}/* 復(fù)合選擇器 */.container .search {width: 100%;/* 為何是52我們后面解釋 */height: 52px;}.container .search input {/* 加上浮動(dòng) */float: left;width: 600px;height: 50px;/* 設(shè)置邊框 */border: 1px solid black;/* 去掉右邊距 */border-right: none;padding-left: 10px;color: #ccc;font-size: 15px;}.container .search button {/* 加上浮動(dòng) */float: left;width: 120px;height: 52px;/* 設(shè)置背景顏色 */background-color: #4e6ef2;/* 設(shè)置字體顏色 */color: #fff;/* 設(shè)置字體大小 */font-size: 19px;/* 設(shè)置字體樣式 */font-family: 'Times New Roman', Times, serif;}.container .result {width: 100%;}.container .result .item {margin-top: 15px;}.container .result .item a {display: block;/* 去掉下劃線 */text-decoration: none;font-size: 20px;color: #4e6ef2;}.container .result .item a:hover {text-decoration: underline;}.container .result .item p {margin: 5px;font-size: 16px;font-family: 'Times New Roman', Times, serif;}.container .result .item i {display: block;/* 取消斜體 */font-style: normal;color: green;}</style>
</head><body><div class="container"><div class="search"><input type="text" value="輸入搜索關(guān)鍵字..."><button onclick="Search()">搜索一下</button></div><div class="result"><!-- 動(dòng)態(tài)生成網(wǎng)頁(yè)內(nèi)容 --><!-- <div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要,這是摘要這是摘要,這是摘要這是摘要,這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div><div class="item"><a href="#">這是標(biāo)題</a><p>這是摘要這是摘要</p><i>https://www.bilibili.com/</i></div> --></div></div><script>function Search() {// alert("hello js");// 1. 提取數(shù)據(jù) jquerylet query = $(".container .search input").val();if(query == '' || query == null){return;}console.log("query = " + query);// 2. 發(fā)起http 請(qǐng)求$.ajax({type: "GET",url: "/s?word=" + query,success: function (data) {console.log(data);// 構(gòu)建新網(wǎng)頁(yè)  -- 動(dòng)態(tài)的BuildHtml(data);}});}function BuildHtml(data) {if(date == '' || data == null){document.write("搜索的內(nèi)容沒(méi)有");return;}let result_lable = $(".container .result");result_lable.empty();for (let elem of data) {// console.log(elem.title);// console.log(elem.url);let a_lable = $("<a>", {text: elem.title,href: elem.url,target: "_blank"});let p_lable = $("<p>", {text: elem.desc});let i_lable = $("<i>", {text: elem.url});let div_lable = $("<div>", {class: "item"});a_lable.appendTo(div_lable);p_lable.appendTo(div_lable);i_lable.appendTo(div_lable);div_lable.appendTo(result_lable);}}</script></body></html>

項(xiàng)目成果

下面我們就可以使用我們的項(xiàng)目做搜索服務(wù)了看一下.

項(xiàng)目補(bǔ)充

下面我們補(bǔ)充點(diǎn)內(nèi)容,有些小細(xì)節(jié)我們還沒(méi)有談.

取重完善

我們?cè)谒阉鞣?wù)那里說(shuō)過(guò),對(duì)于我們關(guān)鍵詞的搜索結(jié)果,在多個(gè)關(guān)鍵字之間,我們的文檔id可能會(huì)重復(fù).這個(gè)時(shí)候我們需要進(jìn)行去重分為兩步.

找到在重復(fù)的id
把id里面的權(quán)重盡心相加
重新構(gòu)造,讓后進(jìn)行查找構(gòu)建json串

下面是我們的遇到的情況.

這里我們應(yīng)該要處理.

struct InvertedElemPrint{uint64_t doc_id; // 文旦idint weight;                     // 權(quán)重std::vector<std::string> words; // 一個(gè)id里面可以對(duì)飲多個(gè)詞InvertedElemPrint() : doc_id(0), weight(0) {}};class Searcher{public:Searcher() {}....void Search(const std::string &query, std::string *json_string){// 1 分詞  先來(lái)分詞后面在進(jìn)行查找std::vector<std::string> words;ns_util::JiebaUtil::CutString(query, &words);// 2 根據(jù)分詞結(jié)果依次觸發(fā)  搜索std::unordered_map<uint64_t, InvertedElemPrint> tokens_map; //根據(jù)id,找到InvertedElemPrintstd::vector<InvertedElemPrint> inverted_list_all; // 為了去重for (std::string s : words){boost::to_lower(s); // 先查倒排ns_index::InvertedList *inverted_list = index->GetInvertedList(s);if (nullptr == inverted_list){continue;}// 根據(jù)倒排拉量找到我們所有的文檔idfor (const auto &elem : *inverted_list){// 去看這個(gè)id是不在哈希表中,如果在,拿到InvertedElemPrintauto &item = tokens_map[elem.doc_id]; item.doc_id = elem.doc_id; // 把關(guān)鍵字也插入其中item.words.push_back(elem.word);// 計(jì)算權(quán)重item.weight += elem.weight;}// 此時(shí)我們相同的id 已經(jīng)被保存了}// 這里就把我們相同id的InvertedElemPrint插入所有的數(shù)組中for (const auto &item : tokens_map){inverted_list_all.push_back(item.second);}// 3 合并排序  -- 按照相關(guān)性進(jìn)行降序排序,這里是根據(jù)新的權(quán)重.std::sort(inverted_list_all.begin(), inverted_list_all.end(),[](const InvertedElemPrint &e1, const InvertedElemPrint &e2){return e1.weight > e2.weight;});// 4 構(gòu)建json串 使用序列化和反序列化Json::Value root;for (auto &item : inverted_list_all){// 此時(shí)拿到正派ns_index::DocInfo *doc = index->GetForwardIndex(item.doc_id);if (nullptr == doc){continue;}// 獲取了 文檔內(nèi)容Json::Value elem;elem["title"] = doc->title;elem["desc"] = make_summary(doc->content, item.words[0]); // 我們需要根據(jù)關(guān)鍵字來(lái)提取摘要elem["url"] = doc->url;// fordebug//  elem["id"] = (int)item.doc_id;//  elem["weight"] = item.weight; // 會(huì)自動(dòng)轉(zhuǎn)成stringroot.append(elem); // 這里是有序的}Json::StyledWriter writer; // 這里我們暫時(shí)用這個(gè)格式*json_string = writer.write(root);}private:....ns_index::Index *index; // 提供系統(tǒng)經(jīng)行查找索引};

添加日志

這里我們添加日志創(chuàng)建一個(gè)文件.

[qkj@localhost BoostSearchEngine]$ touch log.hpp

#pragma once
#include <iostream>
#include <string>
#include <ctime>#define NORMAL 1
#define WARNING 2
#define DEBUG 3
#define FATAL 4
#define LOG(LEVEL, MESSAGE) log(#LEVEL, MESSAGE, __FILE__, __LINE__)void log(std::string level, std::string message, std::string file, int line)
{std::cout << "[" << level << "]"<< "[" << time(nullptr) << "]"<< "[" << message << "]"<< "[" << file << "]"<< "[:" << line << "]" << std::endl;
}

在索引那里建立日志

在搜索那里建立日志

在服務(wù)端那里建立日志

項(xiàng)目拓展

這里我們可以擴(kuò)展一下項(xiàng)目.

摘要完善

我們知道,分詞的時(shí)候是可以去掉暫停詞的.上面的我們都沒(méi)有這么做.這是因?yàn)槲覀兊娜绻由先サ魰和Ｔ~,此時(shí)對(duì)資源的要求非常大.那么這里可以作為一個(gè)擴(kuò)展.jieba里面也有暫停詞的集合.我們使用一下.

class JiebaUtil{public:static void CutString(const std::string &src, std::vector<std::string> *out){assert(out);ns_util::JiebaUtil::get_instance()->CutStringHelper(src, out);}private:/// @brief 這里是分詞/// @param src/// @param outvoid CutStringHelper(const std::string &src, std::vector<std::string> *out){jieba.CutForSearch(src, *out);for (auto iter = out->begin(); iter != out->end();){auto it = stop_words.find(*iter);if (it != stop_words.end()){// 此時(shí)是暫停詞 刪除//  避免迭代器失效// std::cout << *iter << std::endl;iter = out->erase(iter);}else{iter++;}}}static JiebaUtil *get_instance(){static std::mutex mtx;if (nullptr == instance){mtx.lock();if (nullptr == instance){instance = new JiebaUtil;instance->InitJiebaUtil();}mtx.unlock();}return instance;}// 這是我們的切分詞void InitJiebaUtil(){std::ifstream in(STOP_WORD_PATH);if (in.is_open() == false){LOG(FATAL, "加載暫停詞錯(cuò)誤");return;}std::string line;while (std::getline(in, line)){stop_words.insert(std::make_pair(line, true));}in.close();}private:static JiebaUtil *instance;cppjieba::Jieba jieba;std::unordered_map<std::string, bool> stop_words;JiebaUtil() : jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH) {}// 拷貝構(gòu)造等 delte};JiebaUtil *JiebaUtil::instance = nullptr;

后臺(tái)部署服務(wù)

我們可以把它設(shè)置為精靈進(jìn)程.

nohup指令

nohup的執(zhí)行:

nohup指令: 將服務(wù)進(jìn)程以守護(hù)進(jìn)程的方式執(zhí)行 , 使關(guān)閉XShell之后仍可以訪問(wèn)該服務(wù)。

例如 nohup ./http_server

如果讓程序在后臺(tái)執(zhí)行, 可以在末尾加上 & , 程序就會(huì)隱身 , 不會(huì)顯示在終端。

例如 nohup ./http_server &

nohup形成的文件:

執(zhí)行完上述的nohup指令之后,將會(huì)形成一個(gè)nohup.out存儲(chǔ)日志信息文件,可以cat查看該文件

setsid

我們也是可以使用下面的方式驚醒守護(hù)進(jìn)程化

#pragma once#include <cstdio>
#include <iostream>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>void daemonize()
{int fd = 0;// 1. 忽略SIGPIPEsignal(SIGPIPE, SIG_IGN);// 2. 更改進(jìn)程的工作目錄// chdir();// 3. 讓自己不要成為進(jìn)程組組長(zhǎng)if (fork() > 0)exit(0);// 4. 設(shè)置自己是一個(gè)獨(dú)立的會(huì)話setsid();// 5. 重定向0,1,2if ((fd = open("/dev/null", O_RDWR)) != -1) // fd == 3{dup2(fd, STDIN_FILENO);dup2(fd, STDOUT_FILENO);dup2(fd, STDERR_FILENO);// 6. 關(guān)閉掉不需要的fdif(fd > STDERR_FILENO) close(fd);// 6. close(0,1,2)// 嚴(yán)重不推薦
}

其他拓展

我們?cè)谒阉饕嬷?對(duì)于權(quán)重的設(shè)置先后顯示順序,我們其實(shí)可以疊加一些算法,比如可以設(shè)置競(jìng)價(jià)排名,熱點(diǎn)統(tǒng)計(jì),額外增加某些文檔的權(quán)重。
我們可以利用數(shù)據(jù)庫(kù),設(shè)置用戶登錄注冊(cè),引入對(duì)MySQL的使用。

查看全文

http://www.risenshineclean.com/news/44137.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网