當前位置：首頁 > news >正文

學校資源網站的建設方案/東莞seo推廣機構帖子

news 2025/7/1 6:56:47

學校資源網站的建設方案,東莞seo推廣機構帖子,500元做網站,二級域名電子商務網站推廣方案前言 Lucene全文檢索主要分為索引、搜索兩個過程，對于索引過程就是將文檔磁盤存儲然后按照指定格式構建索引文件，其中涉及數據存儲一些壓縮、數據結構設計還是很巧妙的，下面主要記錄學習過程中的StoredField、DocValue以及磁盤BKD Tree的一些…

前言

Lucene全文檢索主要分為索引、搜索兩個過程，對于索引過程就是將文檔磁盤存儲然后按照指定格式構建索引文件，其中涉及數據存儲一些壓縮、數據結構設計還是很巧妙的，下面主要記錄學習過程中的StoredField、DocValue以及磁盤BKD Tree的一些相關知識。

參考：

https://juejin.cn/post/6978437292549636132
https://juejin.cn/user/2559318800998141/posts
Lucene 原理與代碼分析完整版.pdf
https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/codecs/lucene99/package-summary.html#package.description
美團外賣搜索基于 Elasticsearch 的優(yōu)化實踐

1、Lucene數據分類

在Lucene中索引數據存儲的邏輯層次有多個層次，從大到小依次是

index：索引代表了一類數據的完整存儲
segment: 一個索引可能有一個或者多個段構成
doc: segment中存儲的是一篇一篇的文檔doc，每個segment是一個doc的集合
field: 每個doc都有多個field構成，filed才包含了具體的文本，類似于一個json對象的一個屬性
term: 每個field的值可以進行分詞，進而得到多個term，term是最基本的單元，每個field可以保存自己的詞向量，用來計算搜索相似度

按照數據的維度整個Lucene把需要處理的數據分為這么幾類

PostingList,倒排表，也就是term->[doc1， doc3, doc5]這種倒排索引數據
BlockTree, 從term和PostingList的映射關系，這種映射一般都用FST這種數據結構來表示，這種數據結構其實是一種樹形結構，類似于Tire樹，所以Lucene這里就叫BlockTree，其實我更習慣叫它TermDict。
StoredField ，一般類型的field原始數據存儲。
DocValue 鍵值數據，這種數據主要用于數值、日期類型的field，是用來加速對字段的排序、篩選的，列式存儲。
TermVector詞向量信息，主要記一個不同term的全局出現頻率等信息，用于score，如搜索的str會被分為一個個term，然后會被轉為指定維度的向量，存儲文檔維護索引會根據當前文檔、所有文檔中term出現的頻率以得到一個當前term的權重創(chuàng)建一個對應的指定維度的向量，然后就計算查詢相關性score。
Norms用來存儲Normalisation信息，比如給某些field加權之類的。
PointValue 用來加速 range Query的信息。

一個段索引維護的數據，Lucene9_9_0版本https://lucene.apache.org/core/9_9_0/core/org/apache/lucene/codecs/lucene99/package-summary.html#package.description

Segment info. This contains metadata about a segment, such as the number of documents, what files it uses, and information about how the segment is sorted。其中包含有關片段的元數據，例如文檔數量、它使用的文件以及有關片段排序方式的信息
Field names. This contains metadata about the set of named fields used in the index.包含文檔fields的元數據以及名稱。
Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.以文檔ID作為key，存儲當前文檔的fields鍵值對。
Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term’s frequency and proximity data.包含所有文檔的所有索引字段中使用的所有term的字典。該詞典還包含包含該term的文檔數量，以及指向該術語的頻率和鄰近數據的指針。
Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS)。term在當前文檔出現的頻率以及在全部文檔出現的頻率，主要用于score得分，比如term在當前文檔出現的頻率最高，在所有文檔出現的頻率最低，那么搜索該term在該文檔中搜索得分高。
Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data。term出現在所有文檔的位置，可省略。
Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.計算相關性score的時候可為某些field字段乘以一個系數。
Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors。每一個文檔的每一個field會有一個term向量，主要根據term出現的頻率計算出來，用于搜索的score分值計算。
- TextField: Reader or String indexed for full-text search。用于全文搜索。
- StringField: String indexed verbatim as a single token
- IntPoint: int indexed for exact/range queries.
- LongPoint: long indexed for exact/range queries.
- FloatPoint: float indexed for exact/range queries.
- DoublePoint: double indexed for exact/range queries.
- SortedDocValuesField: byte[] indexed column-wise for sorting/faceting，按列索引，用于排序
- SortedSetDocValuesField: SortedSet<byte[]> indexed column-wise for sorting/faceting
- NumericDocValuesField: long indexed column-wise for sorting/faceting
- SortedNumericDocValuesField: SortedSet indexed column-wise for sorting/faceting
- StoredField: Stored-only value for retrieving in summary results。僅存儲值。
Per-document values. Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.類似StoreField，可以更快加載到內存訪問，用于搜索的摘要結果，但是每個文檔的值對于評分因素有很大的影響。
Live documents. An optional file indicating which documents are live.一個可選文件，指定哪些文檔是實時的。主要用于段數據刪除時候，在段外部維護一個狀態(tài)記錄段的最新狀態(tài)。
Point values. Optional pair of files, recording dimensionally indexed fields, to enable fast numeric range filtering and large numeric values like BigInteger and BigDecimal (1D) and geographic shape intersection (2D, 3D).可選的一對文件，記錄維度索引字段，以啟用快速數值范圍過濾和大數值，例如 BigInteger 和 BigDecimal (1D) 以及地理形狀交集（2D、3D）。
Vector values. The vector format stores numeric vectors in a format optimized for random access and computation, supporting high-dimensional nearest-neighbor search.

按照數據存儲的方向維度可以分為

一般存儲形式：按層次保存了從索引，一直到詞的包含關系：索引(Index) –> 段(segment) –> 文檔 (Document) –> 域(Field) –> 詞(Term) ，層次結構，則每個層次都保存了本層次的信息以及下一層次的元信息。如StoredFileld、DocValue存儲形式。
反向存儲形式：如倒排索引（PostingList + BlockTree）數據存儲形式。

2、Lucene存儲文件

一個索引相關的存儲文件對應一個文件夾，一個段的所有文件都具有相同的名稱和不同的擴展名。擴展名對應于下面描述的不同文件格式。當使用復合文件格式時（小段的默認格式），這些文件（段信息文件、鎖定文件和文件夾文檔文件除外）將折疊為單個.cfs文件。

Segments info：多個段文件名永遠不會重復使用。也就是說，當任何文件保存到目錄時，以前從未使用過的文件名。這是使用簡單的生成方法實現的。比如說，第一個段文件是segments_1，然后是segments_2，依此類推。生成是連續(xù)的長以字母數字（以36為基數）形式表示的整數。主要保存段的元信息，segments_N 保存了此索引包含多少個段，每個段包含多少篇文檔，實際的數據信息保存在field和詞中的。
Write.lock：寫鎖默認存儲在索引目錄中，名為“write.lock”。如果鎖目錄與索引目錄不同，則寫鎖將被命名為“XXXX-write.lock”，其中“”是從索引目錄的完整路徑導出的唯一前綴。如果存在此文件，則表示編寫者正在修改索引（添加或刪除文檔）。這個鎖文件確保一次只有一個writer修改索引。
Fields、Field Index 、Field Data：This is keyed by document number.也就是上面說的一般存儲形式，保存了此段包含了多少個field，每個field的名稱及索引方式以及數據。
Term Vector Index、Term Vector Data：當你將字段設置為存儲Term Vector時，Lucene會提取出該字段中每個詞項的相關信息，并將其存儲到倒排索引中。這樣可以在搜索時不僅找到包含關鍵詞的文檔，還能得知每個關鍵詞在文檔中的頻率和位置。因為不僅要根據倒排索引找到文檔ID，還需要計算文檔的相關性得分，會存儲當前文檔全部term的頻率、位置信息，為了下一步也就是根據文檔內全部的term的頻率信息計算下面的vector value。
Vector values：根據每個文檔的所有term vector data數據，為每個文檔計算出一個指定的相關性vector values，然后在跟query vevtor計算相關性score。

企業(yè)微信截圖_8914cb9a-4d36-4b25-b5b0-f6fcb58a9e92.png

3、Lucene數據存儲

ps：學習分析Lucene版本為9_9_0

3.1、StoredField

In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed. A field may be both stored and indexed.

保存字段屬性信息的，過程主要關注各數據類型是如何存儲的？最終寫入索引是如何壓縮的？Lucene的field數據類型有下面幾大類