裝修公司做網(wǎng)站有用嗎友情鏈接的網(wǎng)站圖片
? ? ? ?在LLM之RAG實戰(zhàn)(二十九)| 探索RAG PDF解析解析文檔后,我們可以獲得結(jié)構(gòu)化或半結(jié)構(gòu)化的數(shù)據(jù)。現(xiàn)在的主要任務是將它們分解成更小的塊來提取詳細的特征,然后嵌入這些特征來表示它們的語義,其在RAG中的位置如圖1所示:
? ? ? ?最常用的分塊方法是基于規(guī)則的,采用固定的塊大小或相鄰塊的重疊等技術。對于多級文檔,我們可以使用Langchain提供的RecursiveCharacterTextSplitter[1]來定義多級分隔符。
? ? ? ?然而,在實際應用中,由于嚴格的預定義規(guī)則(塊大小或重疊部分的大小),基于規(guī)則的分塊方法很容易導致檢索上下文不完整或包含噪聲的塊大小過大等問題。
? ? ? ?因此,對于分塊,最優(yōu)雅的方法顯然是基于語義的分塊。語義分塊旨在確保每個分塊包含盡可能多的語義獨立信息。
? ? ? ?本文將探討語義分塊的方法,并解釋了它們的原理和應用。我們將介紹三種類型的方法:
- Embedding-based
- Model-based
- LLM-based
一、基于Embedding的方法
? ? ? LlamaIndex和Langchain都提供了一個基于embedding的語義分塊器。這兩個框架的實現(xiàn)思路基本是一樣的,我們將以LlamaIndex為例進行介紹。
? ? ? 請注意,要訪問LlamaIndex中的語義分塊器,您需要安裝最新的版本。我安裝的前一個版本0.9.45沒有包含此算法。因此,我創(chuàng)建了一個新的conda環(huán)境,并安裝了更新版本0.10.12:
pip install llama-index-core
?
pip install llama-index-readers-file
?
pip install llama-index-embeddings-openai
? ? ? 值得一提的是,LlamaIndex的0.10.12可以靈活安裝,因此這里只安裝了一些關鍵組件。安裝的版本如下:
(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core 0.10.12
llama-index-embeddings-openai 0.1.6
llama-index-readers-file 0.1.5
llamaindex-py-client 0.1.13
測試代碼如下:
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader
?
?
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"
?
# load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()
?
?
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
?
nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
print('-' * 100)
print(node.get_content())
? ? ? ? ?splitter.get_nodes_from_documents函數(shù)的主要過程如圖2所示:
? ? ? ?圖2中提到的“sentences”是一個python列表,其中每個成員都是一個字典,包含四個(鍵、值)對,鍵的含義如下:
- sentences:當前句子;
- index:當前句子的序號;
- combined_sentence:一個滑動窗口,包括[index-self-buffer_size,index,index+self.buffer_size]3句話(默認情況下,self-buffer_size=1)。它是一種用于計算句子之間語義相關性的工具。組合前句和后句的目的是減少噪音,更好地捕捉連續(xù)句子之間的關系;
- combined_sentence_embedding:combined_sentence的嵌入。
? ? ? ?從以上分析中可以明顯看出,基于embedding的語義分塊本質(zhì)上包括基于滑動窗口(combined_sentence)計算相似度。那些相鄰的并且滿足閾值的句子被分類到一個塊中。
? ? ? ?下面我們使用BERT論文[2]作為目錄路徑,以下是一些運行結(jié)果:
(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py
...
...
----------------------------------------------------------------------------------------------------
We argue that current techniques restrict the
power of the pre-trained representations, espe-
cially for the ?ne-tuning approaches. The ma-
jor limitation is that standard language models are
unidirectional, and this limits the choice of archi-
tectures that can be used during pre-training. For
example, in OpenAI GPT, the authors use a left-to-
right architecture, where every token can only at-
tend to previous tokens in the self-attention layers
of the Transformer (Vaswani et al., 2017). Such re-
strictions are sub-optimal for sentence-level tasks,
and could be very harmful when applying ?ne-
tuning based approaches to token-level tasks such
as question answering, where it is crucial to incor-
porate context from both directions.
In this paper, we improve the ?ne-tuning based
approaches by proposing BERT: Bidirectional
Encoder Representations from Transformers.
BERT alleviates the previously mentioned unidi-
rectionality constraint by using a “masked lan-
guage model” (MLM) pre-training objective, in-
spired by the Cloze task (Taylor, 1953). The
masked language model randomly masks some of
the tokens from the input, and the objective is to
predict the original vocabulary id of the maskedarXiv:1810.04805v2 [cs.CL] 24 May 2019
----------------------------------------------------------------------------------------------------
word based only on its context. Unlike left-to-
right language model pre-training, the MLM ob-
jective enables the representation to fuse the left
and the right context, which allows us to pre-
train a deep bidirectional Transformer. In addi-
tion to the masked language model, we also use
a “next sentence prediction” task that jointly pre-
trains text-pair representations. The contributions
of our paper are as follows:
? We demonstrate the importance of bidirectional
pre-training for language representations. Un-
like Radford et al. (2018), which uses unidirec-
tional language models for pre-training, BERT
uses masked language models to enable pre-
trained deep bidirectional representations. This
is also in contrast to Peters et al.
----------------------------------------------------------------------------------------------------
...
...
基于embedding的方法:總結(jié)
- 測試結(jié)果表明,塊的粒度相對較粗。
- 圖2還顯示了這種方法是基于頁面的,并且沒有直接解決跨越多個頁面的塊的問題。
- 通常,基于嵌入的方法的性能在很大程度上取決于嵌入模型。實際效果需要進一步評估。
二、基于模型的方法
2.1 Naive BERT
? ? ? ?回憶一下BERT的預訓練過程,其中有個二元分類任務(NSP)來讓模型學習兩個句子之間的關系。兩個句子同時輸入到BERT中,并且該模型預測第二個句子是否在第一個句子之后。
? ? ? ?我們可以將這一原理應用于設計一種簡單的分塊方法。對于文檔,請將其拆分為多個句子。然后,使用滑動窗口將兩個相鄰的句子輸入到BERT模型中進行NSP判斷,如圖3所示:
? ? ? ?如果預測得分低于預設閾值,則表明兩句之間的語義關系較弱。這可以作為文本分割點,如圖3中句子2和句子3之間所示。
? ? ? ?這種方法的優(yōu)點是可以直接使用,而不需要訓練或微調(diào)。
? ? ? ?然而,這種方法在確定文本分割點時只考慮前句和后句,忽略了來自其他片段的信息。此外,這種方法的預測效率相對較低。
2.2?Cross Segment Attention
? ? ? 論文《Text Segmentation by Cross Segment Attention》[3]提出了三種跨段注意力模型,如圖4所示:
? ? ? ?圖4(a)顯示了跨段BERT模型,該模型將文本分割定義為逐句分類任務。潛在中斷的上下文(兩側(cè)的k個令牌)被輸入到模型中。將與[CLS]相對應的隱藏狀態(tài)傳遞給softmax分類器,以做出關于在潛在斷點處進行分割的決定。
? ? ? ?本論文還提出了另外兩個模型:一種是使用BERT模型來獲得每個句子的向量表示。然后,將多個連續(xù)句子的這些向量表示輸入到Bi-LSTM(圖4(b))或另一個BERT(圖4),以預測每個句子是否是文本分割邊界。
? ? ? ?當時,這三個模型取得了SOTA的結(jié)果,如圖5所示:
? ? ? ?然而,到目前為止,只發(fā)現(xiàn)了本論文的訓練代碼[4],推理模型沒有找到。
2.3?SeqModel
? ? ? ?跨段模型獨立地對每個句子進行矢量化,不考慮任何更廣泛的上下文信息。SeqModel中提出了進一步的增強,如論文“Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation”[5]中所述。
? ? ???SeqModel[6]使用BERT同時對多個句子進行編碼,在計算句子向量之前對較長上下文中的依賴關系進行建模。然后它預測文本分割是否發(fā)生在每個句子之后。此外,該模型利用自適應滑動窗口方法在不影響精度的情況下提高推理速度。SeqModel的示意圖如圖6所示。
? ? ? ? SeqModel可以通過ModelScope[7]框架使用。代碼如下所示:
from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
?
p = pipeline(
task = Tasks.document_segmentation,
model = 'damo/nlp_bert_document-segmentation_english-base'
)
?
print('-' * 100)
?
result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ? We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')
?
print(result[OutputKeys.TEXT])
? ? ? ?測試數(shù)據(jù)最后附加了一句話,“Today is a good day”,但結(jié)果并沒有把“Today is a good day”分開。
(modelscope) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_seqmodel.py
2024-02-24 17:09:36,288 - modelscope - INFO - PyTorch version 2.2.1 Found.
2024-02-24 17:09:36,288 - modelscope - INFO - Loading ast index from /Users/Florian/.cache/modelscope/ast_indexer
...
...
----------------------------------------------------------------------------------------------------
...
...
We demonstrate the importance of bidirectional pre-training for language representations.Unlike Radford et al.(2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations.This is also in contrast to Peters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.? We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures.BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.Today is a good day
三、基于LLM的方法
? ? ? ?論文《Dense X Retrieval: What Retrieval Granularity Should We Use?》[8]引入了一個新的檢索單元,稱為proposition。proposition被定義為文本中的原子表達式,每個命題都封裝了一個不同的事實,并以簡潔、自包含的自然語言格式呈現(xiàn)。
? ? ? ?那么,我們?nèi)绾潍@得這個所謂的命題呢?在本文中,它是通過構(gòu)建提示和與LLM的交互來實現(xiàn)的。
? ? ? ?LlamaIndex和Langchain都實現(xiàn)了相關的算法,下面使用LlamaIndex進行了演示。
? ? ? ? LlamaIndex的實現(xiàn)思想包括使用論文中提供的提示生成命題:
PROPOSITIONS_PROMPT = PromptTemplate(
"""Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.
?
Input: Title: ˉEostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]
?
Input: {node_text}
Output:"""
)
? ? ? ?在上一節(jié)基于嵌入的方法中,我們安裝了LlamaIndex 0.10.12的關鍵組件。但如果我們想使用DenseXRetrievalPack,我們還需要運行pip install-lama-index-llms-openai。安裝后,當前與LlamaIndex相關的組件如下:
(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core 0.10.12
llama-index-embeddings-openai 0.1.6
llama-index-llms-openai 0.1.6
llama-index-readers-file 0.1.5
llamaindex-py-client 0.1.13
? ? ? 測試代碼如下:
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack
?
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
?
# Download and install dependencies
DenseXRetrievalPack = download_llama_pack(
"DenseXRetrievalPack", "./dense_pack"
)
?
# If you have already downloaded DenseXRetrievalPack, you can import it directly.
# from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack
?
# Load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()
?
?
# Use LLM to extract propositions from every document/node
dense_pack = DenseXRetrievalPack(documents)
?
response = dense_pack.run("YOUR_QUERY")
? ? ? ?通過上述測試代碼,學會了初步使用類DenseXRetrievalPack,下面分析一下類DenseXRetrievalPack的源代碼:
class DenseXRetrievalPack(BaseLlamaPack):
def __init__(
self,
documents: List[Document],
proposition_llm: Optional[LLM] = None,
query_llm: Optional[LLM] = None,
embed_model: Optional[BaseEmbedding] = None,
text_splitter: TextSplitter = SentenceSplitter(),
similarity_top_k: int = 4,
) -> None:
"""Init params."""
self._proposition_llm = proposition_llm or OpenAI(
model="gpt-3.5-turbo",
temperature=0.1,
max_tokens=750,
)
?
embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)
?
nodes = text_splitter.get_nodes_from_documents(documents)
sub_nodes = self._gen_propositions(nodes)
?
all_nodes = nodes + sub_nodes
all_nodes_dict = {n.node_id: n for n in all_nodes}
?
service_context = ServiceContext.from_defaults(
llm=query_llm or OpenAI(),
embed_model=embed_model,
num_output=self._proposition_llm.metadata.num_output,
)
?
self.vector_index = VectorStoreIndex(
all_nodes, service_context=service_context, show_progress=True
)
?
self.retriever = RecursiveRetriever(
"vector",
retriever_dict={
"vector": self.vector_index.as_retriever(
similarity_top_k=similarity_top_k
)
},
node_dict=all_nodes_dict,
)
?
self.query_engine = RetrieverQueryEngine.from_args(
self.retriever, service_context=service_context
)
? ? ? ?如代碼所示,首先使用text_splitter將文檔劃分為原始nodes,然后調(diào)用self._gen_propositions來獲得相應的sub_nodes。然后,它使用nodes+sub_nodes構(gòu)建VectorStoreIndex,該索引可以通過RecursiveRetriever進行檢索。遞歸檢索器可以使用小塊進行檢索,但它會將相關的大塊傳遞到生成階段。
? ? ? 測試文檔仍然是BERT論文。通過調(diào)試,我們發(fā)現(xiàn)sub_nodes[].text不是原來的文本,它們被重寫了:
> /Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/base.py(91)__init__()
90
---> 91 all_nodes = nodes + sub_nodes
92 all_nodes_dict = {n.node_id: n for n in all_nodes}
?
?
ipdb> sub_nodes[20]
IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Our paper demonstrates the importance of bidirectional pre-training for language representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[21]
IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Radford et al. (2018) uses unidirectional language models for pre-training.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[22]
IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='BERT uses masked language models to enable pre-trained deep bidirectional representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[23]
IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Peters et al. (2018a) uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[24]
IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Pre-trained representations reduce the need for many heavily-engineered task-specific architectures.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
? ? ? ?sub_nodes和nodes之間的關系如圖7所示,是一個從小到大的索引結(jié)構(gòu)。
? ? ? 從小到大的索引結(jié)構(gòu)是通過?self._gen_propositions構(gòu)建的,代碼如下:
async def _aget_proposition(self, node: TextNode) -> List[TextNode]:
"""Get proposition."""
inital_output = await self._proposition_llm.apredict(
PROPOSITIONS_PROMPT, node_text=node.text
)
outputs = inital_output.split("\n")
?
all_propositions = []
?
for output in outputs:
if not output.strip():
continue
if not output.strip().endswith("]"):
if not output.strip().endswith('"') and not output.strip().endswith(
","
):
output = output + '"'
output = output + " ]"
if not output.strip().startswith("["):
if not output.strip().startswith('"'):
output = '"' + output
output = "[ " + output
?
try:
propositions = json.loads(output)
except Exception:
# fallback to yaml
try:
propositions = yaml.safe_load(output)
except Exception:
# fallback to next output
continue
?
if not isinstance(propositions, list):
continue
?
all_propositions.extend(propositions)
?
assert isinstance(all_propositions, list)
nodes = [TextNode(text=prop) for prop in all_propositions if prop]
?
return [IndexNode.from_text_node(n, node.node_id) for n in nodes]
?
def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]:
"""Get propositions."""
sub_nodes = asyncio.run(
run_jobs(
[self._aget_proposition(node) for node in nodes],
show_progress=True,
workers=8,
)
)
?
# Flatten list
return [node for sub_node in sub_nodes for node in sub_node]
? ? ? ?對于每個原始node,異步調(diào)用self_aget_proposition通過PROPOSITIONS_PROMPT獲取LLM的返回inital_output,然后基于inital_out獲取命題并構(gòu)建TextNode。最后,將這些TextNode與原始node相關聯(lián),即[IndexNode.from_text_node(n,node.node_id)for n in nodes]。
? ? ? ?值得一提的是,原始論文使用LLM生成的命題作為訓練數(shù)據(jù)來進一步微調(diào)文本生成模型。文本生成模型[9]現(xiàn)在可以公開訪問。感興趣的讀者可以嘗試一下。
基于LLM的方法:綜述
? ? ? ?一般來說,這種使用LLM構(gòu)造命題的分塊方法實現(xiàn)了更精細的分塊。它與原始節(jié)點形成了一個從小到大的索引結(jié)構(gòu),從而為語義分塊提供了一個新的思路。然而,這種方法依賴于LLM,這是相對昂貴的。
四、結(jié)論
? ? ? ?本文探討了三種類型的語義分塊方法的原理和實現(xiàn)方法,并提供了一些綜述。
? ? ? ?一般來說,語義分塊是一種更優(yōu)雅的方式,也是優(yōu)化RAG的關鍵。
參考文獻:
[1]?https://github.com/langchain-ai/langchain/blob/v0.1.9/libs/langchain/langchain/text_splitter.py#L851C1-L851C6
[2]?https://arxiv.org/pdf/1810.04805.pdf
[3]?https://arxiv.org/abs/2004.14535
[4]?https://github.com/aakash222/text-segmentation-NLP/
[5]?https://arxiv.org/pdf/2107.09278.pdf
[6]?https://github.com/alibaba-damo-academy/SpokenNLP
[7]?https://github.com/modelscope/modelscope/
[8]?https://arxiv.org/pdf/2312.06648.pdf
[9]?https://github.com/chentong0/factoid-wiki