制作精美網(wǎng)站建設(shè)服務(wù)周到廣東東莞疫情最新情況
LangChain學(xué)習(xí)文檔
- 【LangChain】向量存儲(chǔ)(Vector stores)
- 【LangChain】向量存儲(chǔ)之FAISS
概要
Facebook AI 相似性搜索(Faiss)是一個(gè)用于高效相似性搜索和密集向量聚類的庫。它包含的算法可以搜索任意大小的向量集,甚至可能無法容納在 RAM 中的向量集。它還包含用于評估和參數(shù)調(diào)整的支持代碼。
FAISS詳細(xì)文檔
本篇文章將展示如何使用與 FAISS 向量數(shù)據(jù)庫相關(guān)的功能。
前提條件
pip install faiss-gpu # For CUDA 7.5+ Supported GPU's.
# OR
pip install faiss-cpu # For CPU Installation
內(nèi)容
我們想要使用 OpenAIEmbeddings
,因此我們必須獲取 OpenAI API
key。
import os
import getpassos.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")# Uncomment the following line if you need to initialize FAISS with no AVX2 optimization
# 如果您需要在沒有 AVX2 優(yōu)化的情況下初始化 FAISS,請取消以下注釋
# os.environ['FAISS_NO_AVX2'] = '1'
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
相關(guān)api鏈接:
OpenAIEmbeddings from langchain.embeddings.openai
CharacterTextSplitter from langchain.text_splitter
FAISS from langchain.vectorstores
TextLoader from langchain.document_loaders
from langchain.document_loaders import TextLoaderloader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)embeddings = OpenAIEmbeddings()
參考API
:
TextLoader from langchain.document_loaders
db = FAISS.from_documents(docs, embeddings)query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
print(docs[0].page_content)
結(jié)果:
今晚。我呼吁參議院: 通過《投票自由法案》。通過約翰·劉易斯投票權(quán)法案。當(dāng)你這樣做的時(shí)候,通過《披露法案》,這樣美國人就可以知道誰在資助我們的選舉。
今晚,我要向一位畢生為這個(gè)國家服務(wù)的人表示敬意:斯蒂芬·布雷耶法官——退伍軍人、憲法學(xué)者、即將退休的美國最高法院法官。布雷耶法官,感謝您的服務(wù)。
總統(tǒng)最重要的憲法責(zé)任之一是提名某人在美國最高法院任職。
四天前,當(dāng)我提名巡回上訴法院法官科坦吉·布朗·杰克遜時(shí),我就這樣做了。我們國家最頂尖的法律頭腦之一,他將繼承布雷耶大法官的卓越遺產(chǎn)。
使用分?jǐn)?shù)進(jìn)行相似性搜索(Similarity Search with score)
有一些 FAISS
特定方法。其中之一是similarity_search_with_score
,它不僅允許您返回文檔,還允許返回查詢到它們的距離分?jǐn)?shù)。返回的距離分?jǐn)?shù)是L2
距離。因此,分?jǐn)?shù)越低越好。
docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]
結(jié)果:
(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),0.36913747)
還可以使用similarity_search_by_vector
與給定嵌入向量相似的文檔進(jìn)行搜索,該向量接受嵌入向量作為參數(shù)而不是字符串。
# embed 向量
embedding_vector = embeddings.embed_query(query)
# embed 向量作為入?yún)?embedding_vector
docs_and_scores = db.similarity_search_by_vector(embedding_vector)
保存和加載(Saving and loading)
您還可以保存和加載 FAISS 索引。這很有用,因此我們不必每次使用它時(shí)都重新創(chuàng)建它。
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
docs = new_db.similarity_search(query)
docs[0]
結(jié)果:
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})
合并(Merging)
您還可以合并兩個(gè) FAISS
向量存儲(chǔ)
db1 = FAISS.from_texts(["foo"], embeddings)
db2 = FAISS.from_texts(["bar"], embeddings)
# 打印第一個(gè)FAISS
db1.docstore._dict
結(jié)果:
{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={})}
# 打印第二個(gè)FAISS
db2.docstore._dict
結(jié)果:
{'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}
# 合并
db1.merge_from(db2)
# 打印
db1.docstore._dict
結(jié)果:
{'068c473b-d420-487a-806b-fb0ccea7f711': Document(page_content='foo', metadata={}),'807e0c63-13f6-4070-9774-5c6f0fbb9866': Document(page_content='bar', metadata={})}
帶過濾的相似性搜索(Similarity Search with filtering)
FAISS vectorstore
還可以支持過濾,因?yàn)?FAISS 本身不支持過濾,我們必須手動(dòng)執(zhí)行。
首先獲取多于 k
個(gè)結(jié)果,然后過濾它們來完成的。您可以根據(jù)元數(shù)據(jù)過濾文檔。
您還可以在調(diào)用任何搜索方法時(shí)設(shè)置 fetch_k
參數(shù),以設(shè)置在過濾之前要獲取的文檔數(shù)量。這是一個(gè)小例子:
from langchain.schema import Document
# 先構(gòu)造文檔數(shù)據(jù),方便后面的測試
list_of_documents = [Document(page_content="foo", metadata=dict(page=1)),Document(page_content="bar", metadata=dict(page=1)),Document(page_content="foo", metadata=dict(page=2)),Document(page_content="barbar", metadata=dict(page=2)),Document(page_content="foo", metadata=dict(page=3)),Document(page_content="bar burr", metadata=dict(page=3)),Document(page_content="foo", metadata=dict(page=4)),Document(page_content="bar bruh", metadata=dict(page=4)),
]
# 構(gòu)建向量存儲(chǔ)
db = FAISS.from_documents(list_of_documents, embeddings)
# 簡單搜索下,方便后面的對比
results_with_scores = db.similarity_search_with_score("foo")
# 打印
for doc, score in results_with_scores:print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
相關(guān)API:Document from langchain.schema
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15
現(xiàn)在我們進(jìn)行相同的查詢調(diào)用,但我們僅過濾 page = 1
:
# 開始使用過濾:filter指定過濾元數(shù)據(jù)page:1的數(shù)據(jù)
results_with_scores = db.similarity_search_with_score("foo", filter=dict(page=1))
for doc, score in results_with_scores:print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
結(jié)果:
Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15Content: bar, Metadata: {'page': 1}, Score: 0.3131446838378906
同樣的事情也可以用 max_marginal_relevance_search
來完成。
# max_marginal_relevance_search
results = db.max_marginal_relevance_search("foo", filter=dict(page=1))
for doc in results:print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
結(jié)果:
# 相比上面,少了ScoreContent: foo, Metadata: {'page': 1}Content: bar, Metadata: {'page': 1}
以下是調(diào)用similarity_search
時(shí)如何設(shè)置 fetch_k
參數(shù)的示例。
通常我們需要 fetch_k參數(shù)
>> k 參數(shù)
。
這是因?yàn)?fetch_k 參數(shù)
是過濾之前將獲取的文檔數(shù)。如果將 fetch_k
設(shè)置為較小的數(shù)字,則可能無法獲得足夠的文檔進(jìn)行過濾。
# k設(shè)置過濾后得到的文檔數(shù)、fetch_k設(shè)置過濾前的文檔數(shù)
results = db.similarity_search("foo", filter=dict(page=1), k=1, fetch_k=4)
for doc in results:print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")
結(jié)果:
Content: foo, Metadata: {'page': 1}
總結(jié)
本篇主要講解FAISS的使用。
基本思路:
- 加載文檔、拆分
- 利用embed構(gòu)造向量存儲(chǔ):
db = FAISS.from_documents(docs, embeddings)
- 在此基礎(chǔ)上,就可以
相關(guān)性搜索
、搜索過濾
等操作。
參考地址:
https://python.langchain.com/docs/integrations/vectorstores/faiss