赤峰做企業(yè)網(wǎng)站公司企業(yè)網(wǎng)站建設(shè)方案策劃
文章目錄
- 前言
- self querying 簡(jiǎn)介
- 代碼實(shí)現(xiàn)
- 總結(jié)
前言
現(xiàn)在比較流行的 RAG 檢索就是通過(guò)大模型 embedding 算法將數(shù)據(jù)嵌入向量數(shù)據(jù)庫(kù)中,然后在將用戶的查詢向量化,從向量數(shù)據(jù)庫(kù)中召回相似性數(shù)據(jù),構(gòu)造成 context template, 放到 LLM 中進(jìn)行查詢。
如果說(shuō)將用戶的查詢語(yǔ)句直接轉(zhuǎn)換為向量查詢可能并不會(huì)得到很好的結(jié)果,比如說(shuō)我們往向量數(shù)據(jù)庫(kù)中存入了一些商品向量,現(xiàn)在用戶說(shuō):“我想要一條價(jià)格低于20塊的黑色羊毛衫”,如果使用傳統(tǒng)的嵌入算法,該查詢語(yǔ)句轉(zhuǎn)換為向量查詢就可能“失幀”,被轉(zhuǎn)換為查詢黑色羊毛衫。
針對(duì)這種情況我們就會(huì)使用一些優(yōu)化檢索查詢語(yǔ)句方式來(lái)優(yōu)化 RAG 查詢,其中 langchain 的 self-querying 就是一種很好的方式,這里使用阿里云的 DashVector 向量數(shù)據(jù)庫(kù)和 DashScope LLM 來(lái)進(jìn)行嘗試,優(yōu)化后的查詢效果還是挺不錯(cuò)的。
現(xiàn)在很多網(wǎng)上的資料都是使用 OpenAI 的 Embedding 和 LLM,但是個(gè)人角色現(xiàn)在國(guó)內(nèi)阿里的 LLM 和向量數(shù)據(jù)庫(kù)已經(jīng)非常好了,而且 OpenAI 已經(jīng)禁用了國(guó)內(nèi)的 API 調(diào)用,國(guó)內(nèi)的云服務(wù)又便宜又好用,真的不嘗試一下么?關(guān)于 DashVector 和 DashScope 我之前寫了幾篇實(shí)踐篇,大家感興趣的可以參考下:
LLM-文本分塊(langchain)與向量化(阿里云DashVector)存儲(chǔ),嵌入LLM實(shí)踐
LLM-阿里云 DashVector + ModelScope 多模態(tài)向量化實(shí)時(shí)文本搜圖實(shí)戰(zhàn)總結(jié)
LLM-langchain 與阿里 DashScop (通義千問(wèn)大模型) 和 DashVector(向量數(shù)據(jù)庫(kù)) 結(jié)合使用總結(jié)
前提條件
- 確保開通了通義千問(wèn) API key 和 向量檢索服務(wù) API KEY
- 安裝依賴:
pip install langchain
pip install langchain-community
pip install dashVector
pip install dashscope
self querying 簡(jiǎn)介
簡(jiǎn)單來(lái)說(shuō)就是通過(guò) self-querying 的方式我們可以將用戶的查詢語(yǔ)句進(jìn)行結(jié)構(gòu)化轉(zhuǎn)換,轉(zhuǎn)換為包含兩層意思的向量化數(shù)據(jù):
- Query: 和查詢語(yǔ)義相近的向量查詢
- Filter: 關(guān)于查詢內(nèi)容的一些 metadata 數(shù)據(jù)
比如說(shuō)上圖中用戶輸入:“bar 說(shuō)了關(guān)于 foo 的什么東西?”,self-querying 結(jié)構(gòu)化轉(zhuǎn)換后就變?yōu)榱藘蓪雍x:
- 查詢關(guān)于 foo 的數(shù)據(jù)
- 其中作者為 bar
代碼實(shí)現(xiàn)
將DASHSCOPE_API_KEY
, DASHVECTOR_API_KEY
, DASHVECTOR_ENDPOINT
替換為自己在阿里云開通的。
import osfrom langchain_core.documents import Document
from langchain_community.vectorstores.dashvector import DashVector
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_community.chat_models.tongyi import ChatTongyi
from langchain_core.vectorstores import VectorStoreclass SelfQuerying:def __init__(self):# 我們需要同時(shí)開通 DASHSCOPE_API_KEY 和 DASHVECTOR_API_KEYos.environ["DASHSCOPE_API_KEY"] = ""os.environ["DASHVECTOR_API_KEY"] = ""os.environ["DASHVECTOR_ENDPOINT"] = ""self.llm = ChatTongyi(temperature=0)def handle_embeddings(self)->'VectorStore':docs = [Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},),Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},),Document(page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},),Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},),Document(page_content="Toys come alive and have a blast doing so",metadata={"year": 1995, "genre": "animated"},),Document(page_content="Three men walk into the Zone, three men walk out of the Zone",metadata={"year": 1979,"director": "Andrei Tarkovsky","genre": "thriller","rating": 9.9,},),]# 指定向量數(shù)據(jù)庫(kù)中的 Collection namevectorstore = DashVector.from_documents(docs, DashScopeEmbeddings(), collection_name="langchain")return vectorstoredef build_querying_retriever(self, vectorstore: 'VectorStore', enable_limit: bool=False)->'SelfQueryRetriever':"""構(gòu)造優(yōu)化檢索:param vectorstore: 向量數(shù)據(jù)庫(kù):param enable_limit: 是否查詢 Top k:return:"""metadata_field_info = [AttributeInfo(name="genre",description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",type="string",),AttributeInfo(name="year",description="The year the movie was released",type="integer",),AttributeInfo(name="director",description="The name of the movie director",type="string",),AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),]document_content_description = "Brief summary of a movie"retriever = SelfQueryRetriever.from_llm(self.llm,vectorstore,document_content_description,metadata_field_info,enable_limit=enable_limit)return retrieverdef handle_query(self, query: str):"""返回優(yōu)化查詢后的檢索結(jié)果:param query::return:"""# 使用 LLM 優(yōu)化查詢向量,構(gòu)造優(yōu)化后的檢索retriever = self.build_querying_retriever(self.handle_embeddings())response = retriever.invoke(query)return responseif __name__ == '__main__':q = SelfQuerying()# 只通過(guò)查詢屬性過(guò)濾print(q.handle_query("I want to watch a movie rated higher than 8.5"))# 通過(guò)查詢屬性和查詢語(yǔ)義內(nèi)容過(guò)濾print(q.handle_query("Has Greta Gerwig directed any movies about women"))# 復(fù)雜過(guò)濾查詢print(q.handle_query("What's a highly rated (above 8.5) science fiction film?"))# 復(fù)雜語(yǔ)義和過(guò)濾查詢print(q.handle_query("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"))
上邊的代碼主要步驟有三步:
- 執(zhí)行 embedding, 將帶有 metadata 的 Doc 嵌入 DashVector
- 構(gòu)造 self-querying retriever,需要預(yù)先提供一些關(guān)于我們的文檔支持的元數(shù)據(jù)字段的信息以及文檔內(nèi)容的簡(jiǎn)短描述。
- 執(zhí)行查詢語(yǔ)句
執(zhí)行代碼輸出查詢內(nèi)容如下:
# "I want to watch a movie rated higher than 8.5"
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]# "Has Greta Gerwig directed any movies about women"
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]# "What's a highly rated (above 8.5) science fiction film?"
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]# "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]
總結(jié)
本文主要講了如何使用 langchain 的 self-query 來(lái)優(yōu)化向量檢索,我們使用的是阿里云的 DashVector 和 DashScope LLM 進(jìn)行的代碼演示,讀者可以開通下,體驗(yàn)嘗試一下。