Retrieval-Augmented Generation (RAG)
This page describes how to use RAG with LLMariner.
Embedding API
If you want to just generate embeddings, you can use the Embedding API, which is compatible with the OpenAI API.
Here are examples:
llma embeddings create --model intfloat-e5-mistral-7b-instruct --input "sample text"
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "sentence-transformers-all-MiniLM-L6-v2-f16",
"input": "sample text"
}' \
http://localhost:8080/v1/embeddings
Vector Store API
The first step is to create a vector store and create files in the vector store. Here is an example script with the OpenAI Python library:
from openai import OpenAI
client = OpenAI(
base_url="<LLMariner Endpoint URL>",
api_key="<LLMariner API key>"
)
filename = "llmariner_overview.txt"
with open(filename, "w") as fp:
fp.write("LLMariner builds a software stack that provides LLM as a service. It provides the OpenAI-compatible API.")
file = client.files.create(
file=open(filename, "rb"),
purpose="assistants",
)
print("Uploaded file. ID=%s" % file.id)
vs = client.beta.vector_stores.create(
name='Test vector store',
)
print("Created vector store. ID=%s" % vs.id)
vfs = client.beta.vector_stores.files.create(
vector_store_id=vs.id,
file_id=file.id,
)
print("Created vector store file. ID=%s" % vfs.id)
Once the files are added into vector store, you can run the completion request with the RAG model.
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is LLMariner?"}
],
tool_choice = {
"choice": "auto",
"type": "function",
"function": {
"name": "rag"
}
},
tools = [
{
"type": "function",
"function": {
"name": "rag",
"parameters": {
"vector_store_name": "Test vector store"
}
}
}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
If you want to hit the API endpoint directly, you can use curl
. Here is an example.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "What is LLMariner?"}],
"tool_choice": {
"choice": "auto",
"type": "function",
"function": {
"name": "rag"
}
},
"tools": [{
"type": "function",
"function": {
"name": "rag",
"parameters": "{\"vector_store_name\":\"Test vector store\"}"
}
}]}' \
http://localhost:8080/v1/chat/completions