This is the multi-page printable view of this section. Click here to print.
Features
1 - Inference with Open Models
llma
CLI, or API endpoint.Chat Completion
Here is an example chat completion command with the llma
CLI.
llma chat completions create --model google-gemma-2b-it-q4_0 --role user --completion "What is k8s?"
If you want to use the Python library, you first need to create an API key:
llma auth api-keys create <key name>
You can then pass the API key to initialize the OpenAI client and run the completion:
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is k8s?"}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
You can also just call ``client = OpenAI()if you set environment variables
OPENAI_BASE_URLand
OPENAI_API_KEY`.
If you want to hit the API endpoint directly, you can use curl
. Here is an example.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{"model": "google-gemma-2b-it-q4_0", "messages": [{"role": "user", "content": "What is k8s?"}]}' \
http://localhost:8080/v1/chat/completions
Please see the fine-tuning page if you want to generate a fine-tuning model and use that for chat completion.
Audio-to-Text
Note
Work-in-progress.2 - Supported Open Models
Models that are Officially Supported
The following is a list of supported models where we have validated.
Model | Quantizations | Supporting runtimes |
---|---|---|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 | None | vLLM |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 | AWQ | vLLM |
deepseek-ai/DeepSeek-Coder-V2-Lite-Base | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | Ollama |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | Ollama |
deepseek-ai/deepseek-coder-6.7b-base | None | vLLM, Ollama |
deepseek-ai/deepseek-coder-6.7b-base | AWQ | vLLM |
deepseek-ai/deepseek-coder-6.7b-base | Q4_0 | vLLM, Ollama |
fixie-ai/ultravox-v0_3 | None | vLLM |
google/gemma-2b-it | None | Ollama |
google/gemma-2b-it | Q4_0 | Ollama |
intfloat/e5-mistral-7b-instruct | None | vLLM |
meta-llama/Meta-Llama-3.3-70B-Instruct | AWQ, FP8-Dynamic | vLLM |
meta-llama/Meta-Llama-3.1-70B-Instruct | AWQ | vLLM |
meta-llama/Meta-Llama-3.1-70B-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | vLLM, Ollama |
meta-llama/Meta-Llama-3.1-8B-Instruct | None | vLLM |
meta-llama/Meta-Llama-3.1-8B-Instruct | AWQ | vLLM, Triton |
meta-llama/Meta-Llama-3.1-8B-Instruct | Q4_0 | vLLM, Ollama |
nvidia/Llama-3.1-Nemotron-70B-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | vLLM |
nvidia/Llama-3.1-Nemotron-70B-Instruct | FP8-Dynamic | vLLM |
mistralai/Mistral-7B-Instruct-v0.2 | Q4_0 | Ollama |
sentence-transformers/all-MiniLM-L6-v2-f16 | None | Ollama |
Please note that some models work only with specific inference runtimes.
Using Other Models in HuggingFace
First, create a k8s secret that contains the HuggingFace API key.
kubectl create secret generic \
huggingface-key \
-n llmariner \
--from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}
The above command assumes that LLMarine runs in the llmariner
namespace.
Then deploy LLMariner with the following values.yaml
.
model-manager-loader:
downloader:
kind: huggingFace
huggingFace:
cacheDir: /tmp/.cache/huggingface/hub
huggingFaceSecret:
name: huggingface-key
apiKeyKey: apiKey
baseModels:
- Qwen/Qwen2-7B
- TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ
Then the model should be loaded by model-manager-loader
. Once the loading completes, the model name
should show up in the output of llma models list
.
3 - Retrieval-Augmented Generation (RAG)
An Example Flow
The first step is to create a vector store and create files in the vector store. Here is an example script with the OpenAI Python library:
from openai import OpenAI
client = OpenAI(
base_url="<LLMariner Endpoint URL>",
api_key="<LLMariner API key>"
)
filename = "llmariner_overview.txt"
with open(filename, "w") as fp:
fp.write("LLMariner builds a software stack that provides LLM as a service. It provides the OpenAI-compatible API.")
file = client.files.create(
file=open(filename, "rb"),
purpose="assistants",
)
print("Uploaded file. ID=%s" % file.id)
vs = client.beta.vector_stores.create(
name='Test vector store',
)
print("Created vector store. ID=%s" % vs.id)
vfs = client.beta.vector_stores.files.create(
vector_store_id=vs.id,
file_id=file.id,
)
print("Created vector store file. ID=%s" % vfs.id)
Once the files are added into vector store, you can run the completion request with the RAG model.
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is LLMariner?"}
],
tool_choice = {
"choice": "auto",
"type": "function",
"function": {
"name": "rag"
}
},
tools = [
{
"type": "function",
"function": {
"name": "rag",
"parameters": "{\"vector_store_name\":\"Test vector store\"}"
}
}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
If you want to hit the API endpoint directly, you can use curl
. Here is an example.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "What is LLMariner?"}],
"tool_choice": {
"choice": "auto",
"type": "function",
"function": {
"name": "rag"
}
},
"tools": [{
"type": "function",
"function": {
"name": "rag",
"parameters": "{\"vector_store_name\":\"Test vector store\"}"
}
}]}' \
http://localhost:8080/v1/chat/completions
Embedding API
If you want to just generate embeddings, you can use the Embedding API, which is compatible with the OpenAI API.
Here are examples:
llma embeddings create --model intfloat-e5-mistral-7b-instruct --input "sample text"
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "sentence-transformers-all-MiniLM-L6-v2-f16",
"input": ""sample text,
}' \
http://localhost:8080/v1/embeddings
4 - Model Fine-tuning
Submitting a Fine-Tuning Job
You can use the OpenAI Python library to submit a fine-tuning job. Here is an example snippet that uploads a training file and uses that to run a fine-tuning job.
from openai import OpenAI
client = OpenAI(
base_url="<LLMariner Endpoint URL>",
api_key="<LLMariner API key>"
)
file = client.files.create(
file=open(training_filename, "rb"),
purpose="fine-tune",
)
job = client.fine_tuning.jobs.create(
model="google-gemma-2b-it",
suffix="fine-tuning",
training_file=file.id,
)
print('Created job. ID=%s' % job.id)
Once a fine-tuning job is submitted, a k8s Job is created. A Job runs in a namespace where a user's project is associated.
You can check the status of the job with the Python script or the llma
CLI.
print(client.fine_tuning.jobs.list())
llma fine-tuning jobs list
llma fine-tuning jobs get <job-id>
Once the job completes, you can check the generated models.
fine_tuned_model = client.fine_tuning.jobs.list().data[0].fine_tuned_model
print(fine_tuned_model)
Then you can get the model ID and use that for the chat completion request.
completion = client.chat.completions.create(
model=fine_tuned_model,
...
Debugging a Fine-Tuning Job
You can use the llma
CLI to check the logs and exec into the pod.
llma fine-tuning jobs logs <job-id>
llma fine-tuning jobs exec <job-id>
Managing Quota
LLMariner allows users to manage GPU quotas with integration with Kueue.
You can install Kueue with the following command:
export VERSION=v0.6.2
kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
Once the install completes, you should see kueue-controller-manager
in the kueue-system
namespace.
$ kubectl get po -n kueue-system
NAME READY STATUS RESTARTS AGE
kueue-controller-manager-568995d897-bzxg6 2/2 Running 0 161m
You can then define ResourceFlavor
, ClusterQueue
, and LocalQueue
to manage quota. For example, when you want to allocate 10 GPUs to team-a
whose project namespace is team-a-ns
, you can define ClusterQueue
and LocalQueue
as follows:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team-a
spec:
namespaceSelector: {} # match all.
cohort: org-x
resourceGroups:
- coveredResources: [gpu]
flavors:
- name: gpu-flavor
resources:
- name: gpu
nominalQuota: 10
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: team-a-ns
name: team-a-queue
spec:
clusterQueue: team-a
5 - General-purpose Training
Creating a Training Job
You can create a training job from the local pytorch code by running the following command.
llma batch jobs create \
--image="pytorch-2.1" \
--from-file=my-pytorch-script.py \
--from-file=requirements.txt \
--file-id=<file-id> \
--command "python -u /scripts/my-pytorch-script.py"
Once a training job is created, a k8s Job is created. The job runs the command specified in the --command
flag, and files specified in the --from-file
flag are mounted to the /scripts directory in the container. If you specify the --file-id
flag (optional), the file will be download to the /data directory in the container.
You can check the status of the job by running the following command.
llma batch jobs list
llma batch jobs get <job-id>
Debugging a Training Job
You can use the llma
CLI to check the logs of a training job.
llma batch jobs logs <job-id>
PyTorch Distributed Data Parallel
LLMariner supports PyTorch Distributed Data Parallel (DDP) training. You can run a DDP training job by specifying the number of per-node GPUs and the number of workers in the --gpu
and --workers
flags, respectively.
llma batch jobs create \
--image="pytorch-2.1" \
--from-file=my-pytorch-ddp-script.py \
--gpu=1 \
--workers=3 \
--command "python -u /scripts/my-pytorch-ddp-script.py"
Created training job is pre-configured some DDP environment variables; MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK.
6 - Jupyter Notebook
Creating a Jupyter Notebook
To create a Jupyter Notebook, run:
llma workspace notebooks create my-notebook
By default, there is no GPU allocated to the Jupyter Notebook. If you want to allocate a GPU to the Jupyter Notebook, run:
llma workspace notebooks create my-gpu-notebook --gpu 1
There are other options that you can specify when creating a Jupyter Notebook, such as environment. You can see the list of options by using the --help
flag.
Once the Jupyter Notebook is created, you can access it by running:
# Open the Jupyter Notebook in your browser
llma workspace notebooks open my-notebook
Stopping and Restarting a Jupyter Notebook
To stop a Jupyter Notebook, run:
llma workspace notebooks stop my-notebook
To restart a Jupyter Notebook, run:
llma workspace notebooks start my-notebook
You can check the current status of the Jupyter Notebook by running:
llma workspace notebooks list
llma workspace notebooks get my-notebook
OpenAI API Integration
Jupyter Notebook can be integrated with OpenAI API. Created Jupyter Notebook is pre-configured with OpenAI API URL and API key. All you need to do is to install the openai
package.
To install openai
package, run the following command in the Jupyter Notebook terminal:
pip install openai
Now, you can use the OpenAI API in the Jupyter Notebook. Here is an example of using OpenAI API in the Jupyter Notebook:
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is k8s?"}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
Note
By default, pre-configured API key is a JWT and it can expire. You can also pass your API key to theOpenAI
client.7 - API and GPU Usage Optimization
Note
Work-in-progress.API Usage Visibility
Inference Request Rate-limiting
Optimize GPU Utilization
Auto-scaling of Inference Runtimes
Scheduled Scale Up and Down of Inference Runtimes
8 - User Management
LLMariner installs Dex by default. Dex is an identity service that uses OpenID Connect for authentication.
The Helm chart for Dex is located at https://github.com/llmariner/rbac-manager/tree/main/deployments/dex-server. It uses a built-in local connector and has the following configuration by default:
staticPasswords:
- userID: 08a8684b-db88-4b73-90a9-3cd1661f5466
username: admin
email: admin@example.com
# bcrypt hash of the string: $(echo password | htpasswd -BinC 10 admin | cut -d: -f2)
hash: "$2a$10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
You can switch a connector to an IdP in your environment (e.g., LDAP, GitHub). Here is an example connector configuration with Okta:
global:
auth:
oidcIssuerUrl: https://<LLMariner endpoint URL>/v1/dex
dex-server:
oauth2:
passwordConnector:
enable: false
responseTypes:
- code
connectors:
- type: oidc
id: okta
name: okta
config:
issuer: <Okta issuer URL>
clientID: <Client ID of an Okta application>
clientSecret: <Client secret of an Okta application>
redirectURI: https://<LLMariner endpoint URL>/v1/dex/callback
insecureSkipEmailVerified: true
enablePasswordDb: false
staticPassword:
enable: false
Please refer to the Dec documentations for more details.
The Helm chart for Dex creates an ingress so that HTTP requests to v1/dex
are routed to Dex. This endpoint URL works as the OIDC issuer URL that CLI and backend servers use.
9 - Access Control with Organizations and Projects
Overview
Basic Concepts
LLMariner provides access control with two concepts: Organizations
and Projects
. The basic concept follows OpenAI API.
You can define one or more than one organization. In each organization, you can define one or more than one project. For example, you can create an organization for each team in your company, and each team can create individual projects based on their needs.
A project controls the visibility of resources such as models, fine-tuning jobs. For example, a model that is generated by a fine-tuned job in project P
is only visible from project members in P
.
A project is also associated with a Kubernetes namespace. Fine-tuning jobs for project P
run in the Kubernetes namespace associated with P
(and quota management is applied).
Roles
Each user has an organization role
and a project role
, and these roles control resources that a user can access and actions that a user can take.
An organization role is either owner
or reader
. A project role is either owner
or member
. If you want to allow a user to use LLMariner without any organization/project management privilege, you can grant the organization role reader
and the project role member
. If you want to allow a user to manage the project, you can grant the project role owner
.
Here is an diagram shows an example role assignment.
The following summarizes how these role implements the access control:
- A user can access resources in project
P
in organizationO
if the user is amember
ofP
,owner
ofP
, orowner
ofO
. - A user can manage project
P
(e.g., add a new member) in organizationO
if the user is anowner
ofP
orowner
ofO
. - A user can manage organization
O
(e.g., add a new member) if the user is anowner
ofO
. - A user can create a new organization if the user is an
owner
of the initial organization that is created by default.
Please note that a user who has the reader
organization role cannot access resources in the organization unless the user is added to a project in the organization.
Creating Organizations and Projects
You can use CLI llma
to create a new organization and a project.
Creating a new Organization
You can run the following command to create a new organization.
llma admin organizations create <organization title>
Note
You can also typellm auth orgs
instead of llm auth organizations
.You can confirm that the new organization is created by running:
llma admin organizations list
Then you can add a user member to the organization.
llma admin organizations add-member <organization title> --email <email-address of the member> --role <role>
The role can be either owner
or reader
.
You can confirm organization members by running:
llma admin organizations list-members <organization title>
Creating a new Project
You can take a similar flow to create a new project. To create a new project, run:
llma admin projects create --title <project title> --organization-title <organization title>
To confirm the project is created, run:
llma admin projects list
Then you can add a user member to the project.
llma admin projects add-member <project title> --email <email-address of the member> --role <role>
The role can be either owner
or member
.
You can confirm project members by running:
llma admin projects list-members --title <project title> --organization-title <organization title>
If you want to manage a project in a different organization, you can pass --organization-title <title>
in each command. Otherwise, the organization in the current context is used. You can also change the current context by running:
llma context set
Choosing an Organization and a Project
You can use llma context set
to set the current context.
llma context set
Then the selected context is applied to CLI commands (e.g., llma models list
).
When you create a new API key, the key will be associated with the project in the current context. Suppose that a user runs the following commands:
llma context set # Choose project my-project
llma auth api-keys create my-key
The newly created API key is associated with project my-project
.