1 - Inference with Open Models

Users can run chat completion with open models such as Google Gemma, LLama, Mistral, etc. To run chat completion, users can use the OpenAI Python library, llma CLI, or API endpoint.

Chat Completion

Here is an example chat completion command with the llma CLI.

llma chat completions create --model google-gemma-2b-it-q4_0 --role user --completion "What is k8s?"

If you want to use the Python library, you first need to create an API key:

llma auth api-keys create <key name>

You can then pass the API key to initialize the OpenAI client and run the completion:

from openai import OpenAI

client = OpenAI(
  base_url="<Base URL (e.g., http://localhost:8080/v1)>",
  api_key="<API key secret>"
)

completion = client.chat.completions.create(
  model="google-gemma-2b-it-q4_0",
  messages=[
    {"role": "user", "content": "What is k8s?"}
  ],
  stream=True
)
for response in completion:
  print(response.choices[0].delta.content, end="")
print("\n")

You can also just call ``client = OpenAI()if you set environment variablesOPENAI_BASE_URLandOPENAI_API_KEY`.

If you want to hit the API endpoint directly, you can use curl. Here is an example.

curl \
  --request POST \
  --header "Authorization: Bearer ${LLMARINER_TOKEN}" \
  --header "Content-Type: application/json" \
  --data '{"model": "google-gemma-2b-it-q4_0", "messages": [{"role": "user", "content": "What is k8s?"}]}' \
  http://localhost:8080/v1/chat/completions

Please see the fine-tuning page if you want to generate a fine-tuning model and use that for chat completion.

Audio-to-Text

2 - Supported Open Models

The following shows the supported models.

Models that are Officially Supported

The following is a list of supported models where we have validated.

ModelQuantizationsSupporting runtimes
TinyLlama/TinyLlama-1.1B-Chat-v1.0NonevLLM
TinyLlama/TinyLlama-1.1B-Chat-v1.0AWQvLLM
deepseek-ai/DeepSeek-Coder-V2-Lite-BaseQ2_K, Q3_K_M, Q3_K_S, Q4_0Ollama
deepseek-ai/DeepSeek-Coder-V2-Lite-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0Ollama
deepseek-ai/deepseek-coder-6.7b-baseNonevLLM, Ollama
deepseek-ai/deepseek-coder-6.7b-baseAWQvLLM
deepseek-ai/deepseek-coder-6.7b-baseQ4_0vLLM, Ollama
fixie-ai/ultravox-v0_3NonevLLM
google/gemma-2b-itNoneOllama
google/gemma-2b-itQ4_0Ollama
intfloat/e5-mistral-7b-instructNonevLLM
meta-llama/Meta-Llama-3.3-70B-InstructAWQ, FP8-DynamicvLLM
meta-llama/Meta-Llama-3.1-70B-InstructAWQvLLM
meta-llama/Meta-Llama-3.1-70B-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0vLLM, Ollama
meta-llama/Meta-Llama-3.1-8B-InstructNonevLLM
meta-llama/Meta-Llama-3.1-8B-InstructAWQvLLM, Triton
meta-llama/Meta-Llama-3.1-8B-InstructQ4_0vLLM, Ollama
nvidia/Llama-3.1-Nemotron-70B-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0vLLM
nvidia/Llama-3.1-Nemotron-70B-InstructFP8-DynamicvLLM
mistralai/Mistral-7B-Instruct-v0.2Q4_0Ollama
sentence-transformers/all-MiniLM-L6-v2-f16NoneOllama

Please note that some models work only with specific inference runtimes.

Using Other Models in HuggingFace

First, create a k8s secret that contains the HuggingFace API key.

kubectl create secret generic \
  huggingface-key \
  -n llmariner \
  --from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}

The above command assumes that LLMarine runs in the llmariner namespace.

Then deploy LLMariner with the following values.yaml.

model-manager-loader:
  downloader:
    kind: huggingFace
    huggingFace:
      cacheDir: /tmp/.cache/huggingface/hub
  huggingFaceSecret:
    name: huggingface-key
    apiKeyKey: apiKey

  baseModels:
  - Qwen/Qwen2-7B
  - TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ

Then the model should be loaded by model-manager-loader. Once the loading completes, the model name should show up in the output of llma models list.

3 - Retrieval-Augmented Generation (RAG)

This page describes how to use RAG with LLMariner.

An Example Flow

The first step is to create a vector store and create files in the vector store. Here is an example script with the OpenAI Python library:

from openai import OpenAI

client = OpenAI(
  base_url="<LLMariner Endpoint URL>",
  api_key="<LLMariner API key>"
)

filename = "llmariner_overview.txt"
with open(filename, "w") as fp:
  fp.write("LLMariner builds a software stack that provides LLM as a service. It provides the OpenAI-compatible API.")
file = client.files.create(
  file=open(filename, "rb"),
  purpose="assistants",
)
print("Uploaded file. ID=%s" % file.id)

vs = client.beta.vector_stores.create(
  name='Test vector store',
)
print("Created vector store. ID=%s" % vs.id)

vfs = client.beta.vector_stores.files.create(
  vector_store_id=vs.id,
  file_id=file.id,
)
print("Created vector store file. ID=%s" % vfs.id)

Once the files are added into vector store, you can run the completion request with the RAG model.

from openai import OpenAI

client = OpenAI(
  base_url="<Base URL (e.g., http://localhost:8080/v1)>",
  api_key="<API key secret>"
)

completion = client.chat.completions.create(
  model="google-gemma-2b-it-q4_0",
  messages=[
    {"role": "user", "content": "What is LLMariner?"}
  ],
  tool_choice = {
   "choice": "auto",
   "type": "function",
   "function": {
     "name": "rag"
   }
 },
 tools = [
   {
     "type": "function",
     "function": {
       "name": "rag",
       "parameters": "{\"vector_store_name\":\"Test vector store\"}"
     }
   }
 ],
  stream=True
)
for response in completion:
  print(response.choices[0].delta.content, end="")
print("\n")

If you want to hit the API endpoint directly, you can use curl. Here is an example.

curl \
  --request POST \
  --header "Authorization: Bearer ${LLMARINER_TOKEN}" \
  --header "Content-Type: application/json" \
  --data '{
   "model": "google-gemma-2b-it-q4_0",
   "messages": [{"role": "user", "content": "What is LLMariner?"}],
   "tool_choice": {
     "choice": "auto",
     "type": "function",
     "function": {
       "name": "rag"
     }
   },
   "tools": [{
     "type": "function",
     "function": {
     "name": "rag",
       "parameters": "{\"vector_store_name\":\"Test vector store\"}"
     }
 }]}' \
 http://localhost:8080/v1/chat/completions

Embedding API

If you want to just generate embeddings, you can use the Embedding API, which is compatible with the OpenAI API.

Here are examples:

llma embeddings create --model intfloat-e5-mistral-7b-instruct --input "sample text"


curl \
  --request POST \
  --header "Authorization: Bearer ${LLMARINER_TOKEN}" \
  --header "Content-Type: application/json" \
  --data '{
   "model": "sentence-transformers-all-MiniLM-L6-v2-f16",
   "input": ""sample text,
 }' \
 http://localhost:8080/v1/embeddings

4 - Model Fine-tuning

This page describes how to fine-tune models with LLMariner.

Submitting a Fine-Tuning Job

You can use the OpenAI Python library to submit a fine-tuning job. Here is an example snippet that uploads a training file and uses that to run a fine-tuning job.

from openai import OpenAI

client = OpenAI(
  base_url="<LLMariner Endpoint URL>",
  api_key="<LLMariner API key>"
)

file = client.files.create(
  file=open(training_filename, "rb"),
  purpose="fine-tune",
)

job = client.fine_tuning.jobs.create(
  model="google-gemma-2b-it",
  suffix="fine-tuning",
  training_file=file.id,
)
print('Created job. ID=%s' % job.id)

Once a fine-tuning job is submitted, a k8s Job is created. A Job runs in a namespace where a user's project is associated.

You can check the status of the job with the Python script or the llma CLI.

print(client.fine_tuning.jobs.list())
llma fine-tuning jobs list
llma fine-tuning jobs get <job-id>

Once the job completes, you can check the generated models.

fine_tuned_model = client.fine_tuning.jobs.list().data[0].fine_tuned_model
print(fine_tuned_model)

Then you can get the model ID and use that for the chat completion request.

completion = client.chat.completions.create(
  model=fine_tuned_model,
  ...

Debugging a Fine-Tuning Job

You can use the llma CLI to check the logs and exec into the pod.

llma fine-tuning jobs logs <job-id>
llma fine-tuning jobs exec <job-id>

Managing Quota

LLMariner allows users to manage GPU quotas with integration with Kueue.

You can install Kueue with the following command:

export VERSION=v0.6.2
kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

Once the install completes, you should see kueue-controller-manager in the kueue-system namespace.

$ kubectl get po -n kueue-system
NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-568995d897-bzxg6   2/2     Running   0          161m

You can then define ResourceFlavor, ClusterQueue, and LocalQueue to manage quota. For example, when you want to allocate 10 GPUs to team-a whose project namespace is team-a-ns, you can define ClusterQueue and LocalQueue as follows:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a
spec:
  namespaceSelector: {} # match all.
  cohort: org-x
  resourceGroups:
  - coveredResources: [gpu]
    flavors:
    - name: gpu-flavor
      resources:
      - name: gpu
        nominalQuota: 10
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a-ns
  name: team-a-queue
spec:
  clusterQueue: team-a

5 - General-purpose Training

LLMariner allows users to run general-purpose training jobs in their Kubernetes clusters.

Creating a Training Job

You can create a training job from the local pytorch code by running the following command.

llma batch jobs create \
  --image="pytorch-2.1" \
  --from-file=my-pytorch-script.py \
  --from-file=requirements.txt \
  --file-id=<file-id> \
  --command "python -u /scripts/my-pytorch-script.py"

Once a training job is created, a k8s Job is created. The job runs the command specified in the --command flag, and files specified in the --from-file flag are mounted to the /scripts directory in the container. If you specify the --file-id flag (optional), the file will be download to the /data directory in the container.

You can check the status of the job by running the following command.

llma batch jobs list
llma batch jobs get <job-id>

Debugging a Training Job

You can use the llma CLI to check the logs of a training job.

llma batch jobs logs <job-id>

PyTorch Distributed Data Parallel

LLMariner supports PyTorch Distributed Data Parallel (DDP) training. You can run a DDP training job by specifying the number of per-node GPUs and the number of workers in the --gpu and --workers flags, respectively.

llma batch jobs create \
  --image="pytorch-2.1" \
  --from-file=my-pytorch-ddp-script.py \
  --gpu=1 \
  --workers=3 \
  --command "python -u /scripts/my-pytorch-ddp-script.py"

Created training job is pre-configured some DDP environment variables; MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK.

6 - Jupyter Notebook

LLMariner allows users to run a Jupyter Notebook in a Kubernetes cluster. This functionality is useful when users want to run ad-hoc Python scripts that require GPU.

Creating a Jupyter Notebook

To create a Jupyter Notebook, run:

llma workspace notebooks create my-notebook

By default, there is no GPU allocated to the Jupyter Notebook. If you want to allocate a GPU to the Jupyter Notebook, run:

llma workspace notebooks create my-gpu-notebook --gpu 1

There are other options that you can specify when creating a Jupyter Notebook, such as environment. You can see the list of options by using the --help flag.

Once the Jupyter Notebook is created, you can access it by running:

# Open the Jupyter Notebook in your browser
llma workspace notebooks open my-notebook

Stopping and Restarting a Jupyter Notebook

To stop a Jupyter Notebook, run:

llma workspace notebooks stop my-notebook

To restart a Jupyter Notebook, run:

llma workspace notebooks start my-notebook

You can check the current status of the Jupyter Notebook by running:

llma workspace notebooks list
llma workspace notebooks get my-notebook

OpenAI API Integration

Jupyter Notebook can be integrated with OpenAI API. Created Jupyter Notebook is pre-configured with OpenAI API URL and API key. All you need to do is to install the openai package.

To install openai package, run the following command in the Jupyter Notebook terminal:

pip install openai

Now, you can use the OpenAI API in the Jupyter Notebook. Here is an example of using OpenAI API in the Jupyter Notebook:

from openai import OpenAI

client = OpenAI()
completion = client.chat.completions.create(
  model="google-gemma-2b-it-q4_0",
  messages=[
    {"role": "user", "content": "What is k8s?"}
  ],
  stream=True
)
for response in completion:
  print(response.choices[0].delta.content, end="")
print("\n")

7 - API and GPU Usage Optimization

API Usage Visibility

Inference Request Rate-limiting

Optimize GPU Utilization

Auto-scaling of Inference Runtimes

Scheduled Scale Up and Down of Inference Runtimes

8 - User Management

Describes the way to manage users

LLMariner installs Dex by default. Dex is an identity service that uses OpenID Connect for authentication.

The Helm chart for Dex is located at https://github.com/llmariner/rbac-manager/tree/main/deployments/dex-server. It uses a built-in local connector and has the following configuration by default:

staticPasswords:
- userID: 08a8684b-db88-4b73-90a9-3cd1661f5466
  username: admin
  email: admin@example.com
  # bcrypt hash of the string: $(echo password | htpasswd -BinC 10 admin | cut -d: -f2)
  hash: "$2a$10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"

You can switch a connector to an IdP in your environment (e.g., LDAP, GitHub). Here is an example connector configuration with Okta:

global:
  auth:
    oidcIssuerUrl: https://<LLMariner endpoint URL>/v1/dex

dex-server:
  oauth2:
    passwordConnector:
      enable: false
    responseTypes:
    - code
  connectors:
  - type: oidc
    id: okta
    name: okta
    config:
      issuer: <Okta issuer URL>
      clientID: <Client ID of an Okta application>
      clientSecret: <Client secret of an Okta application>
      redirectURI: https://<LLMariner endpoint URL>/v1/dex/callback
      insecureSkipEmailVerified: true
  enablePasswordDb: false
  staticPassword:
    enable: false

Please refer to the Dec documentations for more details.

The Helm chart for Dex creates an ingress so that HTTP requests to v1/dex are routed to Dex. This endpoint URL works as the OIDC issuer URL that CLI and backend servers use.

9 - Access Control with Organizations and Projects

The way to configure access control using organizations and projects

Overview

Basic Concepts

LLMariner provides access control with two concepts: Organizations and Projects. The basic concept follows OpenAI API.

You can define one or more than one organization. In each organization, you can define one or more than one project. For example, you can create an organization for each team in your company, and each team can create individual projects based on their needs.

A project controls the visibility of resources such as models, fine-tuning jobs. For example, a model that is generated by a fine-tuned job in project P is only visible from project members in P.

A project is also associated with a Kubernetes namespace. Fine-tuning jobs for project P run in the Kubernetes namespace associated with P (and quota management is applied).

Roles

Each user has an organization role and a project role, and these roles control resources that a user can access and actions that a user can take.

An organization role is either owner or reader. A project role is either owner or member. If you want to allow a user to use LLMariner without any organization/project management privilege, you can grant the organization role reader and the project role member. If you want to allow a user to manage the project, you can grant the project role owner.

Here is an diagram shows an example role assignment.

The following summarizes how these role implements the access control:

  • A user can access resources in project P in organization O if the user is a member of P, owner of P, or owner of O.
  • A user can manage project P (e.g., add a new member) in organization O if the user is an owner of P or owner of O.
  • A user can manage organization O (e.g., add a new member) if the user is an owner of O.
  • A user can create a new organization if the user is an owner of the initial organization that is created by default.

Please note that a user who has the reader organization role cannot access resources in the organization unless the user is added to a project in the organization.

Creating Organizations and Projects

You can use CLI llma to create a new organization and a project.

Creating a new Organization

You can run the following command to create a new organization.

llma admin organizations create <organization title>

You can confirm that the new organization is created by running:

llma admin organizations list

Then you can add a user member to the organization.

llma admin organizations add-member <organization title> --email <email-address of the member> --role <role>

The role can be either owner or reader.

You can confirm organization members by running:

llma admin organizations list-members <organization title>

Creating a new Project

You can take a similar flow to create a new project. To create a new project, run:

llma admin projects create --title <project title> --organization-title <organization title>

To confirm the project is created, run:

llma admin projects list

Then you can add a user member to the project.

llma admin projects add-member <project title> --email <email-address of the member> --role <role>

The role can be either owner or member.

You can confirm project members by running:

llma admin projects list-members --title <project title> --organization-title <organization title>

If you want to manage a project in a different organization, you can pass --organization-title <title> in each command. Otherwise, the organization in the current context is used. You can also change the current context by running:

llma context set

Choosing an Organization and a Project

You can use llma context set to set the current context.

llma context set

Then the selected context is applied to CLI commands (e.g., llma models list).

When you create a new API key, the key will be associated with the project in the current context. Suppose that a user runs the following commands:

llma context set # Choose project my-project
llma auth api-keys create my-key

The newly created API key is associated with project my-project.