Model Loading
Overview
LLMariner hosts LLMs in a Kubernetes cluster by downloading models from source repos and uploading to an S3-compatible object store. The supported source model repositories are following:
- LLMariner official model repository
- Hugging Face repositories
- Ollama repositories
- S3 bucket
If you already know models that you would like to download, you can specify them
in values.yaml. Here is an example configuration where two models are downloaded from
the LLMariner official model repository.
model-manager-loader:
baseModels:
- google/gemma-2b-it-q4_0
- sentence-transformers/all-MiniLM-L6-v2-f16
You can always update values.yaml and upgrade the Helm chart to download additional models.
You can also run llma models (base|fine-tuned) create to download additional models. For example, the following command
will download deepseek-r1:1.5b from the Ollama repository.
llma models create base deepseek-r1:1.5b --source-repository ollama
You can check the status of the download with:
llma models list
Once the download has been completed, you can activate the model. The activated model becomes ready
for inference once inference-manager-engine loads the model.
Note
To download models from Hugging Face, you need additional configuration to embed the Hugging Face API key tomodel-manager-loader. Please see the page below for details.Model Configuration
There are two ways to configure model deployment: Helm chart and model API.
In the Helm chart, you can put your own configuration under inference-manager-engine.model and
control GPU allocation, extra flags to runtime, number of replicas, etc. Here is an example:
inference-manager-engine:
model:
default:
runtimeName: vllm
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
overrides:
meta-llama/Llama-3.2-1B-Instruct:
vllmExtraFlags:
- --enable-lora
- --max-lora-rank
- "64"
openai/gpt-oss-120b:
image: vllm/vllm-openai:gptoss
replicas: 2
Please see https://artifacthub.io/packages/helm/inference-manager-engine/inference-manager-engine?modal=values for details.
To use the model API to configure model deployment, you first
need to set inference-manager-engine.model.enableOverrideWithModelConfig to true.
Then you can specify the deployment configuration when running llma models (base|fine-tuned) create or llma models update. For example,
the following command will deploy four replicas of an inference
runtime to serve deepseek-r1:1.5b. Two GPUs are allocated to each
replica.
llma models create base deepseek-r1:1.5b \
--source-repository ollama \
--replicas 4 \
--gpu 2
Official Model Repository
This is the default configuration. The following is a list of supported models where we have validated.
| Model | Quantizations | Supporting runtimes |
|---|---|---|
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | None | vLLM |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | AWQ | vLLM |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Base | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | Ollama |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | Ollama |
| deepseek-ai/deepseek-coder-6.7b-base | None | vLLM, Ollama |
| deepseek-ai/deepseek-coder-6.7b-base | AWQ | vLLM |
| deepseek-ai/deepseek-coder-6.7b-base | Q4_0 | vLLM, Ollama |
| fixie-ai/ultravox-v0_3 | None | vLLM |
| google/gemma-2b-it | None | Ollama |
| google/gemma-2b-it | Q4_0 | Ollama |
| intfloat/e5-mistral-7b-instruct | None | vLLM |
| meta-llama/Llama-3.2-1B-Instruct | None | vLLM |
| meta-llama/Meta-Llama-3.3-70B-Instruct | AWQ, FP8-Dynamic | vLLM |
| meta-llama/Meta-Llama-3.1-70B-Instruct | AWQ | vLLM |
| meta-llama/Meta-Llama-3.1-70B-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | vLLM, Ollama |
| meta-llama/Meta-Llama-3.1-8B-Instruct | None | vLLM |
| meta-llama/Meta-Llama-3.1-8B-Instruct | AWQ | vLLM, Triton |
| meta-llama/Meta-Llama-3.1-8B-Instruct | Q4_0 | vLLM, Ollama |
| nvidia/Llama-3.1-Nemotron-70B-Instruct | Q2_K, Q3_K_M, Q3_K_S, Q4_0 | vLLM |
| nvidia/Llama-3.1-Nemotron-70B-Instruct | FP8-Dynamic | vLLM |
| mistralai/Mistral-7B-Instruct-v0.2 | Q4_0 | Ollama |
| sentence-transformers/all-MiniLM-L6-v2-f16 | None | Ollama |
Please note that some models work only with specific inference runtimes.
Hugging Face Repositories
First, create a k8s secret that contains the Hugging Face API key.
kubectl create secret generic \
huggingface-key \
-n llmariner \
--from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}
The above command assumes that LLMarine runs in the llmariner namespace.
Then deploy LLMariner with the following values.yaml.
model-manager-loader:
downloader:
kind: huggingFace
huggingFace:
cacheDir: /tmp/.cache
huggingFaceSecret:
name: huggingface-key
apiKeyKey: apiKey
baseModels:
- Qwen/Qwen2-7B
- TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ
Then the model should be loaded by model-manager-loader. Once the loading completes, the model name
should show up in the output of llma models list.
When you use a GGUF model with vLLM, please specify --tokenizer=<original model> in vllmExtraFlags. Here is an example
configuration for Phi 4.
inference-manager-engine:
...
model:
default:
runtimeName: vllm
overrides:
lmstudio-community/phi-4-GGUF/phi-4-Q6_K.gguf:
preloaded: true
resources:
limits:
nvidia.com/gpu: 1
vllmExtraFlags:
- --tokenizer
- microsoft/phi-4
Ollama Repositories
You can configure Ollama as model source repos by setting model-manager-loader.downloader.kind to ollama. The following is an example values.yaml that downloads deepseek-r1:1.5b from Ollama.
model-manager-loader:
downloader:
kind: ollama
baseModels:
- deepseek-r1:1.5b
S3 Bucket
If you want to download models from your S3 bucket, you can specify the bucket configuration under
model-manager-loader.downloader.s3. For example, if you store model files under s3://my-models/v1/base-models/<model-name>,
you can specify the downloader config as follows:
model-manager-loader:
downloader:
kind: s3
s3:
# The s3 endpoint URL.
endpointUrl: https://s3.us-west-2.amazonaws.com
# The region name where the models are stored.
region: us-west-2
# The bucket name where the models are stored.
bucket: my-models
# The path prefix of the model.
pathPrefix: v1/base-models
# Set to true if the bucket is public and we don't want to
# use the credential attached to the pod.
isPublic: false