Model Loading

The following shows how to load models in LLMariner.

Overview

LLMariner hosts LLMs in a Kubernetes cluster by downloading models from source repos and uploading to an S3-compatible object store. The supported source model repositories are following:

LLMariner official model repository
Hugging Face repositories
Ollama repositories
S3 bucket

If you already know models that you would like to download, you can specify them in values.yaml. Here is an example configuration where two models are downloaded from the LLMariner official model repository.

model-manager-loader:
  baseModels:
  - google/gemma-2b-it-q4_0
  - sentence-transformers/all-MiniLM-L6-v2-f16

You can always update values.yaml and upgrade the Helm chart to download additional models.

You can also run llma models create to download additional models. For example, the following command will download deepseek-r1:1.5b from the Ollama repository.

llma models create deepseek-r1:1.5b --source-repository ollama

You can check the status of the download with:

llma models list

Once the download has been completed, you can activate the model. The activated model becomes ready for inference once inference-manager-engine loads the model.

Note

To download models from Hugging Face, you need additional configuration to embed the Hugging Face API key to model-manager-loader. Please see the page below for details.

Official Model Repository

This is the default configuration. The following is a list of supported models where we have validated.

Model	Quantizations	Supporting runtimes
TinyLlama/TinyLlama-1.1B-Chat-v1.0	None	vLLM
TinyLlama/TinyLlama-1.1B-Chat-v1.0	AWQ	vLLM
deepseek-ai/DeepSeek-Coder-V2-Lite-Base	Q2_K, Q3_K_M, Q3_K_S, Q4_0	Ollama
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct	Q2_K, Q3_K_M, Q3_K_S, Q4_0	Ollama
deepseek-ai/deepseek-coder-6.7b-base	None	vLLM, Ollama
deepseek-ai/deepseek-coder-6.7b-base	AWQ	vLLM
deepseek-ai/deepseek-coder-6.7b-base	Q4_0	vLLM, Ollama
fixie-ai/ultravox-v0_3	None	vLLM
google/gemma-2b-it	None	Ollama
google/gemma-2b-it	Q4_0	Ollama
intfloat/e5-mistral-7b-instruct	None	vLLM
meta-llama/Llama-3.2-1B-Instruct	None	vLLM
meta-llama/Meta-Llama-3.3-70B-Instruct	AWQ, FP8-Dynamic	vLLM
meta-llama/Meta-Llama-3.1-70B-Instruct	AWQ	vLLM
meta-llama/Meta-Llama-3.1-70B-Instruct	Q2_K, Q3_K_M, Q3_K_S, Q4_0	vLLM, Ollama
meta-llama/Meta-Llama-3.1-8B-Instruct	None	vLLM
meta-llama/Meta-Llama-3.1-8B-Instruct	AWQ	vLLM, Triton
meta-llama/Meta-Llama-3.1-8B-Instruct	Q4_0	vLLM, Ollama
nvidia/Llama-3.1-Nemotron-70B-Instruct	Q2_K, Q3_K_M, Q3_K_S, Q4_0	vLLM
nvidia/Llama-3.1-Nemotron-70B-Instruct	FP8-Dynamic	vLLM
mistralai/Mistral-7B-Instruct-v0.2	Q4_0	Ollama
sentence-transformers/all-MiniLM-L6-v2-f16	None	Ollama

Please note that some models work only with specific inference runtimes.

Hugging Face Repositories

First, create a k8s secret that contains the Hugging Face API key.

kubectl create secret generic \
  huggingface-key \
  -n llmariner \
  --from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}

The above command assumes that LLMarine runs in the llmariner namespace.

Then deploy LLMariner with the following values.yaml.

model-manager-loader:
  downloader:
    kind: huggingFace
    huggingFace:
      cacheDir: /tmp/.cache
  huggingFaceSecret:
    name: huggingface-key
    apiKeyKey: apiKey

  baseModels:
  - Qwen/Qwen2-7B
  - TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ

Then the model should be loaded by model-manager-loader. Once the loading completes, the model name should show up in the output of llma models list.

When you use a GGUF model with vLLM, please specify --tokenizer=<original model> in vllmExtraFlags. Here is an example configuration for Phi 4.

inference-manager-engine:
  ...
  model:
    default:
      runtimeName: vllm
    overrides:
      lmstudio-community/phi-4-GGUF/phi-4-Q6_K.gguf:
        preloaded: true
        resources:
          limits:
            nvidia.com/gpu: 1
        vllmExtraFlags:
        - --tokenizer
        - microsoft/phi-4

Ollama Repositories

You can configure Ollama as model source repos by setting model-manager-loader.downloader.kind to ollama. The following is an example values.yaml that downloads deepseek-r1:1.5b from Ollama.

model-manager-loader:
  downloader:
    kind: ollama
  baseModels:
  - deepseek-r1:1.5b

S3 Bucket

If you want to download models from your S3 bucket, you can specify the bucket configuration under model-manager-loader.downloader.s3. For example, if you store model files under s3://my-models/v1/base-models/<model-name>, you can specify the downloader config as follows:

model-manager-loader:
  downloader:
    kind: s3
    s3:
      # The s3 endpoint URL.
      endpointUrl: https://s3.us-west-2.amazonaws.com
      # The region name where the models are stored.
      region: us-west-2
      # The bucket name where the models are stored.
      bucket: my-models
      # The path prefix of the model.
      pathPrefix: v1/base-models
      # Set to true if the bucket is public and we don't want to
      # use the credential attached to the pod.
      isPublic: false

Last modified June 27, 2025: docs: fix hugging face config (#168) (a388045)