Model Loading

The following shows how to load models in LLMariner.

Overview

LLMariner hosts LLMs in a Kubernetes cluster by downloading models from source repos and uploading to an S3-compatible object store. The supported source model repositories are following:

  • LLMariner official model repository
  • Hugging Face repositories
  • Ollama repositories
  • S3 bucket

If you already know models that you would like to download, you can specify them in values.yaml. Here is an example configuration where two models are downloaded from the LLMariner official model repository.

model-manager-loader:
  baseModels:
  - google/gemma-2b-it-q4_0
  - sentence-transformers/all-MiniLM-L6-v2-f16

You can always update values.yaml and upgrade the Helm chart to download additional models.

You can also run llma models create to download additional models. For example, the following command will download deepseek-r1:1.5b from the Ollama repository.

llma models create deepseek-r1:1.5b --source-repository ollama

You can check the status of the download with:

llma models list

Official Model Repository

This is the default configuration. The following is a list of supported models where we have validated.

ModelQuantizationsSupporting runtimes
TinyLlama/TinyLlama-1.1B-Chat-v1.0NonevLLM
TinyLlama/TinyLlama-1.1B-Chat-v1.0AWQvLLM
deepseek-ai/DeepSeek-Coder-V2-Lite-BaseQ2_K, Q3_K_M, Q3_K_S, Q4_0Ollama
deepseek-ai/DeepSeek-Coder-V2-Lite-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0Ollama
deepseek-ai/deepseek-coder-6.7b-baseNonevLLM, Ollama
deepseek-ai/deepseek-coder-6.7b-baseAWQvLLM
deepseek-ai/deepseek-coder-6.7b-baseQ4_0vLLM, Ollama
fixie-ai/ultravox-v0_3NonevLLM
google/gemma-2b-itNoneOllama
google/gemma-2b-itQ4_0Ollama
intfloat/e5-mistral-7b-instructNonevLLM
meta-llama/Llama-3.2-1B-InstructNonevLLM
meta-llama/Meta-Llama-3.3-70B-InstructAWQ, FP8-DynamicvLLM
meta-llama/Meta-Llama-3.1-70B-InstructAWQvLLM
meta-llama/Meta-Llama-3.1-70B-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0vLLM, Ollama
meta-llama/Meta-Llama-3.1-8B-InstructNonevLLM
meta-llama/Meta-Llama-3.1-8B-InstructAWQvLLM, Triton
meta-llama/Meta-Llama-3.1-8B-InstructQ4_0vLLM, Ollama
nvidia/Llama-3.1-Nemotron-70B-InstructQ2_K, Q3_K_M, Q3_K_S, Q4_0vLLM
nvidia/Llama-3.1-Nemotron-70B-InstructFP8-DynamicvLLM
mistralai/Mistral-7B-Instruct-v0.2Q4_0Ollama
sentence-transformers/all-MiniLM-L6-v2-f16NoneOllama

Please note that some models work only with specific inference runtimes.

Hugging Face Repositories

First, create a k8s secret that contains the Hugging Face API key.

kubectl create secret generic \
  huggingface-key \
  -n llmariner \
  --from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}

The above command assumes that LLMarine runs in the llmariner namespace.

Then deploy LLMariner with the following values.yaml.

model-manager-loader:
  downloader:
    kind: huggingFace
  huggingFaceSecret:
    name: huggingface-key
    apiKeyKey: apiKey

  baseModels:
  - Qwen/Qwen2-7B
  - TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ

Then the model should be loaded by model-manager-loader. Once the loading completes, the model name should show up in the output of llma models list.

When you use a GGUF model with vLLM, please specify --tokenizer=<original model> in vllmExtraFlags. Here is an example configuration for Phi 4.

inference-manager-engine:
  ...
  model:
    default:
      runtimeName: vllm
    overrides:
      lmstudio-community/phi-4-GGUF/phi-4-Q6_K.gguf:
        preloaded: true
        resources:
          limits:
            nvidia.com/gpu: 1
        vllmExtraFlags:
        - --tokenizer
        - microsoft/phi-4

Ollama Repositories

You can configure Ollama as model source repos by setting model-manager-loader.downloader.kind to ollama. The following is an example values.yaml that downloads deepseek-r1:1.5b from Ollama.

model-manager-loader:
  downloader:
    kind: ollama
  baseModels:
  - deepseek-r1:1.5b

S3 Bucket

If you want to download models from your S3 bucket, you can specify the bucket configuration under model-manager-loader.downloader.s3. For example, if you store model files under s3://my-models/v1/base-models/<model-name>, you can specify the downloader config as follows:

model-manager-loader:
  downloader:
    kind: s3
    s3:
      # The s3 endpoint URL.
      endpointUrl: https://s3.us-west-2.amazonaws.com
      # The region name where the models are stored.
      region: us-west-2
      # The bucket name where the models are stored.
      bucket: my-models
      # The path prefix of the model.
      pathPrefix: v1/base-models
      # Set to true if the bucket is public and we don't want to
      # use the credential attached to the pod.
      isPublic: false