Inference with Open Models
llma CLI, or API endpoint.Chat Completion
Here is an example chat completion command with the llma CLI.
llma chat completions create --model google-gemma-2b-it-q4_0 --role user --completion "What is k8s?"
If you want to use the Python library, you first need to create an API key:
llma auth api-keys create <key name>
You can then pass the API key to initialize the OpenAI client and run the completion:
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is k8s?"}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
You can also just call client = OpenAI() if you set environment variables OPENAI_BASE_URL and OPENAI_API_KEY.
If you want to hit the API endpoint directly, you can use curl. Here is an example.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{"model": "google-gemma-2b-it-q4_0", "messages": [{"role": "user", "content": "What is k8s?"}]}' \
http://localhost:8080/v1/chat/completions
Please see the fine-tuning page if you want to generate a fine-tuning model and use that for chat completion.
Tool Calling
vLLM requires additional flags (link to use tool calling.
You can specify the flags with vllmExtraFlags. Here is an example configuration:
inference-manager-engine:
...
model:
overrides:
meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic:
runtimeName: vllm
resources:
limits:
nvidia.com/gpu: 4
vllmExtraFlags:
- --chat-template
- examples/tool_chat_template_llama3.1_json.jinja
- --enable-auto-tool-choice
- --tool-call-parser
- llama3_json
- --max-model-len
- "8192"
Here is an example curl command:
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic",
"messages": [{"role": "user", "content": "What is the weather like in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country"
}
}
},
"strict": true
}
}]
}' http://localhost:8080/v1/chat/completions
The output will have the tool_calls in its message.
{
...
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "chatcmpl-tool-e698e3e36f354d089302b79486e4a702",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, USA\"}"
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls",
"stop_reason": 128008
}
],
...
}
Audio Transcription
LLMariner supports the /v1/audio/transcriptions API. You can use a model like OpenAI/Whispter for this API.
CLI:
llma audio transcriptions create --model openai-whisper-large-v3-turbo --file <audio file>
Python:
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
response = client.audio.transcriptions.create(
model="openai-whisper-large-v3-turbo",
file=open("<audio file>", "rb")
)
print(response)
curl:
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: multipart/form-data" \
-F model=openai-whisper-large-v3-turbo \
-F file="@<audio file>" \
http://localhost:8080/v1/audio/transcriptions
Model Response API
LLMariner supports the /v1/responses API. You can, for example, use openai/gpt-oss-120b for this API.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header 'Content-Type: application/json' \
--data '{
"model": "openai-gpt-oss-120b",
"input": "What is the capital of France?",
"tools": [{
"type": "function",
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
}
}],
"tool_choice": "auto"
}' \
http://localhost:8080/v1/responses
Model Runtime Configuration
We currently support vLLM, Ollama, and Nvidia Triton Inference Server
as an inference runtime. You can change a runtime for each model. For
example, in the following configuration, the default runtime is set to
vLLM, and Ollama is used for deepseek-r1:1.5b.
inference-manager-engine:
...
model:
default:
runtimeName: vllm # Default runtime
resources:
limits:
nvidia.com/gpu: 1
overrides:
lmstudio-community/phi-4-GGUF/phi-4-Q6_K.gguf:
preloaded: true
vllmExtraFlags:
- --tokenizer
- microsoft/phi-4
deepseek-r1:1.5b:
runtimeName: ollama # Override the default runtime
preloaded: true
resources:
limits:
nvidia.com/gpu: 0
By default, one Pod serves only one pod. If you want to make one Ollama pod serve multiple models, you can set dynamicModelLoading to true.
inference-manager-engine:
ollama:
dynamicModelLoading: true