Inference with Open Models
llma
CLI, or API endpoint.Chat Completion
Here is an example chat completion command with the llma
CLI.
llma chat completions create --model google-gemma-2b-it-q4_0 --role user --completion "What is k8s?"
If you want to use the Python library, you first need to create an API key:
llma auth api-keys create <key name>
You can then pass the API key to initialize the OpenAI client and run the completion:
from openai import OpenAI
client = OpenAI(
base_url="<Base URL (e.g., http://localhost:8080/v1)>",
api_key="<API key secret>"
)
completion = client.chat.completions.create(
model="google-gemma-2b-it-q4_0",
messages=[
{"role": "user", "content": "What is k8s?"}
],
stream=True
)
for response in completion:
print(response.choices[0].delta.content, end="")
print("\n")
You can also just call client = OpenAI()
if you set environment variables OPENAI_BASE_URL
and OPENAI_API_KEY
.
If you want to hit the API endpoint directly, you can use curl
. Here is an example.
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{"model": "google-gemma-2b-it-q4_0", "messages": [{"role": "user", "content": "What is k8s?"}]}' \
http://localhost:8080/v1/chat/completions
Please see the fine-tuning page if you want to generate a fine-tuning model and use that for chat completion.
Tool Calling
vLLM requires additional flags (link to use tool calling.
You can specify the flags with vllmExtraFlags
. Here is an example configuration:
inference-manager-engine:
...
model:
overrides:
meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic:
runtimeName: vllm
resources:
limits:
nvidia.com/gpu: 4
vllmExtraFlags:
- --chat-template
- examples/tool_chat_template_llama3.1_json.jinja
- --enable-auto-tool-choice
- --tool-call-parser
- llama3_json
- --max-model-len
- "8192"
Here is an example curl
command:
curl \
--request POST \
--header "Authorization: Bearer ${LLMARINER_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
"model": "meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic",
"messages": [{"role": "user", "content": "What is the weather like in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country"
}
}
},
"strict": true
}
}]
}' http://localhost:8080/v1/chat/completions
The output will have the tool_calls
in its message
.
{
...
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "chatcmpl-tool-e698e3e36f354d089302b79486e4a702",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, USA\"}"
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls",
"stop_reason": 128008
}
],
...
}
Audio-to-Text
Note
Work-in-progress.Model Runtime Configuration
We currently support vLLM, Ollama, and Nvidia Triton Inference Server
as an inference runtime. You can change a runtime for each model. For
example, in the following configuration, the default runtime is set to
vLLM, and Ollama is used for deepseek-r1:1.5b
.
inference-manager-engine:
...
model:
default:
runtimeName: vllm # Default runtime
resources:
limits:
nvidia.com/gpu: 1
overrides:
lmstudio-community/phi-4-GGUF/phi-4-Q6_K.gguf:
preloaded: true
vllmExtraFlags:
- --tokenizer
- microsoft/phi-4
deepseek-r1:1.5b:
runtimeName: ollama # Override the default runtime
preloaded: true
resources:
limits:
nvidia.com/gpu: 0
By default, one Pod serves only one pod. If you want to make one Ollama pod serve multiple models, you can set dynamicModelLoading
to true
.
inference-manager-engine:
ollama:
dynamicModelLoading: true