Inference with Open Models

Users can run chat completion with open models such as Google Gemma, LLama, Mistral, etc. To run chat completion, users can use the OpenAI Python library, llma CLI, or API endpoint.

Chat Completion

Here is an example chat completion command with the llma CLI.

llma chat completions create --model google-gemma-2b-it-q4_0 --role user --completion "What is k8s?"

If you want to use the Python library, you first need to create an API key:

llma auth api-keys create <key name>

You can then pass the API key to initialize the OpenAI client and run the completion:

from openai import OpenAI

client = OpenAI(
  base_url="<Base URL (e.g., http://localhost:8080/v1)>",
  api_key="<API key secret>"
)

completion = client.chat.completions.create(
  model="google-gemma-2b-it-q4_0",
  messages=[
    {"role": "user", "content": "What is k8s?"}
  ],
  stream=True
)
for response in completion:
  print(response.choices[0].delta.content, end="")
print("\n")

You can also just call ``client = OpenAI()if you set environment variablesOPENAI_BASE_URLandOPENAI_API_KEY`.

If you want to hit the API endpoint directly, you can use curl. Here is an example.

curl \
  --request POST \
  --header "Authorization: Bearer ${LLMARINER_TOKEN}" \
  --header "Content-Type: application/json" \
  --data '{"model": "google-gemma-2b-it-q4_0", "messages": [{"role": "user", "content": "What is k8s?"}]}' \
  http://localhost:8080/v1/chat/completions

Please see the fine-tuning page if you want to generate a fine-tuning model and use that for chat completion.

Tool Calling

vLLM requires additional flags (link to use tool calling. You can specify the flags with vllmExtraFlags. Here is an example configuration:

inference-manager-engine:
  ...
  model:
    overrides:
      meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic:
        runtimeName: vllm
        resources:
          limits:
            nvidia.com/gpu: 4
        vllmExtraFlags:
        - --chat-template
        - examples/tool_chat_template_llama3.1_json.jinja
        - --enable-auto-tool-choice
        - --tool-call-parser
        - llama3_json
        - --max-model-len
        - "8192"

Here is an example curl command:

curl \
  --request POST \
  --header "Authorization: Bearer ${LLMARINER_TOKEN}" \
  --header "Content-Type: application/json" \
  --data '{
  "model": "meta-llama-Meta-Llama-3.3-70B-Instruct-fp8-dynamic",
  "messages": [{"role": "user", "content": "What is the weather like in San Francisco?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current temperature for a given location.",
        "parameters": {
          "type": "object",
          "properties": {
          "location": {
            "type": "string",
            "description": "City and country"
          }
        }
      },
      "strict": true
    }
  }]
}' http://localhost:8080/v1/chat/completions

The output will have the tool_calls in its message.

{
  ...
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "chatcmpl-tool-e698e3e36f354d089302b79486e4a702",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco, USA\"}"
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls",
      "stop_reason": 128008
    }
  ],
  ...
}

Audio-to-Text