Install across Multiple Clusters

Install LLMarinr across multiple Kubernetes clusters.

LLMariner deploys Kubernetes deployments to provision the LLM stack. In a typical configuration, all the services are deployed into a single Kubernetes cluster, but you can also deploy these services on multiple Kubernetes clusters. For example, you can deploy a control plane component in a CPU K8s cluster and deploy the rest of the components in GPU compute clusters.

LLMariner can be deployed into multiple GPU clusters, and the clusters can span across multiple cloud providers (including GPU specific clouds like CoreWeave) and on-prem.

Deploying Control Plane Components

You can deploy only Control Plane components by specifying additional parameters the LLMariner helm chart.

In the values.yaml, you need to set tag.worker to false, global.workerServiceIngress.create to true, and set other values so that an ingress and a service are created to receive requests from worker nodes.

Here is an example values.yaml.

tags:
  worker: false

global:
  ingress:
    ingressClassName: kong
    controllerUrl: https://api.mydomain.com
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt
      konghq.com/response-buffering: "false"
    # Enable TLS for the ingresses.
    tls:
      hosts:
      - api.llm.mydomain.com
      secretName: api-tls
  # Create ingress for gRPC requests coming from worker clusters.
  workerServiceIngress:
    create: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt
      konghq.com/protocols: grpc,grpcs
  workerServiceGrpcService:
    annotations:
      konghq.com/protocol: grpc

# Create a separate load balancer for gRPC streaming requests from inference-manager-engine.
inference-manager-server:
  workerServiceTls:
    enable: true
    secretName: inference-cert
  workerServiceGrpcService:
    type: LoadBalancer
    port: 443
    annotations:
      external-dns.alpha.kubernetes.io/hostname: inference.llm.mydomain.com

# Create a separate load balancer for HTTPS requests from session-manager-agent.
session-manager-server:
  workerServiceTls:
    enable: true
    secretName: session-cert
  workerServiceHttpService:
    type: LoadBalancer
    port: 443
    externalTrafficPolicy: Local
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      external-dns.alpha.kubernetes.io/hostname: session.llm.mydomain.com

Deploying Worker Components

To deploy LLMariner to a worker GPU cluster, you first need to obtain a registration key for the cluster.

llma admin clusters register <cluster-name>

The following is an example command that sets the registration key to the environment variable.

REGISTRATION_KEY=$(llma admin clusters register <cluster-name> | sed -n 's/.*Registration Key: "\([^"]*\)".*/\1/p')

The command generates a new registration key.

Then you need to make LLMariner worker components to use the registration key when making gRPC calls to the control plane.

To make that happen, you first need to create a K8s secret.

REGISTRATION_KEY=clusterkey-...

kubectl create secret generic \
  -n llmariner \
  cluster-registration-key \
  --from-literal=regKey="${REGISTRATION_KEY}"

The secret needs to be created in a namespace where LLMariner will be deployed.

When installing the Helm chart for the worker components, you need to specify addition configurations in values.yaml. Here is an example.

tags:
  control-plane: false

global:
  objectStore:
    s3:
      endpointUrl: <S3 endpoint>
      region: <S3 regiona>
      bucket: <S3 bucket name>

  awsSecret:
    name: aws
    accessKeyIdKey: accessKeyId
    secretAccessKeyKey: secretAccessKey

  worker:
    controlPlaneAddr: api.llm.mydomain.com:443
    tls:
      enable: true
    registrationKeySecret:
      name: cluster-registration-key
      key: regKey

inference-manager-engine:
  inferenceManagerServerWorkerServiceAddr: inference.llm.mydomain.com:443

job-manager-dispatcher:
  notebook:
    llmarinerBaseUrl: https://api.llm.mydomain.com/v1

session-manager-agent:
  sessionManagerServerWorkerServiceAddr: session.llm.mydomain.com:443

model-manager-loader:
  baseModels:
  - <model name, e.g. google/gemma-2b-it-q4_0>