API and GPU Usage Optimization

Note

Work-in-progress.

API Usage Visibility

Inference Request Rate-limiting

Optimize GPU Utilization

Auto-scaling of Inference Runtimes

Scheduled Scale Up and Down of Inference Runtimes

If you want to scale up/down model runtimes based on certain schedules, you can enable the scheduled shutdown feature.

Here is an example values.yaml.

  inference-manager-engine:
    runtime:
      scheduledShutdown:
        enable: true
        schedule:
         # Pods are up between 9AM to 5PM.
         scaleUp:   "0 9 * * *"
         scaleDown: "0 17 * * *"
         timeZone: "Asia/Tokyo"

Last modified March 18, 2025: docs: add a quick note on scheduled shutdown (#138) (c79bdc4)