mirror of
https://github.com/ollama/ollama.git
synced 2026-05-13 06:21:28 +00:00
The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy. While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA. Fixes #16081 |
||
|---|---|---|
| .. | ||
| agent | ||
| cmd | ||
| create | ||
| imagegen | ||
| internal/mlxthread | ||
| mlxrunner | ||
| models | ||
| safetensors | ||
| server | ||
| tokenizer | ||
| tools | ||
| transfer | ||