ollama/x
Daniel Hiltgen 206b049508
mlx: avoid status timeout during inference (#16086)
The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy.

While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA.

Fixes #16081
2026-05-11 16:03:38 -07:00
..
agent x/cmd: enable web search and web fetch with flag (#13690) 2026-01-12 13:59:40 -08:00
cmd Reapply "don't require pulling stubs for cloud models" again (#14608) 2026-03-06 14:27:47 -08:00
create mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
imagegen mlx: update the imagegen runner for mlx thread affinity (#16096) 2026-05-11 13:05:06 -07:00
internal/mlxthread Update MLX and MLX-C with threading fixes (#15845) 2026-05-03 10:03:14 -07:00
mlxrunner mlx: avoid status timeout during inference (#16086) 2026-05-11 16:03:38 -07:00
models mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
safetensors mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
server mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
tokenizer New models (#15861) 2026-04-28 11:50:12 -07:00
tools add ability to disable cloud (#14221) 2026-02-12 15:47:00 -08:00
transfer mlx: refined model push behavior (#15431) 2026-05-08 14:25:30 -07:00