ollama/x/mlxrunner
Daniel Hiltgen 206b049508
mlx: avoid status timeout during inference (#16086)
The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy.

While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA.

Fixes #16081
2026-05-11 16:03:38 -07:00
..
batch mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
cache mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
mlx mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
model mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
sample mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
cache.go mlxrunner: schedule periodic snapshots during prefill 2026-03-26 13:32:11 -07:00
cache_test.go mlxrunner: schedule periodic snapshots during prefill 2026-03-26 13:32:11 -07:00
cache_trie.go mlxrunner: share KV cache across conversations with common prefixes 2026-03-18 16:06:33 -07:00
cache_trie_test.go mlxrunner: share KV cache across conversations with common prefixes 2026-03-18 16:06:33 -07:00
client.go Update MLX and MLX-C with threading fixes (#15845) 2026-05-03 10:03:14 -07:00
imports.go New models (#15861) 2026-04-28 11:50:12 -07:00
mtp.go mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
pipeline.go mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
runner.go mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
server.go mlx: avoid status timeout during inference (#16086) 2026-05-11 16:03:38 -07:00
status_memory.go mlx: avoid status timeout during inference (#16086) 2026-05-11 16:03:38 -07:00
status_memory_test.go mlx: avoid status timeout during inference (#16086) 2026-05-11 16:03:38 -07:00
utf8_buffer.go consolidate the tokenizer (#14327) 2026-02-19 15:55:45 -08:00
utf8_buffer_test.go consolidate the tokenizer (#14327) 2026-02-19 15:55:45 -08:00