ollama

mirror of https://github.com/ollama/ollama.git synced 2026-05-13 06:21:28 +00:00

History

Daniel Hiltgen 206b049508 mlx: avoid status timeout during inference (#16086 ) The MLX runner now routes model work through a locked worker thread. Status also used that worker only to sample memory, so a scheduler health ping could sit behind long prefill or generation until its 10s context expired, causing /v1/status to return 500 and the server to treat the runner as unhealthy. While Metal doesn't change VRAM reporting, CUDA does. Cache the last memory sample and make status perform only a short best-effort refresh. If the worker is busy, status returns the cached value while a single background refresh continues and updates the cache when the worker becomes available. The in-flight guard and lifecycle context keep this from spawning unbounded refreshes while preserving live VRAM refresh behavior for CUDA. Fixes #16081		2026-05-11 16:03:38 -07:00
..
agent	x/cmd: enable web search and web fetch with flag (#13690 )	2026-01-12 13:59:40 -08:00
cmd	Reapply "don't require pulling stubs for cloud models" again (#14608 )	2026-03-06 14:27:47 -08:00
create	mlx: Gemma4 MTP speculative decoding (#15980 )	2026-05-05 08:55:04 -07:00
imagegen	mlx: update the imagegen runner for mlx thread affinity (#16096 )	2026-05-11 13:05:06 -07:00
internal/mlxthread	Update MLX and MLX-C with threading fixes (#15845 )	2026-05-03 10:03:14 -07:00
mlxrunner	mlx: avoid status timeout during inference (#16086 )	2026-05-11 16:03:38 -07:00
models	mlx: Gemma4 MTP speculative decoding (#15980 )	2026-05-05 08:55:04 -07:00
safetensors	mlx: Support NVIDIA TensorRT Model Optimizer import (#15566 )	2026-04-27 18:28:10 -07:00
server	mlx: Support NVIDIA TensorRT Model Optimizer import (#15566 )	2026-04-27 18:28:10 -07:00
tokenizer	New models (#15861 )	2026-04-28 11:50:12 -07:00
tools	add ability to disable cloud (#14221 )	2026-02-12 15:47:00 -08:00
transfer	mlx: refined model push behavior (#15431 )	2026-05-08 14:25:30 -07:00