ollama/x/models
Patrick Devine 15e6076d79
mlx: Gemma4 MTP speculative decoding (#15980)
This change adds support for MTP (multi-token prediction) speculative decoding for the
gemma4 model family.

It includes:
  * support for importing safetensors based gemma4 draft models with `ollama create`
  * a new DRAFT command in the Modelfile for specifying draft models
  * a --quantize-draft flag for the ollama create command to quantize the draft model
  * cache support for speculation
  * changes to the rotating cache to be able to handle MTP correctly
  * sampling support for draft model token prediction

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2026-05-05 08:55:04 -07:00
..
gemma3 mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
gemma4 mlx: Gemma4 MTP speculative decoding (#15980) 2026-05-05 08:55:04 -07:00
glm4_moe_lite mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
laguna New models (#15861) 2026-04-28 11:50:12 -07:00
llama mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
nn mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
qwen3 mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
qwen3_5 mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
qwen3_5_moe MLX: add header vendoring and remove go build tag (#14642) 2026-03-09 17:24:45 -07:00