This change adds support for MTP (multi-token prediction) speculative decoding for the
gemma4 model family.
It includes:
* support for importing safetensors based gemma4 draft models with `ollama create`
* a new DRAFT command in the Modelfile for specifying draft models
* a --quantize-draft flag for the ollama create command to quantize the draft model
* cache support for speculation
* changes to the rotating cache to be able to handle MTP correctly
* sampling support for draft model token prediction
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Register sequences with Add/Remove; each Sample call takes any subset of
registered slots and samples one token per row, appending to each slot's
ring-buffer history. When all slots share Options and penalty rings are
full, one fused transform pass runs over the whole batch via a persistent
pooled history tensor; otherwise calls fall back to per-slot serial
processing indexed against the same pool.
Performance is unchanged for a single sequence, which is all that is
exposed for now.
AppendToken used to concatenate the new token onto the history tensor
and slice it back to RepeatLastN every decode step, churning the graph
shape and reallocating a fresh tensor each call. The stateful penalties
don't care about order within the window, so a fixed-capacity ring with
one SliceUpdate per append keeps the tensor shape constant across
steps.
Match the ollamarunner and OpenAI semantics: raw, full-vocab log-softmax
with the top-K ranked by probability. Skipped on the GPU when the request
doesn't ask for logprobs so decode doesn't pay for it otherwise.
- Collapse MLX sampling state into a single sample.Sampler struct (options + history).
- Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms.
- Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step.
- Implement top_p, min_p, repeat_penalty, and frequency_penalty