ollama

mirror of https://github.com/ollama/ollama.git synced 2026-05-13 06:21:28 +00:00

Author	SHA1	Message	Date
Patrick Devine	15e6076d79	mlx: Gemma4 MTP speculative decoding (#15980 ) This change adds support for MTP (multi-token prediction) speculative decoding for the gemma4 model family. It includes: * support for importing safetensors based gemma4 draft models with `ollama create` * a new DRAFT command in the Modelfile for specifying draft models * a --quantize-draft flag for the ollama create command to quantize the draft model * cache support for speculation * changes to the rotating cache to be able to handle MTP correctly * sampling support for draft model token prediction --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2026-05-05 08:55:04 -07:00
Jesse Gross	4656a07e56	mlxrunner: batch the sampler across multiple sequences Register sequences with Add/Remove; each Sample call takes any subset of registered slots and samples one token per row, appending to each slot's ring-buffer history. When all slots share Options and penalty rings are full, one fused transform pass runs over the whole batch via a persistent pooled history tensor; otherwise calls fall back to per-slot serial processing indexed against the same pool. Performance is unchanged for a single sequence, which is all that is exposed for now.	2026-04-25 09:53:53 -07:00
Jesse Gross	30f86cb9dd	mlxrunner: track sampler history in a fixed-size ring buffer AppendToken used to concatenate the new token onto the history tensor and slice it back to RepeatLastN every decode step, churning the graph shape and reallocating a fresh tensor each call. The stateful penalties don't care about order within the window, so a fixed-capacity ring with one SliceUpdate per append keeps the tensor shape constant across steps.	2026-04-25 09:53:53 -07:00
Jesse Gross	24e038d56a	mlxrunner: add logprobs support Match the ollamarunner and OpenAI semantics: raw, full-vocab log-softmax with the top-K ranked by probability. Skipped on the GPU when the request doesn't ask for logprobs so decode doesn't pay for it otherwise.	2026-04-20 17:43:00 -07:00
Daniel Hiltgen	ff23dd343f	mlx: apply repeat penalties in sampler (#15631 )	2026-04-18 07:49:38 -07:00
Patrick Devine	d126467d5d	x/mlxrunner: replace sampler interface chain with single stateful Sampler (#14652 ) - Collapse MLX sampling state into a single sample.Sampler struct (options + history). - Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms. - Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step. - Implement top_p, min_p, repeat_penalty, and frequency_penalty	2026-03-07 17:50:57 -08:00

6 commits