This PR separates prompt caching from the public shift request option for native llama-server requests.
Previously, shift controlled two different mechanisms:
context shifting / overflow behavior
per-request llama-server cache_prompt
That meant callers could not request shift: false without also disabling prompt caching.
Fixes#16635
Adding/Multiplying a tensor by a scalar w/ a different data type
can cause the tensor to be promoted and cause performance issues.
This change adds several guards against over-promotion.
The batched MTP accept paths advance the cache by the whole accepted run
before streaming it to the client. If the stream was cancelled partway
(e.g. the caller disconnects), the loop returned before recording the
remaining accepted tokens, leaving the cache offset ahead of
session.outputs. close() then indexed the token log past its end and
panicked with a slice-bounds error.
Record the whole run to session.outputs before streaming any of it, so a
cancelled stream can no longer desync the cache from the token log.
The same bug is present on main, with identical mechanics: the accept
paths there commit the cache to before+accepted and then stream in a loop
that returns on cancellation before recording the rest.
Prefill no longer splits its batch at each requested snapshot offset. The
session schedules the pending offsets on every cache before prefill, runs the
forward in full-size chunks, and attaches the captured snapshots to the trie
afterward. Offsets the prefill never crosses (it leaves one token for decode
seeding) are dropped instead of materializing a node for tokens never written,
and snapshots from an abandoned prefill are released on session close.
Speculation used a parallel hierarchy of wrapper cache types that shadowed
the live caches and reconciled against them on commit. Replace it with
snapshot/restore on the live caches themselves: a cache snapshots itself as
a write crosses each offset, and the runner commits a batched draft by
restoring to the accepted count. The wrappers and the comparison plumbing
around them are gone.
Snapshots are lazy. A KV or rotating capture indexes into the live buffer and
owns no memory until a destructive write forces a copy-out, so rejecting a
draft is free.
Recurrent layers now validate in the same batched pass rather than falling
back to serial. A gated-delta layer reports its interior split offsets and
hands back the recurrent state at each one, which the cache records as a
snapshot.
CausalConv1D and GatedDelta now run their scan in segments cut at optional
WithSnapshotSplits offsets and return the recurrent state at each boundary
instead of just the final state. The output is identical to the unsegmented
scan; segmenting only adds a few kernel launches, not extra recurrence compute.
This lets a batched forward capture interior recurrent state without re-running
the scan, which the cache will use for speculative validation rollback points.
RecurrentCache.Put and the Qwen3.5 layer now thread the boundary-state slices,
committing the final entry as the live state.
cache.go had grown to hold every cache kind. Move KVCache (and its
speculative wrappers) to kvcache.go and RotatingKVCache (and its
sliding-window mask applier) to rotating.go, leaving cache.go with the
shared interfaces and the Speculation transaction. Pure relocation;
no behavior change.
Work that panics on the locked MLX worker goroutine was recovered and
re-raised on the caller, so the printed trace pointed at the re-panic
site in this package rather than the code that actually panicked.
Capture the worker stack at recovery and carry it through a value that
implements error, so the runtime prints the original location in the
fatal trace.
Add support to launch the hermes-desktop app alongside the hermes agent from ollama launch. It will go through the install on first run if hermes-desktop is not already installed.
Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows.
Fixes#16479Fixes#16489Fixes#16491Fixes#16492Fixes#16495
Windows installer and app cleanup could leave llama-server.exe running when ollama.exe was killed directly, so cleanup now includes llama-server.exe and taskkill /T.
llama.cpp b9478 added a default 30s SSE ping that emits colon-only comment frames (":\n\n") while streamed requests are idle; Ollama treated non-data SSE lines as JSON, so skip SSE comments in completion and chat streams.
* llama: add laguna (poolside) arch via a llama.cpp patch under llama/compat/models
The pinned llama.cpp does not include poolside Laguna yet. Add it as an Ollama-owned source file plus a small registration patch under llama/compat/models/. apply-patch.cmake now applies every *.patch under llama/compat/ (the hooks patch plus each arch patch), so adding an architecture only adds files under llama/compat/models/ and needs no new cmake.
* cleanup patch to keep windows happy
---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
llama-server model loads could time out after the fixed load duration even while tensor-loading progress dots were still being emitted, so track raw runner output activity and use OLLAMA_LOAD_TIMEOUT as a stall deadline.
Fixes#16416Fixes#16412
Default integrated GPU filtering dropped the supported ROCm gfx1151 Radeon 8060S unless OLLAMA_IGPU_ENABLE was set, so add a ROCm gfx-target allowlist with gfx1151 as the first admitted target. This iGPU is a known-good iGPU.
Fixes#16423