Commit graph

5464 commits

Author SHA1 Message Date
ParthSareen
6a6b975607 ci: test darwin xcode pin 2026-06-17 12:08:27 -07:00
Patrick Devine
8c432fc88a
llama: update llama.cpp to b9672 (#16775) 2026-06-16 23:15:52 -07:00
Jeffrey Morgan
acfb50d9af
models: add cohere2_moe (Command A / North) to the MLX engine (#16670)
Implements Cohere2MoeForCausalLM (e.g. CohereLabs/North-Mini-Code-1.0)
2026-06-16 23:15:21 -07:00
Jeffrey Morgan
0f047feef5
llm: context shift allow shiftable prompts (#16764) 2026-06-16 12:55:52 -07:00
Patrick Devine
9e4ed74efe
integration: look for the "hf" tool in integration tests (#16765)
The "huggingface-cli" tool is deprecated, so only try to use the "hf" tool.
2026-06-16 11:04:54 -07:00
Jeffrey Morgan
bbb40a0a6c
server: context shift for context windows larger than 8k, add error when hitting context limit (#16712) 2026-06-15 11:36:50 -07:00
Jeffrey Morgan
993acc7504
model: update lfm2 parser/renderer for optional thinking (#16359) 2026-06-14 20:37:08 -07:00
Jeffrey Morgan
7ea692cb2b
llama: update llama.cpp to b9637 (#16609) 2026-06-14 20:05:08 -07:00
Parth Sareen
12e04379cd
launch: Fix launch provider drift (#16683) 2026-06-11 17:21:46 -07:00
Parafee41
f8a48df24d
llm: decouple prompt caching from context shift (#16639)
This PR separates prompt caching from the public shift request option for native llama-server requests.

Previously, shift controlled two different mechanisms:

context shifting / overflow behavior
per-request llama-server cache_prompt
That meant callers could not request shift: false without also disabling prompt caching.

Fixes #16635
2026-06-11 16:05:24 -07:00
Patrick Devine
82e0ddb6fe
mlxrunner: harden linear/embedding layers against over-promotion (#16682)
Adding/Multiplying a tensor by a scalar w/ a different data type
can cause the tensor to be promoted and cause performance issues.

This change adds several guards against over-promotion.
2026-06-11 13:56:25 -07:00
Jesse Gross
1abd56b6e6 mlxrunner: record committed MTP drafts before streaming them
The batched MTP accept paths advance the cache by the whole accepted run
before streaming it to the client. If the stream was cancelled partway
(e.g. the caller disconnects), the loop returned before recording the
remaining accepted tokens, leaving the cache offset ahead of
session.outputs. close() then indexed the token log past its end and
panicked with a slice-bounds error.

Record the whole run to session.outputs before streaming any of it, so a
cancelled stream can no longer desync the cache from the token log.

The same bug is present on main, with identical mechanics: the accept
paths there commit the cache to before+accepted and then stream in a loop
that returns on cancellation before recording the rest.
2026-06-09 00:39:19 -07:00
Jesse Gross
ded2db7d86 mlxrunner: capture prefill snapshots across the forward
Prefill no longer splits its batch at each requested snapshot offset. The
session schedules the pending offsets on every cache before prefill, runs the
forward in full-size chunks, and attaches the captured snapshots to the trie
afterward. Offsets the prefill never crosses (it leaves one token for decode
seeding) are dropped instead of materializing a node for tokens never written,
and snapshots from an abandoned prefill are released on session close.
2026-06-09 00:39:19 -07:00
Jesse Gross
d00622060f mlxrunner: drive MTP speculation through cache snapshots
Speculation used a parallel hierarchy of wrapper cache types that shadowed
the live caches and reconciled against them on commit. Replace it with
snapshot/restore on the live caches themselves: a cache snapshots itself as
a write crosses each offset, and the runner commits a batched draft by
restoring to the accepted count. The wrappers and the comparison plumbing
around them are gone.

Snapshots are lazy. A KV or rotating capture indexes into the live buffer and
owns no memory until a destructive write forces a copy-out, so rejecting a
draft is free.

Recurrent layers now validate in the same batched pass rather than falling
back to serial. A gated-delta layer reports its interior split offsets and
hands back the recurrent state at each one, which the cache records as a
snapshot.
2026-06-09 00:39:19 -07:00
Jesse Gross
177aefb8a9 nn/recurrent: return per-boundary states from the gated-delta kernels
CausalConv1D and GatedDelta now run their scan in segments cut at optional
WithSnapshotSplits offsets and return the recurrent state at each boundary
instead of just the final state. The output is identical to the unsegmented
scan; segmenting only adds a few kernel launches, not extra recurrence compute.

This lets a batched forward capture interior recurrent state without re-running
the scan, which the cache will use for speculative validation rollback points.
RecurrentCache.Put and the Qwen3.5 layer now thread the boundary-state slices,
committing the final entry as the live state.
2026-06-09 00:39:19 -07:00
Jesse Gross
07588c64ee mlxrunner/cache: split KVCache and RotatingKVCache into their own files
cache.go had grown to hold every cache kind. Move KVCache (and its
speculative wrappers) to kvcache.go and RotatingKVCache (and its
sliding-window mask applier) to rotating.go, leaving cache.go with the
shared interfaces and the Speculation transaction. Pure relocation;
no behavior change.
2026-06-09 00:39:19 -07:00
Jesse Gross
4c97a940ca mlxthread: preserve the original stack when worker work panics
Work that panics on the locked MLX worker goroutine was recovered and
re-raised on the caller, so the printed trace pointed at the re-panic
site in this package rather than the code that actually panicked.

Capture the worker stack at recovery and carry it through a value that
implements error, so the runtime prints the original location in the
fatal trace.
2026-06-09 00:39:19 -07:00
Bruce MacDonald
74cbf1d2c2
docs: omp (#16552)
Add docs for explaining and setting up "oh my pi" (omp)
2026-06-08 11:43:51 -07:00
Bruce MacDonald
5c1e37eb67
docs: hermes desktop (#16549) 2026-06-08 11:43:11 -07:00
Jeffrey Morgan
f0078ae476
docs: update docs examples to use Gemma 4 instead of Gemma 3 (#16607) 2026-06-07 12:43:13 -07:00
Jeffrey Morgan
96201a623a
Add AGENTS.md and CLAUDE.md to root repository (#16604) 2026-06-07 10:57:59 -07:00
Daniel Hiltgen
9c94c2b11e
docs: describe llama.cpp update process (#16603) 2026-06-07 10:27:47 -07:00
Parth Sareen
e09b3f9fb5
openai: align models list with tags (#16556) 2026-06-05 17:59:05 -07:00
Bruce MacDonald
a0099da2d1
launch: use native Windows Hermes config path (#16558) 2026-06-05 17:29:19 -07:00
Chris Chen
25e0e81e12
docs: update Zod example to use native toJSONSchema (#14746)
Co-authored-by: fuleinist <fuleinist@gmail.com>
2026-06-05 16:21:07 -07:00
Bruce MacDonald
87cff95af8
launch: oh-my-pi (#16410) 2026-06-04 17:49:49 -07:00
Patrick Devine
3ef69ef784
mlx: allow the embedding layer to use the nvfp4 global scale (#16527) 2026-06-04 17:40:01 -07:00
Michael Yang
1a7786be14
docs: add cloud model retirement (#16528) 2026-06-04 15:18:38 -07:00
Bruce MacDonald
3370ff8b1c
launch: hermes-desktop app (#16516)
Add support to launch the hermes-desktop app alongside the hermes agent from ollama launch. It will go through the install on first run if hermes-desktop is not already installed.
2026-06-04 11:51:36 -07:00
Daniel Hiltgen
455f57457d
llama.cpp version update (#16511)
Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows.

Fixes #16479
Fixes #16489
Fixes #16491
Fixes #16492
Fixes #16495
2026-06-04 08:20:57 -07:00
Bruce MacDonald
1d955ed990
integrations: hermes windows install (#16487) 2026-06-03 17:40:45 -07:00
Eva H
d071237131
docs: add Cline CLI integration doc (#16341) 2026-06-03 20:30:01 -04:00
Daniel Hiltgen
229a1303fb
llama-server: fix gemma4 patch wiring (#16477)
This will fix the "clip.cpp:4399: Unknown projector type" crash.
2026-06-03 14:41:03 -07:00
Parth Sareen
ac3d0657a2
launch: migrate pi (#16213) 2026-06-03 14:35:32 -07:00
Daniel Hiltgen
01557ff313
llama-server: allow GPU offload for projectors (#16473)
Special case Metal iGPUs to enable GPU offload.
2026-06-03 13:58:40 -07:00
Patrick Devine
e5a38739b4
mlx: "requires" in modelfile is being ignored for mlx based models (#16469)
This change fixes an issue in `ollama create --experimental` which
is currently ignoring the REQUIRES command in a Modelfile.
2026-06-03 13:10:57 -07:00
Jeffrey Morgan
5f56a289b3
server: classify mmproj GGUFs as projector layers (#16472) 2026-06-03 12:59:34 -07:00
Eva H
ad8cda255d
launch: clean legacy codex profile before launch (#16467) 2026-06-03 14:49:31 -04:00
Daniel Hiltgen
3e1b4fe39d
Kill llama-server during Windows cleanup (#16458)
Windows installer and app cleanup could leave llama-server.exe running when ollama.exe was killed directly, so cleanup now includes llama-server.exe and taskkill /T.
2026-06-03 10:25:12 -07:00
Daniel Hiltgen
52196f1a97
llama.cpp version update (#16463)
Bump llama.cpp to b9493 and refresh the Laguna compat patch for upstream enum/tokenizer movement and the renamed SWA layer bitmap field.
2026-06-03 10:20:30 -07:00
Patrick Devine
50bbda5660
models: add support for gemma4-12b (#16457) 2026-06-03 07:44:57 -07:00
Daniel Hiltgen
4b5bdd3b25
fix laguna patch build breakage (#16445)
Follow up to #16396

Fix kernel template instantiation so the symbols are exported in the library.
2026-06-02 16:35:19 -07:00
Daniel Hiltgen
e828061b6e
llm: ignore llama-server SSE ping comments (#16443)
llama.cpp b9478 added a default 30s SSE ping that emits colon-only comment frames (":\n\n") while streamed requests are idle; Ollama treated non-data SSE lines as JSON, so skip SSE comments in completion and chat streams.
2026-06-02 15:40:14 -07:00
Bruce MacDonald
7a2073d17b
docs: configure hermes desktop app (#16440) 2026-06-02 14:32:10 -07:00
Daniel Hiltgen
c952708169
llama: add laguna (poolside) arch via a llama.cpp patch under llama/c… (#16396)
* llama: add laguna (poolside) arch via a llama.cpp patch under llama/compat/models

The pinned llama.cpp does not include poolside Laguna yet. Add it as an Ollama-owned source file plus a small registration patch under llama/compat/models/. apply-patch.cmake now applies every *.patch under llama/compat/ (the hooks patch plus each arch patch), so adding an architecture only adds files under llama/compat/models/ and needs no new cmake.

* cleanup patch to keep windows happy

---------

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2026-06-02 13:17:08 -07:00
Parth Sareen
f57d111754
launch: isolate Codex launch configuration (#16437) 2026-06-02 12:10:46 -07:00
Daniel Hiltgen
c34a79a373
llama.cpp version update (#16426) 2026-06-02 11:46:56 -07:00
Daniel Hiltgen
b051c9cf83
More harden app markdown URL handling (#16436) 2026-06-02 11:46:14 -07:00
Daniel Hiltgen
b7b7fa0454
llm: detect llama-server load stalls from output (#16427)
llama-server model loads could time out after the fixed load duration even while tensor-loading progress dots were still being emitted, so track raw runner output activity and use OLLAMA_LOAD_TIMEOUT as a stall deadline.

Fixes #16416
Fixes #16412
2026-06-02 11:30:48 -07:00
Daniel Hiltgen
4c076813be
discover: allow Radeon 8060S iGPU by default (#16429)
Default integrated GPU filtering dropped the supported ROCm gfx1151 Radeon 8060S unless OLLAMA_IGPU_ENABLE was set, so add a ROCm gfx-target allowlist with gfx1151 as the first admitted target.  This iGPU is a known-good iGPU.

Fixes #16423
2026-06-02 11:15:01 -07:00