ollama

mirror of https://github.com/ollama/ollama.git synced 2026-07-05 07:11:10 +00:00

Author	SHA1	Message	Date
ParthSareen	6a6b975607	ci: test darwin xcode pin	2026-06-17 12:08:27 -07:00
Patrick Devine	8c432fc88a	llama: update llama.cpp to b9672 (#16775 )	2026-06-16 23:15:52 -07:00
Jeffrey Morgan	acfb50d9af	models: add cohere2_moe (Command A / North) to the MLX engine (#16670 ) Implements Cohere2MoeForCausalLM (e.g. CohereLabs/North-Mini-Code-1.0)	2026-06-16 23:15:21 -07:00
Jeffrey Morgan	0f047feef5	llm: context shift allow shiftable prompts (#16764 )	2026-06-16 12:55:52 -07:00
Patrick Devine	9e4ed74efe	integration: look for the "hf" tool in integration tests (#16765 ) The "huggingface-cli" tool is deprecated, so only try to use the "hf" tool.	2026-06-16 11:04:54 -07:00
Jeffrey Morgan	bbb40a0a6c	server: context shift for context windows larger than 8k, add error when hitting context limit (#16712 )	2026-06-15 11:36:50 -07:00
Jeffrey Morgan	993acc7504	model: update lfm2 parser/renderer for optional thinking (#16359 )	2026-06-14 20:37:08 -07:00
Jeffrey Morgan	7ea692cb2b	llama: update llama.cpp to b9637 (#16609 )	2026-06-14 20:05:08 -07:00
Parth Sareen	12e04379cd	launch: Fix launch provider drift (#16683 )	2026-06-11 17:21:46 -07:00
Parafee41	f8a48df24d	llm: decouple prompt caching from context shift (#16639 ) This PR separates prompt caching from the public shift request option for native llama-server requests. Previously, shift controlled two different mechanisms: context shifting / overflow behavior per-request llama-server cache_prompt That meant callers could not request shift: false without also disabling prompt caching. Fixes #16635	2026-06-11 16:05:24 -07:00
Patrick Devine	82e0ddb6fe	mlxrunner: harden linear/embedding layers against over-promotion (#16682 ) Adding/Multiplying a tensor by a scalar w/ a different data type can cause the tensor to be promoted and cause performance issues. This change adds several guards against over-promotion.	2026-06-11 13:56:25 -07:00
Jesse Gross	1abd56b6e6	mlxrunner: record committed MTP drafts before streaming them The batched MTP accept paths advance the cache by the whole accepted run before streaming it to the client. If the stream was cancelled partway (e.g. the caller disconnects), the loop returned before recording the remaining accepted tokens, leaving the cache offset ahead of session.outputs. close() then indexed the token log past its end and panicked with a slice-bounds error. Record the whole run to session.outputs before streaming any of it, so a cancelled stream can no longer desync the cache from the token log. The same bug is present on main, with identical mechanics: the accept paths there commit the cache to before+accepted and then stream in a loop that returns on cancellation before recording the rest.	2026-06-09 00:39:19 -07:00
Jesse Gross	ded2db7d86	mlxrunner: capture prefill snapshots across the forward Prefill no longer splits its batch at each requested snapshot offset. The session schedules the pending offsets on every cache before prefill, runs the forward in full-size chunks, and attaches the captured snapshots to the trie afterward. Offsets the prefill never crosses (it leaves one token for decode seeding) are dropped instead of materializing a node for tokens never written, and snapshots from an abandoned prefill are released on session close.	2026-06-09 00:39:19 -07:00
Jesse Gross	d00622060f	mlxrunner: drive MTP speculation through cache snapshots Speculation used a parallel hierarchy of wrapper cache types that shadowed the live caches and reconciled against them on commit. Replace it with snapshot/restore on the live caches themselves: a cache snapshots itself as a write crosses each offset, and the runner commits a batched draft by restoring to the accepted count. The wrappers and the comparison plumbing around them are gone. Snapshots are lazy. A KV or rotating capture indexes into the live buffer and owns no memory until a destructive write forces a copy-out, so rejecting a draft is free. Recurrent layers now validate in the same batched pass rather than falling back to serial. A gated-delta layer reports its interior split offsets and hands back the recurrent state at each one, which the cache records as a snapshot.	2026-06-09 00:39:19 -07:00
Jesse Gross	177aefb8a9	nn/recurrent: return per-boundary states from the gated-delta kernels CausalConv1D and GatedDelta now run their scan in segments cut at optional WithSnapshotSplits offsets and return the recurrent state at each boundary instead of just the final state. The output is identical to the unsegmented scan; segmenting only adds a few kernel launches, not extra recurrence compute. This lets a batched forward capture interior recurrent state without re-running the scan, which the cache will use for speculative validation rollback points. RecurrentCache.Put and the Qwen3.5 layer now thread the boundary-state slices, committing the final entry as the live state.	2026-06-09 00:39:19 -07:00
Jesse Gross	07588c64ee	mlxrunner/cache: split KVCache and RotatingKVCache into their own files cache.go had grown to hold every cache kind. Move KVCache (and its speculative wrappers) to kvcache.go and RotatingKVCache (and its sliding-window mask applier) to rotating.go, leaving cache.go with the shared interfaces and the Speculation transaction. Pure relocation; no behavior change.	2026-06-09 00:39:19 -07:00
Jesse Gross	4c97a940ca	mlxthread: preserve the original stack when worker work panics Work that panics on the locked MLX worker goroutine was recovered and re-raised on the caller, so the printed trace pointed at the re-panic site in this package rather than the code that actually panicked. Capture the worker stack at recovery and carry it through a value that implements error, so the runtime prints the original location in the fatal trace.	2026-06-09 00:39:19 -07:00
Bruce MacDonald	74cbf1d2c2	docs: omp (#16552 ) Add docs for explaining and setting up "oh my pi" (omp)	2026-06-08 11:43:51 -07:00
Bruce MacDonald	5c1e37eb67	docs: hermes desktop (#16549 )	2026-06-08 11:43:11 -07:00
Jeffrey Morgan	f0078ae476	docs: update docs examples to use Gemma 4 instead of Gemma 3 (#16607 )	2026-06-07 12:43:13 -07:00
Jeffrey Morgan	96201a623a	Add AGENTS.md and CLAUDE.md to root repository (#16604 )	2026-06-07 10:57:59 -07:00
Daniel Hiltgen	9c94c2b11e	docs: describe llama.cpp update process (#16603 )	2026-06-07 10:27:47 -07:00
Parth Sareen	e09b3f9fb5	openai: align models list with tags (#16556 )	2026-06-05 17:59:05 -07:00
Bruce MacDonald	a0099da2d1	launch: use native Windows Hermes config path (#16558 )	2026-06-05 17:29:19 -07:00
Chris Chen	25e0e81e12	docs: update Zod example to use native toJSONSchema (#14746 ) Co-authored-by: fuleinist <fuleinist@gmail.com>	2026-06-05 16:21:07 -07:00
Bruce MacDonald	87cff95af8	launch: oh-my-pi (#16410 )	2026-06-04 17:49:49 -07:00
Patrick Devine	3ef69ef784	mlx: allow the embedding layer to use the nvfp4 global scale (#16527 )	2026-06-04 17:40:01 -07:00
Michael Yang	1a7786be14	docs: add cloud model retirement (#16528 )	2026-06-04 15:18:38 -07:00
Bruce MacDonald	3370ff8b1c	launch: hermes-desktop app (#16516 ) Add support to launch the hermes-desktop app alongside the hermes agent from ollama launch. It will go through the install on first run if hermes-desktop is not already installed.	2026-06-04 11:51:36 -07:00
Daniel Hiltgen	455f57457d	llama.cpp version update (#16511 ) Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows. Fixes #16479 Fixes #16489 Fixes #16491 Fixes #16492 Fixes #16495	2026-06-04 08:20:57 -07:00
Bruce MacDonald	1d955ed990	integrations: hermes windows install (#16487 )	2026-06-03 17:40:45 -07:00
Eva H	d071237131	docs: add Cline CLI integration doc (#16341 )	2026-06-03 20:30:01 -04:00
Daniel Hiltgen	229a1303fb	llama-server: fix gemma4 patch wiring (#16477 ) This will fix the "clip.cpp:4399: Unknown projector type" crash.	2026-06-03 14:41:03 -07:00
Parth Sareen	ac3d0657a2	launch: migrate pi (#16213 )	2026-06-03 14:35:32 -07:00
Daniel Hiltgen	01557ff313	llama-server: allow GPU offload for projectors (#16473 ) Special case Metal iGPUs to enable GPU offload.	2026-06-03 13:58:40 -07:00
Patrick Devine	e5a38739b4	mlx: "requires" in modelfile is being ignored for mlx based models (#16469 ) This change fixes an issue in `ollama create --experimental` which is currently ignoring the REQUIRES command in a Modelfile.	2026-06-03 13:10:57 -07:00
Jeffrey Morgan	5f56a289b3	server: classify mmproj GGUFs as projector layers (#16472 )	2026-06-03 12:59:34 -07:00
Eva H	ad8cda255d	launch: clean legacy codex profile before launch (#16467 )	2026-06-03 14:49:31 -04:00
Daniel Hiltgen	3e1b4fe39d	Kill llama-server during Windows cleanup (#16458 ) Windows installer and app cleanup could leave llama-server.exe running when ollama.exe was killed directly, so cleanup now includes llama-server.exe and taskkill /T.	2026-06-03 10:25:12 -07:00
Daniel Hiltgen	52196f1a97	llama.cpp version update (#16463 ) Bump llama.cpp to b9493 and refresh the Laguna compat patch for upstream enum/tokenizer movement and the renamed SWA layer bitmap field.	2026-06-03 10:20:30 -07:00
Patrick Devine	50bbda5660	models: add support for gemma4-12b (#16457 )	2026-06-03 07:44:57 -07:00
Daniel Hiltgen	4b5bdd3b25	fix laguna patch build breakage (#16445 ) Follow up to #16396 Fix kernel template instantiation so the symbols are exported in the library.	2026-06-02 16:35:19 -07:00
Daniel Hiltgen	e828061b6e	llm: ignore llama-server SSE ping comments (#16443 ) llama.cpp b9478 added a default 30s SSE ping that emits colon-only comment frames (":\n\n") while streamed requests are idle; Ollama treated non-data SSE lines as JSON, so skip SSE comments in completion and chat streams.	2026-06-02 15:40:14 -07:00
Bruce MacDonald	7a2073d17b	docs: configure hermes desktop app (#16440 )	2026-06-02 14:32:10 -07:00
Daniel Hiltgen	c952708169	llama: add laguna (poolside) arch via a llama.cpp patch under llama/c… (#16396 ) * llama: add laguna (poolside) arch via a llama.cpp patch under llama/compat/models The pinned llama.cpp does not include poolside Laguna yet. Add it as an Ollama-owned source file plus a small registration patch under llama/compat/models/. apply-patch.cmake now applies every .patch under llama/compat/ (the hooks patch plus each arch patch), so adding an architecture only adds files under llama/compat/models/ and needs no new cmake. cleanup patch to keep windows happy --------- Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>	2026-06-02 13:17:08 -07:00
Parth Sareen	f57d111754	launch: isolate Codex launch configuration (#16437 )	2026-06-02 12:10:46 -07:00
Daniel Hiltgen	c34a79a373	llama.cpp version update (#16426 )	2026-06-02 11:46:56 -07:00
Daniel Hiltgen	b051c9cf83	More harden app markdown URL handling (#16436 )	2026-06-02 11:46:14 -07:00
Daniel Hiltgen	b7b7fa0454	llm: detect llama-server load stalls from output (#16427 ) llama-server model loads could time out after the fixed load duration even while tensor-loading progress dots were still being emitted, so track raw runner output activity and use OLLAMA_LOAD_TIMEOUT as a stall deadline. Fixes #16416 Fixes #16412	2026-06-02 11:30:48 -07:00
Daniel Hiltgen	4c076813be	discover: allow Radeon 8060S iGPU by default (#16429 ) Default integrated GPU filtering dropped the supported ROCm gfx1151 Radeon 8060S unless OLLAMA_IGPU_ENABLE was set, so add a ROCm gfx-target allowlist with gfx1151 as the first admitted target. This iGPU is a known-good iGPU. Fixes #16423	2026-06-02 11:15:01 -07:00

1 2 3 4 5 ...

5464 commits