Commit graph

5394 commits

Author SHA1 Message Date
Daniel Hiltgen
ae4a0b93df lint fixes 2026-05-10 14:10:09 -07:00
Daniel Hiltgen
bd61ddc192 misc discovery fixes, and bos handling 2026-05-10 14:08:57 -07:00
Daniel Hiltgen
993f80182a win: avoid spaces in compiler path 2026-05-09 09:57:58 -07:00
Daniel Hiltgen
2dd437158e win: fix arm64 cross-compile 2026-05-09 09:39:53 -07:00
Daniel Hiltgen
eb6dea8a74 win: fix build 2026-05-09 07:52:24 -07:00
Daniel Hiltgen
00defa8f79 Merge remote-tracking branch 'upstream/main' into llama-runner-phase-0
# Conflicts:
#	envconfig/config.go
#	integration/context_test.go
#	integration/model_arch_test.go
2026-05-08 16:00:29 -07:00
Daniel Hiltgen
c2f2d90a67
test: integration test hardening (#13532)
* test: integration test hardening

Improve reliability on slower systems, and some flakes.  Fix
a few logic flaws on the newer tests, general hardening.

* tighten up vision logging

* add new models

* remove some older models - still covered by library scenarios
2026-05-08 15:54:17 -07:00
Daniel Hiltgen
396eff5d84 win: favor ninja for faster developer builds 2026-05-08 15:19:41 -07:00
Daniel Hiltgen
4d142c0724 ci: improvements from #15982 2026-05-08 14:34:44 -07:00
Daniel Hiltgen
1e1b34dada
mlx: refined model push behavior (#15431)
* mlx: refined model push behavior

Refine the algorithm for parallel push of safetensors based models to get
better reliability and throughput.

* review comments, hardening, and performance tuning for slow links

* review comments
2026-05-08 14:25:30 -07:00
Daniel Hiltgen
88935c21b5 scheduler improvements 2026-05-08 12:16:35 -07:00
Daniel Hiltgen
fa7092b1d3 win: arm64 cross-compile build
also DRY out CI steps
2026-05-08 10:38:14 -07:00
Daniel Hiltgen
5bf7da16c2 disable openmp 2026-05-07 19:23:28 -07:00
Daniel Hiltgen
db9c11707b win: fix dependency gathering 2026-05-07 15:59:48 -07:00
Daniel Hiltgen
6b93c29ca1 ci: fix windows dependencies 2026-05-07 11:32:52 -07:00
Parth Sareen
f866e7608f
launch: disable Claude Desktop launch (#16028) 2026-05-07 10:46:18 -07:00
Daniel Hiltgen
cb98d5966e ci: windows mlx tuning
Shorten long-tail on build, and get OllamaSetup.exe back under 2g limit
2026-05-07 07:35:22 -07:00
Daniel Hiltgen
0b45854537 ci: fix windows rocm build 2026-05-06 20:01:09 -07:00
Daniel Hiltgen
140fcf2bd0 ci: fix windows llama-server build 2026-05-06 19:38:05 -07:00
Daniel Hiltgen
9c925a2d29 ci: fix windows MLX build 2026-05-06 19:20:48 -07:00
Daniel Hiltgen
d68fd01fa6 refine implementation 2026-05-06 17:26:05 -07:00
jmorganca
868b63ff77 llama/compat: load Ollama-format GGUFs in llama-server
Squashed from upstream/jmorganca/llama-compat on 2026-04-29.
Source tip: 0c33775d37.

Original source commits:
- 25223160d llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs
- 7449b539a llm,server: route Ollama-format gemma3 blobs through llama/compat
- 436f2e2b1 llama/compat: make patch-apply idempotent
- 8c2c9d4c8 llama/compat: extend gemma3 handler to cover 1B and 270M blobs
- 021389f7b llama/compat: shrink clip.cpp injection from 18 lines to 1
- 61b367ec2 llama/compat: shrink patch to pure call-site hooks (34 -> 20 lines)
- 36049361c llama/compat: simplify shim (gemma3-tested)
- 8fa664865 llama/compat: add qwen35moe text handler
- db0c74530 llama/compat: add qwen35moe vision (clip) support
- 2a388da77 llama/compat: split shared infra into a util TU
- 9a69a17dc llama/compat: document non-public API dependencies
- d0f38a915 llama/compat: add gpt-oss and lfm2 handlers
- 086071822 llama/compat: add mistral3 text handler (vision TODO)
- 63bde9ff7 llama/compat: add mistral3 vision (clip) support
- 3a57b89d5 llama/compat: apply LLaMA RoPE permute to mistral3 vision Q/K
- 99cb87439 llama/compat: add qwen35, gemma4, deepseek-ocr handlers
- 2c7850dba llama/compat: add nemotron_h_moe handler (latent FFN + MTP skip)
- 9e3b54225 llama/compat: add llama4 text + clip handlers
- 034fee349 llama/compat: add gemma4 clip handler (gemma4v projector)
- 9945c5a93 server: remove dhiltgen/* compat redirect table
- 5d4539101 llama/compat: rewrite gemma4 tokenizer model to BPE
- 7e0765327 llama/compat: add glm-ocr text handler + text-loader load-op hook
- f1bd1a25a llama/compat: add glm-ocr clip handler (glm4v projector)
- 4b5cf3420 llama/compat: collapse text-loader hook back to one new patch line
- eb4ecf4fc llama/compat: extend gemma4 clip handler to gemma4a (audio)
- a23a5e76f llama/compat: fix gemma4a per-block norm tensor mapping
- cd2dcaff4 llama/compat: add embeddinggemma handler
- 1ce8a6b26 llama/compat: add qwen3-vl + qwen2.5-vl handlers
- fd98ffa1e llama/compat: add gemma3n + glm4moelite handlers
- cc7bdf0bc llama/compat: handle null buft in maybe_load_tensor
- 0c33775d3 llama/compat: disable mmap when load_op transforms text-side tensors
2026-05-06 17:26:05 -07:00
Daniel Hiltgen
31e336791a runner: Remove CGO engines, use llama-server exclusively for GGML models
Remove the vendored GGML and llama.cpp backend, CGO runner, Go model
implementations, and sample.  llama-server (built from upstream llama.cpp via
FetchContent) is now the sole inference engine for GGUF-based models.
(Safetensor based models continue to run on the new MLX engine.)  This allows
us to more rapidly pick up new capabilities and fixes from llama.cpp as they
come out.

On windows this now requires recent AMD driver versions to support ROCm v7 as
llama.cpp currently does not support building against v6.
2026-05-06 17:26:05 -07:00
Daniel Hiltgen
94c40c9ad1 broad lint fixes to sidestep CI scope glitch 2026-05-06 17:26:05 -07:00
Parth Sareen
bab59072fb
launch: add plan-aware model gating (#16027) 2026-05-06 14:34:26 -07:00
Eva H
7c2c36bda2
cmd/launch: improve integration backup UX (#15907) 2026-05-06 11:32:54 -04:00
Parth Sareen
d319227df0
server: cache show responses (#15967) 2026-05-05 14:40:18 -07:00
Daniel Hiltgen
2d84ec939c
mlx: partial cleanup of imagegen layout (#15435)
* mlx: partial cleanup of imagegen layout

This moves part of the imagegen safetensors code to the new package.

* test: remove flaky timing test
2026-05-05 14:15:30 -07:00
Patrick Devine
15e6076d79
mlx: Gemma4 MTP speculative decoding (#15980)
This change adds support for MTP (multi-token prediction) speculative decoding for the
gemma4 model family.

It includes:
  * support for importing safetensors based gemma4 draft models with `ollama create`
  * a new DRAFT command in the Modelfile for specifying draft models
  * a --quantize-draft flag for the ollama create command to quantize the draft model
  * cache support for speculation
  * changes to the rotating cache to be able to handle MTP correctly
  * sampling support for draft model token prediction

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2026-05-05 08:55:04 -07:00
Parth Sareen
4017af96cd
go: bump to 1.26 (#15904) 2026-05-03 23:24:35 -07:00
Daniel Hiltgen
534342e7e2
Update MLX and MLX-C with threading fixes (#15845)
* Update MLX and MLX-C

* Run MLX CGO work on a locked OS thread

MLX now relies on OS-thread-local execution state for streams, encoders, and caches. Add an mlxthread executor backed by runtime.LockOSThread and route runner initialization, model load, inference, status memory reads, and cleanup through the worker so Go goroutine migration cannot split MLX state across native threads.

Also stop caching default MLX streams before the runner owns the thread and add worker/threaded MLX regression tests.

* mlx: use common status writer

* mlx: bundle missing libjaccl on arm64

Inspired by #15793

* review comments
2026-05-03 10:03:14 -07:00
Parth Sareen
9ba5a04914
launch: claude app (#15937) 2026-05-02 19:19:57 -07:00
Bruce MacDonald
938ca6e274
app: source featured models from experimental recommendations endpoint (#15909)
Replace the hardcoded FEATURED_MODELS list with the
/api/experimental/model-recommendations endpoint so the picker stays in
sync with server-driven recommendations. Inline the merge into useModels
(recommendations first, then the rest of /api/tags) and drop the
standalone mergeModels util.
2026-05-01 11:10:20 -07:00
Pratham Agarwal
8f39fff70b
fix: resolve OpenClaw gateway launch timeout on Windows by enforcing IPv4 loopback (#15726) 2026-04-30 22:20:08 -04:00
Daniel Hiltgen
4fe5609563
metal: harden for ggml initialization failures (#15755)
* metal: harden for ggml initialization failures

ggml_metal_device_init performs a probe to verify the tensor API compiles.  On
some systems this passes, even though kernel coverage isn't complete, which
results in a later crash when compiling the real kernels.  This change adds a
single retry if any of the error strings match this failure mode to disable the
tensor API.  It also hardens an error case in the Go initDevices to detect
device initialization failures and panic instead of crashing later on a nil
array entry.

Fixes #15734

* review comments

* review comments
2026-04-30 16:28:03 -07:00
Bruce MacDonald
917324bb4d
app: remove ollama update url env var used for testing (#15905) 2026-04-30 13:14:08 -07:00
Parth Sareen
c7c2837c96
renderers: update gemma4 renderer (#15886) 2026-04-29 18:40:23 -07:00
Parth Sareen
b6447caebc
launch: use vram bytes for model recommendations (#15885) 2026-04-29 18:40:14 -07:00
Eva H
bad32c7244
launch/docs: fix title for pool (#15883) 2026-04-29 17:18:44 -04:00
Eva H
ab2e005bf7
app: align the app launch page with ollama launch (#15753) 2026-04-29 14:45:19 -04:00
Parth Sareen
321cc8a2ba
server/launch: add model recommendations cache endpoint (#15868) 2026-04-28 17:09:04 -07:00
Daniel Hiltgen
87288ced4f
New models (#15861)
* mlx: add laguna model support

* convert: support fp8 safetensors import

Decode HF F8_E4M3 safetensors with block scale companions into GGUF-supported tensor types, and record which output tensors came from FP8 source weights.

Use that source-precision metadata during create quantization: default FP8-sourced GGUFs to Q8_0, keep non-FP8 tensors at their original precision for Q8_0, and promote non-FP8 quantizable tensors to Q8_0 for Q4_K requests.

* ggml: add laguna model support

* server: preserve generate logprobs with builtin parsers

Generate requests were dropping logprob-only chunks whenever a builtin parser buffered visible content. Chat already handled this case, but generate only forwarded chunks with visible response, thinking, or tool-call output.

Keep generate chunks that carry logprobs even when the builtin parser has not flushed visible content yet, and add a regression test that exercises the behavior with a generic thinking parser.

* review comments - perf improvements

* ggml: implement nemotron 3 nano omni

* add poolside integration

* update poolside doc

* adapt to new cache setup

* fix test

* fix test

---------

Co-authored-by: Eva Ho <hoyyeva@gmail.com>
2026-04-28 11:50:12 -07:00
Jesse Gross
2bbe2405fe mlxrunner: decouple models from attention cache storage layout
Models build their own attention masks and read K/V directly from
the cache's buffers, which ties them to the cache's storage layout.
That blocks multi-sequence batching — right-padded rows need a
query-padding mask composed onto every model — and rules out
variants like paged attention where K/V isn't one contiguous tensor.

Caches now hand back a per-layer KVHistory holding post-update K, V,
and a MaskApplier that merges the cache's storage restrictions into
the model's logical mask. Models describe their mask in logical
terms; SDPA composes model, padding, and applier contributions and
dispatches to the kernel's causal or no-mask fast path when it can.
KVHistory still exposes K, V, and the composed mask for manual
attention paths (e.g. CUDA prefill at head_dim > 128).

Performance for single-sequence inference is unchanged.
2026-04-27 20:04:46 -07:00
Jesse Gross
bd21678b16 mlxrunner: apply RoPE at per-row positions
Switch RoPE from the scalar-offset kernel (mlx_fast_rope) to the
array-offset one (mlx_fast_rope_dynamic) so each batch row can start
at its own position. The pipeline tracks the current position locally
and passes it to the model through Batch.SeqOffsets; each model
materializes that slice into an int32 array for the RoPE call.

Single-sequence behavior is unchanged; this is the wiring needed
before the runner can batch independent sequences.
2026-04-27 20:04:46 -07:00
Jesse Gross
088dfd89a8 mlxrunner: wrap model forward inputs in a Batch struct
Gives a single extension point for per-call context (positions,
sequence IDs, masks) as multi-sequence batching grows, without having
to churn every model's Forward signature again.
2026-04-27 20:04:46 -07:00
Eva H
3cab8a7b02
app/server: fix desktop app startup killing active ollama launch sessions (#15657) 2026-04-27 22:52:53 -04:00
Daniel Hiltgen
03aee88186
mlx: Support NVIDIA TensorRT Model Optimizer import (#15566)
* mlx: Support NVIDIA TensorRT Model Optimizer import

* x/create: support FP8 safetensors import

Decode HF F8_E4M3 safetensors with block scale companions into MLX-importable tensor blobs, including compressed-tensors weight_scale metadata, packed NVFP4 layouts, and mixed-precision tensor headers.

Use that source-precision metadata during create quantization: default FP8-sourced imports to mxfp8, allow source FP8 to target MLX low-bit formats, preserve source-quantized NVFP4 layouts, selectively keep or promote tensors based on their source precision, and detect quantized dtype from mixed-precision safetensors manifests.

* review comments
2026-04-27 18:28:10 -07:00
Daniel Hiltgen
ec9b4e9e47
tokenizer: fix multi-regex BPE offset handling (#15844)
Use the current fragment offset when emitting unmatched spans during multi-regex BPE splitting. This avoids duplicating earlier prompt text and inflating token counts for multi-stage BPE tokenizers.
2026-04-27 14:14:27 -07:00
Jesse Gross
4656a07e56 mlxrunner: batch the sampler across multiple sequences
Register sequences with Add/Remove; each Sample call takes any subset of
registered slots and samples one token per row, appending to each slot's
ring-buffer history. When all slots share Options and penalty rings are
full, one fused transform pass runs over the whole batch via a persistent
pooled history tensor; otherwise calls fall back to per-slot serial
processing indexed against the same pool.

Performance is unchanged for a single sequence, which is all that is
exposed for now.
2026-04-25 09:53:53 -07:00
Jesse Gross
30f86cb9dd mlxrunner: track sampler history in a fixed-size ring buffer
AppendToken used to concatenate the new token onto the history tensor
and slice it back to RepeatLastN every decode step, churning the graph
shape and reallocating a fresh tensor each call. The stateful penalties
don't care about order within the window, so a fixed-capacity ring with
one SliceUpdate per append keeps the tensor shape constant across
steps.
2026-04-25 09:53:53 -07:00