llama/compat: add in-memory shim so llama-server can load Ollama-format GGUFs

Older Ollama builds ship GGUFs that diverge slightly from upstream llama.cpp in arch names, KV keys, tensor names, and (for vision models) file layout (text+vision in one monolithic file). This adds a self-contained compat layer that translates those files in memory at load time, so ~/.ollama/models/blobs/* can be served by upstream llama-server with no re-conversion and no re-download. Structure: llama/compat/ llama-ollama-compat.{h,cpp} — the shim (Ollama-owned, ~500 LOC) upstream-edits.patch — ~48 lines of call-site hooks in 6 upstream files compat.cmake — include()-able CMake fragment README.md — what/why/how-to-regen Integration: llama/server/CMakeLists.txt includes compat.cmake and passes OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND to FetchContent_Declare via PATCH_COMMAND. When OLLAMA_LLAMA_CPP_SOURCE is set (dev mode), the patch is skipped so the developer's tree stays untouched. Currently handles gemma3 (text + vision). Pattern is data-driven — adding other archs is a new handle_<arch>() + one dispatch line. See README for the per-arch checklist. Verified end-to-end: `llama-server --model BLOB --mmproj BLOB` with an Ollama gemma3:latest blob answers both text prompts ("Paris") and vision prompts (correct image descriptions).
2026-05-13 14:27:00 +00:00 · 2026-04-18 23:14:38 -07:00 · 2026-04-18 23:14:38 -07:00 · 25223160d8
commit 25223160d8
parent 56c735d871
6 changed files with 823 additions and 0 deletions
--- a/llama/compat/README.md
+++ b/llama/compat/README.md
@ -0,0 +1,80 @@
+# llama.cpp compatibility shim
+
+This directory holds an in-process compatibility layer that lets upstream
+`llama-server` load GGUFs produced by older versions of Ollama (and files
+pulled from the Ollama registry) without re-converting or re-downloading.
+
+The layer is applied automatically at build time via CMake `FetchContent`'s
+`PATCH_COMMAND` — there is no separate "apply patches" step.
+
+## Files
+
+- `llama-ollama-compat.h`, `llama-ollama-compat.cpp` — the shim itself. These
+  are regular source files owned by Ollama; they get copied into the fetched
+  llama.cpp source tree during configure.
+- `upstream-edits.patch` — small additive edits to upstream files so the
+  shim gets called. Currently ~48 lines touching 6 files. Kept as a real
+  `git` patch so re-generation on upstream bumps is one command.
+
+## What the shim does
+
+The shim runs at two well-defined points in the loader:
+
+1. **After `gguf_init_from_file`**, for both the main model loader and the
+   `mtmd/clip` loader: inspects the just-parsed metadata and decides whether
+   the file is an Ollama-format GGUF. If so, it mutates the in-memory
+   `gguf_context` and `ggml_context` (KV names, tensor names, tensor types)
+   so the rest of the loader sees an upstream-shape file.
+
+2. **After `load_all_data`**: applies any numerical fix-ups that need the
+   tensors in their final backend buffers (e.g. RMSNorm `+1` if a future
+   arch needs it — gemma3 doesn't).
+
+Non-Ollama files are detected by the absence of Ollama-specific KV keys
+(e.g. `gemma3.mm.tokens_per_image`) or embedded `v.*` / `mm.*` tensors in
+the main model file. When no markers are present every compat function is
+an immediate no-op.
+
+## Currently supported architectures
+
+| Arch | Text loader | Clip (mmproj) loader |
+|---|---|---|
+| `gemma3` | KV injection (`layer_norm_rms_epsilon`, `rope.freq_base`, `rope.freq_base_swa`), tokenizer vocab truncation, drop `v.*`/`mm.*` tensors | Arch rewrite to `clip`, KV synthesis (`clip.vision.*`, `clip.projector_type=gemma3`), tensor renames (`v.patch_embedding`→`v.patch_embd`, `mlp.fc{1,2}`→`ffn_{down,up}`, etc.), F16→F32 promotion for patch/position embeddings (Metal IM2COL requirement) |
+
+Usage:
+
+```
+llama-server --model /path/to/ollama-blob --mmproj /path/to/ollama-blob
+```
+
+Passing the same monolithic GGUF as both `--model` and `--mmproj` works —
+each loader applies its own translation.
+
+Additional architectures are added by implementing a `handle_<arch>()`
+and (for vision models) `handle_<arch>_clip()` in `llama-ollama-compat.cpp`
+and dispatching them from `translate_metadata` / `translate_clip_metadata`.
+
+## Regenerating `upstream-edits.patch`
+
+After upstream changes the insertion points (rare), re-apply the edits to
+a fresh checkout and run:
+
+```
+cd /path/to/llama.cpp
+git diff -- \
+    ggml/include/gguf.h \
+    ggml/src/gguf.cpp \
+    src/CMakeLists.txt \
+    src/llama-model-loader.cpp \
+    src/llama-model.cpp \
+    tools/mtmd/clip.cpp \
+    > /path/to/ollama/llama/compat/upstream-edits.patch
+```
+
+## Why not fork llama.cpp or vendor it?
+
+Forking means tracking upstream manually. Vendoring means snapshotting all of
+llama.cpp's source in the Ollama tree (the old `llama/llama.cpp/` layout).
+This shim keeps upstream unmodified on disk and the Ollama-specific logic
+isolated in two files plus a small diff — upstream bumps are usually just
+`LLAMA_CPP_VERSION` changes.
--- a/llama/compat/compat.cmake
+++ b/llama/compat/compat.cmake
@ -0,0 +1,50 @@
+# llama.cpp compatibility shim — CMake integration
+#
+# Include this file BEFORE calling FetchContent_Declare(llama_cpp ...) to
+# patch the fetched upstream llama.cpp with Ollama's in-process compat
+# layer. Example usage:
+#
+#     include(${CMAKE_CURRENT_SOURCE_DIR}/../compat/compat.cmake)
+#
+#     FetchContent_Declare(
+#         llama_cpp
+#         GIT_REPOSITORY ...
+#         GIT_TAG        ${LLAMA_CPP_GIT_TAG}
+#         GIT_SHALLOW    TRUE
+#         PATCH_COMMAND  ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND}
+#         UPDATE_DISCONNECTED TRUE
+#     )
+#
+# The compat layer consists of:
+#   1. Two new source files dropped into the fetched tree's src/
+#      (llama-ollama-compat.{h,cpp}) — Ollama-owned.
+#   2. A small patch (upstream-edits.patch) that wires the new files into
+#      the build and adds call-sites in upstream loaders.
+
+set(_compat_dir ${CMAKE_CURRENT_LIST_DIR})
+
+# Expose a single variable the main CMakeLists passes into FetchContent's
+# PATCH_COMMAND. Uses CMake's own `-E copy` so it's cross-platform; uses
+# `git apply` because the patch is in unified git-diff format (same as what
+# `git diff` produces — regeneration is one command, see README).
+set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND
+    ${CMAKE_COMMAND} -E copy
+        "${_compat_dir}/llama-ollama-compat.h"
+        "src/llama-ollama-compat.h"
+    COMMAND ${CMAKE_COMMAND} -E copy
+        "${_compat_dir}/llama-ollama-compat.cpp"
+        "src/llama-ollama-compat.cpp"
+    COMMAND git apply --whitespace=nowarn
+        "${_compat_dir}/upstream-edits.patch"
+    CACHE INTERNAL "llama.cpp compat patch command for FetchContent")
+
+# Also export the individual paths in case callers want to do something
+# custom (e.g. emit a dependency on the patch so reconfigures re-apply).
+set(OLLAMA_LLAMA_CPP_COMPAT_PATCH_FILE
+    "${_compat_dir}/upstream-edits.patch"
+    CACHE INTERNAL "Path to the llama.cpp compat patch")
+
+set(OLLAMA_LLAMA_CPP_COMPAT_SOURCES
+    "${_compat_dir}/llama-ollama-compat.h"
+    "${_compat_dir}/llama-ollama-compat.cpp"
+    CACHE INTERNAL "Source files copied into llama.cpp's src/ dir")
--- a/llama/compat/llama-ollama-compat.cpp
+++ b/llama/compat/llama-ollama-compat.cpp
@ -0,0 +1,439 @@
+#include "llama-ollama-compat.h"
+
+#include "ggml.h"
+#include "ggml-backend.h"
+#include "gguf.h"
+#include "llama-impl.h"
+#include "llama-model-loader.h"
+
+#include <cstdint>
+#include <cstring>
+#include <cstdio>
+#include <functional>
+#include <mutex>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+
+namespace llama_ollama_compat {
+
+namespace {
+
+// ---- helpers -------------------------------------------------------------
+
+bool has_key(const gguf_context * meta, const char * key) {
+    return gguf_find_key(meta, key) >= 0;
+}
+
+void set_f32_if_missing(gguf_context * meta, const char * key, float value) {
+    if (!has_key(meta, key)) {
+        gguf_set_val_f32(meta, key, value);
+    }
+}
+
+bool any_tensor_with_prefix(const ggml_context * ctx, const char * prefix) {
+    const size_t plen = std::strlen(prefix);
+    for (ggml_tensor * t = ggml_get_first_tensor(ctx); t; t = ggml_get_next_tensor(ctx, t)) {
+        if (std::strncmp(ggml_get_name(t), prefix, plen) == 0) {
+            return true;
+        }
+    }
+    return false;
+}
+
+const ggml_tensor * find_tensor(const ggml_context * ctx, const char * name) {
+    for (ggml_tensor * t = ggml_get_first_tensor(ctx); t; t = ggml_get_next_tensor(ctx, t)) {
+        if (std::strcmp(ggml_get_name(t), name) == 0) return t;
+    }
+    return nullptr;
+}
+
+// Truncate a string-typed KV array to `new_n` entries. No-op if absent or
+// already that size or smaller.
+void truncate_str_arr(gguf_context * meta, const char * key, size_t new_n) {
+    const int64_t kid = gguf_find_key(meta, key);
+    if (kid < 0) return;
+    const size_t cur_n = gguf_get_arr_n(meta, kid);
+    if (new_n >= cur_n) return;
+
+    std::vector<std::string> owned;
+    owned.reserve(new_n);
+    std::vector<const char *> ptrs;
+    ptrs.reserve(new_n);
+    for (size_t i = 0; i < new_n; ++i) {
+        owned.emplace_back(gguf_get_arr_str(meta, kid, i));
+    }
+    for (const auto & s : owned) ptrs.push_back(s.c_str());
+    gguf_set_arr_str(meta, key, ptrs.data(), new_n);
+}
+
+// Truncate a primitive-typed KV array to `new_n` entries.
+void truncate_data_arr(gguf_context * meta, const char * key, gguf_type elem_type, size_t elem_size, size_t new_n) {
+    const int64_t kid = gguf_find_key(meta, key);
+    if (kid < 0) return;
+    const size_t cur_n = gguf_get_arr_n(meta, kid);
+    if (new_n >= cur_n) return;
+
+    const void * data = gguf_get_arr_data(meta, kid);
+    std::vector<uint8_t> copy(elem_size * new_n);
+    std::memcpy(copy.data(), data, elem_size * new_n);
+    gguf_set_arr_data(meta, key, elem_type, copy.data(), new_n);
+}
+
+// ---- per-loader state (skip lists + tensor transforms) -------------------
+
+struct TransformSpec {
+    std::function<bool(const std::string &)> matches;
+    std::function<void(void *, size_t, ggml_type)> apply;
+    const char * description;
+};
+
+struct LoaderState {
+    std::vector<TransformSpec> transforms;
+    std::vector<std::string>   skip_prefixes;
+};
+
+std::mutex g_registry_mutex;
+std::unordered_map<const llama_model_loader *, LoaderState> g_registry;
+
+void add_skip_prefix(const llama_model_loader * ml, std::string prefix) {
+    std::lock_guard<std::mutex> lk(g_registry_mutex);
+    g_registry[ml].skip_prefixes.push_back(std::move(prefix));
+}
+
+// ---- gemma3 --------------------------------------------------------------
+
+// Returns true if this looks like an Ollama-format gemma3 blob.
+bool detect_ollama_gemma3(const gguf_context * meta, const ggml_context * ctx) {
+    // Primary marker: Ollama writes the "mm.tokens_per_image" KV for
+    // vision-capable gemma3 blobs. Upstream never uses this key.
+    if (has_key(meta, "gemma3.mm.tokens_per_image")) return true;
+
+    // Secondary marker: embedded vision tensors in the main file. Upstream
+    // llama.cpp stores vision in a separate mmproj file.
+    if (any_tensor_with_prefix(ctx, "v.") ||
+        any_tensor_with_prefix(ctx, "mm.")) return true;
+
+    // Tertiary marker: required KV missing. Upstream always writes
+    // layer_norm_rms_epsilon; Ollama's old converter did not.
+    if (!has_key(meta, "gemma3.attention.layer_norm_rms_epsilon")) return true;
+
+    return false;
+}
+
+void handle_gemma3(const llama_model_loader * ml, gguf_context * meta, ggml_context * ctx) {
+    if (!detect_ollama_gemma3(meta, ctx)) return;
+
+    LLAMA_LOG_INFO("%s: detected Ollama-format gemma3 GGUF; applying compatibility fixes\n", __func__);
+
+    // 1. Inject required KVs that Ollama's old converter omitted. Defaults
+    //    are the gemma3 standard values; only injected if missing, so explicit
+    //    values in a file take precedence.
+    set_f32_if_missing(meta, "gemma3.attention.layer_norm_rms_epsilon", 1e-6f);
+    set_f32_if_missing(meta, "gemma3.rope.freq_base", 1000000.0f);
+    set_f32_if_missing(meta, "gemma3.rope.freq_base_swa", 10000.0f);
+
+    // 2. Tokenizer vocab size vs. embedding dim mismatch. Ollama's old
+    //    converter leaves special/multimodal tokens (e.g. <image_soft_token>)
+    //    in the tokenizer arrays even though the embedding matrix doesn't
+    //    cover them. Truncate the tokenizer to match the embedding rows.
+    if (const ggml_tensor * tok = find_tensor(ctx, "token_embd.weight")) {
+        const size_t embd_rows = tok->ne[1]; // shape is [n_embd, n_vocab]
+        truncate_str_arr (meta, "tokenizer.ggml.tokens",     embd_rows);
+        truncate_data_arr(meta, "tokenizer.ggml.scores",     GGUF_TYPE_FLOAT32, sizeof(float),   embd_rows);
+        truncate_data_arr(meta, "tokenizer.ggml.token_type", GGUF_TYPE_INT32,   sizeof(int32_t), embd_rows);
+    }
+
+    // 3. Drop embedded vision/projector tensors from the text loader.
+    //    Ollama's Go wrapper extracts them to a sidecar mmproj file before
+    //    passing --mmproj to llama-server.
+    add_skip_prefix(ml, "v.");
+    add_skip_prefix(ml, "mm.");
+
+    // Note: no RMSNorm weight shift is required. Ollama's published gemma3
+    // blobs already have the +1 shift baked in at conversion time — same as
+    // upstream llama.cpp's convert_hf_to_gguf.py.
+}
+
+} // anonymous namespace
+
+void translate_metadata(const llama_model_loader * ml,
+                        gguf_context * meta,
+                        ggml_context * ctx,
+                        std::string & arch_name) {
+    if (!meta) return;
+
+    if (arch_name == "gemma3") {
+        handle_gemma3(ml, meta, ctx);
+    }
+    // Dispatch. Add more arches as they are wired up.
+}
+
+// -------------------------------------------------------------------------
+// Clip-side (mmproj) translation
+// -------------------------------------------------------------------------
+
+namespace {
+
+// Rename a tensor in BOTH the gguf_context and the ggml_context so that all
+// name-based lookups — offset map, ggml_get_tensor, tensor.name — agree.
+void rename_tensor(gguf_context * meta, ggml_context * ctx,
+                   const char * old_name, const char * new_name) {
+    if (!gguf_rename_tensor(meta, old_name, new_name)) return;
+    if (ggml_tensor * t = ggml_get_tensor(ctx, old_name)) {
+        ggml_set_name(t, new_name);
+    }
+}
+
+// Rename every tensor whose name contains `needle` by replacing that
+// substring with `replacement`. Applies to both `.weight` and `.bias`.
+void rename_tensors_containing(gguf_context * meta, ggml_context * ctx,
+                               const char * needle, const char * replacement) {
+    // Collect names first — renaming while iterating would shift indices.
+    std::vector<std::string> renames; // old -> new
+    const int64_t n = gguf_get_n_tensors(meta);
+    for (int64_t i = 0; i < n; ++i) {
+        const char * name = gguf_get_tensor_name(meta, i);
+        std::string s(name);
+        size_t pos = s.find(needle);
+        if (pos == std::string::npos) continue;
+        std::string new_s = s;
+        new_s.replace(pos, std::strlen(needle), replacement);
+        renames.push_back(s);
+        renames.push_back(std::move(new_s));
+    }
+    for (size_t i = 0; i + 1 < renames.size(); i += 2) {
+        rename_tensor(meta, ctx, renames[i].c_str(), renames[i + 1].c_str());
+    }
+}
+
+// Copy a KV from src_key to dst_key if src_key exists and dst_key doesn't.
+template <typename Getter, typename Setter>
+bool copy_kv(gguf_context * meta, const char * src_key, const char * dst_key,
+             Getter get, Setter set) {
+    if (has_key(meta, dst_key)) return true; // already set, keep explicit values
+    const int64_t kid = gguf_find_key(meta, src_key);
+    if (kid < 0) return false;
+    set(meta, dst_key, get(meta, kid));
+    return true;
+}
+
+void copy_u32_kv(gguf_context * meta, const char * src_key, const char * dst_key) {
+    copy_kv(meta, src_key, dst_key,
+            gguf_get_val_u32,
+            [](gguf_context * m, const char * k, uint32_t v){ gguf_set_val_u32(m, k, v); });
+}
+
+void copy_f32_kv(gguf_context * meta, const char * src_key, const char * dst_key) {
+    copy_kv(meta, src_key, dst_key,
+            gguf_get_val_f32,
+            [](gguf_context * m, const char * k, float v){ gguf_set_val_f32(m, k, v); });
+}
+
+void set_str(gguf_context * meta, const char * key, const char * value) {
+    gguf_set_val_str(meta, key, value);
+}
+
+// Tensors marked for F16→F32 promotion. Looked up by tensor name.
+// Populated by handle_gemma3_clip; consumed by supply_promoted_tensor_data.
+std::mutex g_promote_mutex;
+std::unordered_set<std::string> g_promote_f16_to_f32;
+
+void mark_promote_f16_to_f32(const std::string & name) {
+    std::lock_guard<std::mutex> lk(g_promote_mutex);
+    g_promote_f16_to_f32.insert(name);
+}
+
+// Change a tensor's type in the ggml_context. Updates type and strides so
+// that ggml_nbytes(t) returns the new-type size, and ggml_dup_tensor
+// propagates the new type to any copies.
+void set_tensor_type_in_ctx(ggml_context * ctx, const char * name, ggml_type new_type) {
+    ggml_tensor * t = ggml_get_tensor(ctx, name);
+    if (!t) return;
+    t->type = new_type;
+    t->nb[0] = ggml_type_size(new_type);
+    t->nb[1] = t->nb[0] * (t->ne[0] / ggml_blck_size(new_type));
+    for (int i = 2; i < GGML_MAX_DIMS; ++i) {
+        t->nb[i] = t->nb[i - 1] * t->ne[i - 1];
+    }
+}
+
+// Promote a tensor's type in both gguf_context and ggml_context. Used for
+// F16→F32 conversion of conv weights that Metal requires as F32.
+void promote_tensor_to_f32(gguf_context * meta, ggml_context * ctx, const char * name) {
+    // Update ggml_context (clip.cpp reads type from here via ggml_dup_tensor).
+    set_tensor_type_in_ctx(ctx, name, GGML_TYPE_F32);
+    // Note: we do NOT call gguf_set_tensor_type on `meta`, because that
+    // recomputes tensor data offsets based on the new type — but we still
+    // have F16 bytes at the original offset. clip.cpp reads the offset from
+    // its own tensor_offset map (populated from gguf_context BEFORE this
+    // promotion), so leaving meta's offset alone preserves the correct
+    // source location. We also don't use meta's type for sizing.
+    mark_promote_f16_to_f32(name);
+}
+
+// Convert F16 → F32 in place.
+void convert_f16_to_f32(const uint16_t * src, float * dst, size_t n) {
+    for (size_t i = 0; i < n; ++i) {
+        dst[i] = ggml_fp16_to_fp32(src[i]);
+    }
+}
+
+void handle_gemma3_clip(gguf_context * meta, ggml_context * ctx) {
+    // Build clip.* KVs from the gemma3.vision.* KVs already in the file.
+    copy_u32_kv(meta, "gemma3.vision.block_count",                   "clip.vision.block_count");
+    copy_u32_kv(meta, "gemma3.vision.embedding_length",              "clip.vision.embedding_length");
+    copy_u32_kv(meta, "gemma3.vision.feed_forward_length",           "clip.vision.feed_forward_length");
+    copy_u32_kv(meta, "gemma3.vision.image_size",                    "clip.vision.image_size");
+    copy_u32_kv(meta, "gemma3.vision.patch_size",                    "clip.vision.patch_size");
+    copy_u32_kv(meta, "gemma3.vision.attention.head_count",          "clip.vision.attention.head_count");
+    copy_f32_kv(meta, "gemma3.vision.attention.layer_norm_epsilon",  "clip.vision.attention.layer_norm_epsilon");
+    // projection_dim is the TEXT model's embedding_length (the mmproj
+    // output dim == language model input dim).
+    copy_u32_kv(meta, "gemma3.embedding_length",                     "clip.vision.projection_dim");
+
+    // image_mean / image_std — constant defaults for gemma3 vision.
+    if (!has_key(meta, "clip.vision.image_mean")) {
+        const float mean[3] = {0.5f, 0.5f, 0.5f};
+        gguf_set_arr_data(meta, "clip.vision.image_mean", GGUF_TYPE_FLOAT32, mean, 3);
+    }
+    if (!has_key(meta, "clip.vision.image_std")) {
+        const float std_[3] = {0.5f, 0.5f, 0.5f};
+        gguf_set_arr_data(meta, "clip.vision.image_std", GGUF_TYPE_FLOAT32, std_, 3);
+    }
+
+    // Top-level clip flags.
+    if (!has_key(meta, "clip.has_vision_encoder")) {
+        gguf_set_val_bool(meta, "clip.has_vision_encoder", true);
+    }
+    if (!has_key(meta, "clip.use_gelu")) {
+        gguf_set_val_bool(meta, "clip.use_gelu", true);
+    }
+    set_str(meta, "clip.projector_type", "gemma3");
+    set_str(meta, "general.architecture", "clip");
+
+    // Tensor name translation (Ollama -> upstream mtmd convention).
+    rename_tensors_containing(meta, ctx, "v.patch_embedding",      "v.patch_embd");
+    rename_tensors_containing(meta, ctx, "v.position_embedding",   "v.position_embd");
+    rename_tensors_containing(meta, ctx, "v.post_layernorm",       "v.post_ln");
+    rename_tensors_containing(meta, ctx, ".layer_norm1",           ".ln1");
+    rename_tensors_containing(meta, ctx, ".layer_norm2",           ".ln2");
+    rename_tensors_containing(meta, ctx, ".attn_output",           ".attn_out");
+    rename_tensors_containing(meta, ctx, ".mlp.fc1",               ".ffn_down");
+    rename_tensors_containing(meta, ctx, ".mlp.fc2",               ".ffn_up");
+    rename_tensors_containing(meta, ctx, "mm.mm_input_projection", "mm.input_projection");
+    rename_tensors_containing(meta, ctx, "mm.mm_soft_emb_norm",    "mm.soft_emb_norm");
+
+    // Promote F16 patch-embed / position-embed to F32. Upstream stores these
+    // as F32 (see Gemma3VisionModel.tensor_force_quant in convert_hf_to_gguf.py).
+    // Metal's IM2COL op requires F32 for these convolution inputs.
+    promote_tensor_to_f32(meta, ctx, "v.patch_embd.weight");
+    promote_tensor_to_f32(meta, ctx, "v.position_embd.weight");
+}
+
+} // anonymous namespace
+
+void translate_clip_metadata(gguf_context * meta, ggml_context * ctx) {
+    if (!meta) return;
+
+    // Detection: Ollama-format gemma3 blob has `gemma3.mm.tokens_per_image`
+    // plus embedded `v.*` tensors. Upstream mmproj files use `general.architecture=clip`
+    // and don't have gemma3.* KVs.
+    if (has_key(meta, "gemma3.mm.tokens_per_image") &&
+        any_tensor_with_prefix(ctx, "v.")) {
+        LLAMA_LOG_INFO("%s: detected Ollama-format gemma3 GGUF used as mmproj; translating\n", __func__);
+        handle_gemma3_clip(meta, ctx);
+    }
+}
+
+bool supply_promoted_tensor_data(const ggml_tensor * cur,
+                                 const char * source_file,
+                                 size_t file_offset,
+                                 std::vector<uint8_t> & out) {
+    {
+        std::lock_guard<std::mutex> lk(g_promote_mutex);
+        if (g_promote_f16_to_f32.find(ggml_get_name(cur)) == g_promote_f16_to_f32.end()) {
+            return false;
+        }
+    }
+    // cur->type is F32 (after promotion). Source bytes are F16 at file_offset.
+    if (cur->type != GGML_TYPE_F32) {
+        return false;
+    }
+
+    const size_t n_elem = ggml_nelements(cur);
+    const size_t src_bytes = n_elem * sizeof(uint16_t);
+    const size_t dst_bytes = n_elem * sizeof(float);
+
+    std::vector<uint8_t> src(src_bytes);
+
+    FILE * f = std::fopen(source_file, "rb");
+    if (!f) {
+        LLAMA_LOG_ERROR("%s: failed to open '%s'\n", __func__, source_file);
+        return false;
+    }
+    if (std::fseek(f, (long) file_offset, SEEK_SET) != 0) {
+        std::fclose(f);
+        LLAMA_LOG_ERROR("%s: failed to seek in '%s'\n", __func__, source_file);
+        return false;
+    }
+    if (std::fread(src.data(), 1, src_bytes, f) != src_bytes) {
+        std::fclose(f);
+        LLAMA_LOG_ERROR("%s: failed to read %zu bytes for '%s'\n",
+                        __func__, src_bytes, ggml_get_name(cur));
+        return false;
+    }
+    std::fclose(f);
+
+    out.resize(dst_bytes);
+    convert_f16_to_f32(reinterpret_cast<const uint16_t *>(src.data()),
+                       reinterpret_cast<float *>(out.data()),
+                       n_elem);
+
+    LLAMA_LOG_INFO("%s: promoted F16->F32 for %s (%zu elems)\n",
+                   __func__, ggml_get_name(cur), n_elem);
+    return true;
+}
+
+bool should_skip_tensor(const llama_model_loader * ml, const char * tensor_name) {
+    std::lock_guard<std::mutex> lk(g_registry_mutex);
+    auto it = g_registry.find(ml);
+    if (it == g_registry.end()) return false;
+    for (const auto & prefix : it->second.skip_prefixes) {
+        if (std::strncmp(tensor_name, prefix.c_str(), prefix.size()) == 0) {
+            return true;
+        }
+    }
+    return false;
+}
+
+void apply_tensor_transforms(const llama_model_loader * ml, ggml_context * ctx) {
+    std::vector<TransformSpec> specs;
+    {
+        std::lock_guard<std::mutex> lk(g_registry_mutex);
+        auto it = g_registry.find(ml);
+        if (it == g_registry.end()) return;
+        specs = it->second.transforms;
+    }
+    if (specs.empty()) return;
+
+    std::vector<uint8_t> buf;
+    for (ggml_tensor * t = ggml_get_first_tensor(ctx); t; t = ggml_get_next_tensor(ctx, t)) {
+        if (!t->buffer) continue;
+        const std::string name = ggml_get_name(t);
+        for (const auto & spec : specs) {
+            if (!spec.matches(name)) continue;
+
+            const size_t nbytes = ggml_nbytes(t);
+            const size_t n_elem = ggml_nelements(t);
+
+            buf.resize(nbytes);
+            ggml_backend_tensor_get(t, buf.data(), 0, nbytes);
+            spec.apply(buf.data(), n_elem, t->type);
+            ggml_backend_tensor_set(t, buf.data(), 0, nbytes);
+        }
+    }
+}
+
+} // namespace llama_ollama_compat
--- a/llama/compat/llama-ollama-compat.h
+++ b/llama/compat/llama-ollama-compat.h
@ -0,0 +1,78 @@
+#pragma once
+
+// Ollama-format GGUF compatibility shim.
+//
+// Older Ollama builds ship GGUFs that differ from upstream in a handful of
+// ways per-architecture. This shim detects those files during load and
+// translates them in-memory so the rest of llama.cpp can load them
+// unmodified. Single entry point per hook; all logic is data-driven from
+// per-architecture rules.
+//
+// Two hooks:
+//   1. translate_metadata() — runs after gguf_init_from_file, mutates KVs
+//      and (optionally) tensor names on the gguf_context / ggml_context.
+//   2. apply_tensor_transforms() — runs after load_all_data, rewrites
+//      tensor data that differs numerically (e.g. gemma3 RMSNorm +1).
+
+#include <cstdint>
+#include <string>
+#include <vector>
+
+struct gguf_context;
+struct ggml_context;
+struct ggml_tensor;
+struct llama_model_loader;
+
+namespace llama_ollama_compat {
+
+// Inspect and mutate the just-loaded gguf_context. May update arch_name if
+// the file uses an Ollama-specific architecture name. Safe to call for any
+// model — no-op when no Ollama markers are present.
+//
+// If compat was applied, registers any tensor transforms against `ml` for
+// apply_tensor_transforms() to consume later.
+void translate_metadata(const llama_model_loader * ml,
+                        gguf_context * meta,
+                        ggml_context * ctx,
+                        std::string & arch_name);
+
+// Returns true if the loader should skip this tensor entirely (not add to
+// weights_map, not count toward n_tensors). Used to drop embedded vision
+// tensors from the text model without physically removing them.
+bool should_skip_tensor(const llama_model_loader * ml, const char * tensor_name);
+
+// Called after load_all_data returns for a model context. Applies any
+// registered transforms (read tensor data from the backend buffer, modify,
+// write back) to tensors in `ctx`. Call once per model context.
+void apply_tensor_transforms(const llama_model_loader * ml, ggml_context * ctx);
+
+// Called from the clip loader (tools/mtmd/clip.cpp). If the file is an
+// Ollama-format monolithic GGUF (text + embedded vision), rewrites the
+// clip-facing view of the metadata so the clip loader sees it as a normal
+// mmproj file. Safe to call unconditionally — no-op when not an Ollama file.
+//
+// Operations:
+//   - sets general.architecture = "clip"
+//   - sets clip.has_vision_encoder, clip.projector_type, clip.use_gelu
+//   - copies gemma3.vision.* KVs into clip.vision.*
+//   - renames vision tensors (v.patch_embedding -> v.patch_embd, etc.)
+//   - promotes specific F16 tensors to F32 in the ggml_context so clip
+//     allocates the correct buffer size
+//
+// Non-vision text tensors remain in the gguf but are never looked up by
+// clip, so they cost nothing.
+void translate_clip_metadata(gguf_context * meta, ggml_context * ctx);
+
+// Called from clip.cpp's tensor-loading loop, before reading bytes from the
+// file. If this tensor was marked for type promotion by translate_clip_metadata,
+// fills `out` with the promoted data (e.g. F16→F32) and returns true. The
+// caller should then use `out` instead of reading from the file.
+//
+// `file_offset` is the absolute file offset of the original (pre-promotion)
+// tensor data in the source GGUF.
+bool supply_promoted_tensor_data(const ggml_tensor * cur,
+                                 const char * source_file,
+                                 size_t file_offset,
+                                 std::vector<uint8_t> & out);
+
+} // namespace llama_ollama_compat
--- a/llama/compat/upstream-edits.patch
+++ b/llama/compat/upstream-edits.patch
@ -0,0 +1,161 @@
+diff --git a/ggml/include/gguf.h b/ggml/include/gguf.h
+index 02d5f221c..2f64264f4 100644
+--- a/ggml/include/gguf.h
+++ b/ggml/include/gguf.h
+@@ -162,6 +162,9 @@ extern "C" {
+     // assumes that at least gguf_get_tensor_size bytes can be read from data
+     GGML_API void gguf_set_tensor_data(struct gguf_context * ctx, const char * name, const void * data);
+ 
+    // rename a tensor. Returns false if old_name is not present.
+    GGML_API bool gguf_rename_tensor(struct gguf_context * ctx, const char * old_name, const char * new_name);
+
+     // writing gguf files can be done in 3 ways:
+     //
+     // - write the entire gguf_context to a binary file in a single pass:
+diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
+index ab3cc9748..e3d06c959 100644
+--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
+@@ -1280,6 +1280,16 @@ void gguf_set_tensor_data(struct gguf_context * ctx, const char * name, const vo
+     ctx->info[tensor_id].t.data = (void *)(uintptr_t)data; // double cast suppresses warning about casting away const
+ }
+ 
+bool gguf_rename_tensor(struct gguf_context * ctx, const char * old_name, const char * new_name) {
+    const int64_t tensor_id = gguf_find_tensor(ctx, old_name);
+    if (tensor_id < 0) {
+        return false;
+    }
+    // ggml_set_name truncates to GGML_MAX_NAME.
+    ggml_set_name(&ctx->info[tensor_id].t, new_name);
+    return true;
+}
+
+ struct gguf_writer_base {
+     size_t written_bytes {0u};
+ 
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index 7b1fcfca0..155a819fa 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -32,6 +32,7 @@ add_library(llama
+             llama-model-loader.cpp
+             llama-model-saver.cpp
+             llama-model.cpp
+            llama-ollama-compat.cpp
+             llama-quant.cpp
+             llama-sampler.cpp
+             llama-vocab.cpp
+diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
+index 4e65a45a5..75836c683 100644
+--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
+@@ -4,6 +4,7 @@
+ #include "ggml.h"
+ #include "gguf.h"
+ #include "llama-hparams.h"
+#include "llama-ollama-compat.h"
+ 
+ #include <algorithm>
+ #include <array>
+@@ -549,6 +550,7 @@ llama_model_loader::llama_model_loader(
+         }
+ 
+         get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
+        llama_ollama_compat::translate_metadata(this, metadata, ctx, arch_name);
+         llm_kv = LLM_KV(llm_arch_from_string(arch_name));
+ 
+         files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
+@@ -573,6 +575,9 @@ llama_model_loader::llama_model_loader(
+         // so we build a unified tensors index for weights.
+         for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+             std::string tensor_name = std::string(cur->name);
+            if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+                continue;
+            }
+             // make sure there is no duplicated tensor names
+             if (weights_map.find(tensor_name) != weights_map.end()) {
+                 throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
+@@ -683,6 +688,9 @@ llama_model_loader::llama_model_loader(
+         // Save tensors data offset info of the main file.
+         for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+             std::string tensor_name = std::string(cur->name);
+            if (llama_ollama_compat::should_skip_tensor(this, tensor_name.c_str())) {
+                continue;
+            }
+             // make sure there is no duplicated tensor names
+             if (weights_map.find(tensor_name) != weights_map.end()) {
+                 throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
+diff --git a/src/llama-model.cpp b/src/llama-model.cpp
+index 4ded484dd..7d3509c23 100644
+--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
+@@ -6,6 +6,7 @@
+ #include "llama-mmap.h"
+ #include "llama-cparams.h"
+ #include "llama-model-loader.h"
+#include "llama-ollama-compat.h"
+ 
+ #include "llama-kv-cache.h"
+ #include "llama-kv-cache-iswa.h"
+@@ -8023,6 +8024,9 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
+         if (!ml.load_all_data(ctx, buf_map, use_mlock ? &pimpl->mlock_mmaps : NULL, params.progress_callback, params.progress_callback_user_data)) {
+             return false;
+         }
+        // Apply any Ollama-format numerical fixups (e.g. gemma3 RMSNorm +1)
+        // while the data is in its final backend buffers.
+        llama_ollama_compat::apply_tensor_transforms(&ml, ctx);
+     }
+ 
+     if (use_mmap_buffer) {
+diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
+index f0e8786b6..ec2a7d320 100644
+--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
+@@ -10,6 +10,8 @@
+ #include "ggml-backend.h"
+ #include "gguf.h"
+ 
+#include "src/llama-ollama-compat.h"
+
+ #include <algorithm>
+ #include <cassert>
+ #include <cmath>
+@@ -985,6 +987,11 @@ struct clip_model_loader {
+ 
+         ctx_meta.reset(meta);
+ 
+        // If this is an Ollama-format monolithic GGUF (text + embedded
+        // vision), translate its metadata and tensor names into the
+        // upstream mmproj shape so the rest of this loader runs unchanged.
+        llama_ollama_compat::translate_clip_metadata(ctx_gguf.get(), meta);
+
+         const int n_tensors = gguf_get_n_tensors(ctx_gguf.get());
+ 
+         // print gguf info
+@@ -2358,11 +2365,25 @@ struct clip_model_loader {
+                 auto it_off = tensor_offset.find(t->name);
+                 GGML_ASSERT(it_off != tensor_offset.end() && "no offset for tensor");
+                 const size_t offset = it_off->second;
+                size_t num_bytes = ggml_nbytes(cur);
+
+                // Ollama-compat: let the compat layer supply promoted tensor
+                // data (e.g. F16→F32 for conv weights) instead of reading
+                // bytes directly from the file.
+                std::vector<uint8_t> compat_buf;
+                if (llama_ollama_compat::supply_promoted_tensor_data(cur, fname.c_str(), offset, compat_buf)) {
+                    if (ggml_backend_buft_is_host(buft)) {
+                        std::memcpy(cur->data, compat_buf.data(), num_bytes);
+                    } else {
+                        ggml_backend_tensor_set(cur, compat_buf.data(), 0, num_bytes);
+                    }
+                    continue;
+                }
+
+                 fin.seekg(offset, std::ios::beg);
+                 if (!fin) {
+                     throw std::runtime_error(string_format("%s: failed to seek for tensor %s\n", __func__, t->name));
+                 }
+-                size_t num_bytes = ggml_nbytes(cur);
+                 if (ggml_backend_buft_is_host(buft)) {
+                     // for the CPU and Metal backend, we can read directly into the tensor
+                     fin.read(reinterpret_cast<char *>(cur->data), num_bytes);
--- a/llama/server/CMakeLists.txt
+++ b/llama/server/CMakeLists.txt
@ -35,6 +35,20 @@ if(DEFINED ENV{OLLAMA_LLAMA_CPP_SOURCE})
    message(STATUS "Using local llama.cpp source: ${_src}")
 endif()

+# Ollama-compat shim: overlays the fetched llama.cpp source with a tiny
+# in-memory translation layer that lets upstream llama-server load GGUFs
+# produced by older Ollama versions (e.g. existing ~/.ollama/models/blobs).
+# See llama/compat/README.md for details.
+#
+# The patch only runs when fetching from GitHub — if a local source override
+# is active, leave the developer's tree alone (they can apply by hand if
+# they want to iterate on the compat layer).
+set(_ollama_compat_patch_cmd "")
+if(NOT DEFINED ENV{OLLAMA_LLAMA_CPP_SOURCE})
+    include(${CMAKE_CURRENT_SOURCE_DIR}/../compat/compat.cmake)
+    set(_ollama_compat_patch_cmd PATCH_COMMAND ${OLLAMA_LLAMA_CPP_COMPAT_PATCH_COMMAND})
+endif()
+
 # Configure upstream build options BEFORE FetchContent_MakeAvailable.
 # When included via FetchContent, llama.cpp sets LLAMA_STANDALONE=OFF
 # so all optional builds default to OFF. We explicitly enable what we need.
@ -53,6 +67,7 @@ FetchContent_Declare(
    GIT_REPOSITORY "https://github.com/ggml-org/llama.cpp.git"
    GIT_TAG ${LLAMA_CPP_GIT_TAG}
    GIT_SHALLOW TRUE
+    ${_ollama_compat_patch_cmd}
 )
 FetchContent_MakeAvailable(llama_cpp)