llama/compat: document non-public API dependencies

Added a "Maintenance" section to llama/compat/README.md and an inline
block to llama-ollama-compat-util.h enumerating the three places the
compat layer depends on something that isn't strictly guaranteed
upstream-public API:

  1. Direct ggml_tensor::{type,ne,nb} field writes (public struct,
     no sanctioned post-creation mutators).
  2. const_cast on gguf_get_tensor_name's return to rename in place.
  3. llama_model_loader forward-decl from src/ (used only as an
     opaque pointer key).

Each entry ships with its escape hatch (small upstream patch or a
local replacement). All have been stable for years; they don't need
immediate action, but surfacing them makes the maintenance story
explicit for future readers.
This commit is contained in:
jmorganca 2026-04-19 13:01:46 -07:00
parent 2a388da77b
commit 9a69a17dc2
2 changed files with 59 additions and 0 deletions

View file

@ -78,3 +78,36 @@ llama.cpp's source in the Ollama tree (the old `llama/llama.cpp/` layout).
This shim keeps upstream unmodified on disk and the Ollama-specific logic
isolated in two files plus a small diff — upstream bumps are usually just
`LLAMA_CPP_VERSION` changes.
## Maintenance: non-public API dependencies
The compat code is mostly written against stable public APIs (`gguf.h`,
`ggml.h`, `ggml-backend.h`). There are three places where we lean on
something that isn't strictly public:
| Hack | Why | Escape hatch if upstream changes |
|---|---|---|
| Direct writes to `ggml_tensor::type` / `ne[]` / `nb[]` | No sanctioned mutator exists for post-creation tensor reshape/retype. Struct is public so this works today. | Ask upstream to expose `ggml_tensor_set_{type,shape}` helpers, or introduce them in our compat util and submit a PR. |
| `const_cast<char *>(gguf_get_tensor_name(...))` in `rename_tensor` | Pointer aims into a mutable `char[GGML_MAX_NAME]` buffer inside a `std::vector` element; the const is API hygiene. Lets us rename gguf tensors without a new public helper. | Add `gguf_rename_tensor` to `gguf.h` (10 lines) and drop the `const_cast`. |
| `llama_model_loader` forward-decl from `src/llama-model-loader.h` | Used only as an opaque pointer key for our skip-prefix registry. Never dereferenced. | Replace with `const void *` in our registry signatures. Zero behavioral change. |
None of these have changed in years. If an upstream bump breaks any of
them, each has a trivial workaround. See the top of
`llama-ollama-compat-util.h` for the inline notes.
## Documented hacks inside per-arch handlers
- **`reclaim_slot_as` (qwen35moe patch_embed split)** — repurposes an
orphaned `v.blk.0.attn_k` slot (left over after the QKV merge) as a
newly-synthesized `v.patch_embd.weight.1`. Needed because clip.cpp's
`ctx_meta` is sized for exactly the original tensor count (no_alloc
branch of `gguf_init_from_file` uses `n_tensors * ggml_tensor_overhead()`
with zero slack). Comment in the helper and call site explains the
reasoning; replacement would be a 1-line upstream patch that adds small
slack to the ctx size.
- **Load-op registry overrides `file_offset`**`maybe_load_tensor` gets
passed the gguf offset by its caller but ignores it when a registered
op exists. Intentional: the ops capture their own source offsets at
translate time (before our renames invalidate them). Documented in the
op-registration helpers.

View file

@ -7,6 +7,32 @@
// definitions live in llama-ollama-compat-util.cpp, which also owns the
// registry globals (tensor skip list, load-op table) that need a single
// translation unit.
//
// ---- Non-public API dependencies (see also README.md "Maintenance") ----
//
// Mostly public: gguf_* and ggml_* accessors from ggml/include/ are all
// stable. `ggml_backend_*` and `ggml_fp16_to_fp32` are stable too.
//
// Three pieces we rely on that aren't strictly guaranteed public:
//
// 1. Direct writes to `ggml_tensor::type`, `ne[]`, `nb[]` — the struct is
// public and fields are spec'd, but there's no sanctioned mutator for
// them post-creation. Used in set_tensor_type / set_tensor_shape /
// reclaim_slot_as. Risk: upstream could in principle introduce an
// opaque-tensor mode; in practice it hasn't in years.
//
// 2. `const_cast<char *>(gguf_get_tensor_name(...))` in rename_tensor.
// The pointer returned points into a mutable char[GGML_MAX_NAME]
// buffer inside a std::vector element. Defined behavior as long as
// upstream keeps name storage in-line (has done so forever).
//
// 3. `llama_model_loader` forward decl from src/llama-model-loader.h
// (internal, not llama.h). Only used as an opaque pointer key for
// the skip-prefix registry — we never dereference it. Could swap for
// `const void *` if upstream ever moved that type around.
//
// All three are trivially replaceable if upstream changes out from under
// us. See llama/compat/README.md for the escape hatches.
#include <cstddef>
#include <cstdint>