mirror of
https://github.com/danny-avila/LibreChat.git
synced 2026-05-13 16:07:30 +00:00
* feat: migrate GitNexus deployment from Fly.io to DigitalOcean droplet
Fly.io's 1GB machine was pegged at ~900MB memory with load spiking to
2.7 under even modest query load. Moving to a 2GB+ DO droplet that can
take advantage of existing credits.
Architecture change: indexes no longer baked into the image. Instead,
a long-lived image (built only when .do/gitnexus/ changes) is pulled
from GHCR, and the deploy workflow rsyncs .gitnexus/ data into
/opt/gitnexus/indexes/<name>/ on the droplet and restarts only the
gitnexus container. Caddy stays running so TLS certs don't churn.
- Add .do/gitnexus/Dockerfile (same native-addon + extension patch
layers as the Fly variant, but no COPY indexes/ step)
- Add .do/gitnexus/docker-compose.yml with gitnexus + caddy services
on an internal bridge network, 1.8GB memory limit, healthcheck
- Add .do/gitnexus/Caddyfile with automatic HTTPS for the configured
subdomain and bearer token auth for all routes except /health
- Add .do/gitnexus/entrypoint.sh that registers every index mounted
at /indexes/<name>/.gitnexus at container start, then runs
gitnexus serve bound to 0.0.0.0 (internal docker network only)
- Add .do/gitnexus/install-extensions.js for LadybugDB FTS/vector
extension pre-install (workaround for upstream bug)
- Add .github/workflows/gitnexus-deploy-do.yml that builds the image
only on Dockerfile/entrypoint changes, pushes to GHCR, rsyncs the
index artifacts to the droplet, and restarts the gitnexus container
- Remove .fly/gitnexus/ and .github/workflows/gitnexus-deploy.yml —
Fly app will be destroyed after DO deploy is verified working
Required new secrets: DO_HOST, DO_USER, DO_SSH_KEY. GITNEXUS_DOMAIN
and API_TOKEN live in /opt/gitnexus/.env on the droplet itself.
* refactor: prefix deploy secrets with GITNEXUS_ for namespace isolation
Rename DO_HOST -> GITNEXUS_DO_HOST, DO_USER -> GITNEXUS_DO_USER, and
DO_SSH_KEY -> GITNEXUS_DO_SSH_KEY so the secrets are clearly scoped
to the gitnexus deploy and don't collide with any other DigitalOcean
secrets LibreChat might add later.
* feat: serve PR indexes alongside main/dev and add /gitnexus command
The index workflow was already building and uploading per-PR indexes
(gitnexus-index-pr-<N>) for contributor PRs, but the deploy workflow
only consumed main and dev artifacts. PR indexes were sitting in
storage doing nothing. This wires them all the way through to the
live MCP server, with proper cleanup when PRs close.
Deploy workflow changes:
- Drop the branches filter on workflow_run so PR index completions
also trigger deploys (PR indexes are already contributor-gated
upstream in gitnexus-index.yml via author_association)
- Resolve all open PRs via the GitHub API, look up each one's latest
non-expired gitnexus-index-pr-<N> artifact, and serve whichever
ones exist. PRs without an index artifact are skipped — we don't
retroactively index anything.
- Per-ref concurrency group so rapid pushes to the same PR coalesce
but different refs still deploy in parallel
- After rsyncing active indexes, prune any /opt/gitnexus/indexes/
folder that isn't in the active set. Safety net for missed PR
close events.
New workflow: gitnexus-cleanup-pr.yml
- Fires on pull_request closed (merged or not)
- SSHs to the droplet, removes /opt/gitnexus/indexes/LibreChat-pr-<N>,
restarts the gitnexus container
New workflow: gitnexus-pr-command.yml
- Listens for issue_comment events where body starts with /gitnexus
- Contributor gated via author_association
- Supports: /gitnexus index — index with defaults
/gitnexus index embeddings — index with --embeddings
- Dispatches gitnexus-index.yml with the PR number and head SHA,
reacts to the comment with a rocket emoji
Index workflow changes:
- New dispatch inputs pr_number and pr_ref for command-driven runs
- Checkout step uses inputs.pr_ref when set so the PR's head commit
is analyzed instead of the default branch
- Artifact naming falls back through pr_number -> pull_request number
-> ref_name, keeping existing behavior for push/PR events
- Concurrency group switches to pr-<N> when dispatched by the command
so re-runs on the same PR debounce correctly
* chore: remove Fly variant reference from Dockerfile header
The Fly variant no longer exists in the repo, so the comparison
comment is meaningless. Rewritten as a standalone description of
the image's design.
* review: resolve 15 findings from review audit
Critical
- Drop the unused caddy binary from the gitnexus image. Caddy runs in
its own container in this architecture; installing it inside the
gitnexus image added ~40-60MB for no reason and contradicted the
Dockerfile header comment.
Major
- Replace ssh-keyscan TOFU with a GITNEXUS_DO_KNOWN_HOST secret.
Both deploy and cleanup workflows now pin the droplet's host key
from a stored value instead of silently trusting whatever the host
presents at deploy time. Fails the workflow if the secret is empty
so no one accidentally regresses to TOFU.
- Gate gitnexus-cleanup-pr.yml to same-repo PRs via
github.event.pull_request.head.repo.full_name == github.repository.
Fork PR closes no longer produce failed runs when secrets are
withheld by GitHub. The deploy workflow's stale-folder prune step
remains the safety net for any fork-contributor indexes.
- Fail fast in entrypoint.sh when main/dev index registration errors.
Previously `|| echo WARN` swallowed failures so a broken index
passed the docker healthcheck and the deploy was marked green
while queries returned empty. PR indexes stay best-effort
(a corrupt PR index shouldn't take the whole server down).
- Authenticate the droplet with GHCR on every deploy using
GITHUB_TOKEN, so private GHCR packages work without documentation
detours or manual docker login on the host. Bootstrap comments
explain the flow.
- Switch docker-compose caddy.depends_on from the short-form
(service_started) to service_healthy so Caddy doesn't route to a
gitnexus container that's still starting (500ms-60s window after
recreation).
Minor
- Guard the HEAD~1 diff with `git rev-parse --verify HEAD~1` so the
first-commit and workflow_run-from-PR cases default to rebuild
instead of silently skipping a legitimately-changed image.
- Move `packages: write` off the workflow-level permissions and onto
the build-image job. deploy no longer inherits unnecessary GHCR
write access.
- Skip the SSH session in cleanup-pr.yml when no gitnexus-index-pr-<N>
artifact ever existed for the PR. Eliminates ~95% of no-op SSH
round-trips on a busy repo (docs-only PRs, paths-ignored PRs, etc).
- Reload Caddy in-place after config upload with `caddy reload`,
falling back to force-recreate on reload failure and `compose up`
on first-time bootstrap. Picks up Caddyfile or env changes without
losing TLS certs.
- Replace `sleep 5` post-deploy with a real readiness poll against
docker's health status. Fails the workflow if gitnexus doesn't
report healthy within 120s, so a broken startup surfaces in CI
instead of being silently marked green.
- Warn when listPulls hits the 100-item per_page ceiling so a future
growth spurt past 100 open PRs doesn't silently drop indexes.
Nit
- Tighten NODE_OPTIONS --max-old-space-size from 1536 to 1280MB,
giving KuzuDB's C++ heap ~512MB of room under the 1792MB cgroup
limit instead of ~256MB.
- Rewrite the stale "headroom for Caddy" comment in entrypoint.sh
(Caddy lives in a separate container now).
- Restore load-bearing comments in install-extensions.js explaining
the @ladybugdb/core path layout and the throwaway-db cache-priming
pattern.
- Parameterize the docker-compose image reference as
${GITNEXUS_IMAGE:-ghcr.io/danny-avila/librechat-gitnexus:latest}
so forks or pinned version tags can override via /opt/gitnexus/.env.
Deferred
- Finding 12 (memory headroom) addressed partially via the heap cap
reduction; full profiling of KuzuDB C++ allocations under query
load deferred to post-deploy monitoring.
* review: resolve 8 follow-up findings from second review pass
Security
- F1: pipe GHCR token via SSH stdin instead of expanding it into the
remote command string. Previously `"echo '$GH_TOKEN' | docker login"`
expanded the token locally before SSH sent it as an argument, so the
live token was briefly visible in /proc/<pid>/cmdline on the droplet
to any process running as deploy or root. New form uses
`printf '%s' "$GH_TOKEN" | ssh ... "docker login --password-stdin"`
so the token only travels through the encrypted SSH stdin pipe.
Reliability
- F2: add json-file log rotation (50m x 3 files) via a YAML anchor
shared by both services. Default Docker logging is unbounded and
would eventually fill the 60GB droplet disk.
- F4: set memswap_limit=1792m to match mem_limit. Without this, Docker
lets the container silently spill onto host swap when KuzuDB's C++
heap overruns the 1792m RAM budget, turning sub-second graph queries
into multi-second ones with no alert. Hard OOM-kill is preferable —
unless-stopped restarts the container, the deploy health poll
catches it, the failure is explicit.
- F5: extend the post-deploy health poll from 24 iterations (120s) to
36 iterations (180s) so it clears Docker's own unhealthy-detection
ceiling (start_period 60s + retries 3 * interval 30s = 150s). A
container that legitimately takes 125s to warm up would previously
fail the deploy at 120s while Docker would still report it as
"starting".
Operability
- F3: document the `--no-deps` escape hatch in the compose file header
so an operator can restart Caddy during a gitnexus outage without
being trapped by the service_healthy dependency (e.g. emergency
Caddyfile fix while gitnexus is thrashing).
- F7: rewrite the misleading service_healthy comment. The old text
said it prevents 502s "after a restart", implying continuous
protection. Clarified that depends_on only governs initial compose
up ordering — during force-recreates Caddy briefly routes to a
starting gitnexus and the deploy's health poll is the actual guard.
- F6: add `shopt -s nullglob` before the prune loop so an empty
/opt/gitnexus/indexes directory is an explicit no-op instead of
relying on the quirk that `rm -rf "*"` (with literal "*") silently
succeeds. Next reader won't have to recognize the bash default.
- F8: soft-fail PR artifact downloads when the artifact disappeared
between resolve and download. Main/dev artifact failures stay fatal
because a missing main/dev index is a real deploy failure, but a
deleted PR artifact no longer aborts the whole deploy.
* review: resolve 3 NITs from third review pass
- F1: rewrite the printf '%s' comment. The previous version claimed
docker login --password-stdin rejects trailing newlines, which is
inaccurate — docker login strips whitespace. The real reason for
printf over echo is byte-exact output and portability, and the
token-in-process-table security rationale is already documented
in the preceding sentences.
- F2: when a PR artifact download soft-fails, the PR's name stays
in active_names so the prune step keeps the droplet's existing
copy instead of wiping it (stale > empty). Make this transition
visible by spelling it out in the :⚠️: message.
- F3: fencepost fix in the health poll. The previous loop ran 36
iterations and claimed "180s" in the comment, but the final
iteration exits without a trailing sleep, so the real ceiling
was 35 * 5s = 175s. Extended to 37 iterations (36 sleeps * 5s
= 180s) so the comment matches reality.
|
||
|---|---|---|
| .. | ||
| Caddyfile | ||
| docker-compose.yml | ||
| Dockerfile | ||
| entrypoint.sh | ||
| install-extensions.js | ||