LibreChat

mirror of https://github.com/danny-avila/LibreChat.git synced 2026-06-26 17:31:27 +00:00

Author	SHA1	Message	Date
Danny Avila	7554f2e9e9	🩹 fix: Bump GitNexus to 1.6.5 and Fail-Soft the PR Index Job (#13569 ) * 🩹 fix: Bump GitNexus to 1.6.5 and Fail-Soft the PR Index Job The GitNexus Index workflow began failing on most PRs with "Analysis failed: Maximum call stack size exceeded". Root cause is in the pinned gitnexus@1.5.3 CLI: pipeline.js does `deferredWorkerCalls.push(...chunkWorkerData.calls)`, and once a chunk yields more extracted calls than V8's argument-count limit (~125k on this repo) the spread-push throws a RangeError. It is deterministic on repo size, not flaky — LibreChat simply grew past the threshold, so it fails "more often" as more branches cross it. Stack-size flags don't help; it's an arg-count limit, not stack depth. gitnexus@1.6.5 refactored that code path (the .calls spread-pushes are gone) and indexes this repo cleanly. Bump the indexer, the deploy image tag/build-arg, and the Dockerfile default in lockstep (an index written by 1.6.5 must be served by a 1.6.5 server), and move the co-pinned @ladybugdb/core to 0.16.1 to match. Also make the index job fail-soft on pull_request events so a future tool-internal crash degrades gracefully instead of red-X'ing PRs. Push, dispatch, and /gitnexus command runs still fail loudly, keeping the deploy-gating and completion-comment logic correct. * 🐳 fix: Unbreak the GitNexus Deploy Image for 1.6.5 Addresses two issues in the deploy image surfaced after the 1.6.5 bump: - The image build's lbug-adapter patch grepped dist/mcp/core/lbug-adapter.js for "LOAD EXTENSION fts", but in 1.6.5 that file is a shim re-export and the FTS load moved to dist/core/lbug/lbug-adapter.js. The grep would fail the build on the next image rebuild. The patch is also obsolete: 1.6.5 loads the vector extension itself via loadVectorExtension. Removed the patch step. - The image installed only gitnexus, letting @ladybugdb/core resolve freely via gitnexus's ^0.16.1 range while the index workflow pins 0.16.1 exactly. Pin the native DB in the image too (nested under gitnexus so install-extensions.js keeps resolving it), restoring the intended indexer/server lockstep.	2026-06-07 08:03:28 -04:00
Danny Avila	76e9543f99	🧹 chore: Cap PR Indexes at 3 and Add Delete-Before-Sync (#12672 ) * fix: add docker system prune before image pull to prevent disk exhaustion The 60GB droplet filled up after ~40 deploys because each docker compose pull leaves the previous image's layers as dangling/unused. The gitnexus image is ~700MB, so ~40 stale copies ≈ 28GB of dead layers. Combined with indexes, OS, and Docker's build cache, the disk hits 100% and the next pull fails with 'no space left on device'. Add a docker system prune -af --volumes BEFORE pulling the new image on every deploy. This removes stopped containers, unused networks, all images not referenced by a running container, and build cache. Running containers are never touched. Typically frees 1-2GB per deploy (the previous image's layers). Also add a hard 2GB free-space guard after prune so the deploy fails with a clear error instead of letting docker pull attempt a 700MB extract onto a near-full disk. * fix: cap PR indexes at 3 + delete-before-sync for 10GB disk The 10GB droplet has ~2GB free. Each index is ~130MB, so 7 PR indexes (~900MB) plus main+dev (~260MB) plus the ~700MB Docker image leaves almost nothing for image pulls. The deploy failed with 'no space left on device' during docker compose pull. Three changes: 1. Cap PR indexes at MAX_PR_INDEXES=3. The resolve step now sorts PR artifacts by created_at descending and only keeps the 3 most recent. Older PR indexes are logged as evicted and their droplet folders get cleaned by the prune step. 2. Prune BEFORE sync (was after). Freeing disk space from evicted indexes before rsyncing new data is critical on a tight disk. The old order (sync then prune) could briefly hold both old evicted indexes and newly-uploaded ones simultaneously. 3. Delete-before-sync for every index, including main/dev. Instead of rsync --delete (which transfers new files then removes extras), rm -rf the target folder before rsync so the disk never holds both old and new copies of the same index (~260MB saved per index). Main/dev are only deleted when a fresh artifact is about to replace them — never evicted between deploys. Budget on 10GB disk: OS + Docker engine: ~4.0 GB Docker image (running): ~0.7 GB main + dev indexes: ~0.26 GB 3 PR indexes: ~0.39 GB Docker prune headroom: ~0.7 GB (for image pull) Free: ~3.9 GB * refine: restrict automatic PR indexing to danny-avila authored PRs With 200+ open PRs and a 10GB disk capped at 3 served PR indexes, auto-indexing every contributor PR burns CI minutes for artifacts that will mostly be evicted before anyone queries them. Narrow the pull_request auto-trigger to PRs authored by danny-avila only. Other contributors' PRs can still be indexed on demand via /gitnexus index (contributor-gated comment command) or manual workflow_dispatch — both arrive as workflow_dispatch events and bypass the pull_request filter entirely. * fix: drop --volumes from docker system prune to preserve Caddy TLS state The deploy workflow explicitly handles a caddy-not-running state later in the same step. If Caddy is stopped when the prune runs, --volumes deletes the caddy-data and caddy-config volumes (TLS certs + ACME account keys), forcing a Let's Encrypt re-issuance on next start. LE rate-limits to 5 certs per domain per week, so repeated wipes could brick HTTPS for days. docker system prune -af (without --volumes) still removes stopped containers, unused networks, all dangling/unreferenced images, and build cache — which is where the disk savings come from. Named volumes are left untouched. * fix: rsync-then-swap instead of delete-before-sync The delete-before-sync pattern removed the live index BEFORE rsync ran. If rsync failed (SSH timeout, disk pressure, network error), the index was already gone — production served nothing for that repo until a later deploy succeeded. Replace with rsync-then-swap: upload to a .new temp directory, and only rm + mv into place after rsync succeeds. On rsync failure, the .new temp is cleaned up and the old index stays live. The cost is ~130MB of extra disk while both old and new coexist, but the prune step runs first and frees evicted PR indexes, so this fits comfortably on the 10GB disk. * fix: fail deploy on main/dev rsync failure, soft-fail PRs only The rsync-then-swap pattern downgraded ALL failures to a warning, so the deploy continued even when LibreChat or LibreChat-dev failed to sync. The job would pull the new image, restart the container, and report success while serving stale or missing core indexes. Split by criticality: main/dev rsync failures now exit 1 (aborting the deploy before the container restart). PR index failures remain soft-fail with a warning — a missing PR index is inconvenient but shouldn't take the whole server down.	2026-04-15 09:46:48 -04:00
Danny Avila	546f006e42	💬 feat: Serialize GitNexus Deploys and Post Completion Comments on PR Commands (#12623 ) Three related changes that tighten the GitNexus CI/CD loop. Serialized deploys - Previous concurrency group was keyed by head ref with cancel-in-progress, which let deploys targeting different refs (e.g. main push + PR command) run in parallel. That's a data race: the prune-stale-indexes step computes active_names up front, so deploy A rsyncing /opt/gitnexus/indexes/LibreChat-pr-12580 can collide with deploy B pruning the same folder based on a pre-rsync view of the active set. - Collapse to a single global group gitnexus-deploy with cancel-in-progress: false. All deploys queue behind one another. A rsync/docker-compose restart is never killed mid-operation. The 20-minute job timeout bounds queue depth. PR completion feedback - Add a "index complete" comment step in gitnexus-index.yml that fires only when inputs.pr_number is set (i.e. the run came via the /gitnexus command). Posts success or failure with a link to the run and whether embeddings were generated. - Add a "deploy complete" comment step in gitnexus-deploy.yml that handles both trigger paths: workflow_run from a native PR auto-index (PR number recovered from the matrix entry whose runId matches the trigger run), and workflow_dispatch from the index workflow's bot- fallback path (PR number passed through as a new inputs.pr_number). - Plumb inputs.pr_number through the bot-fallback dispatch in gitnexus-index.yml so the deploy workflow knows where to comment for command-triggered runs. - Only comments on the PR that asked for the index, never broadcasts. Workflow rename - Drop the "DigitalOcean" suffix from the deploy workflow's display name and filename. The platform is still DO (.do/gitnexus/ still holds the compose + caddy config) but the workflow itself is platform-agnostic in form and the suffix was visual noise. - File renamed gitnexus-deploy-do.yml -> gitnexus-deploy.yml. - Concurrency group and all cross-references updated in lock-step. - permissions at deploy job level now includes pull-requests: write so the completion comment can post.	2026-04-11 18:15:56 -04:00
Danny Avila	8eab39bc8f	🌊 feat: Add GitNexus DigitalOcean Pipeline with PR Index Serving (#12612 ) * feat: migrate GitNexus deployment from Fly.io to DigitalOcean droplet Fly.io's 1GB machine was pegged at ~900MB memory with load spiking to 2.7 under even modest query load. Moving to a 2GB+ DO droplet that can take advantage of existing credits. Architecture change: indexes no longer baked into the image. Instead, a long-lived image (built only when .do/gitnexus/ changes) is pulled from GHCR, and the deploy workflow rsyncs .gitnexus/ data into /opt/gitnexus/indexes/<name>/ on the droplet and restarts only the gitnexus container. Caddy stays running so TLS certs don't churn. - Add .do/gitnexus/Dockerfile (same native-addon + extension patch layers as the Fly variant, but no COPY indexes/ step) - Add .do/gitnexus/docker-compose.yml with gitnexus + caddy services on an internal bridge network, 1.8GB memory limit, healthcheck - Add .do/gitnexus/Caddyfile with automatic HTTPS for the configured subdomain and bearer token auth for all routes except /health - Add .do/gitnexus/entrypoint.sh that registers every index mounted at /indexes/<name>/.gitnexus at container start, then runs gitnexus serve bound to 0.0.0.0 (internal docker network only) - Add .do/gitnexus/install-extensions.js for LadybugDB FTS/vector extension pre-install (workaround for upstream bug) - Add .github/workflows/gitnexus-deploy-do.yml that builds the image only on Dockerfile/entrypoint changes, pushes to GHCR, rsyncs the index artifacts to the droplet, and restarts the gitnexus container - Remove .fly/gitnexus/ and .github/workflows/gitnexus-deploy.yml — Fly app will be destroyed after DO deploy is verified working Required new secrets: DO_HOST, DO_USER, DO_SSH_KEY. GITNEXUS_DOMAIN and API_TOKEN live in /opt/gitnexus/.env on the droplet itself. * refactor: prefix deploy secrets with GITNEXUS_ for namespace isolation Rename DO_HOST -> GITNEXUS_DO_HOST, DO_USER -> GITNEXUS_DO_USER, and DO_SSH_KEY -> GITNEXUS_DO_SSH_KEY so the secrets are clearly scoped to the gitnexus deploy and don't collide with any other DigitalOcean secrets LibreChat might add later. * feat: serve PR indexes alongside main/dev and add /gitnexus command The index workflow was already building and uploading per-PR indexes (gitnexus-index-pr-<N>) for contributor PRs, but the deploy workflow only consumed main and dev artifacts. PR indexes were sitting in storage doing nothing. This wires them all the way through to the live MCP server, with proper cleanup when PRs close. Deploy workflow changes: - Drop the branches filter on workflow_run so PR index completions also trigger deploys (PR indexes are already contributor-gated upstream in gitnexus-index.yml via author_association) - Resolve all open PRs via the GitHub API, look up each one's latest non-expired gitnexus-index-pr-<N> artifact, and serve whichever ones exist. PRs without an index artifact are skipped — we don't retroactively index anything. - Per-ref concurrency group so rapid pushes to the same PR coalesce but different refs still deploy in parallel - After rsyncing active indexes, prune any /opt/gitnexus/indexes/ folder that isn't in the active set. Safety net for missed PR close events. New workflow: gitnexus-cleanup-pr.yml - Fires on pull_request closed (merged or not) - SSHs to the droplet, removes /opt/gitnexus/indexes/LibreChat-pr-<N>, restarts the gitnexus container New workflow: gitnexus-pr-command.yml - Listens for issue_comment events where body starts with /gitnexus - Contributor gated via author_association - Supports: /gitnexus index — index with defaults /gitnexus index embeddings — index with --embeddings - Dispatches gitnexus-index.yml with the PR number and head SHA, reacts to the comment with a rocket emoji Index workflow changes: - New dispatch inputs pr_number and pr_ref for command-driven runs - Checkout step uses inputs.pr_ref when set so the PR's head commit is analyzed instead of the default branch - Artifact naming falls back through pr_number -> pull_request number -> ref_name, keeping existing behavior for push/PR events - Concurrency group switches to pr-<N> when dispatched by the command so re-runs on the same PR debounce correctly * chore: remove Fly variant reference from Dockerfile header The Fly variant no longer exists in the repo, so the comparison comment is meaningless. Rewritten as a standalone description of the image's design. * review: resolve 15 findings from review audit Critical - Drop the unused caddy binary from the gitnexus image. Caddy runs in its own container in this architecture; installing it inside the gitnexus image added ~40-60MB for no reason and contradicted the Dockerfile header comment. Major - Replace ssh-keyscan TOFU with a GITNEXUS_DO_KNOWN_HOST secret. Both deploy and cleanup workflows now pin the droplet's host key from a stored value instead of silently trusting whatever the host presents at deploy time. Fails the workflow if the secret is empty so no one accidentally regresses to TOFU. - Gate gitnexus-cleanup-pr.yml to same-repo PRs via github.event.pull_request.head.repo.full_name == github.repository. Fork PR closes no longer produce failed runs when secrets are withheld by GitHub. The deploy workflow's stale-folder prune step remains the safety net for any fork-contributor indexes. - Fail fast in entrypoint.sh when main/dev index registration errors. Previously `\|\| echo WARN` swallowed failures so a broken index passed the docker healthcheck and the deploy was marked green while queries returned empty. PR indexes stay best-effort (a corrupt PR index shouldn't take the whole server down). - Authenticate the droplet with GHCR on every deploy using GITHUB_TOKEN, so private GHCR packages work without documentation detours or manual docker login on the host. Bootstrap comments explain the flow. - Switch docker-compose caddy.depends_on from the short-form (service_started) to service_healthy so Caddy doesn't route to a gitnexus container that's still starting (500ms-60s window after recreation). Minor - Guard the HEAD~1 diff with `git rev-parse --verify HEAD~1` so the first-commit and workflow_run-from-PR cases default to rebuild instead of silently skipping a legitimately-changed image. - Move `packages: write` off the workflow-level permissions and onto the build-image job. deploy no longer inherits unnecessary GHCR write access. - Skip the SSH session in cleanup-pr.yml when no gitnexus-index-pr-<N> artifact ever existed for the PR. Eliminates ~95% of no-op SSH round-trips on a busy repo (docs-only PRs, paths-ignored PRs, etc). - Reload Caddy in-place after config upload with `caddy reload`, falling back to force-recreate on reload failure and `compose up` on first-time bootstrap. Picks up Caddyfile or env changes without losing TLS certs. - Replace `sleep 5` post-deploy with a real readiness poll against docker's health status. Fails the workflow if gitnexus doesn't report healthy within 120s, so a broken startup surfaces in CI instead of being silently marked green. - Warn when listPulls hits the 100-item per_page ceiling so a future growth spurt past 100 open PRs doesn't silently drop indexes. Nit - Tighten NODE_OPTIONS --max-old-space-size from 1536 to 1280MB, giving KuzuDB's C++ heap ~512MB of room under the 1792MB cgroup limit instead of ~256MB. - Rewrite the stale "headroom for Caddy" comment in entrypoint.sh (Caddy lives in a separate container now). - Restore load-bearing comments in install-extensions.js explaining the @ladybugdb/core path layout and the throwaway-db cache-priming pattern. - Parameterize the docker-compose image reference as ${GITNEXUS_IMAGE:-ghcr.io/danny-avila/librechat-gitnexus:latest} so forks or pinned version tags can override via /opt/gitnexus/.env. Deferred - Finding 12 (memory headroom) addressed partially via the heap cap reduction; full profiling of KuzuDB C++ allocations under query load deferred to post-deploy monitoring. * review: resolve 8 follow-up findings from second review pass Security - F1: pipe GHCR token via SSH stdin instead of expanding it into the remote command string. Previously `"echo '$GH_TOKEN' \| docker login"` expanded the token locally before SSH sent it as an argument, so the live token was briefly visible in /proc/<pid>/cmdline on the droplet to any process running as deploy or root. New form uses `printf '%s' "$GH_TOKEN" \| ssh ... "docker login --password-stdin"` so the token only travels through the encrypted SSH stdin pipe. Reliability - F2: add json-file log rotation (50m x 3 files) via a YAML anchor shared by both services. Default Docker logging is unbounded and would eventually fill the 60GB droplet disk. - F4: set memswap_limit=1792m to match mem_limit. Without this, Docker lets the container silently spill onto host swap when KuzuDB's C++ heap overruns the 1792m RAM budget, turning sub-second graph queries into multi-second ones with no alert. Hard OOM-kill is preferable — unless-stopped restarts the container, the deploy health poll catches it, the failure is explicit. - F5: extend the post-deploy health poll from 24 iterations (120s) to 36 iterations (180s) so it clears Docker's own unhealthy-detection ceiling (start_period 60s + retries 3 * interval 30s = 150s). A container that legitimately takes 125s to warm up would previously fail the deploy at 120s while Docker would still report it as "starting". Operability - F3: document the `--no-deps` escape hatch in the compose file header so an operator can restart Caddy during a gitnexus outage without being trapped by the service_healthy dependency (e.g. emergency Caddyfile fix while gitnexus is thrashing). - F7: rewrite the misleading service_healthy comment. The old text said it prevents 502s "after a restart", implying continuous protection. Clarified that depends_on only governs initial compose up ordering — during force-recreates Caddy briefly routes to a starting gitnexus and the deploy's health poll is the actual guard. - F6: add `shopt -s nullglob` before the prune loop so an empty /opt/gitnexus/indexes directory is an explicit no-op instead of relying on the quirk that `rm -rf ""` (with literal "") silently succeeds. Next reader won't have to recognize the bash default. - F8: soft-fail PR artifact downloads when the artifact disappeared between resolve and download. Main/dev artifact failures stay fatal because a missing main/dev index is a real deploy failure, but a deleted PR artifact no longer aborts the whole deploy. * review: resolve 3 NITs from third review pass - F1: rewrite the printf '%s' comment. The previous version claimed docker login --password-stdin rejects trailing newlines, which is inaccurate — docker login strips whitespace. The real reason for printf over echo is byte-exact output and portability, and the token-in-process-table security rationale is already documented in the preceding sentences. - F2: when a PR artifact download soft-fails, the PR's name stays in active_names so the prune step keeps the droplet's existing copy instead of wiping it (stale > empty). Make this transition visible by spelling it out in the :⚠️: message. - F3: fencepost fix in the health poll. The previous loop ran 36 iterations and claimed "180s" in the comment, but the final iteration exits without a trailing sleep, so the real ceiling was 35 * 5s = 175s. Extended to 37 iterations (36 sleeps * 5s = 180s) so the comment matches reality.	2026-04-11 13:04:46 -04:00
Danny Avila	711747a5a0	🔎 fix: Install LadybugDB Extensions and Patch Vector Load for GitNexus Search (#12607 ) Two upstream GitNexus 1.5.3 bugs combine to break query() in serve mode: 1. pool-adapter.ts calls LOAD EXTENSION fts but never INSTALL fts. The CI-produced .gitnexus/ artifact doesn't include the extension cache (~/.kuzu/extension/), so LOAD silently fails in the fresh container and all BM25/FTS searches return empty. 2. pool-adapter.ts only loads the FTS extension — it never loads the vector extension. Every CALL QUERY_VECTOR_INDEX in semanticSearch fails, so hybrid search's semantic leg returns empty too. Combined, query() returns empty even for exact function names because both the BM25 and semantic legs of RRF merging have zero inputs. Meanwhile context()/cypher()/route_map() still work because they use plain Cypher with no extensions. Workaround: - Add install-extensions.js that runs INSTALL fts and INSTALL vector against a throwaway database during Docker build, populating ~/.kuzu/extension/ with the cached extension binaries - Sed-patch pool-adapter.js at build time to also LOAD EXTENSION vector alongside the existing FTS load, wrapped in try/catch - Update deploy workflow to copy install-extensions.js into the build context	2026-04-10 13:41:55 -04:00
Danny Avila	a121ae3dea	🩺 fix: Capture GitNexus Serve Output and Add Startup Health Check (#12599 ) * fix: capture gitnexus serve output and add startup health check gitnexus serve was backgrounded with no log capture, so crash reasons were invisible in Fly logs. Now pipes output to stdout and waits up to 30s for the server to be ready before starting Caddy, with early exit if the process dies. * fix: install newer libstdc++ for LadybugDB native addon @ladybugdb/core prebuilt binary requires GLIBCXX_3.4.32 (GCC 13+) but node:24-slim ships Bookworm's libstdc++6 which only has 3.4.31. Pull libstdc++6 from Debian Trixie to satisfy the runtime dependency. * fix: separate build tools from libstdc++ upgrade to avoid conflict Installing Trixie's libstdc++6 alongside Bookworm's g++ fails because the Trixie libc6 transitive dep conflicts with Bookworm's libc6-dev. Split into two RUN steps: compile native addons first, remove g++, then upgrade libstdc++ from Trixie with no conflicting packages. * fix: name repo as LibreChat and add dev branch deploy support - Use REPO_NAME build arg (default: LibreChat) as the WORKDIR so gitnexus registers the index with a proper name instead of "repo" - Deploy workflow now triggers on both main and dev branch index runs - Dev branch registers as "LibreChat-dev", main as "LibreChat" - workflow_dispatch gains a branch selector input * feat: serve both main and dev indexes from one container + auto-embed main - Dockerfile now copies multiple indexes from indexes/<name>/.gitnexus and registers each with gitnexus, so one server handles both branches - Deploy workflow downloads latest successful main + dev artifacts in parallel and bundles them into a single deploy - list_repos returns LibreChat and LibreChat-dev; queries target either via the repo parameter - Main branch pushes auto-enable --embeddings for semantic search; dev and PRs remain graph-only for speed (opt-in via dispatch input) - Bump index job timeout to 25m to account for embedding generation * fix: raise Fly machine memory to 1GB + add swap + cap Node heap The 512MB machine was getting OOM-killed when gitnexus serve's default --max-old-space-size=8192 over-committed memory during query spikes. - Bump VM memory 512mb -> 1gb - Add 512MB swap file to absorb transient spikes - Cap Node heap at 768MB via NODE_OPTIONS so it stays within the machine's real capacity and leaves headroom for Caddy and the OS	2026-04-10 12:35:11 -04:00
Danny Avila	01a1bc1689	📊 experimental: Add GitNexus CI/CD and deployment configuration (#12577 ) * feat: Add GitNexus CI/CD and deployment configuration - Introduced a Dockerfile for building the GitNexus application with necessary dependencies and configurations. - Added a Caddyfile to set up a reverse proxy with bearer token authentication for secure access to GitNexus. - Created an entrypoint script to validate the API token and start both GitNexus and Caddy services. - Configured Fly.io deployment settings in fly.toml, including health checks and service parameters. - Established GitHub Actions workflows for deploying the GitNexus index and managing deployments to Fly.io. * fix: use npx instead of bunx for native addon compatibility bunx skips node-gyp lifecycle scripts, so @ladybugdb/core's native .node binary never gets compiled/downloaded. npx handles this correctly.	2026-04-08 16:17:19 -04:00

7 commits