LibreChat/packages/api/src
Danny Avila c5610871d0
🐌 fix: Prevent ReDoS in YouTube URL Extraction for URL Context (#13937)
* 🛡️ fix: Prevent ReDoS in YouTube URL extraction for URL Context

The YouTube detection/strip regexes ran as a single global pass over
authenticated, user-controlled chat text. The engine could restart at every
`youtube.com/watch?` occurrence and the lazy `\S*?&` rescanned the rest of a
long non-whitespace token each time, giving quadratic CPU behavior that blocks
the Node event loop (DoS) for Google/Vertex agents with url_context enabled.

- Tokenize on whitespace and skip tokens longer than a real URL, and cap the
  total text scanned, so work is bounded to O(n). URLs never contain whitespace,
  so per-token matching is equivalent.
- Replace the lazy unbounded `(?:\S*?&)?` with the delimiter-bounded
  `(?:[^\s&]*&)*` (no behavior change for real URLs).
- Apply the same discipline to the strip path.
- Add ReDoS regression tests; a 3MB crafted input now completes in <10ms.

* 🛡️ fix: Bound the YouTube strip scan by the same total budget

Address Codex P1: the strip path applied only the per-token cap, so a valid URL
followed by many sub-cap malformed tokens still regex-scanned the entire message
(~1s on 3MB). Injected ids only come from the first MAX_YOUTUBE_SCAN_CHARS
(extraction's cap), so a link beyond that is never in injectedIds anyway; cap the
strip scan at the same budget and leave the tail verbatim. 3MB PoC: ~1s -> ~14ms.

* 🧬 fix: Make YouTube URL matching linear instead of capping the scan

The previous fix bounded the scan with per-token + total-scan caps, but the
total-scan cap discarded content: a URL near the end of a long prompt was missed
(extraction sliced to 100k), and large prepended file/quote context exhausted the
strip budget before the real URL (strip skipped it). Codex round 2 (P2 x2).

Replace the backtracking-prone matcher with a linear one: a single regex captures
host + path/query (greedy `[^\s]*`, bounded `{1,63}`/`{0,10}` subdomain repetition,
no lazy/ambiguous quantifier), and the video id is parsed from the capture
afterwards. This is O(n) over arbitrary input, so the scan caps (and the content
they discarded) are removed entirely. Extraction and stripping now scan the whole
message linearly.

Benchmarks (no caps): 3MB attack token ~3ms, 3MB many-token ~4ms, valid URL at end
of 3MB found in ~18ms. Adds regression tests for long-prompt extraction and
stripping past large prepended context.

* 🔡 fix: Match adjacent + capitalized YouTube URLs after linear rewrite

Codex round 3 (regressions from the linear matcher):
- Stop the path capture at URL-list delimiters (`,` `)` `]` `<` `>`, none of which
  occur in a real YouTube URL) so adjacent links in one token (comma-separated or
  markdown `](url1)](url2)`) are matched separately instead of swallowed.
- Lowercase the path segment before matching route names, since the detection regex
  is case-insensitive (`/WATCH?v=`, `/EMBED/`).

* 🔒 fix: Allowlist URL chars + bounded path parsing for YouTube matching

Codex round 4:
- Replace the path stop-char blocklist with an allowlist of characters that occur
  in real YouTube URLs, so adjacent links separated by any prose delimiter
  (`;`, `|`, etc.) are matched separately instead of swallowed.
- Parse the route with anchored, bounded regexes instead of `path.split('/')`, so a
  malformed path of millions of slashes no longer allocates a huge array / blocks
  the event loop. Also bounds the `v=` param read.

* 🎯 fix: Restrict YouTube matcher to recognized video routes

Codex round 5: a nested video URL inside an unrecognized YouTube URL
(`youtube.com/redirect?q=https://youtu.be/<id>`) was swallowed by the greedy
match and missed. Restrict the matcher to recognized single-video forms
(youtu.be/<id>, /(shorts|live|embed|v)/<id>, /watch?<query>) so an unrecognized
route doesn't match and the global scan continues into the nested link. Stays
linear (verified: 3MB redirect/slash/host floods all <25ms) and keeps the
allowlist tail so adjacent links still split. Adds nested-URL + unrecognized-route
regression tests.

* 🎬 fix: Find nested watch links + skip malformed v= duplicates

Codex round 6 (P3 watch-query edges):
- Drop `:` from the path allowlist. It never occurs in a real YouTube path/query,
  but `://` of a nested URL does — so `watch?url=https://youtu.be/<id>` now stops
  the watch match and the scan finds the nested link.
- Scan every `v=` param and return the first valid 11-char id, so a malformed
  earlier `v=` (e.g. `watch?v=tooShort&v=<valid>`) no longer shadows a later valid one.

* 🧹 fix: Strip whole YouTube URL incl. colon-containing trailing params

Codex round 7: dropping `:` from the tail (round 6) made the strip path stop mid-URL
on a URL-valued param (`watch?v=<id>&next=https://example.com`), leaving `://example.com`
orphaned. Use a separate strip matcher whose tail re-includes `:` so the whole URL token
is removed, while detection keeps the `:`-excluded tail to still find nested video links.
Also corrects a stale "per-token cap" comment left over from the linear rewrite.
2026-06-24 13:06:59 -04:00
..
acl 🔗 feat: Add Granular Access Control to Shared Links via ACL System (#13051) 2026-06-03 14:17:17 -04:00
actions 🔀 fix: Reconcile Agent Action Credential Merges (#13559) 2026-06-06 15:09:58 -04:00
admin ✂️ fix: Cap Audit Chain Verification and Honor Client Cancellation (#13903) 2026-06-23 08:33:16 -04:00
agents 📺 feat: Google URL Context Param with Native YouTube Video Understanding (#13924) 2026-06-23 22:42:06 -04:00
apiKeys ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
app 📈 fix: Isolate RUM Telemetry Proxy Auth from App Auth (#13765) 2026-06-15 12:49:44 -04:00
artifacts 🪡 fix: Artifact Edit Saves (#13358) 2026-05-27 22:03:42 -07:00
auth 🌐 fix: Centralize Outbound Proxy Handling (#13726) 2026-06-14 10:47:49 -04:00
cache 🚰 ci: Close Leaked Redis Clients in Cache Integration Tests (#13649) 2026-06-10 08:59:13 -04:00
cdn ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
cluster ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
crypto 🧵 refactor: Migrate Endpoint Initialization to TypeScript (#10794) 2025-12-11 16:37:16 -05:00
db 🗂️ feat: Add Private Chat Projects (#13467) 2026-06-03 15:29:18 -04:00
endpoints 🐌 fix: Prevent ReDoS in YouTube URL Extraction for URL Context (#13937) 2026-06-24 13:06:59 -04:00
files 🐛 fix: Prevent Infinite Render Loop on Code-Execution File Preview (#13922) 2026-06-23 16:34:43 -04:00
flow 🤫 refactor: Silent MCP OAuth Refresh on Mid-Session 401 (#13369) 2026-06-10 13:12:42 -04:00
html ⚙️ refactor: lazy-load React Query Devtools (#13639) 2026-06-10 13:06:20 -04:00
langfuse 📋 refactor: Attach Message Context to Langfuse Feedback Scores (#13604) 2026-06-08 15:54:01 -04:00
mcp 💰 fix: Bound MCP tools/list Pagination with Aggregate Budgets (#13909) 2026-06-23 08:34:09 -04:00
memory 🧠 fix: Bound Memory Agent Input (#13606) 2026-06-09 14:38:21 -04:00
middleware 🖇️ feat: Reference Selected Chat Text with Multi-Quote Popup (#13868) 2026-06-21 08:33:11 -04:00
modelSpecs 💬 feat: Conversation Starters for Model Specs (#13710) 2026-06-13 11:38:49 -04:00
oauth ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
projects ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
prompts ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
rum 📈 fix: Isolate RUM Telemetry Proxy Auth from App Auth (#13765) 2026-06-15 12:49:44 -04:00
shared-links 🔐 fix: Gate Shared Startup Config By Link Access (#13897) 2026-06-23 08:28:37 -04:00
skills 🏘️ fix: Scope Skill Sync Status (#13771) 2026-06-15 15:23:49 -04:00
storage ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00
stream 🖇️ feat: Reference Selected Chat Text with Multi-Quote Popup (#13868) 2026-06-21 08:33:11 -04:00
telemetry 📡 refactor: Gate Noisy Redis OTEL Instrumentation (#13764) 2026-06-15 12:48:20 -04:00
tools fix: Sanitize MCP Tool Schemas for Gemini/Vertex Compatibility (#13623) 2026-06-09 14:16:25 -04:00
types 🧭 fix: Harden User Provided Endpoint URL Protection (#13919) 2026-06-23 16:35:16 -04:00
utils 🧭 fix: Harden User Provided Endpoint URL Protection (#13919) 2026-06-23 16:35:16 -04:00
web 🛟 refactor: Gracefully Skip Unavailable Web Search Rerankers (#13191) 2026-05-19 09:48:12 -04:00
index.ts 🔗 feat: Snapshot Files for Shared-Link Attachments (#13740) 2026-06-20 23:05:13 -04:00
telemetry.ts ️ refactor: Migrate @librechat/api build to tsdown (#13595) 2026-06-08 10:54:48 -04:00