mirror of
https://github.com/danny-avila/LibreChat.git
synced 2026-06-28 10:21:39 +00:00
* 🛡️ fix: Prevent ReDoS in YouTube URL extraction for URL Context The YouTube detection/strip regexes ran as a single global pass over authenticated, user-controlled chat text. The engine could restart at every `youtube.com/watch?` occurrence and the lazy `\S*?&` rescanned the rest of a long non-whitespace token each time, giving quadratic CPU behavior that blocks the Node event loop (DoS) for Google/Vertex agents with url_context enabled. - Tokenize on whitespace and skip tokens longer than a real URL, and cap the total text scanned, so work is bounded to O(n). URLs never contain whitespace, so per-token matching is equivalent. - Replace the lazy unbounded `(?:\S*?&)?` with the delimiter-bounded `(?:[^\s&]*&)*` (no behavior change for real URLs). - Apply the same discipline to the strip path. - Add ReDoS regression tests; a 3MB crafted input now completes in <10ms. * 🛡️ fix: Bound the YouTube strip scan by the same total budget Address Codex P1: the strip path applied only the per-token cap, so a valid URL followed by many sub-cap malformed tokens still regex-scanned the entire message (~1s on 3MB). Injected ids only come from the first MAX_YOUTUBE_SCAN_CHARS (extraction's cap), so a link beyond that is never in injectedIds anyway; cap the strip scan at the same budget and leave the tail verbatim. 3MB PoC: ~1s -> ~14ms. * 🧬 fix: Make YouTube URL matching linear instead of capping the scan The previous fix bounded the scan with per-token + total-scan caps, but the total-scan cap discarded content: a URL near the end of a long prompt was missed (extraction sliced to 100k), and large prepended file/quote context exhausted the strip budget before the real URL (strip skipped it). Codex round 2 (P2 x2). Replace the backtracking-prone matcher with a linear one: a single regex captures host + path/query (greedy `[^\s]*`, bounded `{1,63}`/`{0,10}` subdomain repetition, no lazy/ambiguous quantifier), and the video id is parsed from the capture afterwards. This is O(n) over arbitrary input, so the scan caps (and the content they discarded) are removed entirely. Extraction and stripping now scan the whole message linearly. Benchmarks (no caps): 3MB attack token ~3ms, 3MB many-token ~4ms, valid URL at end of 3MB found in ~18ms. Adds regression tests for long-prompt extraction and stripping past large prepended context. * 🔡 fix: Match adjacent + capitalized YouTube URLs after linear rewrite Codex round 3 (regressions from the linear matcher): - Stop the path capture at URL-list delimiters (`,` `)` `]` `<` `>`, none of which occur in a real YouTube URL) so adjacent links in one token (comma-separated or markdown `](url1)](url2)`) are matched separately instead of swallowed. - Lowercase the path segment before matching route names, since the detection regex is case-insensitive (`/WATCH?v=`, `/EMBED/`). * 🔒 fix: Allowlist URL chars + bounded path parsing for YouTube matching Codex round 4: - Replace the path stop-char blocklist with an allowlist of characters that occur in real YouTube URLs, so adjacent links separated by any prose delimiter (`;`, `|`, etc.) are matched separately instead of swallowed. - Parse the route with anchored, bounded regexes instead of `path.split('/')`, so a malformed path of millions of slashes no longer allocates a huge array / blocks the event loop. Also bounds the `v=` param read. * 🎯 fix: Restrict YouTube matcher to recognized video routes Codex round 5: a nested video URL inside an unrecognized YouTube URL (`youtube.com/redirect?q=https://youtu.be/<id>`) was swallowed by the greedy match and missed. Restrict the matcher to recognized single-video forms (youtu.be/<id>, /(shorts|live|embed|v)/<id>, /watch?<query>) so an unrecognized route doesn't match and the global scan continues into the nested link. Stays linear (verified: 3MB redirect/slash/host floods all <25ms) and keeps the allowlist tail so adjacent links still split. Adds nested-URL + unrecognized-route regression tests. * 🎬 fix: Find nested watch links + skip malformed v= duplicates Codex round 6 (P3 watch-query edges): - Drop `:` from the path allowlist. It never occurs in a real YouTube path/query, but `://` of a nested URL does — so `watch?url=https://youtu.be/<id>` now stops the watch match and the scan finds the nested link. - Scan every `v=` param and return the first valid 11-char id, so a malformed earlier `v=` (e.g. `watch?v=tooShort&v=<valid>`) no longer shadows a later valid one. * 🧹 fix: Strip whole YouTube URL incl. colon-containing trailing params Codex round 7: dropping `:` from the tail (round 6) made the strip path stop mid-URL on a URL-valued param (`watch?v=<id>&next=https://example.com`), leaving `://example.com` orphaned. Use a separate strip matcher whose tail re-includes `:` so the whole URL token is removed, while detection keeps the `:`-excluded tail to still find nested video links. Also corrects a stale "per-token cap" comment left over from the linear rewrite. |
||
|---|---|---|
| .. | ||
| acl | ||
| actions | ||
| admin | ||
| agents | ||
| apiKeys | ||
| app | ||
| artifacts | ||
| auth | ||
| cache | ||
| cdn | ||
| cluster | ||
| crypto | ||
| db | ||
| endpoints | ||
| files | ||
| flow | ||
| html | ||
| langfuse | ||
| mcp | ||
| memory | ||
| middleware | ||
| modelSpecs | ||
| oauth | ||
| projects | ||
| prompts | ||
| rum | ||
| shared-links | ||
| skills | ||
| storage | ||
| stream | ||
| telemetry | ||
| tools | ||
| types | ||
| utils | ||
| web | ||
| index.ts | ||
| telemetry.ts | ||