LibreChat/api/server/services/Files/Code/process.js
Danny Avila c9dee962e7
📂 fix: Preserve Nested Folder Paths for Code-Execution Artifacts (#12848)
* 📂 fix: Preserve Nested Folder Paths for Code-Execution Artifacts

When codeapi reports a generated file at a nested path (`a/b/file.txt`),
`processCodeOutput` was running it through `sanitizeFilename` — which
calls `path.basename()` and then collapses `/` to `_`. The DB row ended
up with `filename: "file.txt"`, `primeFiles` shipped that flat name back
to the next sandbox session, and `cat /mnt/data/a/b/file.txt` 404'd.

Fix: split the sanitizer into two helpers in `packages/api/src/utils/files.ts`:

  - `sanitizeArtifactPath` — segment-wise sanitize while preserving
    `/`. Falls back to basename on `..` traversal, absolute paths, and
    other malformed inputs. The DB record uses this so the next prime()
    can recreate the nested path in the sandbox.

  - `flattenArtifactPath` — encode `/` as `__` for the local
    `saveBuffer` strategies, which key by single-component filename and
    would otherwise create unintended subdirectories under uploads/.

`process.js` is updated to use both: DB filename keeps the path, storage
key flattens. `claimCodeFile` is also keyed on `safeName` so the
(filename, conversationId) compound key stays consistent with the
record `createFile` writes.

Tests:
  +13 unit tests in `files.spec.ts` (sanitizeArtifactPath table,
  flattenArtifactPath round-trip).
  +1 integration test in `process.spec.js` asserting the DB-row vs
  storage-key split for a nested path.
  Updated `process-traversal.spec.js` to mock the new helpers.

64 pass / 0 fail across `Files/Code/`; 36 pass / 0 fail in
`packages/api/src/utils/files.spec.ts`.

Companion: ClickHouse/ai#1327 — the codeapi-side counterpart that stops
phantom file IDs from reaching this code path in the first place. These
two are independent but the matplotlib bug is most cleanly resolved when
both ship.

* 🛡️ fix: Re-add 255-char per-segment cap in sanitizeArtifactPath (codex review P2)

`sanitizeArtifactPath` dropped the 255-char basename cap that
`sanitizeFilename` enforces. Long artifact names then flowed unbounded
into `processCodeOutput`'s storage key (`${file_id}__${flatName}`) and
tripped `ENAMETOOLONG` on filesystems that enforce `NAME_MAX` —
saveBuffer fails, and the file falls back to a download URL instead of
persisting / priming. This was a regression specifically for flat
filenames that the original `sanitizeFilename` would have truncated
safely.

Re-add the cap as a per-path-component limit so it applies cleanly to
both flat and nested paths:

  - Leaf segment: extension-preserving truncation, matching
    `sanitizeFilename`'s shape (`<truncated-stem>-<6 hex>.<ext>`).
  - Non-leaf (directory) segments: plain truncate-and-disambiguate
    (`<truncated-name>-<6 hex>`); directory names don't carry semantic
    extensions worth preserving.
  - Defensive fallback when `path.extname` returns a pathologically long
    "extension" (e.g. `_.aaaa…aaa` after the dotfile underscore prefix
    rewrite turns a long hidden file into a non-dotfile with a 300-char
    "extension"): collapse to whole-segment truncation rather than
    leaving the cap unmet.

+6 unit tests covering: long leaf (regression case), long leaf under a
preserved directory, long non-leaf segment, deeply nested mixed-length,
exact-255 boundary (no truncation), and the dotfile + truncation
interaction.

* 🛡️ fix: Cap flattened storage key against NAME_MAX in processCodeOutput (codex review P1)

Per-segment caps on the path-preserving form aren't enough. Once segments
are joined with `__` for the storage key, deeply-nested or moderately
long paths can still produce a flat form that overflows once
`${file_id}__` is prepended — `${file_id}__a__b__c.csv` for a 3-level
100-char-each path is ~344 chars, well past filesystem NAME_MAX (255).
saveBuffer then trips ENAMETOOLONG and falls back to a download URL,
and the artifact never persists / primes.

`flattenArtifactPath` gets an optional `maxLength` parameter. When set,
the function truncates the flat form to fit, preserving the leaf
extension with the same disambiguating-hex-suffix shape sanitizeFilename
uses. Default (`undefined`) keeps existing call sites uncapped — the cap
is opt-in for callers that are actually building a filesystem key.
Pathologically long "extensions" from `path.extname` (e.g. `.aaaa…aaa`)
fall back to whole-key truncation rather than leaving the cap unmet.

processCodeOutput composes the storage key after `file_id` is known and
passes `255 - file_id.length - 2` as the budget so the full
`${file_id}__${flatName}` string fits in one filesystem path component.

+7 unit tests in files.spec.ts:
  - Pass-through when no maxLength supplied (cap is opt-in).
  - Pass-through when flat form fits within maxLength.
  - Truncation with leaf extension preserved (the regression case).
  - Leaf-only overflow with extension preservation.
  - Pathological long-extension fallback (whole-key truncation).
  - No-extension stem truncation.
  - Boundary equality (off-by-one guard).

+1 integration test in process.spec.js: processCodeOutput passes the
file_id-aware budget (`255 - file_id.length - 2`) to flattenArtifactPath.

114/114 across files.spec.ts + Files/Code (49 + 65).

* 🛡️ fix: Determinize + clamp artifact-path truncation (codex review P2 ×2)

Two follow-ups to Codex review on the path/flat-key cap:

1. **Deterministic truncation suffixes**. The previous helpers used
   `crypto.randomBytes(3)` for the disambiguator, mirroring
   `sanitizeFilename`'s shape. That made the truncated form non-
   deterministic: a re-upload of the same long filename would compute a
   *different* storage key, orphaning the previous on-disk file under
   the reused `file_id` returned by `claimCodeFile`.

   New `deterministicHexSuffix(input)` helper hashes the input with
   SHA-256 and takes the first 6 hex chars. Same input → same suffix
   (storage key stable across re-uploads); different inputs sharing a
   truncation prefix still get different suffixes (collision avoidance).
   24 bits ≈ 16M values is collision-safe for our scale (single-digit
   artifacts per turn per (filename, conversationId) bucket).

   Applied to `truncateLeafSegment`, `truncateDirSegment`, and
   `flattenArtifactPath` — every truncation site in the new helpers.
   `sanitizeFilename` (pre-existing) is intentionally left alone; its
   tests rely on the random-bytes mock and it's outside this PR's scope.

2. **Final clamp on flattenArtifactPath result**. The old `Math.max(1,
   maxLength - ext.length - 7)` floor could let the result slip past
   `maxLength` when the extension was nearly as large as the budget
   (e.g. `maxLength=5`, `ext=".txt"`: budget computed as 0, but result
   was `-<6 hex>.txt` = 11 chars). Drop the `Math.max(1, …)` floor and
   add a final `truncated.slice(0, maxLength)` so the contract holds
   for any input. Also short-circuit `maxLength <= 0` to `''` for
   pathological budgets.

Tests updated to compute the expected hash inline (the existing
`randomBytes` mock doesn't apply to the new code path), plus 4 new
regression tests:
  - sanitizeArtifactPath: same input → same output, different inputs →
    different outputs (determinism + collision avoidance).
  - flattenArtifactPath: same input → same output, different inputs
    sharing a truncation prefix → different outputs.
  - flattenArtifactPath: clamp holds when ext.length > maxLength - 7.
  - flattenArtifactPath: returns '' for maxLength <= 0.

53 unit tests pass. 65 integration tests pass.

* 🛡️ fix: Total-path cap + basename for classifier (codex P2 + comprehensive review)

Four follow-ups from the latest reviews on this PR:

1. **Codex P2: total-path cap in sanitizeArtifactPath**. Per-segment
   caps weren't enough — a deeply nested path (3+ at-cap segments) can
   still produce a joined form past Mongo's 1024-byte indexed-key limit
   (4.0 and earlier reject; later versions configurable). Added
   `ARTIFACT_PATH_TOTAL_MAX = 512` and a leaf-only fallback when the
   joined form exceeds it. Same shape as the absolute-path /
   `..`-traversal fallbacks above; the leaf is already segment-capped to
   ≤255, so the final result stays within bounds.

2. **Codex P2: pass basename to classifier/extractor in process.js**.
   With the path-preserving sanitizer, `safeName` can now be a nested
   string like `reports.v1/Makefile`. The classifier's `extensionOf`
   reads that as `v1/Makefile` (the slice after the dot in the directory
   name) and the bare-name branch rejects because it sees a `.`
   anywhere. Result: extensionless artifacts under dotted folders
   (Makefile, Dockerfile, etc.) get misclassified as `other` and skip
   text extraction. Pass `path.basename(safeName)` to both
   `classifyCodeArtifact` and `extractCodeArtifactText` so
   classification matches what the old flat-name flow produced.

3. **Review nit: drop dead `sanitizeFilename` mock in process.spec.js**.
   process.js no longer imports `sanitizeFilename`; the mock was
   misleading dead code.

4. **Review nit: rename misleading `'embedded parent traversal'` test**.
   `path.posix.normalize('a/../escape.txt')` resolves to `escape.txt`
   which goes through the normal segment-split path, not the
   `sanitizeFilename` fallback. Test name now says "resolves embedded
   parent traversal via path normalization" to match the actual code
   path.

+3 regression tests:
  - sanitizeArtifactPath falls back to leaf-only when joined > 512.
  - sanitizeArtifactPath keeps nested path within the 512 budget.
  - process.spec: passes basename (`Makefile` from `reports.v1/Makefile`)
    to classifyCodeArtifact + extractCodeArtifactText.

Existing "caps every segment in a deeply-nested path" test now uses 2
segments (not 3) so the joined form stays under the new total cap; the
3-segment scenario is covered by the new fallback test instead.

55 unit + 66 integration = 121/121 pass.

* 📝 docs: Correct sanitizeArtifactPath JSDoc to match actual schema index

Two doc-only fixes from the latest comprehensive review (both NIT):

1. **Index field list was wrong**. JSDoc claimed the compound unique
   index was `{ file_id, filename, conversationId, context }`. The
   actual index in `packages/data-schemas/src/schema/file.ts:92-95` is
   `{ filename, conversationId, context, tenantId }` with a partial
   filter for `context: FileContext.execute_code`. The cap rationale
   (Mongo 4.0 indexed-key limit) is correct and unchanged; just the
   field list was wrong. Added the schema file path so future readers
   can find the source of truth.

2. **Trade-off acknowledgement**. The reviewer noted that the
   leaf-only fallback loses directory structure, which means the
   model's `cat /mnt/data/<deep>/<path>/file.txt` would 404 on the
   pathological-depth case — partially re-introducing the original
   flat-name bug for >512-char paths. This is intentional (DB write
   failure is strictly worse than losing structure), but the trade-off
   wasn't called out explicitly in the JSDoc. Added a paragraph
   acknowledging it and noting that the cap is monotonically better
   than the pre-PR behavior, where ALL artifacts were treated this way
   regardless of depth.

No code or test changes — pure JSDoc correction. Tests still 55/0.

* 🛡️ fix: Disambiguate sanitized artifact names to keep claimCodeFile keys unique (codex P2)

`sanitizeArtifactPath` is not injective — multiple raw inputs can collapse
onto the same regex-and-normalize output. Codex's example:
`reports 2026/out.csv` and `reports_2026/out.csv` both sanitize to
`reports_2026/out.csv`. `claimCodeFile` is keyed on the schema's compound
unique `(filename, conversationId, context, tenantId)` index, so the
later upload silently matches the earlier record and overwrites the first
artifact's bytes via the reused `file_id` — a single conversation can
drop files when both names are valid in the sandbox.

This collision space isn't strictly new — pre-PR `sanitizeFilename`
(basename-only) had the same property — but the path-preserving form
gives us enough information to fix it for the first time.

**Fix.** When character-level sanitization changed something (regex
replacement, path normalization, dotfile prefix, empty-segment collapse),
embed a deterministic SHA-256 prefix of the **raw input** in the leaf
segment via the new `embedDisambiguatorInLeaf` helper. Same raw input →
same safe form (idempotent for re-uploads); different raw inputs that
would have collided → different safe forms.

**Why "character-level"** specifically:
- The disambiguator fires when `preCapJoined !== inputName` (post-regex
  + dotfile + empty-segment, BUT pre-truncation).
- Truncation alone is already disambiguated by `truncateLeafSegment`'s
  own seg-hash; firing the input-hash branch on truncation would just
  stack a second hash for no collision-avoidance benefit and clutter
  human-readable filenames.

**Three known collision shapes covered:**
1. `out 1.csv` vs `out_1.csv` (and `out@1.csv` vs `out#1.csv`, etc.)
2. `dir//file.txt` vs `dir/file.txt` (empty-segment collapse)
3. `.x` vs `_.x` (dotfile-prefix step)

**Disambiguator + truncation interaction:** for very long mutated leaves,
`truncateLeafSegment` caps at 255 first, then `embedDisambiguatorInLeaf`
re-trims to insert the input hash. The seg-hash from the first pass is
replaced by the input-hash from the second pass — that's intentional
(input-hash is the load-bearing collision-avoidance suffix; seg-hash was
only ever decorative once the input-hash exists). Final clamp ensures
the result never exceeds `ARTIFACT_PATH_SEGMENT_MAX` regardless of input.

**Disambiguator + total-cap fallback:** when joined > 512, we fall back
to the leaf-only form. The leaf has already had the disambiguator
embedded, so collision avoidance survives the pathological-depth case.

**`embedDisambiguatorInLeaf`** uses `dot <= 1` to detect "no real
extension" (covers extensionless names AND dotfile-prefixed leaves like
`_.hidden` — without this, `_.hidden` would split as stem `_` + ext
`.hidden` and produce the awkward `_-<hash>.hidden`).

**Updated 5 existing tests** that asserted the old collision-prone
outputs — they now verify the disambiguator-included form. The
character-level-only firing rule was load-bearing here: tests for
"clean inputs (no mutation)" and "long inputs (truncation only)" still
pass without any disambiguator clutter.

**+7 regression tests** in a new `collision avoidance (Codex review P2)`
describe block:
1. Different raw inputs sanitizing to the same form get distinct safes
2. Whitespace-vs-underscore in directory segment
3. Dotfile-prefix collision
4. Idempotency: same raw → same safe across calls
5. Clean inputs skip the disambiguator (cosmetic guarantee)
6. Disambiguator survives leaf truncation (long mutated leaf)
7. Disambiguator survives total-cap fallback (pathological depth)

62 unit + 66 integration = 128/128 pass.
2026-04-28 12:52:04 +09:00

603 lines
21 KiB
JavaScript

const path = require('path');
const { v4 } = require('uuid');
const { logger } = require('@librechat/data-schemas');
const { getCodeBaseURL } = require('@librechat/agents');
const {
getBasePath,
logAxiosError,
sanitizeArtifactPath,
flattenArtifactPath,
createAxiosInstance,
classifyCodeArtifact,
codeServerHttpAgent,
codeServerHttpsAgent,
extractCodeArtifactText,
} = require('@librechat/api');
const {
Tools,
megabyte,
fileConfig,
FileContext,
FileSources,
imageExtRegex,
inferMimeType,
EToolResources,
EModelEndpoint,
mergeFileConfig,
getEndpointFileConfig,
} = require('librechat-data-provider');
const { filterFilesByAgentAccess } = require('~/server/services/Files/permissions');
const { createFile, getFiles, updateFile, claimCodeFile } = require('~/models');
const { getStrategyFunctions } = require('~/server/services/Files/strategies');
const { convertImage } = require('~/server/services/Files/images/convert');
const { determineFileType } = require('~/server/utils');
const axios = createAxiosInstance();
/**
* Creates a fallback download URL response when file cannot be processed locally.
* Used when: file exceeds size limit, storage strategy unavailable, or download error occurs.
* @param {Object} params - The parameters.
* @param {string} params.name - The filename.
* @param {string} params.session_id - The code execution session ID.
* @param {string} params.id - The file ID from the code environment.
* @param {string} params.conversationId - The current conversation ID.
* @param {string} params.toolCallId - The tool call ID that generated the file.
* @param {string} params.messageId - The current message ID.
* @param {number} params.expiresAt - Expiration timestamp (24 hours from creation).
* @returns {Object} Fallback response with download URL.
*/
const createDownloadFallback = ({
id,
name,
messageId,
expiresAt,
session_id,
toolCallId,
conversationId,
}) => {
const basePath = getBasePath();
return {
filename: name,
filepath: `${basePath}/api/files/code/download/${session_id}/${id}`,
expiresAt,
conversationId,
toolCallId,
messageId,
};
};
/**
* Process code execution output files - downloads and saves both images and non-image files.
* All files are saved to local storage with fileIdentifier metadata for code env re-upload.
* @param {ServerRequest} params.req - The Express request object.
* @param {string} params.id - The file ID from the code environment.
* @param {string} params.name - The filename.
* @param {string} params.toolCallId - The tool call ID that generated the file.
* @param {string} params.session_id - The code execution session ID.
* @param {string} params.conversationId - The current conversation ID.
* @param {string} params.messageId - The current message ID.
* @returns {Promise<MongoFile & { messageId: string, toolCallId: string } | undefined>} The file metadata or undefined if an error occurs.
*/
const processCodeOutput = async ({
req,
id,
name,
toolCallId,
conversationId,
messageId,
session_id,
}) => {
const appConfig = req.config;
const currentDate = new Date();
const baseURL = getCodeBaseURL();
const fileExt = path.extname(name).toLowerCase();
const isImage = fileExt && imageExtRegex.test(name);
const mergedFileConfig = mergeFileConfig(appConfig.fileConfig);
const endpointFileConfig = getEndpointFileConfig({
fileConfig: mergedFileConfig,
endpoint: EModelEndpoint.agents,
});
const fileSizeLimit = endpointFileConfig.fileSizeLimit ?? mergedFileConfig.serverFileSizeLimit;
try {
const formattedDate = currentDate.toISOString();
const response = await axios({
method: 'get',
url: `${baseURL}/download/${session_id}/${id}`,
responseType: 'arraybuffer',
headers: {
'User-Agent': 'LibreChat/1.0',
},
httpAgent: codeServerHttpAgent,
httpsAgent: codeServerHttpsAgent,
timeout: 15000,
});
const buffer = Buffer.from(response.data, 'binary');
// Enforce file size limit
if (buffer.length > fileSizeLimit) {
logger.warn(
`[processCodeOutput] File "${name}" (${(buffer.length / megabyte).toFixed(2)} MB) exceeds size limit of ${(fileSizeLimit / megabyte).toFixed(2)} MB, falling back to download URL`,
);
return createDownloadFallback({
id,
name,
messageId,
toolCallId,
session_id,
conversationId,
expiresAt: currentDate.getTime() + 86400000,
});
}
const fileIdentifier = `${session_id}/${id}`;
/* `safeName` keeps the directory structure (`a/b/file.txt` -> `a/b/file.txt`)
* so the next prime() can place the file at the same nested path in the
* sandbox; flattening would re-create the bug where every nested artifact
* collapsed into the root and read_file calls 404'd. The flat-form
* storage key is composed below once `file_id` is known so we can cap
* the total length at filesystem NAME_MAX. */
const safeName = sanitizeArtifactPath(name);
if (safeName !== name) {
logger.warn(
`[processCodeOutput] Filename sanitized: "${name}" -> "${safeName}" | conv=${conversationId}`,
);
}
/**
* Atomically claim a file_id for this (filename, conversationId, context) tuple.
* Uses $setOnInsert so concurrent calls for the same filename converge on
* a single record instead of creating duplicates (TOCTOU race fix).
*
* Claim by `safeName` (not raw `name`) so the claim and the eventual
* `createFile` agree on the filename column — otherwise weird inputs
* (e.g. `"proj name/file@v1.txt"`) would claim under the raw name and
* then write under the sanitized one, leaving the claim row orphaned.
*/
const newFileId = v4();
const claimed = await claimCodeFile({
filename: safeName,
conversationId,
file_id: newFileId,
user: req.user.id,
});
const file_id = claimed.file_id;
const isUpdate = file_id !== newFileId;
if (isUpdate) {
logger.debug(
`[processCodeOutput] Updating existing file "${safeName}" (${file_id}) instead of creating duplicate`,
);
}
/**
* Preserve the original `messageId` on update. Each `processCodeOutput`
* call would otherwise overwrite it with the current run's run id, which
* decouples the file from the assistant message that originally created
* it. `getCodeGeneratedFiles` filters by `messageId IN <thread>`, so a
* stale id (e.g. from a later regeneration / failed re-read attempt)
* silently excludes the file from priming on subsequent turns.
*/
const persistedMessageId = isUpdate ? (claimed.messageId ?? messageId) : messageId;
if (isImage) {
const usage = isUpdate ? (claimed.usage ?? 0) + 1 : 1;
const _file = await convertImage(req, buffer, 'high', `${file_id}${fileExt}`);
const filepath = usage > 1 ? `${_file.filepath}?v=${Date.now()}` : _file.filepath;
const file = {
..._file,
filepath,
file_id,
messageId: persistedMessageId,
usage,
filename: safeName,
conversationId,
user: req.user.id,
type: `image/${appConfig.imageOutputType}`,
createdAt: isUpdate ? claimed.createdAt : formattedDate,
updatedAt: formattedDate,
source: appConfig.fileStrategy,
context: FileContext.execute_code,
metadata: { fileIdentifier },
};
await createFile(file, true);
return Object.assign(file, { messageId, toolCallId });
}
const { saveBuffer } = getStrategyFunctions(appConfig.fileStrategy);
if (!saveBuffer) {
logger.warn(
`[processCodeOutput] saveBuffer not available for strategy ${appConfig.fileStrategy}, falling back to download URL`,
);
return createDownloadFallback({
id,
name,
messageId,
toolCallId,
session_id,
conversationId,
expiresAt: currentDate.getTime() + 86400000,
});
}
const detectedType = await determineFileType(buffer, true);
const mimeType = detectedType?.mime || inferMimeType(name, '') || 'application/octet-stream';
/** Check MIME type support - for code-generated files, we're lenient but log unsupported types */
const isSupportedMimeType = fileConfig.checkType(
mimeType,
endpointFileConfig.supportedMimeTypes,
);
if (!isSupportedMimeType) {
logger.warn(
`[processCodeOutput] File "${name}" has unsupported MIME type "${mimeType}", proceeding with storage but may not be usable as tool resource`,
);
}
/* Compose the storage key here, after `file_id` is known, so the
* `flattenArtifactPath` cap budget can be calculated against the
* actual prefix length. The full key has to fit in one filesystem
* path component (NAME_MAX = 255 on most filesystems); without this
* cap, deeply-nested artifact paths whose individual segments were
* within bounds can still produce a flat form that overflows once
* `${file_id}__` is prepended, causing `ENAMETOOLONG` inside
* saveBuffer and falling back to a download URL. The 255 figure is
* the conservative cross-platform NAME_MAX (Linux ext4, NTFS, APFS).
*/
const NAME_MAX = 255;
const flatName = flattenArtifactPath(safeName, NAME_MAX - file_id.length - 2);
const fileName = `${file_id}__${flatName}`;
const filepath = await saveBuffer({
userId: req.user.id,
buffer,
fileName,
basePath: 'uploads',
});
/* `classifyCodeArtifact` and `extractCodeArtifactText` make
* extension/bare-name decisions on the input string. With the
* path-preserving sanitizer they can now receive a nested path like
* `reports.v1/Makefile`, which the classifier's `extensionOf` reads
* as `v1/Makefile` (the slice after the dot in the directory name)
* and the bare-name branch rejects because it sees a `.` anywhere in
* the string. Result: extensionless artifacts under dotted folders
* (Makefile, Dockerfile, etc.) get misclassified as `other` and
* skip text extraction. Pass the basename so classification matches
* what it would have gotten with the old flat-name flow. */
const leafName = path.basename(safeName);
const category = classifyCodeArtifact(leafName, mimeType);
const text = await extractCodeArtifactText(buffer, leafName, mimeType, category);
const file = {
file_id,
filepath,
messageId: persistedMessageId,
object: 'file',
filename: safeName,
type: mimeType,
conversationId,
user: req.user.id,
bytes: buffer.length,
updatedAt: formattedDate,
metadata: { fileIdentifier },
source: appConfig.fileStrategy,
context: FileContext.execute_code,
usage: isUpdate ? (claimed.usage ?? 0) + 1 : 1,
createdAt: isUpdate ? claimed.createdAt : formattedDate,
// Always set `text` explicitly (string or null) so that an update which
// produces a binary or oversized artifact clears any previously cached
// text — `createFile` uses findOneAndUpdate with $set semantics, which
// would otherwise leave a stale value behind.
text: text ?? null,
};
await createFile(file, true);
return Object.assign(file, { messageId, toolCallId });
} catch (error) {
if (error?.message === 'Path traversal detected in filename') {
logger.warn(
`[processCodeOutput] Path traversal blocked for file "${name}" | conv=${conversationId}`,
);
}
logAxiosError({
message: 'Error downloading/processing code environment file',
error,
});
// Fallback for download errors - return download URL so user can still manually download
return createDownloadFallback({
id,
name,
messageId,
toolCallId,
session_id,
conversationId,
expiresAt: currentDate.getTime() + 86400000,
});
}
};
function checkIfActive(dateString) {
const givenDate = new Date(dateString);
const currentDate = new Date();
const timeDifference = currentDate - givenDate;
const hoursPassed = timeDifference / (1000 * 60 * 60);
return hoursPassed < 23;
}
/**
* Retrieves the `lastModified` time string for a specified file from Code Execution Server.
*
* @param {string} fileIdentifier - The identifier for the file (e.g., "session_id/fileId").
*
* @returns {Promise<string|null>}
* A promise that resolves to the `lastModified` time string of the file if successful, or null if there is an
* error in initialization or fetching the info.
*/
async function getSessionInfo(fileIdentifier) {
try {
const baseURL = getCodeBaseURL();
const [path, queryString] = fileIdentifier.split('?');
const [session_id, fileId] = path.split('/');
let queryParams = {};
if (queryString) {
queryParams = Object.fromEntries(new URLSearchParams(queryString).entries());
}
const response = await axios({
method: 'get',
url: `${baseURL}/sessions/${session_id}/objects/${fileId}`,
params: queryParams,
headers: {
'User-Agent': 'LibreChat/1.0',
},
httpAgent: codeServerHttpAgent,
httpsAgent: codeServerHttpsAgent,
timeout: 5000,
});
return response.data?.lastModified;
} catch (error) {
logAxiosError({
message: `Error fetching session info: ${error.message}`,
error,
});
return null;
}
}
/**
*
* @param {Object} options
* @param {ServerRequest} options.req
* @param {Agent['tool_resources']} options.tool_resources
* @param {string} [options.agentId] - The agent ID for file access control
* @returns {Promise<{
* files: Array<{ id: string; session_id: string; name: string }>,
* toolContext: string,
* }>}
*/
const primeFiles = async (options) => {
const { tool_resources, req, agentId } = options;
const file_ids = tool_resources?.[EToolResources.execute_code]?.file_ids ?? [];
const agentResourceIds = new Set(file_ids);
const resourceFiles = tool_resources?.[EToolResources.execute_code]?.files ?? [];
// Get all files first
const allFiles = (await getFiles({ file_id: { $in: file_ids } }, null, { text: 0 })) ?? [];
// Filter by access if user and agent are provided
let dbFiles;
if (req?.user?.id && agentId) {
dbFiles = await filterFilesByAgentAccess({
files: allFiles,
userId: req.user.id,
role: req.user.role,
agentId,
});
} else {
dbFiles = allFiles;
}
dbFiles = dbFiles.concat(resourceFiles);
const files = [];
const sessions = new Map();
let toolContext = '';
for (let i = 0; i < dbFiles.length; i++) {
const file = dbFiles[i];
if (!file) {
continue;
}
if (file.metadata.fileIdentifier) {
const [path, queryString] = file.metadata.fileIdentifier.split('?');
const [session_id, id] = path.split('/');
/**
* `pushFile` accepts optional overrides so the reupload path can
* push the FRESH `(session_id, id)` parsed off the new
* `fileIdentifier`. Without these overrides, the closure would
* capture the stale pre-reupload refs from the outer loop and
* the in-memory `files` array (now consumed by
* `buildInitialToolSessions` to seed `Graph.sessions`) would
* point at a sandbox object that no longer exists. The DB record
* gets the new identifier via `updateFile`, but the seed would
* still inject the old one — bash_tool / read_file would 404
* trying to mount the file until the next turn re-reads metadata.
*/
const pushFile = (overrideSessionId, overrideId) => {
if (!toolContext) {
toolContext = `- Note: The following files are available in the "${Tools.execute_code}" tool environment:`;
}
let fileSuffix = '';
if (!agentResourceIds.has(file.file_id)) {
fileSuffix =
file.context === FileContext.execute_code
? ' (from previous code execution)'
: ' (attached by user)';
}
toolContext += `\n\t- /mnt/data/${file.filename}${fileSuffix}`;
files.push({
id: overrideId ?? id,
session_id: overrideSessionId ?? session_id,
name: file.filename,
});
};
if (sessions.has(session_id)) {
pushFile();
continue;
}
let queryParams = {};
if (queryString) {
queryParams = Object.fromEntries(new URLSearchParams(queryString).entries());
}
const reuploadFile = async () => {
try {
const { getDownloadStream } = getStrategyFunctions(file.source);
const { handleFileUpload: uploadCodeEnvFile } = getStrategyFunctions(
FileSources.execute_code,
);
const stream = await getDownloadStream(options.req, file.filepath);
const fileIdentifier = await uploadCodeEnvFile({
req: options.req,
stream,
filename: file.filename,
entity_id: queryParams.entity_id,
});
// Preserve existing metadata when adding fileIdentifier
const updatedMetadata = {
...file.metadata, // Preserve existing metadata (like S3 storage info)
fileIdentifier, // Add fileIdentifier
};
await updateFile({
file_id: file.file_id,
metadata: updatedMetadata,
});
/**
* Parse the FRESH fileIdentifier returned by the reupload and
* route it through both the dedupe Map and the in-memory
* `files` list. The original `(session_id, id)` parsed at the
* top of this iteration refer to the old, expired/missing
* sandbox object — using them here would silently re-introduce
* the bug `Graph.sessions` seeding is supposed to fix.
*/
const [newPath] = fileIdentifier.split('?');
const [newSessionId, newId] = newPath.split('/');
sessions.set(newSessionId, true);
pushFile(newSessionId, newId);
} catch (error) {
logger.error(
`Error re-uploading file ${id} in session ${session_id}: ${error.message}`,
error,
);
}
};
const uploadTime = await getSessionInfo(file.metadata.fileIdentifier);
if (!uploadTime) {
logger.warn(`Failed to get upload time for file ${id} in session ${session_id}`);
await reuploadFile();
continue;
}
if (!checkIfActive(uploadTime)) {
await reuploadFile();
continue;
}
sessions.set(session_id, true);
pushFile();
}
}
return { files, toolContext };
};
/**
* Reads a single file from the code-execution sandbox by shelling `cat`
* through the sandbox `/exec` endpoint. Used by the `read_file` host
* handler when the requested path is a code-env path (`/mnt/data/...`)
* or otherwise not resolvable as a skill file. Resolves to
* `{ content }` from stdout on success, or `null` when the codeapi base
* URL isn't configured / the read returns no content (caller turns that
* into a model-visible error). Throws axios-style errors on transport
* failure so the caller can surface a meaningful error message.
*
* `session_id` and `files` come from the seeded `tc.codeSessionContext`
* (emitted by the agents-side `ToolNode` for `read_file` calls in
* v3.1.72+) so the read lands in the same sandbox session that holds
* the agent's prior-turn artifacts.
*
* @param {Object} params
* @param {string} params.file_path - Absolute path inside the sandbox (e.g. `/mnt/data/foo.txt`).
* @param {string} [params.session_id] - Sandbox session id from the seeded context.
* @param {Array<{id: string, name: string, session_id?: string}>} [params.files] - File refs to mount.
* @returns {Promise<{content: string} | null>}
*/
async function readSandboxFile({ file_path, session_id, files }) {
const baseURL = getCodeBaseURL();
if (!baseURL) {
return null;
}
/** Single-quote `file_path` with embedded-quote escaping so a malicious
* filename can't break out of the `cat` command. The handler upstream
* has already established this is a code-env path the model
* legitimately asked to read; this just keeps the shell quoting safe. */
const safePath = `'${file_path.replace(/'/g, `'\\''`)}'`;
/** @type {Record<string, unknown>} */
const postData = { lang: 'bash', code: `cat ${safePath}` };
if (session_id) {
postData.session_id = session_id;
}
if (files && files.length > 0) {
postData.files = files;
}
try {
const response = await axios({
method: 'post',
url: `${baseURL}/exec`,
data: postData,
headers: {
'Content-Type': 'application/json',
'User-Agent': 'LibreChat/1.0',
},
httpAgent: codeServerHttpAgent,
httpsAgent: codeServerHttpsAgent,
timeout: 15000,
});
const result = response?.data ?? {};
if (result.stderr && (result.stdout == null || result.stdout === '')) {
throw new Error(String(result.stderr).trim());
}
if (result.stdout == null) {
return null;
}
return { content: String(result.stdout) };
} catch (error) {
logAxiosError({
message: `Error reading sandbox file "${file_path}"`,
error,
});
throw error;
}
}
module.exports = {
primeFiles,
checkIfActive,
getSessionInfo,
processCodeOutput,
readSandboxFile,
};