LibreChat/packages/api/src/flow/manager.ts
Danny Avila c27d6b85a4
🤫 refactor: Silent MCP OAuth Refresh on Mid-Session 401 (#13369)
* 🤫 fix: Silent MCP OAuth Refresh on Mid-Session 401

Avoids the hourly interactive re-auth prompt when an MCP server
(e.g. Azure Entra ID) returns 401 mid-session by attempting a refresh
token exchange first, and only falling back to the interactive OAuth
flow when no refresh token is stored or the refresh server rejects it.

Resolves #13364.

* fix: Use distinct flow type for silent token refresh to avoid cache hit

Addresses the Codex review on PR #13369: `attemptSilentTokenRefresh` was
reusing the `'mcp_get_tokens'` flow type, so
`FlowStateManager.createFlowWithHandler` would short-circuit and return
the same tokens cached by an earlier `getOAuthTokens` call — the very
tokens the server just rejected — without executing the forced-refresh
handler.

Switch silent refresh to the distinct `'mcp_force_refresh_tokens'` flow
type so coalescing still works but stale `mcp_get_tokens` cache entries
are not reused. After a successful refresh, invalidate the
`mcp_get_tokens` flow cache so the next `getOAuthTokens` call reads the
freshly persisted tokens from storage rather than the stale cached
value.

Add a regression test that simulates the real
`FlowStateManager.createFlowWithHandler` cache-hit behavior for
`mcp_get_tokens` and verifies the silent refresh handler still runs and
returns the freshly refreshed tokens.

* fix: Address Codex round-2 review on silent MCP OAuth refresh

Three follow-up findings from Codex on PR #13369:

1. The new `mcp_force_refresh_tokens` flow type was itself cached by
   `FlowStateManager.createFlowWithHandler`, so a subsequent 401 within
   the refreshed token's `expires_at` could re-serve the just-rejected
   token without ever re-running the refresh handler.

2. The factory's `oauthRequired` listener was removed immediately after
   the initial `attemptToConnect` succeeded, so a real mid-session 401
   emitted by `MCPConnection.connectClient` during transport recovery
   had no listener — the OAuth handled-promise would simply time out
   instead of triggering the silent refresh.

3. Routing the silent refresh through a distinct flow type broke
   coalescing with the `mcp_get_tokens` lock used by `getOAuthTokens`,
   letting two paths concurrently redeem the same stored refresh token.
   For providers that rotate refresh tokens (e.g. Azure Entra) the
   second redemption is rejected, kicking the user back into interactive
   OAuth despite a successful refresh elsewhere.

Resolution:

- Drop `FlowStateManager` from the silent-refresh path entirely. Replace
  with a process-local `inflightSilentRefreshes` Map keyed by
  `userId:serverName` that holds only the in-flight Promise (no cached
  result), so every fresh 401 after settlement triggers a fresh
  redemption while concurrent 401s for the same user/server still share
  one redemption.
- Stop calling `cleanupOAuthHandlers()` on successful initial connect,
  keeping the OAuth handler attached for the connection's lifetime so
  mid-session 401s actually reach `attemptSilentTokenRefresh`.
- Add a regression test reproducing the stale-cache scenario by faking
  the `mcp_get_tokens` cache hit and asserting silent refresh still runs
  against storage and returns the fresh tokens.
- Add a coalescing test asserting two concurrent oauthRequired events
  for the same user/server result in a single `forceRefreshTokens` call.
- Clear `inflightSilentRefreshes` in `beforeEach` to prevent
  cross-test leakage; switch the silent-refresh test mocks to
  `mockResolvedValueOnce` / `mockImplementationOnce` so leftover mock
  state cannot leak into later test cases.

Acknowledged remaining gap: the silent refresh still races
`getOAuthTokens`'s `mcp_get_tokens` flow when both run concurrently
(narrow window when an existing connection's local `expires_at` is
still valid but the server invalidated the token, and a new connection
is being created in parallel). The race is self-healing on the next
401 and documented inline.

* fix: Address Codex round-3 review on silent MCP OAuth refresh

Three more findings from Codex on PR #13369:

1. The in-flight silent-refresh promise was unbounded. If
   `forceRefreshTokens()` ever hung (slow provider, dropped TCP), the
   `inflightSilentRefreshes` lock stayed occupied forever and every
   later 401 for the same user/server joined the stuck promise instead
   of starting a fresh attempt or falling back to interactive OAuth.

2. The interactive-OAuth fallback didn't invalidate the
   `mcp_get_tokens` flow cache after persisting fresh tokens. For
   providers that don't issue refresh tokens (so silent refresh
   returns null), the old cache could still feed stale access tokens
   to the next `getOAuthTokens` call until its TTL expired — causing
   an immediate reconnect with the same just-rejected token.

3. When silent refresh failed, the handler fell through to
   `handleOAuthRequired()` whose recent-completion fast path can
   reuse a COMPLETED `mcp_oauth` flow within `PENDING_STALE_MS`. Those
   cached tokens are exactly the ones the server just rejected, so
   the connection would keep adopting them and looping on 401s until
   the cache aged out.

Resolution:

- Wrap `runSilentRefresh()` with a 60-second `withTimeout` (well under
  `connectClient`'s 120s OAuth timeout). On timeout the `.catch`
  resolves to null and the `finally` clears the in-flight entry, so
  the next 401 starts fresh and falls through to interactive OAuth.
- Extract two helpers — `invalidateGetTokensFlow` and
  `invalidateCompletedOAuthFlow` — and call them from the right
  branches: clear `mcp_get_tokens` after silent-refresh success AND
  after interactive-OAuth `storeTokens`; clear the COMPLETED
  `mcp_oauth` state (plus its CSRF mapping) before falling through to
  interactive OAuth so the fast-reuse path can't re-serve the
  rejected tokens.
- Add three regression tests: hung refresh release-the-lock under
  fake timers, completed-OAuth cache invalidation pre-fallback, and
  `mcp_get_tokens` invalidation after interactive token store.

* fix: Address Codex round-4 review on silent MCP OAuth refresh

Three more findings from Codex on PR #13369:

1. (P1) The silent-refresh in-flight lock keyed only by
   `userId:serverName`. In multi-tenant setups where two tenants share a
   userId (e.g. username-based IDs) and the same MCP server name, a
   concurrent mid-session 401 from tenant B would join tenant A's
   in-flight refresh and adopt tenant A's freshly minted tokens onto a
   tenant-B connection — a cross-tenant credential leak.

2. (P2) `invalidateGetTokensFlow` deleted the `mcp_get_tokens` flow
   state regardless of its status. When another connection was
   currently in `getOAuthTokens()` (PENDING flow) and joiners were
   monitoring it, the unconditional delete made those waiters see
   "Flow state not found" and unnecessarily fall back to interactive
   OAuth — even though fresh tokens were already being written.

3. (P2) The 60s `withTimeout` wrapping `runSilentRefresh()` only races
   the promise; it does not cancel the underlying `forceRefreshTokens`
   /  refresh-token HTTP request. If the request returned after a
   subsequent interactive OAuth had stored newer tokens, the late
   completion would `storeTokens` over the newer state. This requires
   a provider that doesn't rotate refresh tokens AND a refresh slower
   than 60s AND a successful interactive OAuth in that window — narrow
   but real.

Resolution:

- Capture `getTenantId()` into a new `factory.tenantId` field at
  factory construction time (before the OAuth handler closes over it
  outside the original request's async context) and include it in the
  silent-refresh lock key as `tenantId:userId:serverName`.
- `invalidateGetTokensFlow` now calls `getFlowState` first and only
  deletes when `status === 'COMPLETED'`. PENDING lookups are left
  alone so concurrent `getOAuthTokens` waiters via `monitorFlow` can
  still settle.
- For (3), document the race as a known limitation inline. Fully
  closing it requires threading an `AbortSignal` through
  `MCPTokenStorage.forceRefreshTokens` and the OAuth refresh handler
  to skip the late `storeTokens` after timeout — out of scope for this
  PR's surgical change.
- Add `getTenantId` to the `MCPOAuthConnectionEvents` test's
  `@librechat/data-schemas` mock so the factory constructor doesn't
  blow up under that suite.
- Add three regression tests: per-tenant lock isolation, PENDING-state
  preservation under `invalidateGetTokensFlow`, and (reused) the
  existing interactive-store invalidation test now driven through
  `getFlowState` returning the COMPLETED state.

* fix: Address silent MCP OAuth refresh review

Restore captured tenant context around token storage and OAuth fallback paths so mid-session callbacks do not lose tenant scope.

Thread AbortSignal through forced refresh and OAuth token requests, cap silent refresh by the connection OAuth timeout, and prevent timed-out refreshes from writing stale credentials after fallback.

Complete pending mcp_get_tokens flows with fresh tokens, add missing FlowState createdAt test fixtures, and cover the new tenant/abort/cache behaviors.

* fix: Tighten tenant-scoped MCP token refresh

Cap silent refresh by both the factory connect timeout and the connection OAuth wait timeout so fallback OAuth wins before the outer connect attempt expires.

Tenant-scope mcp_get_tokens flow ids for both token lookup and refresh invalidation, preventing cross-tenant flow completion or cache deletion when tenants share user ids and server names.

Add regression tests for the omitted initTimeout budget and tenant-prefixed token flow locks.

* fix: Reserve MCP OAuth fallback budget

* fix: Harden MCP OAuth refresh races

* fix: Keep MCP OAuth fallback route-compatible

* test: Add SDK MCP OAuth refresh repro

* fix: Address MCP OAuth refresh review findings

* fix: Address MCP OAuth tenant review findings

* fix: Close MCP OAuth route tenant gaps

* fix: Preserve MCP OAuth refresh flow guards

* fix: Avoid reprocessing MCP OAuth reauth config

* fix: Release timed-out MCP refresh locks

* fix: Release MCP OAuth request callbacks

* fix: Tenant-scope remaining MCP OAuth flow lookups

* ci: Sort imports in MCP OAuth test suites
2026-06-10 13:12:42 -04:00

452 lines
14 KiB
TypeScript

import { Keyv } from 'keyv';
import { logger } from '@librechat/data-schemas';
import type { StoredDataNoRaw } from 'keyv';
import type { FlowState, FlowMetadata, FlowManagerOptions } from './types';
import { registerShutdownTask } from '../app/shutdown';
import { math } from '~/utils/math';
/**
* Lifetime of a PENDING OAuth flow: how long the auth button stays valid and an
* in-flight flow can be reused before it is replaced. Mirrors
* `mcpConfig.OAUTH_HANDLING_TIMEOUT` (`MCP_OAUTH_HANDLING_TIMEOUT`) so the reuse
* window matches the wait the server grants the user. Default: 10 minutes.
*/
export const PENDING_STALE_MS: number = math(
process.env.MCP_OAUTH_HANDLING_TIMEOUT ?? 10 * 60 * 1000,
);
const SECONDS_THRESHOLD = 1e10;
/**
* Normalizes an expiration timestamp to milliseconds.
* Timestamps below 10 billion are assumed to be in seconds (valid until ~2286).
*/
export function normalizeExpiresAt(timestamp: number): number {
return timestamp < SECONDS_THRESHOLD ? timestamp * 1000 : timestamp;
}
export class FlowStateManager<T = unknown> {
private keyv: Keyv;
private ttl: number;
private intervals: Set<NodeJS.Timeout>;
constructor(store: Keyv, options?: FlowManagerOptions) {
if (!options) {
options = { ttl: 60000 * 3 };
}
const { ci = false, ttl } = options;
if (!ci && !(store instanceof Keyv)) {
throw new Error('Invalid store provided to FlowStateManager');
}
this.ttl = ttl;
this.keyv = store;
this.intervals = new Set();
if (!ci) {
this.setupCleanupHandlers();
}
}
private setupCleanupHandlers() {
// Register cleanup with the centralized graceful-shutdown coordinator
// (see ../app/shutdown.ts) rather than attaching direct signal
// handlers — multiple competing handlers race the HTTP drain.
registerShutdownTask('flow manager cleanup', () => {
logger.info('Cleaning up FlowStateManager intervals...');
this.intervals.forEach((interval) => clearInterval(interval));
this.intervals.clear();
});
}
/**
* Flow keys are intentionally NOT tenant-scoped. OAuth callbacks arrive
* without tenant ALS context (the provider redirect doesn't carry
* X-Tenant-Id). Flow IDs are random UUIDs with no collision risk, and
* flow data is ephemeral (TTL-bounded, no sensitive user content).
*/
private getFlowKey(flowId: string, type: string): string {
return `${type}:${flowId}`;
}
private isTokenExpired(flowState: FlowState<T> | undefined): boolean {
if (!flowState?.result || typeof flowState.result !== 'object') {
return false;
}
if (!('expires_at' in flowState.result)) {
return false;
}
const expiresAt = (flowState.result as { expires_at: unknown }).expires_at;
if (typeof expiresAt !== 'number' || !Number.isFinite(expiresAt)) {
return false;
}
return normalizeExpiresAt(expiresAt) < Date.now();
}
/**
* Stores initial PENDING flow state without starting the monitor loop.
* Use this when you need to guarantee the state is persisted before
* performing an action (e.g., an OAuth redirect), then call createFlow()
* separately to start monitoring for completion.
*/
async initFlow(flowId: string, type: string, metadata: FlowMetadata = {}): Promise<void> {
const flowKey = this.getFlowKey(flowId, type);
const initialState: FlowState = {
type,
status: 'PENDING',
metadata,
createdAt: Date.now(),
};
logger.debug(`[${flowKey}] Storing initial flow state`);
await this.keyv.set(flowKey, initialState, this.ttl);
}
/**
* Creates a new flow and waits for its completion
*/
async createFlow(
flowId: string,
type: string,
metadata: FlowMetadata = {},
signal?: AbortSignal,
): Promise<T> {
const flowKey = this.getFlowKey(flowId, type);
let existingState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (existingState) {
logger.debug(`[${flowKey}] Flow already exists`);
return this.monitorFlow(flowKey, type, signal);
}
await new Promise((resolve) => setTimeout(resolve, 250));
existingState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (existingState) {
logger.debug(`[${flowKey}] Flow exists on 2nd check`);
return this.monitorFlow(flowKey, type, signal);
}
const initialState: FlowState = {
type,
status: 'PENDING',
metadata,
createdAt: Date.now(),
};
logger.debug(`[${flowKey}] Creating initial flow state`);
await this.keyv.set(flowKey, initialState, this.ttl);
return this.monitorFlow(flowKey, type, signal);
}
private monitorFlow(flowKey: string, type: string, signal?: AbortSignal): Promise<T> {
return new Promise<T>((resolve, reject) => {
const checkInterval = 2000;
let elapsedTime = 0;
let isCleanedUp = false;
let intervalId: NodeJS.Timeout | null = null;
let missingStateRetried = false;
let isRetrying = false;
// Cleanup function to avoid duplicate cleanup
const cleanup = () => {
if (isCleanedUp) return;
isCleanedUp = true;
if (intervalId) {
clearInterval(intervalId);
this.intervals.delete(intervalId);
}
if (signal && abortHandler) {
signal.removeEventListener('abort', abortHandler);
}
};
// Immediate abort handler - responds instantly to abort signal
const abortHandler = async () => {
cleanup();
logger.warn(`[${flowKey}] Flow aborted (immediate)`);
const message = `${type} flow aborted`;
try {
await this.keyv.delete(flowKey);
} catch {
// Ignore delete errors during abort
}
reject(new Error(message));
};
// Register abort handler immediately if signal provided
if (signal) {
if (signal.aborted) {
// Already aborted, reject immediately
cleanup();
reject(new Error(`${type} flow aborted`));
return;
}
signal.addEventListener('abort', abortHandler, { once: true });
}
intervalId = setInterval(async () => {
if (isCleanedUp || isRetrying) return;
try {
let flowState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (!flowState) {
if (!missingStateRetried) {
missingStateRetried = true;
isRetrying = true;
logger.warn(
`[${flowKey}] Flow state not found, retrying once after 500ms (race recovery)`,
);
await new Promise((r) => setTimeout(r, 500));
flowState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
isRetrying = false;
}
if (!flowState) {
cleanup();
logger.error(`[${flowKey}] Flow state not found after retry`);
reject(new Error(`${type} Flow state not found`));
return;
}
}
if (signal?.aborted) {
cleanup();
logger.warn(`[${flowKey}] Flow aborted`);
const message = `${type} flow aborted`;
await this.keyv.delete(flowKey);
reject(new Error(message));
return;
}
if (flowState.status !== 'PENDING') {
cleanup();
logger.debug(`[${flowKey}] Flow completed`);
if (flowState.status === 'COMPLETED' && flowState.result !== undefined) {
resolve(flowState.result);
} else if (flowState.status === 'FAILED') {
await this.keyv.delete(flowKey);
reject(new Error(flowState.error ?? `${type} flow failed`));
}
return;
}
elapsedTime += checkInterval;
if (elapsedTime >= this.ttl) {
cleanup();
logger.error(
`[${flowKey}] Flow timed out | Elapsed time: ${elapsedTime} | TTL: ${this.ttl}`,
);
await this.keyv.delete(flowKey);
reject(new Error(`${type} flow timed out`));
}
logger.debug(`[${flowKey}] Flow state elapsed time: ${elapsedTime}, checking again...`);
} catch (error) {
logger.error(`[${flowKey}] Error checking flow state:`, error);
cleanup();
reject(error);
}
}, checkInterval);
this.intervals.add(intervalId);
});
}
/**
* Completes a flow successfully
*/
async completeFlow(flowId: string, type: string, result: T): Promise<boolean> {
const flowKey = this.getFlowKey(flowId, type);
const flowState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (!flowState) {
logger.warn(
`[FlowStateManager] completeFlow: flow not found — key=${flowKey}. ` +
'Possible causes: flow TTL expired before callback arrived, flow was never created, or ' +
'the callback is routing to a different instance without shared Keyv storage.',
{ flowId, type },
);
return false;
}
/** Prevent duplicate completion */
if (flowState.status === 'COMPLETED') {
logger.debug(
'[FlowStateManager] Flow already completed, skipping to prevent duplicate completion',
{
flowId,
type,
},
);
return true;
}
const updatedState: FlowState<T> = {
...flowState,
status: 'COMPLETED',
result,
completedAt: Date.now(),
};
await this.keyv.set(flowKey, updatedState, this.ttl);
logger.debug('[FlowStateManager] Flow completed successfully', {
flowId,
type,
});
return true;
}
/**
* Checks if a flow is stale based on its age and status
* @param flowId - The flow identifier
* @param type - The flow type
* @param staleThresholdMs - Age in milliseconds after which a non-pending flow is considered stale (default: 2 minutes)
* @returns Object with isStale boolean and age in milliseconds
*/
async isFlowStale(
flowId: string,
type: string,
staleThresholdMs: number = PENDING_STALE_MS,
): Promise<{ isStale: boolean; age: number; status?: string }> {
const flowKey = this.getFlowKey(flowId, type);
const flowState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (!flowState) {
return { isStale: false, age: 0 };
}
if (flowState.status === 'PENDING') {
return { isStale: false, age: 0, status: flowState.status };
}
const completedAt = flowState.completedAt || flowState.failedAt;
const createdAt = flowState.createdAt;
let flowAge = 0;
if (completedAt) {
flowAge = Date.now() - completedAt;
} else if (createdAt) {
flowAge = Date.now() - createdAt;
}
return {
isStale: flowAge > staleThresholdMs,
age: flowAge,
status: flowState.status,
};
}
/**
* Marks a flow as failed
*/
async failFlow(flowId: string, type: string, error: Error | string): Promise<boolean> {
const flowKey = this.getFlowKey(flowId, type);
const flowState = (await this.keyv.get(flowKey)) as FlowState | undefined;
if (!flowState) {
return false;
}
if (flowState.status === 'COMPLETED') {
logger.debug(
'[FlowStateManager] Flow already completed, skipping failure to prevent overwrite',
{
flowId,
type,
},
);
return true;
}
const updatedState: FlowState = {
...flowState,
status: 'FAILED',
error: error instanceof Error ? error.message : error,
failedAt: Date.now(),
};
await this.keyv.set(flowKey, updatedState, this.ttl);
return true;
}
/**
* Gets current flow state
*/
async getFlowState(flowId: string, type: string): Promise<StoredDataNoRaw<FlowState<T>> | null> {
const flowKey = this.getFlowKey(flowId, type);
return this.keyv.get(flowKey);
}
/**
* Creates a new flow and waits for its completion, only executing the handler if no existing flow is found
* @param flowId - The ID of the flow
* @param type - The type of flow
* @param handler - Async function to execute if no existing flow is found
* @param signal - Optional AbortSignal to cancel the flow
*/
async createFlowWithHandler(
flowId: string,
type: string,
handler: () => Promise<T>,
signal?: AbortSignal,
): Promise<T> {
const flowKey = this.getFlowKey(flowId, type);
let existingState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (existingState && !this.isTokenExpired(existingState)) {
logger.debug(`[${flowKey}] Flow already exists with valid token`);
return this.monitorFlow(flowKey, type, signal);
}
await new Promise((resolve) => setTimeout(resolve, 250));
existingState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (existingState && !this.isTokenExpired(existingState)) {
logger.debug(`[${flowKey}] Flow exists on 2nd check with valid token`);
return this.monitorFlow(flowKey, type, signal);
}
const initialState: FlowState = {
type,
status: 'PENDING',
metadata: {},
createdAt: Date.now(),
};
logger.debug(`[${flowKey}] Creating initial flow state`);
await this.keyv.set(flowKey, initialState, this.ttl);
try {
const result = await handler();
await this.completeFlow(flowId, type, result);
const completedState = (await this.keyv.get(flowKey)) as FlowState<T> | undefined;
if (completedState?.status === 'COMPLETED' && completedState.result !== undefined) {
return completedState.result;
}
return result;
} catch (error) {
await this.failFlow(flowId, type, error instanceof Error ? error : new Error(String(error)));
throw error;
}
}
/**
* Deletes a flow state
*/
async deleteFlow(flowId: string, type: string): Promise<boolean> {
const flowKey = this.getFlowKey(flowId, type);
try {
await this.keyv.delete(flowKey);
logger.debug(`[${flowKey}] Flow deleted`);
return true;
} catch (error) {
logger.error(`[${flowKey}] Error deleting flow:`, error);
return false;
}
}
}