llm-cli-gateway 1.6.0 → 1.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +205 -0
- package/README.md +4 -4
- package/dist/async-job-manager.d.ts +70 -2
- package/dist/async-job-manager.js +166 -6
- package/dist/codex-json-parser.js +4 -1
- package/dist/index.js +81 -18
- package/dist/job-store.d.ts +43 -4
- package/dist/job-store.js +28 -2
- package/package.json +1 -1
- package/socket.yml +19 -0
package/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,211 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to the llm-cli-gateway project.
|
|
4
4
|
|
|
5
|
+
## [1.7.0] - 2026-05-26 — cache-awareness slice 1.5 (async-path flight recorder + codex parser fix)
|
|
6
|
+
|
|
7
|
+
Closes the two telemetry gaps that v1.6.0 explicitly deferred: async-path
|
|
8
|
+
flight-recorder integration and Codex parser support for the actual
|
|
9
|
+
`cached_input_tokens` field the current Codex CLI emits. Both ship
|
|
10
|
+
together because they jointly close out `cache_state://*` completeness
|
|
11
|
+
for the async tools and the codex CLI.
|
|
12
|
+
|
|
13
|
+
### Added — async-path flight recorder writes
|
|
14
|
+
|
|
15
|
+
- `AsyncJobManager` now accepts a `FlightRecorderLike` constructor
|
|
16
|
+
dependency (defaults to `NoopFlightRecorder` for tests that don't
|
|
17
|
+
inject one). `StartJobOptions` extended with `writeFlightStart`,
|
|
18
|
+
`flightRecorderEntry`, and `extractUsage` — pure async tools
|
|
19
|
+
(`*_request_async`) pass `writeFlightStart: true` so the manager owns
|
|
20
|
+
the row. The legacy positional `startJob(...)` signature was extended
|
|
21
|
+
with trailing optional params so existing callers keep working.
|
|
22
|
+
- New private `writeFlightComplete` helper inside the manager fires on
|
|
23
|
+
every terminal-state code path (close handler, error handler, idle
|
|
24
|
+
timeout, output overflow, cancelJob, evictCompletedJobs dead-process
|
|
25
|
+
and exited-mismatch branches). Failure payload mirrors sync-helper
|
|
26
|
+
semantics: `response = stderr || stdout` on failure, `errorMessage`
|
|
27
|
+
falls back through override → `job.error` → `job.stderr` →
|
|
28
|
+
`"Exit code N"`. Single-shot guard set only on successful write so a
|
|
29
|
+
thrown `logComplete` can be retried by a later terminal callback.
|
|
30
|
+
- New public `armFlightCompleteForDeferral(jobId)` on AsyncJobManager.
|
|
31
|
+
Called by `awaitJobOrDefer` in `src/index.ts` immediately before
|
|
32
|
+
returning a `DeferredJobResponse` — this lets the sync handler keep
|
|
33
|
+
ownership of the rich-metadata `safeFlightComplete` write for
|
|
34
|
+
sync-inline completions, while still ensuring deferred-from-sync rows
|
|
35
|
+
get a terminal `logComplete` from the manager when the underlying job
|
|
36
|
+
finishes. Includes a race-mitigation immediate-write path if the job
|
|
37
|
+
already terminated before the arm signal landed.
|
|
38
|
+
- `JobStore.markOrphanedOnStartup()` return shape extended from `number`
|
|
39
|
+
to `{ count, orphaned: Array<{ id, correlationId, startedAt, stdout,
|
|
40
|
+
stderr, exitCode }> }` so the manager constructor can write FR
|
|
41
|
+
`logComplete` rows for previously orphaned jobs with proper audit data
|
|
42
|
+
(durationMs from `startedAt`, response from `stderr || stdout`,
|
|
43
|
+
errorMessage `"orphaned after gateway restart"`). `SqliteJobStore`
|
|
44
|
+
SELECTs the per-orphan fields before the orphan-flip UPDATE; no
|
|
45
|
+
transaction wrapper needed because gateway boot is single-threaded
|
|
46
|
+
before any new jobs can arrive. `MemoryJobStore` returns
|
|
47
|
+
`{ count: 0, orphaned: [] }` (in-process state can't be orphaned).
|
|
48
|
+
Breaking change to the `JobStore` interface; the `PostgresJobStore`
|
|
49
|
+
stub was updated to match (the impl is still not yet shipped).
|
|
50
|
+
- `cache_state://global`, `cache_state://session/{id}`, and
|
|
51
|
+
`cache_state://prefix/{hash}` aggregates now include async-job
|
|
52
|
+
activity. No query changes — `cache_state://*` already didn't filter
|
|
53
|
+
on `asyncJobId`, so the new rows participate naturally.
|
|
54
|
+
|
|
55
|
+
### Fixed — Codex parser accepts current CLI's cache-token field
|
|
56
|
+
|
|
57
|
+
- `src/codex-json-parser.ts` now reads `cached_input_tokens` (preferred,
|
|
58
|
+
what Codex CLI ≥0.133.0 emits) in addition to the legacy
|
|
59
|
+
`cache_read_input_tokens` and the bare `cache_read_tokens` fallback.
|
|
60
|
+
Live smoke-tested against Codex CLI on 2026-05-26 — see
|
|
61
|
+
`docs/personal-mcp/PROVIDER_CACHE_SURFACES.md` "Codex — field name
|
|
62
|
+
divergence" for the exact invocation. Cache hits on codex rows now
|
|
63
|
+
populate the FR's `cache_read_tokens` column.
|
|
64
|
+
|
|
65
|
+
### Known limitation — sync-deferred-dedup orphan rows
|
|
66
|
+
|
|
67
|
+
When a sync request dedup-hits an in-flight original job AND the sync
|
|
68
|
+
deadline expires before the original finishes, the dedup'd caller's
|
|
69
|
+
sync-side `logStart` row stays at `status='started'` forever. The
|
|
70
|
+
manager's `logComplete` writes to the ORIGINAL job's correlationId, not
|
|
71
|
+
the dedup'd caller's. This is a pre-existing limitation surfaced by the
|
|
72
|
+
slice's clearer accounting; it predates v1.7.0 and is not a regression.
|
|
73
|
+
A future slice can address it via per-request corrId fan-out.
|
|
74
|
+
|
|
75
|
+
### Cross-table asymmetry — `canceled` / `orphaned` jobs in the FR
|
|
76
|
+
|
|
77
|
+
`FlightLogResult.status` only carries `"completed" | "failed"`, so
|
|
78
|
+
canceled and orphaned async jobs are encoded as `"failed"` plus a
|
|
79
|
+
distinguishing `errorMessage`. The underlying `jobs` table in JobStore
|
|
80
|
+
retains the distinct `"canceled"` / `"orphaned"` statuses for
|
|
81
|
+
`getJobSnapshot` callers. External consumers of `~/.llm-cli-gateway/
|
|
82
|
+
logs.db` that filter `status='failed'` will count cancels and boot-time
|
|
83
|
+
orphans as errors; `cache_state://*` aggregation does not distinguish.
|
|
84
|
+
|
|
85
|
+
### No config or schema changes
|
|
86
|
+
|
|
87
|
+
No migration. No new opt-in flag. The new behaviour is gated solely on
|
|
88
|
+
whether the caller (handler or `awaitJobOrDefer`) supplies a
|
|
89
|
+
`flightRecorderEntry` to `startJobWithDedup`. Tests/callers that don't
|
|
90
|
+
opt in see no behaviour change (the constructor's default
|
|
91
|
+
`NoopFlightRecorder` short-circuits the FR writes).
|
|
92
|
+
|
|
93
|
+
### Migration impact
|
|
94
|
+
|
|
95
|
+
None. SQLite schema and TOML config surface are byte-identical to
|
|
96
|
+
v1.6.1. Rollback is non-destructive (revert the release commit).
|
|
97
|
+
|
|
98
|
+
### Documentation
|
|
99
|
+
|
|
100
|
+
- `docs/plans/async-flight-recorder.dag.toml` — new slice plan (Unit A
|
|
101
|
+
unanimously approved across Codex/Gemini/Grok/Mistral).
|
|
102
|
+
- `docs/plans/async-flight-recorder.pr-body.md` — new PR description.
|
|
103
|
+
- `docs/personal-mcp/ASYNC_FLIGHT_RECORDER_SURFACES.md` — new research
|
|
104
|
+
note documenting every terminal state, the data contract per FR write
|
|
105
|
+
site, the sync-path responsibility split table, and the cancel /
|
|
106
|
+
orphan / dedup limitations.
|
|
107
|
+
- `docs/personal-mcp/PROVIDER_CACHE_SURFACES.md` — Codex section updated
|
|
108
|
+
to reflect that the parser now accepts `cached_input_tokens`; slice 2
|
|
109
|
+
"Populated for **claude only** today" claim corrected to include
|
|
110
|
+
codex.
|
|
111
|
+
- `docs/launch/blog-cache-awareness.md` — slice 1.5 follow-up note in
|
|
112
|
+
the "What's next" section.
|
|
113
|
+
|
|
114
|
+
## [1.6.1] - 2026-05-26 — docs-only follow-up to 1.6.0
|
|
115
|
+
|
|
116
|
+
Pure documentation release; zero source-code changes since 1.6.0.
|
|
117
|
+
|
|
118
|
+
### Changed — agent-install guidance current with v1.6.0 + five providers
|
|
119
|
+
|
|
120
|
+
- New `setup/providers/mistral-vibe.md` provider snippet (Mistral was the
|
|
121
|
+
fifth provider but had no setup/providers/ page; install agents had
|
|
122
|
+
nothing to point at when the user asked for Mistral coverage).
|
|
123
|
+
- New `setup/assistants/mistral-install-prompt.md` per-assistant install
|
|
124
|
+
prompt (mirrors the Grok prompt; outbound-only framing,
|
|
125
|
+
session_logging walk-through, `VIBE_ACTIVE_MODEL` guidance, secret-
|
|
126
|
+
safety rules preserved).
|
|
127
|
+
- `setup/assistants/ASSISTANT_CONTRACT.md`: Mistral added to "Applies
|
|
128
|
+
to" and outbound providers; new "Doctor Report Notes (v1.6.0)"
|
|
129
|
+
paragraph clarifying that the `cache_awareness` block is structural
|
|
130
|
+
(always present) and that all `[cache_awareness]` flags default off.
|
|
131
|
+
- All 6 per-assistant install prompts (universal, chatgpt, claude,
|
|
132
|
+
codex, gemini, grok) extended to enumerate all five providers and
|
|
133
|
+
reference the cache_awareness doctor block.
|
|
134
|
+
- `setup/install-plan.dag.toml` choose-targets / check-diagnostics /
|
|
135
|
+
apply-client-snippet steps generalised to all five providers; Mistral
|
|
136
|
+
named outbound-only; cache_awareness must-not-treat-as-blocker note
|
|
137
|
+
added inline. TOML re-validated.
|
|
138
|
+
- 6 `docs/personal-mcp/connect-*.md` legacy pages now carry an
|
|
139
|
+
admonition pointing to `setup/providers/` + `ASSISTANT_CONTRACT.md`
|
|
140
|
+
as canonical.
|
|
141
|
+
|
|
142
|
+
### Changed — 12 SKILL.md files current with v1.6.0
|
|
143
|
+
|
|
144
|
+
- All 12 skills (7 under `skills/`, 5 under `.agents/skills/`) extended
|
|
145
|
+
with `promptParts`, `cache_state://` MCP resources, and (where the
|
|
146
|
+
skill's centre of gravity is session continuity) the
|
|
147
|
+
`cache_ttl_expiring_soon` warning. Depth tiered by skill audience:
|
|
148
|
+
multi-llm-orchestration, model-routing, multi-llm-consensus,
|
|
149
|
+
implement-review-fix, multi-llm-review, async-job-orchestration,
|
|
150
|
+
session-workflow, secure-orchestration carry full sections or
|
|
151
|
+
examples; agent-codex-gate, codex-review-gate, design-review-cycle,
|
|
152
|
+
red-team-assessment carry tip-level mentions.
|
|
153
|
+
- Plugin-namespaced skills (`.agents/skills/*`) version-bumped 1.5 → 1.6.
|
|
154
|
+
- Exact runtime strings cross-checked against `src/index.ts` (the
|
|
155
|
+
`provide exactly one of …` / `one of … is required` mutex errors and
|
|
156
|
+
the `cache_ttl_expiring_soon` warning code).
|
|
157
|
+
|
|
158
|
+
### Fixed — README / BEST_PRACTICES / integrations doc drift
|
|
159
|
+
|
|
160
|
+
- README.md: headline + Core Capabilities now name Mistral as the fifth
|
|
161
|
+
provider; test counts 284 / 221 → 681; new Supply-chain hardening
|
|
162
|
+
call-out under Security & Quality.
|
|
163
|
+
- BEST_PRACTICES.md: testing coverage / performance lines 284 → 681.
|
|
164
|
+
- integrations/llm-plugin/README.md: Grok + Mistral added to providers
|
|
165
|
+
list, usage examples, and the "at least one of" requirements list.
|
|
166
|
+
- ENFORCEMENT.md: self-enforcement checklist provider list now Claude /
|
|
167
|
+
Codex / Gemini / Grok / Mistral.
|
|
168
|
+
|
|
169
|
+
### Fixed — `docs/launch/blog-cache-awareness.md` accuracy + voice
|
|
170
|
+
|
|
171
|
+
Technical corrections from the multi-LLM voice + technical review:
|
|
172
|
+
- Mutually-exclusive error-string quotation reformatted so the
|
|
173
|
+
``provide exactly one of `prompt` or `promptParts``` example renders
|
|
174
|
+
correctly in markdown.
|
|
175
|
+
- `lastWriteAt` references corrected to `lastRequestAt` (the actual
|
|
176
|
+
public field name on `SessionCacheStats`).
|
|
177
|
+
- Security tools sentence rewritten: separates SHA-pinned actions,
|
|
178
|
+
version-pinned Python/Go tools, and the SHA256-verified gitleaks
|
|
179
|
+
binary; clarifies that `eslint-plugin-security` runs via the existing
|
|
180
|
+
eslint config (not security.yml); replaces the inaccurate "Top-level
|
|
181
|
+
`permissions: contents: read` on every workflow" claim with the
|
|
182
|
+
accurate least-privilege phrasing.
|
|
183
|
+
- "Signed installer artefacts" → "SHA256-verifiable installer artefacts"
|
|
184
|
+
(no signing today); npm note adds the sigstore-provenance context.
|
|
185
|
+
- Haiku 3.5 Vertex 2048 caveat added: the in-code alias table
|
|
186
|
+
conservatively collapses all Haiku variants to 4096.
|
|
187
|
+
- Solorigate / Codecov / xz now link separately.
|
|
188
|
+
- Codex smoke-test evidence now links to
|
|
189
|
+
`docs/personal-mcp/PROVIDER_CACHE_SURFACES.md` and the CHANGELOG.
|
|
190
|
+
- Three broken links surfaced by lychee CI fixed: Mistral Vibe URL,
|
|
191
|
+
bare CLAUDE.md link (the file lives outside the gateway repo), and
|
|
192
|
+
the agent-assurance exclude regex tightened to match bare URLs.
|
|
193
|
+
|
|
194
|
+
### Fixed — `socket.yml` networkAccess false-positive documentation
|
|
195
|
+
|
|
196
|
+
- Documented that the `globalThis["fetch"]` flag on `dist/index.js` /
|
|
197
|
+
`dist/job-store.js` is a substring-match false positive. Neither file
|
|
198
|
+
contains any actual fetch call; the matches are English-prose
|
|
199
|
+
occurrences in an error message, the `fetchWith` JSON field name, and
|
|
200
|
+
a code comment. Verified by sub-agent investigation, no code change
|
|
201
|
+
required, no attack-surface delta vs 1.5.35.
|
|
202
|
+
|
|
203
|
+
### Fixed — `lychee.toml` exclusions
|
|
204
|
+
|
|
205
|
+
- Added `https://npmjs.com/`, `https://help.openai.com/`, and bare
|
|
206
|
+
`github.com/verivus-oss/agent-assurance` URLs to the exclude list
|
|
207
|
+
(each is a Cloudflare bot-blocked / private host that returns
|
|
208
|
+
4xx/5xx to anonymous CI requests). Rationale documented inline.
|
|
209
|
+
|
|
5
210
|
## [1.6.0] - 2026-05-26 — cache-awareness phase 1 + security posture
|
|
6
211
|
|
|
7
212
|
Also includes (beyond cache-awareness):
|
package/README.md
CHANGED
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
> *"Without consultation, plans are frustrated, but with many counselors they succeed."*
|
|
4
4
|
> — Proverbs 15:22 (LSB)
|
|
5
5
|
|
|
6
|
-
A Model Context Protocol (MCP) server providing unified access to Claude Code, Codex, Gemini, and
|
|
6
|
+
A Model Context Protocol (MCP) server providing unified access to Claude Code, Codex, Gemini, Grok, and Mistral (Vibe) CLIs with session management, retry logic, and async job orchestration.
|
|
7
7
|
|
|
8
8
|
## Personal MCP Appliance MVP
|
|
9
9
|
|
|
@@ -79,7 +79,7 @@ docker compose -f docker-compose.personal.yml run --rm doctor
|
|
|
79
79
|
## Features
|
|
80
80
|
|
|
81
81
|
### Core Capabilities
|
|
82
|
-
- **Multi-LLM Orchestration**: Unified interface for Claude Code, Codex, Gemini, and
|
|
82
|
+
- **Multi-LLM Orchestration**: Unified interface for Claude Code, Codex, Gemini, Grok, and Mistral (Vibe) CLIs
|
|
83
83
|
- **Session Management**: Track and resume conversations across all CLIs with persistent storage
|
|
84
84
|
- **Token Optimization**: Automatic 44% reduction on prompts, 37% on responses (opt-in)
|
|
85
85
|
- **Correlation ID Tracking**: Full request tracing across all LLM interactions
|
|
@@ -127,12 +127,12 @@ Opt-in flags (all default off) live under `[cache_awareness]` in `~/.llm-cli-gat
|
|
|
127
127
|
- **Long-Running Jobs**: Non-time-bound async execution via `*_request_async` + polling tools
|
|
128
128
|
|
|
129
129
|
### Security & Quality
|
|
130
|
-
- **Comprehensive Testing**:
|
|
130
|
+
- **Comprehensive Testing**: 681 tests covering unit, integration, and regression scenarios with real CLI execution
|
|
131
131
|
- **Input Validation**: Zod schemas prevent injection attacks
|
|
132
132
|
- **No Secret Leakage**: Generic session descriptions only (file permissions 0o600)
|
|
133
133
|
- **No ReDoS**: Bounded regex patterns prevent catastrophic backtracking
|
|
134
134
|
- **Type Safety**: Strict TypeScript with comprehensive error handling
|
|
135
|
-
- **
|
|
135
|
+
- **Supply-chain hardening**: a dedicated `.github/workflows/security.yml` runs actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff, bandit, and lychee on every push and PR (see `SECURITY.md` for the threat model)
|
|
136
136
|
|
|
137
137
|
## Prerequisites
|
|
138
138
|
|
|
@@ -1,8 +1,35 @@
|
|
|
1
1
|
import type { Logger } from "./logger.js";
|
|
2
2
|
import { type JobHealth } from "./process-monitor.js";
|
|
3
3
|
import { JobStore } from "./job-store.js";
|
|
4
|
+
import { type FlightRecorderLike } from "./flight-recorder.js";
|
|
4
5
|
export type LlmCli = "claude" | "codex" | "gemini" | "grok" | "mistral";
|
|
5
6
|
export type AsyncJobStatus = "running" | "completed" | "failed" | "canceled" | "orphaned";
|
|
7
|
+
/**
|
|
8
|
+
* Slice 1.5 flight-recorder payload supplied via StartJobOptions.
|
|
9
|
+
* Decomposed to primitive fields (no nested handler-locals) so retaining
|
|
10
|
+
* a reference on the in-memory job record doesn't pin large promptParts
|
|
11
|
+
* or attachments via closure scope.
|
|
12
|
+
*/
|
|
13
|
+
export interface AsyncJobFlightRecorderEntry {
|
|
14
|
+
model: string;
|
|
15
|
+
prompt: string;
|
|
16
|
+
sessionId?: string;
|
|
17
|
+
stablePrefixHash?: string;
|
|
18
|
+
stablePrefixTokens?: number;
|
|
19
|
+
}
|
|
20
|
+
/**
|
|
21
|
+
* Slice 1.5 usage-extraction callback. Closures MUST be constructed from
|
|
22
|
+
* primitive locals only (e.g. const fmt = params.outputFormat; closure
|
|
23
|
+
* captures fmt). Capturing the handler's full `params` object pins large
|
|
24
|
+
* promptParts/attachments for JOB_TTL_MS.
|
|
25
|
+
*/
|
|
26
|
+
export type AsyncJobUsageExtractor = (stdout: string) => {
|
|
27
|
+
inputTokens?: number;
|
|
28
|
+
outputTokens?: number;
|
|
29
|
+
cacheReadTokens?: number;
|
|
30
|
+
cacheCreationTokens?: number;
|
|
31
|
+
costUsd?: number;
|
|
32
|
+
};
|
|
6
33
|
export interface AsyncJobSnapshot {
|
|
7
34
|
id: string;
|
|
8
35
|
cli: LlmCli;
|
|
@@ -45,6 +72,23 @@ export interface StartJobOptions {
|
|
|
45
72
|
* etc.) that must persist for the lifetime of the spawned CLI process.
|
|
46
73
|
*/
|
|
47
74
|
onComplete?: () => void;
|
|
75
|
+
/**
|
|
76
|
+
* Slice 1.5: when true, AsyncJobManager writes a flight-recorder logStart
|
|
77
|
+
* row at startJob entry using `flightRecorderEntry`. Pure async handlers
|
|
78
|
+
* (handle*RequestAsync) pass true because they have no upstream
|
|
79
|
+
* safeFlightStart writer. The sync-deferred path (awaitJobOrDefer) passes
|
|
80
|
+
* false because the upstream sync handler already wrote logStart keyed on
|
|
81
|
+
* the same correlationId — a second INSERT would crash on the PK.
|
|
82
|
+
*/
|
|
83
|
+
writeFlightStart?: boolean;
|
|
84
|
+
/** Slice 1.5: payload for the FR logStart and the terminal logComplete. */
|
|
85
|
+
flightRecorderEntry?: AsyncJobFlightRecorderEntry;
|
|
86
|
+
/**
|
|
87
|
+
* Slice 1.5: invoked only on terminal `completed` to populate token-usage
|
|
88
|
+
* fields in the FR logComplete payload. Construct from primitive locals
|
|
89
|
+
* only (see AsyncJobUsageExtractor doc).
|
|
90
|
+
*/
|
|
91
|
+
extractUsage?: AsyncJobUsageExtractor;
|
|
48
92
|
}
|
|
49
93
|
export interface StartJobOutcome {
|
|
50
94
|
snapshot: AsyncJobSnapshot;
|
|
@@ -60,7 +104,8 @@ export declare class AsyncJobManager {
|
|
|
60
104
|
private evictionTimer;
|
|
61
105
|
private processMonitor;
|
|
62
106
|
private store;
|
|
63
|
-
|
|
107
|
+
private flightRecorder;
|
|
108
|
+
constructor(logger?: Logger, onJobComplete?: ((cli: LlmCli, durationMs: number, success: boolean) => void) | undefined, store?: JobStore | null, flightRecorder?: FlightRecorderLike);
|
|
64
109
|
/**
|
|
65
110
|
* True iff a durable (or memory) job store is attached. The MCP-tool
|
|
66
111
|
* registration layer ANDs this with persistence.asyncJobsEnabled when
|
|
@@ -81,6 +126,29 @@ export declare class AsyncJobManager {
|
|
|
81
126
|
*/
|
|
82
127
|
private buildRequestKey;
|
|
83
128
|
private fireOnComplete;
|
|
129
|
+
/**
|
|
130
|
+
* Slice 1.5: write the terminal flight-recorder row. Mirrors sync-path
|
|
131
|
+
* failure semantics (response = stderr||stdout on failure, errorMessage
|
|
132
|
+
* falls back through overrideErrorMessage → job.error → job.stderr →
|
|
133
|
+
* "Exit code N"). Single-shot guard set only on SUCCESSFUL write so a
|
|
134
|
+
* thrown logComplete can be retried by a later terminal callback; the
|
|
135
|
+
* FR's WHERE status='started' UPDATE guard remains the actual
|
|
136
|
+
* idempotency mechanism for the common "retry succeeds, original
|
|
137
|
+
* succeeded too" case.
|
|
138
|
+
*/
|
|
139
|
+
private writeFlightComplete;
|
|
140
|
+
private safeExtractUsage;
|
|
141
|
+
/**
|
|
142
|
+
* R2 Codex-Unit-B F1: awaitJobOrDefer calls this when returning a
|
|
143
|
+
* deferred response. From this point on the sync handler will not write
|
|
144
|
+
* its own safeFlightComplete, so the manager takes over.
|
|
145
|
+
*
|
|
146
|
+
* Race mitigation: if the job already terminated between the sync
|
|
147
|
+
* deadline expiring and this method firing, write logComplete
|
|
148
|
+
* synchronously here so the previously-skipped terminal callback's
|
|
149
|
+
* write isn't lost.
|
|
150
|
+
*/
|
|
151
|
+
armFlightCompleteForDeferral(jobId: string): void;
|
|
84
152
|
private safeStoreCall;
|
|
85
153
|
/**
|
|
86
154
|
* Flush in-memory stdout/stderr to the durable store if anything changed
|
|
@@ -100,7 +168,7 @@ export declare class AsyncJobManager {
|
|
|
100
168
|
* Existing callers keep working unchanged; forceRefresh is exposed as a trailing
|
|
101
169
|
* optional param for the dedup-aware path.
|
|
102
170
|
*/
|
|
103
|
-
startJob(cli: LlmCli, args: string[], correlationId: string, cwd?: string, idleTimeoutMs?: number, outputFormat?: string, forceRefresh?: boolean, env?: Record<string, string>, onComplete?: () => void): AsyncJobSnapshot;
|
|
171
|
+
startJob(cli: LlmCli, args: string[], correlationId: string, cwd?: string, idleTimeoutMs?: number, outputFormat?: string, forceRefresh?: boolean, env?: Record<string, string>, onComplete?: () => void, flightRecorderEntry?: AsyncJobFlightRecorderEntry, extractUsage?: AsyncJobUsageExtractor, writeFlightStart?: boolean): AsyncJobSnapshot;
|
|
104
172
|
/**
|
|
105
173
|
* Start a job, with optional dedup against recent identical requests.
|
|
106
174
|
* Returns `{ snapshot, deduped }` so callers can log/report the short-circuit.
|
|
@@ -3,6 +3,7 @@ import { envWithExtendedPath, getExtendedPath, killProcessGroup, spawnCliProcess
|
|
|
3
3
|
import { noopLogger } from "./logger.js";
|
|
4
4
|
import { ProcessMonitor } from "./process-monitor.js";
|
|
5
5
|
import { computeRequestKey } from "./job-store.js";
|
|
6
|
+
import { NoopFlightRecorder } from "./flight-recorder.js";
|
|
6
7
|
const MAX_OUTPUT_SIZE = 50 * 1024 * 1024;
|
|
7
8
|
const JOB_TTL_MS = 60 * 60 * 1000; // 1 hour in-memory retention; durable store has its own (longer) retention
|
|
8
9
|
const EVICTION_INTERVAL_MS = 5 * 60 * 1000; // Check every 5 minutes
|
|
@@ -61,16 +62,40 @@ export class AsyncJobManager {
|
|
|
61
62
|
evictionTimer = null;
|
|
62
63
|
processMonitor;
|
|
63
64
|
store;
|
|
64
|
-
|
|
65
|
+
flightRecorder;
|
|
66
|
+
constructor(logger = noopLogger, onJobComplete, store = null, flightRecorder = new NoopFlightRecorder()) {
|
|
65
67
|
this.logger = logger;
|
|
66
68
|
this.onJobComplete = onJobComplete;
|
|
67
69
|
this.processMonitor = new ProcessMonitor(logger);
|
|
68
70
|
this.store = store;
|
|
71
|
+
this.flightRecorder = flightRecorder;
|
|
69
72
|
if (this.store) {
|
|
70
73
|
try {
|
|
71
|
-
const orphaned = this.store.markOrphanedOnStartup();
|
|
72
|
-
if (
|
|
73
|
-
this.logger.info(`Marked ${
|
|
74
|
+
const { count, orphaned } = this.store.markOrphanedOnStartup();
|
|
75
|
+
if (count > 0) {
|
|
76
|
+
this.logger.info(`Marked ${count} in-flight job(s) as orphaned after gateway restart`);
|
|
77
|
+
}
|
|
78
|
+
// Slice 1.5: close out the FR row for each orphaned job. The FR
|
|
79
|
+
// logComplete UPDATE has WHERE status='started' so pre-1.7.0 rows
|
|
80
|
+
// (where the prior gateway never wrote a logStart) silently
|
|
81
|
+
// no-op. Wrapped per-orphan so a single bad row can't tank boot.
|
|
82
|
+
for (const orphan of orphaned) {
|
|
83
|
+
try {
|
|
84
|
+
const durationMs = Math.max(0, Date.now() - new Date(orphan.startedAt).getTime());
|
|
85
|
+
this.flightRecorder.logComplete(orphan.correlationId, {
|
|
86
|
+
response: orphan.stderr || orphan.stdout,
|
|
87
|
+
durationMs,
|
|
88
|
+
retryCount: 0,
|
|
89
|
+
circuitBreakerState: "closed",
|
|
90
|
+
optimizationApplied: false,
|
|
91
|
+
exitCode: orphan.exitCode ?? 1,
|
|
92
|
+
errorMessage: "orphaned after gateway restart",
|
|
93
|
+
status: "failed",
|
|
94
|
+
});
|
|
95
|
+
}
|
|
96
|
+
catch (err) {
|
|
97
|
+
this.logger.error(`Async-path FR logComplete for orphaned job ${orphan.id} failed`, err);
|
|
98
|
+
}
|
|
74
99
|
}
|
|
75
100
|
}
|
|
76
101
|
catch (err) {
|
|
@@ -129,6 +154,7 @@ export class AsyncJobManager {
|
|
|
129
154
|
this.logger.error(`Job ${id} process ${job.process.pid} no longer exists, marking as failed`);
|
|
130
155
|
this.emitMetrics(job);
|
|
131
156
|
this.persistComplete(job);
|
|
157
|
+
this.writeFlightComplete(job, "failed");
|
|
132
158
|
this.fireOnComplete(job);
|
|
133
159
|
}
|
|
134
160
|
// EPERM: process exists but we can't signal it — ignore
|
|
@@ -144,6 +170,7 @@ export class AsyncJobManager {
|
|
|
144
170
|
this.logger.error(`Job ${id} has exited flag but was still in running state, marking as failed`);
|
|
145
171
|
this.emitMetrics(job);
|
|
146
172
|
this.persistComplete(job);
|
|
173
|
+
this.writeFlightComplete(job, "failed");
|
|
147
174
|
this.fireOnComplete(job);
|
|
148
175
|
}
|
|
149
176
|
}
|
|
@@ -196,6 +223,96 @@ export class AsyncJobManager {
|
|
|
196
223
|
this.logger.error(`Job ${job.id} onComplete hook threw`, err);
|
|
197
224
|
}
|
|
198
225
|
}
|
|
226
|
+
/**
|
|
227
|
+
* Slice 1.5: write the terminal flight-recorder row. Mirrors sync-path
|
|
228
|
+
* failure semantics (response = stderr||stdout on failure, errorMessage
|
|
229
|
+
* falls back through overrideErrorMessage → job.error → job.stderr →
|
|
230
|
+
* "Exit code N"). Single-shot guard set only on SUCCESSFUL write so a
|
|
231
|
+
* thrown logComplete can be retried by a later terminal callback; the
|
|
232
|
+
* FR's WHERE status='started' UPDATE guard remains the actual
|
|
233
|
+
* idempotency mechanism for the common "retry succeeds, original
|
|
234
|
+
* succeeded too" case.
|
|
235
|
+
*/
|
|
236
|
+
writeFlightComplete(job, finalStatus, overrideErrorMessage) {
|
|
237
|
+
if (!job.flightRecorderEntry)
|
|
238
|
+
return; // never opted in
|
|
239
|
+
// R2 Codex-Unit-B F1: only write when armed. Sync-inline requests are
|
|
240
|
+
// NOT armed at startJob — the sync handler owns the rich-metadata
|
|
241
|
+
// safeFlightComplete write. Pure async + sync-deferred ARE armed.
|
|
242
|
+
if (!job.flightCompleteArmed)
|
|
243
|
+
return;
|
|
244
|
+
if (job.flightRecorderComplete)
|
|
245
|
+
return; // already wrote successfully
|
|
246
|
+
const durationMs = Math.max(0, Date.now() - new Date(job.startedAt).getTime());
|
|
247
|
+
const usage = finalStatus === "completed" && job.extractUsage ? this.safeExtractUsage(job) : {};
|
|
248
|
+
const isFailure = finalStatus === "failed";
|
|
249
|
+
const response = isFailure ? job.stderr || job.stdout : job.stdout;
|
|
250
|
+
const exitCode = job.exitCode ?? (finalStatus === "completed" ? 0 : 1);
|
|
251
|
+
const errorMessage = isFailure
|
|
252
|
+
? (overrideErrorMessage ?? job.error ?? job.stderr ?? `Exit code ${exitCode}`)
|
|
253
|
+
: undefined;
|
|
254
|
+
try {
|
|
255
|
+
this.flightRecorder.logComplete(job.correlationId, {
|
|
256
|
+
response,
|
|
257
|
+
durationMs,
|
|
258
|
+
retryCount: 0,
|
|
259
|
+
circuitBreakerState: "closed",
|
|
260
|
+
optimizationApplied: false,
|
|
261
|
+
exitCode,
|
|
262
|
+
errorMessage,
|
|
263
|
+
status: finalStatus,
|
|
264
|
+
inputTokens: usage.inputTokens,
|
|
265
|
+
outputTokens: usage.outputTokens,
|
|
266
|
+
cacheReadTokens: usage.cacheReadTokens,
|
|
267
|
+
cacheCreationTokens: usage.cacheCreationTokens,
|
|
268
|
+
costUsd: usage.costUsd,
|
|
269
|
+
});
|
|
270
|
+
// Only mark complete on successful write so a thrown logComplete
|
|
271
|
+
// can be retried by the next terminal callback.
|
|
272
|
+
job.flightRecorderComplete = true;
|
|
273
|
+
// Clear retained references so the GC can reclaim anything the
|
|
274
|
+
// extractUsage closure captured.
|
|
275
|
+
job.flightRecorderEntry = undefined;
|
|
276
|
+
job.extractUsage = undefined;
|
|
277
|
+
}
|
|
278
|
+
catch (err) {
|
|
279
|
+
this.logger.error("Async-path flight recorder logComplete failed", err);
|
|
280
|
+
}
|
|
281
|
+
}
|
|
282
|
+
safeExtractUsage(job) {
|
|
283
|
+
try {
|
|
284
|
+
return job.extractUsage?.(job.stdout) ?? {};
|
|
285
|
+
}
|
|
286
|
+
catch (err) {
|
|
287
|
+
this.logger.error(`Job ${job.id} extractUsage threw`, err);
|
|
288
|
+
return {};
|
|
289
|
+
}
|
|
290
|
+
}
|
|
291
|
+
/**
|
|
292
|
+
* R2 Codex-Unit-B F1: awaitJobOrDefer calls this when returning a
|
|
293
|
+
* deferred response. From this point on the sync handler will not write
|
|
294
|
+
* its own safeFlightComplete, so the manager takes over.
|
|
295
|
+
*
|
|
296
|
+
* Race mitigation: if the job already terminated between the sync
|
|
297
|
+
* deadline expiring and this method firing, write logComplete
|
|
298
|
+
* synchronously here so the previously-skipped terminal callback's
|
|
299
|
+
* write isn't lost.
|
|
300
|
+
*/
|
|
301
|
+
armFlightCompleteForDeferral(jobId) {
|
|
302
|
+
const job = this.jobs.get(jobId);
|
|
303
|
+
if (!job)
|
|
304
|
+
return;
|
|
305
|
+
if (job.flightCompleteArmed)
|
|
306
|
+
return; // pure async already armed
|
|
307
|
+
job.flightCompleteArmed = true;
|
|
308
|
+
if (job.status === "running")
|
|
309
|
+
return;
|
|
310
|
+
// Job already terminal — the close handler's writeFlightComplete
|
|
311
|
+
// saw flightCompleteArmed=false and skipped. Write now to recover.
|
|
312
|
+
const finalStatus = job.status === "completed" ? "completed" : "failed";
|
|
313
|
+
const override = job.canceled ? "canceled by caller" : undefined;
|
|
314
|
+
this.writeFlightComplete(job, finalStatus, override);
|
|
315
|
+
}
|
|
199
316
|
safeStoreCall(label, fn) {
|
|
200
317
|
if (!this.store)
|
|
201
318
|
return;
|
|
@@ -300,7 +417,7 @@ export class AsyncJobManager {
|
|
|
300
417
|
* Existing callers keep working unchanged; forceRefresh is exposed as a trailing
|
|
301
418
|
* optional param for the dedup-aware path.
|
|
302
419
|
*/
|
|
303
|
-
startJob(cli, args, correlationId, cwd, idleTimeoutMs, outputFormat, forceRefresh, env, onComplete) {
|
|
420
|
+
startJob(cli, args, correlationId, cwd, idleTimeoutMs, outputFormat, forceRefresh, env, onComplete, flightRecorderEntry, extractUsage, writeFlightStart) {
|
|
304
421
|
return this.startJobWithDedup(cli, args, correlationId, {
|
|
305
422
|
cwd,
|
|
306
423
|
idleTimeoutMs,
|
|
@@ -308,6 +425,9 @@ export class AsyncJobManager {
|
|
|
308
425
|
forceRefresh,
|
|
309
426
|
env,
|
|
310
427
|
onComplete,
|
|
428
|
+
flightRecorderEntry,
|
|
429
|
+
extractUsage,
|
|
430
|
+
writeFlightStart,
|
|
311
431
|
}).snapshot;
|
|
312
432
|
}
|
|
313
433
|
/**
|
|
@@ -319,7 +439,7 @@ export class AsyncJobManager {
|
|
|
319
439
|
* is returned without spawning a new process. forceRefresh skips dedup entirely.
|
|
320
440
|
*/
|
|
321
441
|
startJobWithDedup(cli, args, correlationId, opts = {}) {
|
|
322
|
-
const { cwd, idleTimeoutMs, outputFormat, forceRefresh, env: extraEnv, onComplete } = opts;
|
|
442
|
+
const { cwd, idleTimeoutMs, outputFormat, forceRefresh, env: extraEnv, onComplete, flightRecorderEntry, extractUsage, writeFlightStart, } = opts;
|
|
323
443
|
const requestKey = this.buildRequestKey(cli, args, extraEnv);
|
|
324
444
|
if (!forceRefresh && this.store) {
|
|
325
445
|
try {
|
|
@@ -405,6 +525,14 @@ export class AsyncJobManager {
|
|
|
405
525
|
onCompleteFired: false,
|
|
406
526
|
outputDirty: false,
|
|
407
527
|
lastOutputFlushAt: Date.now(),
|
|
528
|
+
flightRecorderEntry,
|
|
529
|
+
extractUsage,
|
|
530
|
+
flightRecorderComplete: false,
|
|
531
|
+
// R2 Codex-Unit-B F1: pure async path arms now (writeFlightStart=true
|
|
532
|
+
// means the manager is the only FR writer). Sync-deferred path
|
|
533
|
+
// arrives with writeFlightStart=false and arms later via
|
|
534
|
+
// armFlightCompleteForDeferral when awaitJobOrDefer decides to defer.
|
|
535
|
+
flightCompleteArmed: writeFlightStart === true,
|
|
408
536
|
};
|
|
409
537
|
this.jobs.set(id, job);
|
|
410
538
|
this.safeStoreCall("recordStart", () => this.store.recordStart({
|
|
@@ -417,6 +545,27 @@ export class AsyncJobManager {
|
|
|
417
545
|
startedAt,
|
|
418
546
|
pid: child.pid ?? null,
|
|
419
547
|
}));
|
|
548
|
+
// Slice 1.5: only opt-in callers (pure async handlers) write logStart
|
|
549
|
+
// here. The sync-deferred path passes writeFlightStart=false because
|
|
550
|
+
// the upstream sync handler already wrote a logStart row keyed on the
|
|
551
|
+
// same correlationId; a duplicate INSERT would crash on the PK.
|
|
552
|
+
if (writeFlightStart && flightRecorderEntry) {
|
|
553
|
+
try {
|
|
554
|
+
this.flightRecorder.logStart({
|
|
555
|
+
correlationId,
|
|
556
|
+
cli,
|
|
557
|
+
model: flightRecorderEntry.model,
|
|
558
|
+
prompt: flightRecorderEntry.prompt,
|
|
559
|
+
sessionId: flightRecorderEntry.sessionId,
|
|
560
|
+
asyncJobId: id,
|
|
561
|
+
stablePrefixHash: flightRecorderEntry.stablePrefixHash,
|
|
562
|
+
stablePrefixTokens: flightRecorderEntry.stablePrefixTokens,
|
|
563
|
+
});
|
|
564
|
+
}
|
|
565
|
+
catch (err) {
|
|
566
|
+
this.logger.error("Async-path flight recorder logStart failed", err);
|
|
567
|
+
}
|
|
568
|
+
}
|
|
420
569
|
this.logger.info(`Job ${id} started for ${cli}`, { correlationId });
|
|
421
570
|
// Idle timeout: kill process if no output activity for idleTimeoutMs
|
|
422
571
|
let idleTimerId;
|
|
@@ -439,6 +588,7 @@ export class AsyncJobManager {
|
|
|
439
588
|
});
|
|
440
589
|
this.emitMetrics(job);
|
|
441
590
|
this.persistComplete(job);
|
|
591
|
+
this.writeFlightComplete(job, "failed");
|
|
442
592
|
this.fireOnComplete(job);
|
|
443
593
|
setTimeout(() => {
|
|
444
594
|
if (!job.exited && job.process)
|
|
@@ -473,6 +623,7 @@ export class AsyncJobManager {
|
|
|
473
623
|
this.logger.error(`Job ${id} error: ${launchError.message}`, { correlationId });
|
|
474
624
|
this.emitMetrics(job);
|
|
475
625
|
this.persistComplete(job);
|
|
626
|
+
this.writeFlightComplete(job, "failed");
|
|
476
627
|
this.fireOnComplete(job);
|
|
477
628
|
}
|
|
478
629
|
});
|
|
@@ -490,6 +641,12 @@ export class AsyncJobManager {
|
|
|
490
641
|
}
|
|
491
642
|
// Ensure terminal state reaches the durable store (idle-timeout/output-overflow already persisted).
|
|
492
643
|
this.persistComplete(job);
|
|
644
|
+
// Slice 1.5: retry the FR complete write iff the earlier terminal
|
|
645
|
+
// callback's logComplete threw. The single-shot guard in
|
|
646
|
+
// writeFlightComplete makes this a no-op in the common case.
|
|
647
|
+
const fallbackFlightStatus = job.status === "completed" ? "completed" : "failed";
|
|
648
|
+
const fallbackOverride = job.status === "canceled" ? "canceled by caller" : undefined;
|
|
649
|
+
this.writeFlightComplete(job, fallbackFlightStatus, fallbackOverride);
|
|
493
650
|
this.fireOnComplete(job);
|
|
494
651
|
return;
|
|
495
652
|
}
|
|
@@ -512,6 +669,7 @@ export class AsyncJobManager {
|
|
|
512
669
|
}
|
|
513
670
|
this.emitMetrics(job);
|
|
514
671
|
this.persistComplete(job);
|
|
672
|
+
this.writeFlightComplete(job, job.status === "completed" ? "completed" : "failed", job.status === "canceled" ? "canceled by caller" : undefined);
|
|
515
673
|
this.fireOnComplete(job);
|
|
516
674
|
});
|
|
517
675
|
return { snapshot: this.snapshot(job), deduped: false };
|
|
@@ -567,6 +725,7 @@ export class AsyncJobManager {
|
|
|
567
725
|
killProcessGroup(job.process, "SIGTERM");
|
|
568
726
|
this.logger.info(`Job ${jobId} canceled`, { correlationId: job.correlationId });
|
|
569
727
|
this.persistComplete(job);
|
|
728
|
+
this.writeFlightComplete(job, "failed", "canceled by caller");
|
|
570
729
|
this.fireOnComplete(job);
|
|
571
730
|
setTimeout(() => {
|
|
572
731
|
if (!job.exited && job.process)
|
|
@@ -639,6 +798,7 @@ export class AsyncJobManager {
|
|
|
639
798
|
});
|
|
640
799
|
this.emitMetrics(job);
|
|
641
800
|
this.persistComplete(job);
|
|
801
|
+
this.writeFlightComplete(job, "failed", "Output exceeded maximum size (50MB)");
|
|
642
802
|
this.fireOnComplete(job);
|
|
643
803
|
setTimeout(() => {
|
|
644
804
|
if (!job.exited && job.process)
|
|
@@ -47,7 +47,10 @@ export function parseCodexJsonStream(stdout) {
|
|
|
47
47
|
input_tokens: typeof u.input_tokens === "number" ? u.input_tokens : 0,
|
|
48
48
|
output_tokens: typeof u.output_tokens === "number" ? u.output_tokens : 0,
|
|
49
49
|
};
|
|
50
|
-
if (typeof u.
|
|
50
|
+
if (typeof u.cached_input_tokens === "number") {
|
|
51
|
+
usage.cache_read_tokens = u.cached_input_tokens;
|
|
52
|
+
}
|
|
53
|
+
else if (typeof u.cache_read_input_tokens === "number") {
|
|
51
54
|
usage.cache_read_tokens = u.cache_read_input_tokens;
|
|
52
55
|
}
|
|
53
56
|
else if (typeof u.cache_read_tokens === "number") {
|
package/dist/index.js
CHANGED
|
@@ -17,7 +17,7 @@ import { estimateTokens, optimizePrompt as optimizePromptText, optimizeResponse
|
|
|
17
17
|
import { loadConfig, loadPersistenceConfig, loadCacheAwarenessConfig, } from "./config.js";
|
|
18
18
|
import { checkHealth } from "./health.js";
|
|
19
19
|
import { clearModelRegistryCache, getAvailableCliInfo, getCliInfo, resolveModelAlias, } from "./model-registry.js";
|
|
20
|
-
import { AsyncJobManager } from "./async-job-manager.js";
|
|
20
|
+
import { AsyncJobManager, } from "./async-job-manager.js";
|
|
21
21
|
import { createJobStore } from "./job-store.js";
|
|
22
22
|
import { ApprovalManager } from "./approval-manager.js";
|
|
23
23
|
import { checkReviewIntegrity } from "./review-integrity.js";
|
|
@@ -213,10 +213,10 @@ function getJobStore(runtimeLogger = logger) {
|
|
|
213
213
|
}
|
|
214
214
|
return jobStore;
|
|
215
215
|
}
|
|
216
|
-
function newAsyncJobManager(metrics, runtimeLogger, store = getJobStore(runtimeLogger)) {
|
|
216
|
+
function newAsyncJobManager(metrics, runtimeLogger, store = getJobStore(runtimeLogger), fr = getFlightRecorder(runtimeLogger)) {
|
|
217
217
|
return new AsyncJobManager(runtimeLogger, (cli, durationMs, success) => {
|
|
218
218
|
metrics.recordRequest(cli, durationMs, success);
|
|
219
|
-
}, store);
|
|
219
|
+
}, store, fr);
|
|
220
220
|
}
|
|
221
221
|
function getAsyncJobManager(runtimeLogger = logger) {
|
|
222
222
|
asyncJobManager ??= newAsyncJobManager(performanceMetrics, runtimeLogger);
|
|
@@ -239,17 +239,19 @@ function resolveGatewayServerRuntime(deps = {}, options = {}) {
|
|
|
239
239
|
const runtimeSessionManager = deps.sessionManager ?? sessionManager;
|
|
240
240
|
const runtimePerformanceMetrics = deps.performanceMetrics ??
|
|
241
241
|
(options.isolateState ? new PerformanceMetrics() : performanceMetrics);
|
|
242
|
+
// Resolve flight recorder BEFORE async manager so isolateState managers
|
|
243
|
+
// can be wired with the same recorder instance the runtime exposes.
|
|
244
|
+
const runtimeFlightRecorder = deps.flightRecorder ?? getFlightRecorder(runtimeLogger);
|
|
242
245
|
const runtimeAsyncJobManager = deps.asyncJobManager ??
|
|
243
246
|
(options.isolateState
|
|
244
247
|
? // Factory-created test/HTTP session servers must not mark another instance's
|
|
245
248
|
// durable jobs orphaned. Stdio startup injects the process-global manager.
|
|
246
|
-
newAsyncJobManager(runtimePerformanceMetrics, runtimeLogger, null)
|
|
249
|
+
newAsyncJobManager(runtimePerformanceMetrics, runtimeLogger, null, runtimeFlightRecorder)
|
|
247
250
|
: getAsyncJobManager(runtimeLogger));
|
|
248
251
|
const runtimeApprovalManager = deps.approvalManager ??
|
|
249
252
|
(options.isolateState
|
|
250
253
|
? new ApprovalManager(undefined, runtimeLogger)
|
|
251
254
|
: getApprovalManager(runtimeLogger));
|
|
252
|
-
const runtimeFlightRecorder = deps.flightRecorder ?? getFlightRecorder(runtimeLogger);
|
|
253
255
|
return {
|
|
254
256
|
sessionManager: runtimeSessionManager,
|
|
255
257
|
resourceProvider: deps.resourceProvider ??
|
|
@@ -286,7 +288,16 @@ const SYNC_POLL_INTERVAL_MS = 1_000;
|
|
|
286
288
|
* Start an async job and poll until completion or deadline.
|
|
287
289
|
* Returns the job result if it finishes in time, or a deferral marker.
|
|
288
290
|
*/
|
|
289
|
-
async function awaitJobOrDefer(cli, args, corrId, idleTimeoutMs, outputFormat, forceRefresh, runtime = resolveGatewayServerRuntime(), env, onComplete
|
|
291
|
+
async function awaitJobOrDefer(cli, args, corrId, idleTimeoutMs, outputFormat, forceRefresh, runtime = resolveGatewayServerRuntime(), env, onComplete,
|
|
292
|
+
/**
|
|
293
|
+
* Slice 1.5: when the sync handler has already written a logStart row
|
|
294
|
+
* keyed on `corrId`, pass these so the manager can write logComplete
|
|
295
|
+
* (with usage extraction) when the underlying async job terminates —
|
|
296
|
+
* even if the sync handler returned a deferred response.
|
|
297
|
+
* `writeFlightStart` is NEVER true on this path: the sync handler is
|
|
298
|
+
* always the upstream logStart writer.
|
|
299
|
+
*/
|
|
300
|
+
flightRecorderEntry, extractUsage) {
|
|
290
301
|
// U26 fix: ownership of onComplete is a contract. Once this function returns
|
|
291
302
|
// OR throws, the caller MUST consider onComplete consumed — i.e. it has
|
|
292
303
|
// either been run, or the AsyncJobManager has taken ownership of it. The
|
|
@@ -336,6 +347,13 @@ async function awaitJobOrDefer(cli, args, corrId, idleTimeoutMs, outputFormat, f
|
|
|
336
347
|
forceRefresh,
|
|
337
348
|
env,
|
|
338
349
|
onComplete,
|
|
350
|
+
// Sync-deferred path: the upstream sync handler already wrote
|
|
351
|
+
// logStart for this corrId, so writeFlightStart stays false. The
|
|
352
|
+
// manager still writes logComplete on terminal state (which UPDATEs
|
|
353
|
+
// the sync handler's row), closing the previously-orphaned
|
|
354
|
+
// sync-deferred case.
|
|
355
|
+
flightRecorderEntry,
|
|
356
|
+
extractUsage,
|
|
339
357
|
});
|
|
340
358
|
// Handoff succeeded: AsyncJobManager owns onComplete (it'll fire via
|
|
341
359
|
// fireOnComplete on terminal status, or run inline immediately for dedup).
|
|
@@ -369,7 +387,14 @@ async function awaitJobOrDefer(cli, args, corrId, idleTimeoutMs, outputFormat, f
|
|
|
369
387
|
}
|
|
370
388
|
await new Promise(resolve => setTimeout(resolve, SYNC_POLL_INTERVAL_MS));
|
|
371
389
|
}
|
|
372
|
-
// Deadline exceeded — return deferral
|
|
390
|
+
// Deadline exceeded — return deferral.
|
|
391
|
+
// R2 Codex-Unit-B F1: hand FR-complete ownership to the manager. Until
|
|
392
|
+
// this call, the manager skips writeFlightComplete on terminal so the
|
|
393
|
+
// sync handler's safeFlightComplete (with rich approvalDecision /
|
|
394
|
+
// optimizationApplied metadata) wins for sync-inline completions. From
|
|
395
|
+
// here on the sync handler returns deferred and will NOT write
|
|
396
|
+
// safeFlightComplete, so the manager must.
|
|
397
|
+
runtime.asyncJobManager.armFlightCompleteForDeferral(job.id);
|
|
373
398
|
runtime.logger.info(`[${corrId}] ${cli} sync deadline exceeded (${SYNC_DEADLINE_MS}ms), deferring to async job ${job.id}`);
|
|
374
399
|
return {
|
|
375
400
|
deferred: true,
|
|
@@ -495,6 +520,30 @@ function extractUsageAndCost(cli, output, outputFormat) {
|
|
|
495
520
|
// once we resolve the session id post-run.
|
|
496
521
|
return {};
|
|
497
522
|
}
|
|
523
|
+
/**
|
|
524
|
+
* Slice 1.5: build the async-job-manager's FR payload from a prep object
|
|
525
|
+
* (which every prepare*Request returns), plus the bound CLI and output
|
|
526
|
+
* format primitives needed by extractUsageAndCost. Returning the closure
|
|
527
|
+
* separately means it captures `cliName` and `fmt` ONLY — never `params`
|
|
528
|
+
* or `prep` — so retention on AsyncJobRecord is O(constant).
|
|
529
|
+
*/
|
|
530
|
+
function buildAsyncFlightRecorderHandoff(cliName, prep, sessionId, outputFormat) {
|
|
531
|
+
// Extract primitives BEFORE building the closure — capturing `prep` or
|
|
532
|
+
// `params` directly would pin large attachments / promptParts on the
|
|
533
|
+
// AsyncJobRecord for JOB_TTL_MS.
|
|
534
|
+
const cli = cliName;
|
|
535
|
+
const fmt = outputFormat;
|
|
536
|
+
return {
|
|
537
|
+
flightRecorderEntry: {
|
|
538
|
+
model: prep.resolvedModel || "default",
|
|
539
|
+
prompt: prep.effectivePrompt,
|
|
540
|
+
sessionId,
|
|
541
|
+
stablePrefixHash: prep.stablePrefixHash ?? undefined,
|
|
542
|
+
stablePrefixTokens: prep.stablePrefixTokens ?? undefined,
|
|
543
|
+
},
|
|
544
|
+
extractUsage: (stdout) => extractUsageAndCost(cli, stdout, fmt),
|
|
545
|
+
};
|
|
546
|
+
}
|
|
498
547
|
function safeFlightStart(entry, runtime = resolveGatewayServerRuntime()) {
|
|
499
548
|
try {
|
|
500
549
|
runtime.flightRecorder.logStart(entry);
|
|
@@ -1542,7 +1591,8 @@ export async function handleGeminiRequest(deps, params) {
|
|
|
1542
1591
|
args.push(...sessionPlan.args);
|
|
1543
1592
|
const userProvidedSession = sessionPlan.resumed;
|
|
1544
1593
|
const effectiveSessionIdHint = sessionPlan.resumed ? params.sessionId : undefined;
|
|
1545
|
-
const
|
|
1594
|
+
const geminiFrHandoff = buildAsyncFlightRecorderHandoff("gemini", prep, params.sessionId, params.outputFormat);
|
|
1595
|
+
const result = await awaitJobOrDefer("gemini", args, corrId, resolveIdleTimeout("gemini", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, runtime, undefined, undefined, geminiFrHandoff.flightRecorderEntry, geminiFrHandoff.extractUsage);
|
|
1546
1596
|
// Deferred — job still running, return async reference
|
|
1547
1597
|
if (isDeferredResponse(result)) {
|
|
1548
1598
|
return buildDeferredToolResponse(result, effectiveSessionIdHint);
|
|
@@ -1675,7 +1725,10 @@ export async function handleGeminiRequestAsync(deps, params) {
|
|
|
1675
1725
|
// surfaces it in the snapshot).
|
|
1676
1726
|
assertUpstreamCliArgs("gemini", args);
|
|
1677
1727
|
assertUpstreamCliEnv("gemini", undefined);
|
|
1678
|
-
|
|
1728
|
+
// Slice 1.5: pure async path — no upstream safeFlightStart, so the
|
|
1729
|
+
// manager owns both logStart and logComplete for this corrId.
|
|
1730
|
+
const geminiAsyncFrHandoff = buildAsyncFlightRecorderHandoff("gemini", prep, effectiveSessionId, params.outputFormat);
|
|
1731
|
+
const job = deps.asyncJobManager.startJob("gemini", args, corrId, undefined, resolveIdleTimeout("gemini", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, undefined, undefined, geminiAsyncFrHandoff.flightRecorderEntry, geminiAsyncFrHandoff.extractUsage, true);
|
|
1679
1732
|
deps.logger.info(`[${corrId}] gemini_request_async started job ${job.id}`);
|
|
1680
1733
|
const asyncResponse = {
|
|
1681
1734
|
success: true,
|
|
@@ -1745,7 +1798,8 @@ export async function handleGrokRequest(deps, params) {
|
|
|
1745
1798
|
createNewSession: params.createNewSession,
|
|
1746
1799
|
});
|
|
1747
1800
|
args.push(...sessionResult.resumeArgs);
|
|
1748
|
-
const
|
|
1801
|
+
const grokFrHandoff = buildAsyncFlightRecorderHandoff("grok", prep, params.sessionId, params.outputFormat);
|
|
1802
|
+
const result = await awaitJobOrDefer("grok", args, corrId, resolveIdleTimeout("grok", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, runtime, undefined, undefined, grokFrHandoff.flightRecorderEntry, grokFrHandoff.extractUsage);
|
|
1749
1803
|
// Deferred — job still running, return async reference
|
|
1750
1804
|
if (isDeferredResponse(result)) {
|
|
1751
1805
|
return buildDeferredToolResponse(result, sessionResult.effectiveSessionId);
|
|
@@ -1875,7 +1929,8 @@ export async function handleGrokRequestAsync(deps, params) {
|
|
|
1875
1929
|
// Start job only after all session I/O succeeds
|
|
1876
1930
|
assertUpstreamCliArgs("grok", args);
|
|
1877
1931
|
assertUpstreamCliEnv("grok", undefined);
|
|
1878
|
-
const
|
|
1932
|
+
const grokAsyncFrHandoff = buildAsyncFlightRecorderHandoff("grok", prep, effectiveSessionId, params.outputFormat);
|
|
1933
|
+
const job = deps.asyncJobManager.startJob("grok", args, corrId, undefined, resolveIdleTimeout("grok", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, undefined, undefined, grokAsyncFrHandoff.flightRecorderEntry, grokAsyncFrHandoff.extractUsage, true);
|
|
1879
1934
|
deps.logger.info(`[${corrId}] grok_request_async started job ${job.id}`);
|
|
1880
1935
|
const asyncResponse = {
|
|
1881
1936
|
success: true,
|
|
@@ -1943,7 +1998,8 @@ export async function handleMistralRequest(deps, params) {
|
|
|
1943
1998
|
createNewSession: params.createNewSession,
|
|
1944
1999
|
});
|
|
1945
2000
|
args.push(...sessionResult.resumeArgs);
|
|
1946
|
-
|
|
2001
|
+
const mistralFrHandoff = buildAsyncFlightRecorderHandoff("mistral", prep, params.sessionId, params.outputFormat);
|
|
2002
|
+
let result = await awaitJobOrDefer("mistral", args, corrId, resolveIdleTimeout("mistral", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, runtime, mistralEnv, undefined, mistralFrHandoff.flightRecorderEntry, mistralFrHandoff.extractUsage);
|
|
1947
2003
|
if (isDeferredResponse(result)) {
|
|
1948
2004
|
return buildDeferredToolResponse(result, sessionResult.effectiveSessionId);
|
|
1949
2005
|
}
|
|
@@ -1964,7 +2020,9 @@ export async function handleMistralRequest(deps, params) {
|
|
|
1964
2020
|
disallowedTools: params.disallowedTools,
|
|
1965
2021
|
});
|
|
1966
2022
|
const retryArgs = [...retryPrep.args, ...sessionResult.resumeArgs];
|
|
1967
|
-
|
|
2023
|
+
// Reuse the FR handoff built above — the retry preserves corrId,
|
|
2024
|
+
// so the manager's logComplete still updates the original row.
|
|
2025
|
+
result = await awaitJobOrDefer("mistral", retryArgs, corrId, resolveIdleTimeout("mistral", params.idleTimeoutMs), params.outputFormat, true, runtime, retryPrep.env, undefined, mistralFrHandoff.flightRecorderEntry, mistralFrHandoff.extractUsage);
|
|
1968
2026
|
if (isDeferredResponse(result)) {
|
|
1969
2027
|
return buildDeferredToolResponse(result, sessionResult.effectiveSessionId);
|
|
1970
2028
|
}
|
|
@@ -2092,7 +2150,8 @@ export async function handleMistralRequestAsync(deps, params) {
|
|
|
2092
2150
|
}
|
|
2093
2151
|
assertUpstreamCliArgs("mistral", args);
|
|
2094
2152
|
assertUpstreamCliEnv("mistral", mistralEnv);
|
|
2095
|
-
const
|
|
2153
|
+
const mistralAsyncFrHandoff = buildAsyncFlightRecorderHandoff("mistral", prep, effectiveSessionId, params.outputFormat);
|
|
2154
|
+
const job = deps.asyncJobManager.startJob("mistral", args, corrId, undefined, resolveIdleTimeout("mistral", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, mistralEnv, undefined, mistralAsyncFrHandoff.flightRecorderEntry, mistralAsyncFrHandoff.extractUsage, true);
|
|
2096
2155
|
deps.logger.info(`[${corrId}] mistral_request_async started job ${job.id}`);
|
|
2097
2156
|
const asyncResponse = {
|
|
2098
2157
|
success: true,
|
|
@@ -2193,9 +2252,10 @@ export async function handleCodexRequestAsync(deps, params) {
|
|
|
2193
2252
|
// registering the record, ownership stays here and we run it in the catch.
|
|
2194
2253
|
assertUpstreamCliArgs("codex", args);
|
|
2195
2254
|
assertUpstreamCliEnv("codex", undefined);
|
|
2255
|
+
const codexAsyncFrHandoff = buildAsyncFlightRecorderHandoff("codex", prep, effectiveSessionId, params.outputFormat);
|
|
2196
2256
|
let job;
|
|
2197
2257
|
try {
|
|
2198
|
-
job = deps.asyncJobManager.startJob("codex", args, corrId, undefined, resolveIdleTimeout("codex", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, undefined, prepCleanup);
|
|
2258
|
+
job = deps.asyncJobManager.startJob("codex", args, corrId, undefined, resolveIdleTimeout("codex", params.idleTimeoutMs), params.outputFormat, params.forceRefresh, undefined, prepCleanup, codexAsyncFrHandoff.flightRecorderEntry, codexAsyncFrHandoff.extractUsage, true);
|
|
2199
2259
|
// Handoff succeeded: AsyncJobManager will fire prepCleanup on terminal
|
|
2200
2260
|
// status. Release our local ownership claim so the catch path doesn't
|
|
2201
2261
|
// double-fire.
|
|
@@ -2461,7 +2521,8 @@ export function createGatewayServer(deps = {}) {
|
|
|
2461
2521
|
}
|
|
2462
2522
|
// Idle timeout only for stream-json (text/json produce no output until done)
|
|
2463
2523
|
const effectiveIdleTimeout = outputFormat === "stream-json" ? resolveIdleTimeout("claude", idleTimeoutMs) : undefined;
|
|
2464
|
-
const
|
|
2524
|
+
const claudeSyncFrHandoff = buildAsyncFlightRecorderHandoff("claude", prep, effectiveSessionId, outputFormat);
|
|
2525
|
+
const result = await awaitJobOrDefer("claude", args, corrId, effectiveIdleTimeout, outputFormat, forceRefresh, runtime, undefined, undefined, claudeSyncFrHandoff.flightRecorderEntry, claudeSyncFrHandoff.extractUsage);
|
|
2465
2526
|
// Deferred — job still running, return async reference
|
|
2466
2527
|
if (isDeferredResponse(result)) {
|
|
2467
2528
|
return buildDeferredToolResponse(result, effectiveSessionId);
|
|
@@ -2703,7 +2764,8 @@ export function createGatewayServer(deps = {}) {
|
|
|
2703
2764
|
// completion or deferred). The outer finally MUST NOT clean again.
|
|
2704
2765
|
const prepCleanup = "cleanup" in prep && typeof prep.cleanup === "function" ? prep.cleanup : undefined;
|
|
2705
2766
|
try {
|
|
2706
|
-
const
|
|
2767
|
+
const codexSyncFrHandoff = buildAsyncFlightRecorderHandoff("codex", prep, sessionId, outputFormat);
|
|
2768
|
+
const result = await awaitJobOrDefer("codex", args, corrId, resolveIdleTimeout("codex", idleTimeoutMs), outputFormat, forceRefresh, runtime, undefined, prepCleanup, codexSyncFrHandoff.flightRecorderEntry, codexSyncFrHandoff.extractUsage);
|
|
2707
2769
|
// Deferred — job still running, return async reference. Cleanup
|
|
2708
2770
|
// ownership belongs to AsyncJobManager via onComplete.
|
|
2709
2771
|
if (isDeferredResponse(result)) {
|
|
@@ -3344,7 +3406,8 @@ export function createGatewayServer(deps = {}) {
|
|
|
3344
3406
|
: undefined;
|
|
3345
3407
|
assertUpstreamCliArgs("claude", args);
|
|
3346
3408
|
assertUpstreamCliEnv("claude", undefined);
|
|
3347
|
-
const
|
|
3409
|
+
const claudeAsyncFrHandoff = buildAsyncFlightRecorderHandoff("claude", prep, effectiveSessionId, outputFormat);
|
|
3410
|
+
const job = asyncJobManager.startJob("claude", args, corrId, undefined, effectiveIdleTimeout, outputFormat, forceRefresh, undefined, undefined, claudeAsyncFrHandoff.flightRecorderEntry, claudeAsyncFrHandoff.extractUsage, true);
|
|
3348
3411
|
logger.info(`[${corrId}] claude_request_async started job ${job.id}, outputFormat=${outputFormat}`);
|
|
3349
3412
|
const asyncResponse = {
|
|
3350
3413
|
success: true,
|
package/dist/job-store.d.ts
CHANGED
|
@@ -51,10 +51,35 @@ export interface JobStore {
|
|
|
51
51
|
}): void;
|
|
52
52
|
getById(id: string): JobRecord | null;
|
|
53
53
|
findByRequestKey(requestKey: string): JobRecord | null;
|
|
54
|
-
|
|
54
|
+
/**
|
|
55
|
+
* Flip every `status='running'` row to `'orphaned'` at gateway boot.
|
|
56
|
+
*
|
|
57
|
+
* Returns the row count AND a snapshot of every row that was flipped, so
|
|
58
|
+
* AsyncJobManager can write a flight-recorder logComplete with the full
|
|
59
|
+
* sync-helper-equivalent payload (response from stderr||stdout,
|
|
60
|
+
* durationMs from startedAt). Pre-slice-1.5 rows that never wrote a
|
|
61
|
+
* logStart degrade silently to a no-op UPDATE inside the FR.
|
|
62
|
+
*/
|
|
63
|
+
markOrphanedOnStartup(): {
|
|
64
|
+
count: number;
|
|
65
|
+
orphaned: Array<OrphanedJobSnapshot>;
|
|
66
|
+
};
|
|
55
67
|
evictExpired(): number;
|
|
56
68
|
close(): void;
|
|
57
69
|
}
|
|
70
|
+
/**
|
|
71
|
+
* Per-orphan snapshot returned by `markOrphanedOnStartup` so the
|
|
72
|
+
* AsyncJobManager constructor can build a faithful FlightLogResult for
|
|
73
|
+
* each row it flipped.
|
|
74
|
+
*/
|
|
75
|
+
export interface OrphanedJobSnapshot {
|
|
76
|
+
id: string;
|
|
77
|
+
correlationId: string;
|
|
78
|
+
startedAt: string;
|
|
79
|
+
stdout: string;
|
|
80
|
+
stderr: string;
|
|
81
|
+
exitCode: number | null;
|
|
82
|
+
}
|
|
58
83
|
/**
|
|
59
84
|
* SQLite-backed job store. Default backend for production. Durable across
|
|
60
85
|
* gateway restarts; safe for single-instance deployments.
|
|
@@ -69,6 +94,7 @@ export declare class SqliteJobStore implements JobStore {
|
|
|
69
94
|
private updateCompleteStmt;
|
|
70
95
|
private getByIdStmt;
|
|
71
96
|
private findByRequestKeyStmt;
|
|
97
|
+
private selectRunningOrphansStmt;
|
|
72
98
|
private markOrphanedStmt;
|
|
73
99
|
private deleteExpiredStmt;
|
|
74
100
|
constructor(dbPath: string, logger?: Logger, options?: {
|
|
@@ -114,8 +140,15 @@ export declare class SqliteJobStore implements JobStore {
|
|
|
114
140
|
/**
|
|
115
141
|
* On gateway boot, flip any jobs that were 'running' to 'orphaned'.
|
|
116
142
|
* The child processes were detached but can't be reattached to in this process.
|
|
143
|
+
*
|
|
144
|
+
* Returns the row count + a per-orphan snapshot so AsyncJobManager can
|
|
145
|
+
* write a flight-recorder logComplete with proper audit data
|
|
146
|
+
* (durationMs from startedAt, response from stderr||stdout).
|
|
117
147
|
*/
|
|
118
|
-
markOrphanedOnStartup():
|
|
148
|
+
markOrphanedOnStartup(): {
|
|
149
|
+
count: number;
|
|
150
|
+
orphaned: Array<OrphanedJobSnapshot>;
|
|
151
|
+
};
|
|
119
152
|
/**
|
|
120
153
|
* Delete rows whose expires_at has passed. Returns number of rows deleted.
|
|
121
154
|
*/
|
|
@@ -171,7 +204,10 @@ export declare class MemoryJobStore implements JobStore {
|
|
|
171
204
|
* In-memory stores have no cross-process state, so any "running" rows here
|
|
172
205
|
* came from this very process and aren't actually orphaned. No-op.
|
|
173
206
|
*/
|
|
174
|
-
markOrphanedOnStartup():
|
|
207
|
+
markOrphanedOnStartup(): {
|
|
208
|
+
count: number;
|
|
209
|
+
orphaned: Array<OrphanedJobSnapshot>;
|
|
210
|
+
};
|
|
175
211
|
evictExpired(): number;
|
|
176
212
|
close(): void;
|
|
177
213
|
}
|
|
@@ -188,7 +224,10 @@ export declare class PostgresJobStore implements JobStore {
|
|
|
188
224
|
recordComplete(): void;
|
|
189
225
|
getById(): JobRecord | null;
|
|
190
226
|
findByRequestKey(): JobRecord | null;
|
|
191
|
-
markOrphanedOnStartup():
|
|
227
|
+
markOrphanedOnStartup(): {
|
|
228
|
+
count: number;
|
|
229
|
+
orphaned: Array<OrphanedJobSnapshot>;
|
|
230
|
+
};
|
|
192
231
|
evictExpired(): number;
|
|
193
232
|
close(): void;
|
|
194
233
|
}
|
package/dist/job-store.js
CHANGED
|
@@ -73,6 +73,7 @@ export class SqliteJobStore {
|
|
|
73
73
|
updateCompleteStmt;
|
|
74
74
|
getByIdStmt;
|
|
75
75
|
findByRequestKeyStmt;
|
|
76
|
+
selectRunningOrphansStmt;
|
|
76
77
|
markOrphanedStmt;
|
|
77
78
|
deleteExpiredStmt;
|
|
78
79
|
constructor(dbPath, logger = noopLogger, options = {}) {
|
|
@@ -148,6 +149,16 @@ export class SqliteJobStore {
|
|
|
148
149
|
AND status IN ('running', 'completed')
|
|
149
150
|
ORDER BY started_at DESC
|
|
150
151
|
LIMIT 1
|
|
152
|
+
`);
|
|
153
|
+
// Snapshot every in-flight row's audit data BEFORE the orphan-flip
|
|
154
|
+
// UPDATE so AsyncJobManager can construct a full FlightLogResult per
|
|
155
|
+
// orphan. No transaction wrapper required: gateway boot is
|
|
156
|
+
// single-threaded before any new jobs can arrive, so no
|
|
157
|
+
// status='running' row can be inserted between this SELECT and the
|
|
158
|
+
// UPDATE below.
|
|
159
|
+
this.selectRunningOrphansStmt = this.db.prepare(`
|
|
160
|
+
SELECT id, correlation_id, started_at, stdout, stderr, exit_code
|
|
161
|
+
FROM jobs WHERE status = 'running'
|
|
151
162
|
`);
|
|
152
163
|
this.markOrphanedStmt = this.db.prepare(`
|
|
153
164
|
UPDATE jobs
|
|
@@ -227,14 +238,29 @@ export class SqliteJobStore {
|
|
|
227
238
|
/**
|
|
228
239
|
* On gateway boot, flip any jobs that were 'running' to 'orphaned'.
|
|
229
240
|
* The child processes were detached but can't be reattached to in this process.
|
|
241
|
+
*
|
|
242
|
+
* Returns the row count + a per-orphan snapshot so AsyncJobManager can
|
|
243
|
+
* write a flight-recorder logComplete with proper audit data
|
|
244
|
+
* (durationMs from startedAt, response from stderr||stdout).
|
|
230
245
|
*/
|
|
231
246
|
markOrphanedOnStartup() {
|
|
232
247
|
const now = new Date().toISOString();
|
|
233
248
|
// Orphaned jobs retain a short window so callers can fetch the partial output,
|
|
234
249
|
// then evict. Reuse the standard retention.
|
|
235
250
|
const expiresAt = new Date(Date.now() + this.retentionMs).toISOString();
|
|
251
|
+
// SELECT before UPDATE — gateway boot is single-threaded so no row can
|
|
252
|
+
// appear in 'running' between the two statements.
|
|
253
|
+
const rows = (this.selectRunningOrphansStmt.all?.() ?? []);
|
|
254
|
+
const orphaned = rows.map(row => ({
|
|
255
|
+
id: row.id,
|
|
256
|
+
correlationId: row.correlation_id,
|
|
257
|
+
startedAt: row.started_at,
|
|
258
|
+
stdout: row.stdout ?? "",
|
|
259
|
+
stderr: row.stderr ?? "",
|
|
260
|
+
exitCode: row.exit_code,
|
|
261
|
+
}));
|
|
236
262
|
const result = this.markOrphanedStmt.run(now, expiresAt);
|
|
237
|
-
return result?.changes ?? 0;
|
|
263
|
+
return { count: result?.changes ?? 0, orphaned };
|
|
238
264
|
}
|
|
239
265
|
/**
|
|
240
266
|
* Delete rows whose expires_at has passed. Returns number of rows deleted.
|
|
@@ -341,7 +367,7 @@ export class MemoryJobStore {
|
|
|
341
367
|
* came from this very process and aren't actually orphaned. No-op.
|
|
342
368
|
*/
|
|
343
369
|
markOrphanedOnStartup() {
|
|
344
|
-
return 0;
|
|
370
|
+
return { count: 0, orphaned: [] };
|
|
345
371
|
}
|
|
346
372
|
evictExpired() {
|
|
347
373
|
const nowIso = new Date().toISOString();
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "llm-cli-gateway",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.7.0",
|
|
4
4
|
"mcpName": "io.github.verivus-oss/llm-cli-gateway",
|
|
5
5
|
"description": "MCP server providing unified access to Claude Code, Codex, Gemini, Grok, and Mistral Vibe CLIs with session management, retry logic, async job orchestration, durable job results, and cross-LLM validation.",
|
|
6
6
|
"license": "MIT",
|
package/socket.yml
CHANGED
|
@@ -14,6 +14,25 @@ version: 2
|
|
|
14
14
|
# src/endpoint-exposure.ts also issues a HEAD probe when verifying
|
|
15
15
|
# tunnel reachability — opt-in via the start:http entry point only.
|
|
16
16
|
#
|
|
17
|
+
# Additionally, Socket may flag `dist/index.js` and `dist/job-store.js`
|
|
18
|
+
# against the `globalThis["fetch"]` rule. This is a substring-match
|
|
19
|
+
# false positive (verified for v1.6.0 by sub-agent investigation on
|
|
20
|
+
# 2026-05-26; same matches exist in v1.5.35). Neither file contains
|
|
21
|
+
# any `fetch(`, `globalThis.fetch`, polyfill import, or any other
|
|
22
|
+
# network-call construct. The matches are:
|
|
23
|
+
# - dist/index.js — the English word "fetch" inside an async-defer
|
|
24
|
+
# error message ("Poll with llm_job_status, fetch with
|
|
25
|
+
# llm_job_result.") AND the JSON field name `fetchWith:
|
|
26
|
+
# "llm_job_result"` (part of the deferred-job response contract).
|
|
27
|
+
# - dist/job-store.js — the word "fetch" inside a code comment on
|
|
28
|
+
# markOrphanedOnStartup() describing how callers retrieve partial
|
|
29
|
+
# output from SQLite.
|
|
30
|
+
# Verify with: `grep -rEn "\bfetch\(|globalThis\.fetch|globalThis\[" dist/`
|
|
31
|
+
# — returns empty. Production code does not import undici / node-fetch
|
|
32
|
+
# / axios / got. The cache-awareness slice (v1.6.0) introduced zero
|
|
33
|
+
# new network surfaces; all I/O is filesystem (SQLite, sessions.json)
|
|
34
|
+
# or in-process.
|
|
35
|
+
#
|
|
17
36
|
# shellAccess
|
|
18
37
|
# src/executor.ts uses child_process.spawn(cmd, args, { ... }) with a
|
|
19
38
|
# fixed allow-list of CLI binaries (claude / codex / gemini / grok /
|