moflo 4.9.30 → 4.9.32

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,167 @@
1
+ # Root-Cause Discipline — Measure Twice, Cut Once
2
+
3
+ **Purpose:** The MoFlo standard for fixing bugs. We do not "shoot first and ask questions later" — we measure twice and cut once. Apply this whenever you are about to write a fix, especially when a previous fix on the same surface didn't fully work.
4
+
5
+ ---
6
+
7
+ ## The Headline Rule
8
+
9
+ **Measure twice, cut once. Step back, understand the problem holistically, then make the simplest fix that eliminates the cause.** Do not pile patch onto patch onto patch.
10
+
11
+ This is the single most important engineering posture in this project. Layered patches have produced the worst regressions, the longest debugging sessions, and the most expensive token bills. When you find yourself reaching for "another layer" — stop.
12
+
13
+ ---
14
+
15
+ ## Before You Write Fix N+1
16
+
17
+ Before adding a new fix on top of an existing one, you MUST answer all four:
18
+
19
+ | Question | If you can't answer | Action |
20
+ |----------|---------------------|--------|
21
+ | What exactly is the failure mode at the lowest level? (Not the symptom — the actual mechanism.) | You don't understand the bug yet | Investigate further; do not fix |
22
+ | Why didn't fix N work? Is it wrong, or just incomplete? | You're guessing at the gap | Read fix N's code + history; reproduce the failure |
23
+ | Would removing fix N + replacing with one cleaner fix simplify the surface? | You haven't considered consolidation | Try the consolidation first |
24
+ | What's the SIMPLEST change that makes the bug structurally impossible? | You're patching symptoms, not causes | Step back further |
25
+
26
+ If three answers are vague, you're in patch-on-patch territory. Stop and re-think.
27
+
28
+ ---
29
+
30
+ ## Patch-on-Patch Smoke Alarms
31
+
32
+ Stop and reconsider when you see yourself doing any of these:
33
+
34
+ | Smoke alarm | What it usually means | The right move |
35
+ |-------------|----------------------|----------------|
36
+ | Adding a "belt-and-suspenders" cleanup | The first cleanup is racing something — find what | Eliminate the race, not double-cleanup |
37
+ | Adding `try/catch` around code that already has `try/catch` | Outer catch is masking inner failure | Surface the inner error, don't double-wrap |
38
+ | Adding a `setTimeout` retry loop on top of an existing retry | Retry won't fix a logic bug | Fix the logic |
39
+ | Bumping a timeout because tests fail intermittently | The op is slower than expected — find why | Fix the slowness or remove the op |
40
+ | Adding a flag/env-var to "skip the broken path" | You're hiding the bug, not fixing it | Fix the path or delete it |
41
+ | Adding a workaround "until we can fix this properly" | You won't come back; "later" never happens | Fix it now or file with full context |
42
+ | Touching three files to fix one bug | Bug is misdiagnosed; one file usually suffices | Re-diagnose |
43
+
44
+ When **two or more** of these apply at once, the fix is almost certainly wrong. Throw it away and re-investigate.
45
+
46
+ ---
47
+
48
+ ## The Holistic Step-Back
49
+
50
+ When fix N didn't work, do these in order — not in parallel, not skipping steps:
51
+
52
+ 1. **Read every prior fix on this surface in full.** Not the commit message — the code. Note what each one was trying to prevent and what it actually does.
53
+ 2. **Reproduce the failure deterministically** before touching code. If you can't reproduce it, you don't understand it.
54
+ 3. **Trace the data flow.** Where does the bad state originate? What writes it? What reads it? What invariant got violated?
55
+ 4. **Question the test, not just the code.** What invariant does the failing test actually encode? Does that invariant match the runtime contract, or is the test stricter? A test stricter than the contract will produce flakes that look like bugs but aren't. (See #1017 case study.)
56
+ 5. **Identify the structural cause** — the place where the bug becomes possible, not the place where it becomes visible.
57
+ 6. **Now consider fixes.** The cheapest fix at the structural cause beats the cleverest fix at the symptom every time. If the cause is "test asserts X, runtime contract is Y, X is stricter," the fix is in the test.
58
+
59
+ If step 6 yields a fix smaller and simpler than the existing patches, **delete the existing patches** as part of the same change. Do not stack.
60
+
61
+ ---
62
+
63
+ ## Code Serves the Specification, Not the Test
64
+
65
+ **Periodically ask: "Am I solving an actual problem, or am I flailing to satisfy a flawed test?"** When several attempted fixes haven't moved the needle, the test framework is a likely suspect — but the response is never to degrade production code to make the test pass.
66
+
67
+ **Never introduce substandard code to satisfy shortcomings of the testing infrastructure.** Production code expresses the runtime contract. Tests verify the contract. When they disagree:
68
+
69
+ | Disagreement | Correct response | Wrong response |
70
+ |--------------|------------------|----------------|
71
+ | Test asserts behavior the runtime never promised | Fix the test to match the contract | Add code to satisfy the test's stricter assertion |
72
+ | Test uses an unrealistic environment (mocks the wrong layer, races a SIGKILL'd daemon, single-session asserts on a multi-session contract) | Fix the test environment | Add retry / sleep / workaround in production code |
73
+ | Test framework can't observe a legitimate runtime path | Add a test hook (`_resetForTest`, `getStateForTest`) that doesn't change runtime behavior | Restructure runtime to make the test framework's observation easier |
74
+ | Test is flaky on one platform but the runtime works | Identify why the test, not the runtime, is sensitive | Bump timeouts / retries / sleeps in production paths |
75
+
76
+ **Code purity check before any "make the test pass" change:** would you ship this change if the test didn't exist? If no, you're degrading the code to satisfy the test. Stop. Fix the test.
77
+
78
+ **Signals you're flailing for the test, not solving the bug:**
79
+
80
+ | Signal | What it actually says |
81
+ |--------|----------------------|
82
+ | You've tried 3+ fixes and nothing has moved the needle | The diagnosis is wrong; investigate before patching again |
83
+ | Each fix gets narrower / more defensive without removing the prior layer | You're piling on, not solving |
84
+ | The runtime works fine in real-world usage but the test fails | The test's spec doesn't match the contract — that's the bug |
85
+ | You'd need to add a sleep, retry, lock, or platform-special-case to make the test happy | Production code is paying for a test-environment limitation |
86
+ | Removing the test makes the bug "go away" | The test was right but the fix is wrong, OR the test was the bug — diagnose which |
87
+
88
+ The user said it directly: **"we never want to introduce substandard code to satisfy shortcomings of our testing infrastructure."** Tests serve the code; the code does not serve the tests.
89
+
90
+ When you find that the test is the actual problem: change the test, document why in the commit message, and (if the change weakens an invariant) add a separate test that captures the invariant the original was *trying* to encode without the false strictness.
91
+
92
+ ---
93
+
94
+ ## Concrete Example: #1017 Hive-Mind Shutdown
95
+
96
+ This is the canonical case study for this guidance — and it has a second-order lesson that makes it even more useful.
97
+
98
+ | Attempt | Approach | Outcome |
99
+ |---------|----------|---------|
100
+ | #1017 first try | Loop list+delete in `clearNamespace` | Race window remained — broadcasts landed mid-loop |
101
+ | #1024 layer 1 | Detach adapter BEFORE `clearNamespace` (after `terminateAgent`) | Race narrowed but not eliminated |
102
+ | #1024 layer 2 | Add `purgeHiveNamespacesDirect` raw sql.js DELETE | Looked bulletproof; actually clobber-prone vs daemon's stale snapshot (#981 single-writer) |
103
+ | #1024 declared green | All 6 CI checks pass once | Same flake reappeared on next PR's CI |
104
+ | #1027 attempt 4 | Move `adapter.detach()` BEFORE `terminateAgent`; delete `purgeHiveNamespacesDirect` | Code simplified by -73 LOC. **Same flake on macos-latest CI.** |
105
+ | #1027 — actual fix | Run launcher a SECOND time after doctor in the populated harness | Test passes. Race is intrinsic to multi-process sql.js + daemon kill timing; the harness assertion was over-strict. |
106
+
107
+ The first three attempts kept asking "how do we delete this row harder?" The fourth attempt was a structural simplification that was correct on its own merits (-73 LOC, removed dead code, simpler shutdown ordering) but **did not fix the flake**.
108
+
109
+ The actual root cause was outside the surface every patch had touched: the populated harness was asserting "ephemerals purged after one launcher run" when the **runtime contract is "ephemerals purged at next session-start launcher"**. The doctor's hive-mind probe writes a row intentionally; that row is supposed to live until the NEXT session purges it. The test was conflating "purge mechanism works" with "purge happens within one session" — those are different invariants and only the first is the real product behavior.
110
+
111
+ **Two lessons stack here:**
112
+
113
+ 1. **Don't pile layers** (the original lesson): four shutdown patches, each narrower than the last, none structurally sufficient.
114
+ 2. **Question the test, not just the code** (the second-order lesson): if you've been fighting a race for four PRs and the simplest in-code fix doesn't move the needle, the spec encoded in the test may be wrong. A test that is stricter than the runtime contract WILL produce flakes that look like product bugs but aren't.
115
+
116
+ Together: when a fix isn't working, ask both "what writes the bad state?" AND "is this state actually bad in the runtime contract, or only in the test's expectation?"
117
+
118
+ ---
119
+
120
+ ## When You Genuinely Need a Belt-and-Suspenders
121
+
122
+ Belts-and-suspenders are not always wrong. They are right when:
123
+
124
+ | Condition | Example |
125
+ |-----------|---------|
126
+ | The two layers protect against **different** failure modes | atomic-write tmp+fsync+rename: tmp protects partial writes; fsync protects OS cache; rename protects readers — three concerns, three mechanisms |
127
+ | The first layer's failure is **silent**, the second surfaces it | A retry that logs the first failure before re-attempting |
128
+ | Removing either layer has a **stated, documented reason** for keeping the other | A fallback path with a comment explaining when the primary doesn't reach |
129
+
130
+ They are wrong when both layers protect against the **same** failure mode and you're hoping at least one wins. That's hope, not engineering.
131
+
132
+ ---
133
+
134
+ ## What This Means for PR Reviews
135
+
136
+ Reviewers should reject — not just question — PRs that show patch-on-patch signatures:
137
+
138
+ | Signal | Reviewer action |
139
+ |--------|----------------|
140
+ | Same file/function touched in 3+ recent commits, same bug | Ask: "is the prior fix wrong? remove it" |
141
+ | New fix adds a layer without removing one | Ask: "what was wrong with the prior layer? why does it stay?" |
142
+ | Comment in new code says "for safety" or "just in case" | Ask: "what specific failure is this preventing? cite the line that produces it" |
143
+ | The PR description says "this should fix the flake" without a deterministic repro | Ask: "what was the actual root cause? the writeup doesn't name it" |
144
+
145
+ These questions are not pedantic. They are the difference between fixing a bug and growing the surface area of bugs.
146
+
147
+ ---
148
+
149
+ ## How to Apply When You Are Stuck
150
+
151
+ If you genuinely cannot find the root cause after stepping back:
152
+
153
+ 1. **Stop fixing. Start measuring.** Add logging at every state transition. Reproduce. Read the log.
154
+ 2. **Ask the user before patching.** A two-line confirmation question costs less than a wrong fix.
155
+ 3. **File the issue with what you DO know.** Partial diagnosis with logs is more useful than a guessed fix.
156
+ 4. **Never ship "I think this might work."** That phrasing is a self-warning that the diagnosis isn't done.
157
+
158
+ It is always cheaper to admit uncertainty than to ship a layered patch that creates two new bugs.
159
+
160
+ ---
161
+
162
+ ## See Also
163
+
164
+ - `.claude/guidance/moflo-error-handling.md` — Silent failures are the prerequisite condition for most patch-on-patch saga; fix those first
165
+ - `.claude/guidance/moflo-source-hygiene.md` — When you decide to delete redundant code, the canonical-location rules tell you what's safe to remove
166
+ - `feedback_no_layered_workarounds.md` (auto-memory) — The personal-feedback version of this rule, recorded from prior incidents
167
+ - `feedback_ci_flake_means_not_done.md` (auto-memory) — A flake that "passed on rerun" is not fixed; root-cause it under this discipline
@@ -132,6 +132,20 @@ export class FlagEmbedding {
132
132
  const session = await InferenceSession.create(modelPath, {
133
133
  executionProviders: ['cpu'],
134
134
  graphOptimizationLevel: 'all',
135
+ // Suppress ORT's WARNING-level chatter on session bring-up. ORT 1.26.0
136
+ // emits a `[W:onnxruntime ... GetPciBusId] Skipping pci_bus_id` line on
137
+ // Linux Azure VMs whose `/sys/devices/...` filenames don't match the
138
+ // `[0-9a-f]+:[0-9a-f]+:[0-9a-f]+.[0-9a-f]+` PCI pattern; the warning
139
+ // is harmless (we run on the CPU EP only) but leaks to stderr and
140
+ // confuses users into thinking moflo is broken. 0=verbose, 1=info,
141
+ // 2=warning (default), 3=error, 4=fatal — error is the right level
142
+ // because session bring-up genuine failures still surface.
143
+ //
144
+ // Re-audit when bumping fastembed or onnxruntime-node: ORT
145
+ // occasionally promotes deprecation / model-compatibility notices to
146
+ // WARNING that would now be hidden. If a model upgrade ever lands
147
+ // alongside this suppression, drop to 2 once to scan the output.
148
+ logSeverityLevel: 3,
135
149
  });
136
150
  return new FlagEmbedding(tokenizer, session);
137
151
  }
@@ -8,17 +8,30 @@
8
8
  * For `fast-all-MiniLM-L6-v2`, the URL slug is `sentence-transformers-all-MiniLM-L6-v2`
9
9
  * but the on-disk directory keeps the `fast-` prefix — verbatim from upstream.
10
10
  *
11
- * Concurrency: parallel callers downloading the same model atomic-rename the
12
- * tarball through a unique temp path, so Windows file locks during extraction
13
- * never collide. The final model dir is the synchronization point.
11
+ * Concurrency: a per-model file lock (`<cacheDir>/.<model>.download.lock`,
12
+ * created with `wx`) serializes the download/extract for any number of
13
+ * parallel processes only one process performs the work, the rest poll for
14
+ * the completion sentinel. This was issue #1021's secondary failure mode:
15
+ * the smoke harness spawns ~12 parallel doctor + memory probes on a cold
16
+ * cache, and Windows file locking exposed the race when the in-tree
17
+ * "synchronization point" was just a shared directory write.
14
18
  */
15
19
  import { createWriteStream, existsSync, mkdirSync, renameSync, rmSync, writeFileSync, } from 'node:fs';
16
20
  import { homedir } from 'node:os';
17
21
  import { dirname, join } from 'node:path';
18
22
  import { pipeline } from 'node:stream/promises';
19
23
  import { Readable } from 'node:stream';
24
+ import { setTimeout as delay } from 'node:timers/promises';
20
25
  import { x as tarExtract } from 'tar';
21
26
  const GCS_BASE_URL = 'https://storage.googleapis.com/qdrant-fastembed';
27
+ // Lock-poll: how long a non-holder waits for the holder to finish before
28
+ // concluding the holder crashed. Cold-fetch is ~90 MB on slow CI runners, so
29
+ // a generous timeout avoids false takeovers under network back-pressure.
30
+ const LOCK_TIMEOUT_MS = 120_000;
31
+ const LOCK_POLL_INTERVAL_MS = 250;
32
+ // Standard transient-error retry per feedback_transient_retry_circuit_breaker.md:
33
+ // 50/200/800ms backoff, only on network errors and 5xx (4xx is deterministic).
34
+ const HTTP_BACKOFF_MS = [50, 200, 800];
22
35
  /**
23
36
  * Sentinel file written into the model directory only after the tarball has
24
37
  * been fully downloaded AND extracted. Cache hits without it are treated as
@@ -50,28 +63,121 @@ function gcsSlugFor(model) {
50
63
  export function resolveCacheDir(explicit, env = process.env) {
51
64
  return explicit ?? env.FASTEMBED_CACHE ?? join(homedir(), '.cache', 'fastembed');
52
65
  }
66
+ class TransientHttpError extends Error {
67
+ constructor(message) {
68
+ super(message);
69
+ this.name = 'TransientHttpError';
70
+ }
71
+ }
53
72
  /**
54
73
  * Stream the tarball to a unique temp path, then atomic-rename to the final
55
- * tarball path before extracting. The temp suffix prevents two concurrent
56
- * downloads from clobbering each other's write stream — extraction itself is
57
- * the slow step on Windows where file-lock contention shows up.
74
+ * tarball path before extracting. The temp suffix prevents the in-flight
75
+ * write stream from being observed at the final path — extraction always
76
+ * sees a complete file.
77
+ *
78
+ * Throws `TransientHttpError` on 5xx / network failure (caller retries) and
79
+ * a plain Error on 4xx (caller fails fast — retrying won't help).
58
80
  */
59
81
  async function downloadTarball(url, destPath, showProgress, deps) {
60
82
  const fetchFn = deps.fetchImpl ?? fetch;
61
83
  const tmpPath = `${destPath}.${process.pid}.tmp`;
62
84
  mkdirSync(dirname(destPath), { recursive: true });
63
- const res = await fetchFn(url);
85
+ let res;
86
+ try {
87
+ res = await fetchFn(url);
88
+ }
89
+ catch (err) {
90
+ throw new TransientHttpError(`Model download failed: GET ${url} → ${err.message}`);
91
+ }
64
92
  if (!res.ok || !res.body) {
65
- throw new Error(`Model download failed: GET ${url} → ${res.status} ${res.statusText}`);
93
+ const msg = `Model download failed: GET ${url} → ${res.status} ${res.statusText}`;
94
+ if (res.status >= 500)
95
+ throw new TransientHttpError(msg);
96
+ throw new Error(msg);
66
97
  }
67
98
  if (showProgress) {
68
99
  const total = Number(res.headers.get('content-length') ?? 0);
69
100
  const totalMb = (total / (1024 * 1024)).toFixed(1);
70
101
  process.stderr.write(`fastembed: downloading ${totalMb} MB from ${url}\n`);
71
102
  }
72
- await pipeline(Readable.fromWeb(res.body), createWriteStream(tmpPath));
103
+ try {
104
+ await pipeline(Readable.fromWeb(res.body), createWriteStream(tmpPath));
105
+ }
106
+ catch (err) {
107
+ rmSync(tmpPath, { force: true });
108
+ throw new TransientHttpError(`Model download stream failed mid-transfer (${url}): ${err.message}`);
109
+ }
73
110
  renameSync(tmpPath, destPath);
74
111
  }
112
+ async function downloadTarballWithRetry(url, destPath, showProgress, deps) {
113
+ let lastErr;
114
+ for (let attempt = 0; attempt <= HTTP_BACKOFF_MS.length; attempt++) {
115
+ try {
116
+ await downloadTarball(url, destPath, showProgress, deps);
117
+ return;
118
+ }
119
+ catch (err) {
120
+ lastErr = err;
121
+ if (!(err instanceof TransientHttpError) || attempt === HTTP_BACKOFF_MS.length)
122
+ break;
123
+ if (showProgress) {
124
+ process.stderr.write(`fastembed: download attempt ${attempt + 1} failed (${err.message}); retrying in ${HTTP_BACKOFF_MS[attempt]}ms.\n`);
125
+ }
126
+ await delay(HTTP_BACKOFF_MS[attempt]);
127
+ }
128
+ }
129
+ throw lastErr;
130
+ }
131
+ /**
132
+ * Cross-process serialization for the download/extract step. Lock holder runs
133
+ * `work`; non-holders poll for the completion sentinel and return as soon as
134
+ * it appears. If the lock holder crashes (lockfile remains but no sentinel
135
+ * after the timeout), the next caller cleans up and retries — preventing a
136
+ * permanently-stuck cache after a Ctrl+C mid-download.
137
+ */
138
+ async function withModelLock(lockPath, completionPath, work) {
139
+ try {
140
+ writeFileSync(lockPath, String(process.pid), { flag: 'wx' });
141
+ }
142
+ catch (err) {
143
+ if (err.code !== 'EEXIST')
144
+ throw err;
145
+ await waitForCompletionOrTakeover(lockPath, completionPath, work);
146
+ return;
147
+ }
148
+ try {
149
+ await work();
150
+ }
151
+ finally {
152
+ try {
153
+ rmSync(lockPath, { force: true });
154
+ }
155
+ catch { /* best effort */ }
156
+ }
157
+ }
158
+ async function waitForCompletionOrTakeover(lockPath, completionPath, work) {
159
+ const deadline = Date.now() + LOCK_TIMEOUT_MS;
160
+ while (Date.now() < deadline) {
161
+ if (existsSync(completionPath))
162
+ return;
163
+ if (!existsSync(lockPath)) {
164
+ // Holder finished without writing the sentinel (crashed). Try to take
165
+ // over the lock ourselves.
166
+ await withModelLock(lockPath, completionPath, work);
167
+ return;
168
+ }
169
+ await delay(LOCK_POLL_INTERVAL_MS);
170
+ }
171
+ // Stale lock — clear it and let the next caller (or our own retry above)
172
+ // pick up the work. Force unlinking is safer than leaving the cache
173
+ // permanently wedged.
174
+ try {
175
+ rmSync(lockPath, { force: true });
176
+ }
177
+ catch { /* best effort */ }
178
+ throw new Error(`fastembed: timed out after ${LOCK_TIMEOUT_MS}ms waiting for ${lockPath}. ` +
179
+ `Stale lock cleared — retry the operation.`);
180
+ }
75
181
  /**
76
182
  * Ensure the per-model directory exists in the cache. Returns the absolute
77
183
  * path. If already present AND the completion sentinel is in place, no
@@ -86,25 +192,34 @@ async function downloadTarball(url, destPath, showProgress, deps) {
86
192
  */
87
193
  export async function retrieveModel(model, cacheDir, showProgress, deps = {}) {
88
194
  const modelDir = join(cacheDir, model);
89
- if (existsSync(modelDir)) {
90
- if (existsSync(join(modelDir, COMPLETION_SENTINEL)))
91
- return modelDir;
92
- if (showProgress) {
93
- process.stderr.write(`fastembed: cached model at ${modelDir} is incomplete (no completion marker); redownloading.\n`);
94
- }
95
- rmSync(modelDir, { recursive: true, force: true });
96
- }
195
+ const completionPath = join(modelDir, COMPLETION_SENTINEL);
196
+ // Fast path: complete cache hit needs no lock, no fs writes.
197
+ if (existsSync(completionPath))
198
+ return modelDir;
97
199
  mkdirSync(cacheDir, { recursive: true });
200
+ const lockPath = join(cacheDir, `.${model}.download.lock`);
98
201
  const tarballPath = join(cacheDir, `${model}.tar.gz`);
99
202
  const url = `${GCS_BASE_URL}/${gcsSlugFor(model)}.tar.gz`;
100
- await downloadTarball(url, tarballPath, showProgress, deps);
101
- const extract = deps.extract ?? tarExtract;
102
- await extract({ file: tarballPath, cwd: cacheDir });
103
- rmSync(tarballPath, { force: true });
104
- if (!existsSync(modelDir)) {
105
- throw new Error(`Model archive extracted but ${modelDir} is missing — corrupt tarball?`);
106
- }
107
- writeFileSync(join(modelDir, COMPLETION_SENTINEL), '');
203
+ await withModelLock(lockPath, completionPath, async () => {
204
+ // Re-check inside the lock — another process may have completed the
205
+ // download between our fast-path check and our lock acquisition.
206
+ if (existsSync(completionPath))
207
+ return;
208
+ if (existsSync(modelDir)) {
209
+ if (showProgress) {
210
+ process.stderr.write(`fastembed: cached model at ${modelDir} is incomplete (no completion marker); redownloading.\n`);
211
+ }
212
+ rmSync(modelDir, { recursive: true, force: true });
213
+ }
214
+ await downloadTarballWithRetry(url, tarballPath, showProgress, deps);
215
+ const extract = deps.extract ?? tarExtract;
216
+ await extract({ file: tarballPath, cwd: cacheDir });
217
+ rmSync(tarballPath, { force: true });
218
+ if (!existsSync(modelDir)) {
219
+ throw new Error(`Model archive extracted but ${modelDir} is missing — corrupt tarball?`);
220
+ }
221
+ writeFileSync(completionPath, '');
222
+ });
108
223
  return modelDir;
109
224
  }
110
225
  //# sourceMappingURL=model-loader.js.map
@@ -9,7 +9,7 @@
9
9
  */
10
10
  import * as readline from 'node:readline';
11
11
  import { loadSpellEngine, } from '../services/engine-loader.js';
12
- import { createDashboardMemoryAccessor } from '../services/daemon-dashboard.js';
12
+ import { getSharedMemoryAccessor } from '../services/daemon-dashboard.js';
13
13
  /**
14
14
  * Wrap a MemoryAccessor with a write-failure counter so the [epic] summary
15
15
  * can warn when spell progress didn't reach disk (#982). Without this, a
@@ -56,17 +56,22 @@ async function promptAcceptPermissions() {
56
56
  */
57
57
  export async function runEpicSpell(yamlContent, options = {}) {
58
58
  const engine = await loadSpellEngine();
59
- // Lazily initialize a real memory accessor so execution records
60
- // are persisted and visible in the dashboard.
59
+ // Lazily wrap the process-wide shared accessor (#1020) so execution
60
+ // records are persisted and visible in the dashboard. The shared helper
61
+ // owns the warn-and-return-null degradation; we only attach the
62
+ // failed-write counter on top of a successful inner accessor.
61
63
  if (!memoryAccessor) {
62
- try {
63
- const inner = await createDashboardMemoryAccessor();
64
+ const inner = await getSharedMemoryAccessor();
65
+ if (inner) {
64
66
  memoryAccessor = trackPersistFailures(inner);
65
67
  console.log('[epic] Memory accessor ready — spell progress will be persisted');
66
68
  }
67
- catch (err) {
68
- console.warn(`[epic] Dashboard memory unavailable: ${err.message ?? err}`);
69
- console.warn('[epic] Spell executions will NOT appear in the dashboard');
69
+ else {
70
+ // The shared helper already emitted `[memory]`-prefixed warns. Add an
71
+ // `[epic]`-tagged note so a user running `flo epic` can correlate the
72
+ // missing dashboard history with this command without scanning for a
73
+ // `[memory]` line elsewhere in the output.
74
+ console.warn('[epic] ⚠ Memory unavailable — this run will not appear in the dashboard');
70
75
  }
71
76
  }
72
77
  // memoryAccessor is module-cached, so `failedWrites` is cumulative across
@@ -719,9 +719,22 @@ export const hiveMindTools = [
719
719
  workerCount,
720
720
  };
721
721
  }
722
- // Story #807: terminate coordinator-side worker records before we
723
- // wipe the hive state so swarm agent_list reflects the shutdown.
724
- // allSettled so one failed terminate doesn't strand the rest.
722
+ // #1017 detach the adapter FIRST, before any code that broadcasts
723
+ // hive-mind events. terminateAgent below sends agent_terminate
724
+ // broadcasts on the hive-mind namespace; with the adapter still
725
+ // listening, those broadcasts register fire-and-forget storeEntry
726
+ // calls that can land after clearNamespace runs. Detaching first means
727
+ // every subsequent broadcast hits a dead listener and never persists,
728
+ // so clearNamespace operates on a deterministic, unchanging set.
729
+ const adapter = _writeThroughAdapter;
730
+ if (adapter) {
731
+ adapter.detach();
732
+ _writeThroughAdapter = null;
733
+ }
734
+ // Story #807: terminate coordinator-side worker records so swarm
735
+ // agent_list reflects the shutdown. allSettled so one failed terminate
736
+ // doesn't strand the rest. Broadcasts emitted here are intentionally
737
+ // ignored by the (now-detached) adapter.
725
738
  try {
726
739
  const coordinator = await getSwarmCoordinator();
727
740
  const results = await Promise.allSettled(hiveState.workers.map(id => coordinator.terminateAgent(id, { reason: 'hive-mind_shutdown', force: true })));
@@ -734,23 +747,24 @@ export const hiveMindTools = [
734
747
  catch (err) {
735
748
  process.stderr.write(`[hive-mind_shutdown] coordinator cleanup failed: ${err.message}\n`);
736
749
  }
737
- // Clear write-through namespaces in Memory DB
738
- try {
739
- const adapter = await getWriteThroughAdapter();
740
- await adapter.clearNamespace(HIVE_NS);
741
- await adapter.clearNamespace(HIVE_MEMORY_NS);
742
- }
743
- catch {
744
- // Best-effort cleanup
750
+ // Drain whatever the adapter already had in flight at detach, then
751
+ // delete the persisted hive-mind rows. Routed through the chokepoint
752
+ // (deleteEntry daemon RPC when alive), so the daemon's in-memory
753
+ // snapshot stays consistent with disk and cannot clobber the cleanup
754
+ // on its next flush.
755
+ if (adapter) {
756
+ try {
757
+ await adapter.clearNamespace(HIVE_NS);
758
+ await adapter.clearNamespace(HIVE_MEMORY_NS);
759
+ }
760
+ catch {
761
+ // Best-effort cleanup
762
+ }
745
763
  }
746
764
  // Shutdown MessageBus for hive-mind
747
765
  try {
748
766
  const bus = await getMessageBus();
749
767
  bus.unsubscribe('hive-mind-system');
750
- if (_writeThroughAdapter) {
751
- _writeThroughAdapter.detach();
752
- _writeThroughAdapter = null;
753
- }
754
768
  }
755
769
  catch {
756
770
  // Bus may not be initialized
@@ -12,6 +12,7 @@ import { findProjectRoot } from '../services/project-root.js';
12
12
  import { buildGrimoire } from '../services/grimoire-builder.js';
13
13
  import { errorDetail } from '../shared/utils/error-detail.js';
14
14
  import { inferSpellTier } from '../spells/core/spell-tier.js';
15
+ import { getSharedMemoryAccessor } from '../services/daemon-dashboard.js';
15
16
  // ============================================================================
16
17
  // Constants
17
18
  // ============================================================================
@@ -53,16 +54,23 @@ function trackResult(tracked, result) {
53
54
  tracked.result = result;
54
55
  tracked.completedAt = new Date().toISOString();
55
56
  }
57
+ // Memory accessor wiring (#1016): without `getSharedMemoryAccessor()`,
58
+ // runner.storeProgress() writes go to noopMemory and The Luminarium's
59
+ // "Flo Runs" tab never sees flo run / spell_cast invocations. The shared
60
+ // accessor is the same singleton runner-adapter.ts uses for `flo epic`
61
+ // (one cold init per process — see #1020).
56
62
  /** Execute a definition via the engine with tracking and error handling. */
57
63
  async function executeAndTrack(engine, definition, args, options = {}) {
58
64
  const spellId = `sp-${Date.now()}`;
59
65
  const tracked = trackStart(spellId, definition.name, definition.description);
60
66
  try {
61
67
  const sandboxConfig = await engine.loadSandboxConfigFromProject(findProjectRoot());
68
+ const memory = await getSharedMemoryAccessor();
62
69
  const result = await engine.bridgeExecuteSpell(definition, args, {
63
70
  spellId,
64
71
  sandboxConfig,
65
72
  forceCredentialReprompt: options.forceCredentialReprompt,
73
+ ...(memory ? { memory } : {}),
66
74
  });
67
75
  trackResult(tracked, result);
68
76
  return withSpellSource(serializeResult(result), options.sourceFile, options.tier);
@@ -112,6 +112,30 @@ export async function bridgeStoreEntry(options) {
112
112
  const now = Date.now();
113
113
  const guardResult = await guardValidate(registry, 'store', { key, namespace, size: value.length });
114
114
  if (!guardResult.allowed) {
115
+ // Dedupe rejection means the same `(op, params)` write just succeeded
116
+ // — the caller's data is already durable. Look up the existing row so
117
+ // we can return its id with success:true; this matches what the
118
+ // dedupe semantically means (a no-op, not a failure). Other rejection
119
+ // reasons (rate limit, etc.) remain real failures. Match the literal
120
+ // reason string rather than a substring regex so a future rejection
121
+ // worded with "duplicate mutation" but different semantics doesn't
122
+ // get silently swallowed.
123
+ if (guardResult.reason === 'duplicate mutation within dedupe window') {
124
+ let existingId = null;
125
+ const probe = ctx.db.prepare(`SELECT id FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
126
+ try {
127
+ probe.bind([namespace, key]);
128
+ if (probe.step()) {
129
+ existingId = String(probe.getAsObject().id);
130
+ }
131
+ }
132
+ finally {
133
+ probe.free();
134
+ }
135
+ if (existingId) {
136
+ return { success: true, id: existingId };
137
+ }
138
+ }
115
139
  return { success: false, id, error: `MutationGuard rejected: ${guardResult.reason}` };
116
140
  }
117
141
  const resolved = await resolveBridgeEmbedding(value, options.precomputedEmbedding, options.generateEmbeddingFlag, namespace);
@@ -120,6 +144,48 @@ export async function bridgeStoreEntry(options) {
120
144
  }
121
145
  const { json: embeddingJson, dimensions, model } = resolved;
122
146
  const embeddingResponse = embeddingResponseFrom(resolved);
147
+ // Idempotency guard, mirrors the one in `memory-initializer.ts`'s raw-
148
+ // sql.js fallback. When the daemon route just wrote this exact row but
149
+ // the client missed the ack, we land here with the row already on disk;
150
+ // a plain INSERT would trip UNIQUE and surface as `[moflo] bridge
151
+ // operation failed:` stderr noise even though the data is durable.
152
+ // Probe first so withDb never sees the throw.
153
+ //
154
+ // Limitations carried forward: only `content` is compared, not `tags`
155
+ // or `ttl`. The targeted scenario is the same caller's request being
156
+ // processed twice (daemon write + client retry), where every option is
157
+ // identical by definition — a different caller varying `tags` after a
158
+ // missed-ack would still see this as an idempotent no-op rather than
159
+ // an update. `cached: false, attested: false` because the prior writer
160
+ // already ran post-persist bookkeeping; this process's in-memory cache
161
+ // stays cold for one retrieve until the read path warms it (perf only,
162
+ // not correctness).
163
+ if (!options.upsert) {
164
+ let existingId = null;
165
+ let existingContent = null;
166
+ const probe = ctx.db.prepare(`SELECT id, content FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
167
+ try {
168
+ probe.bind([namespace, key]);
169
+ if (probe.step()) {
170
+ const row = probe.getAsObject();
171
+ existingId = String(row.id);
172
+ existingContent = row.content;
173
+ }
174
+ }
175
+ finally {
176
+ probe.free();
177
+ }
178
+ if (existingId && existingContent === value) {
179
+ return {
180
+ success: true,
181
+ id: existingId,
182
+ embedding: embeddingResponse,
183
+ guarded: true,
184
+ cached: false,
185
+ attested: false,
186
+ };
187
+ }
188
+ }
123
189
  const insertSql = options.upsert
124
190
  ? `INSERT OR REPLACE INTO memory_entries (
125
191
  id, key, namespace, content, type,
@@ -1650,6 +1650,42 @@ export async function storeEntry(options) {
1650
1650
  embeddingModel = embResult.model;
1651
1651
  }
1652
1652
  }
1653
+ // Idempotency guard. By the time we reach the raw-sql.js fallback, an
1654
+ // earlier write attempt — daemon route via `tryDaemonStore`, or bridge
1655
+ // via `bridgeStoreEntry` — may have already persisted this exact row to
1656
+ // disk. If a post-persist throw escaped the bridge's inner guards (#994,
1657
+ // #982), `bridgeStoreEntry` returned null and we landed here. Re-running
1658
+ // a plain INSERT would then trip the UNIQUE constraint on `(namespace,
1659
+ // key)` and surface as `exit 1` even though the data is durable on disk
1660
+ // — exactly the cascade described in `bridge-entries.ts:205`. If the
1661
+ // existing row matches the value the caller asked us to write, treat
1662
+ // this as a successful no-op and propagate the existing id instead of
1663
+ // re-inserting. If the content differs, fall through to INSERT — the
1664
+ // UNIQUE error is then a real "key already taken with other content"
1665
+ // signal that the caller deserves to see.
1666
+ if (!upsert) {
1667
+ let existingRow = null;
1668
+ const probe = db.prepare(`SELECT id, content FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
1669
+ try {
1670
+ probe.bind([namespace, key]);
1671
+ if (probe.step()) {
1672
+ existingRow = probe.getAsObject();
1673
+ }
1674
+ }
1675
+ finally {
1676
+ probe.free();
1677
+ }
1678
+ if (existingRow && existingRow.content === value) {
1679
+ db.close();
1680
+ return {
1681
+ success: true,
1682
+ id: String(existingRow.id),
1683
+ embedding: embeddingJson
1684
+ ? { dimensions: embeddingDimensions, model: embeddingModel }
1685
+ : undefined,
1686
+ };
1687
+ }
1688
+ }
1653
1689
  // Insert or update entry (upsert mode uses REPLACE)
1654
1690
  const insertSql = upsert
1655
1691
  ? `INSERT OR REPLACE INTO memory_entries (
@@ -16,6 +16,46 @@ import { createServer } from 'node:http';
16
16
  import { errorDetail } from '../shared/utils/error-detail.js';
17
17
  import { handleMemoryStore, handleMemoryDelete, handleMemoryBatch, matchMemoryRpcRoute, } from './daemon-memory-rpc.js';
18
18
  export const DEFAULT_DASHBOARD_PORT = 3117;
19
+ /**
20
+ * Process-wide promise for the shared MemoryAccessor. Memoized as a *promise*
21
+ * (not the resolved value) so concurrent first-callers share a single init
22
+ * — without this, two near-simultaneous calls would each kick off their own
23
+ * `createDashboardMemoryAccessor()` chain and the loser's accessor would
24
+ * leak. The race fix originated in #1016 inside `mcp-tools/spell-tools.ts`;
25
+ * #1020 lifted it into this shared helper so `epic/runner-adapter.ts` (which
26
+ * had the same latent race) and any future caller benefit from one cold
27
+ * init per process.
28
+ */
29
+ let _sharedAccessorPromise = null;
30
+ /**
31
+ * Return the process-wide MemoryAccessor, lazy-initialized on first call and
32
+ * cached as a promise thereafter. Returns `null` (with a warn log) if init
33
+ * fails so callers can degrade gracefully — the spell still runs, the user
34
+ * just doesn't see the run in The Luminarium.
35
+ */
36
+ export function getSharedMemoryAccessor() {
37
+ if (_sharedAccessorPromise)
38
+ return _sharedAccessorPromise;
39
+ _sharedAccessorPromise = (async () => {
40
+ try {
41
+ return await createDashboardMemoryAccessor();
42
+ }
43
+ catch (err) {
44
+ console.warn(`[memory] dashboard accessor unavailable: ${err.message ?? err}`);
45
+ console.warn('[memory] runs will NOT appear in The Luminarium');
46
+ return null;
47
+ }
48
+ })();
49
+ return _sharedAccessorPromise;
50
+ }
51
+ /**
52
+ * Test-only: reset the cached promise so a subsequent call re-runs init.
53
+ * Production code MUST NOT call this — leaks the previous accessor's DB
54
+ * handle if the prior init succeeded.
55
+ */
56
+ export function _resetSharedMemoryAccessorForTest() {
57
+ _sharedAccessorPromise = null;
58
+ }
19
59
  /**
20
60
  * Create a MemoryAccessor backed by the sql.js/HNSW memory database.
21
61
  * Lazy-loads memory-initializer to avoid circular deps.
@@ -4,11 +4,21 @@
4
4
  * processes write to the same target concurrently.
5
5
  *
6
6
  * Pattern: write to a process-unique temp path `<target>.tmp.<pid>.<rand>`,
7
- * then rename onto `target`.
8
- * - `fs.renameSync` is atomic on POSIX.
9
- * - On Windows, Node maps it to `MoveFileExW(..., MOVEFILE_REPLACE_EXISTING)`,
10
- * which replaces the destination near-atomically concurrent readers
11
- * always observe either the old file or the new, never a truncated one.
7
+ * **fsync the temp file**, then rename onto `target`.
8
+ * - `writeFileSync` does NOT fsync — the OS keeps data in the write cache.
9
+ * On Windows that cache isn't always coherent with what other processes
10
+ * see when they open the freshly-renamed target. Issue #1015 surfaced
11
+ * this as a flaky `memory-retrieve` race in consumer-smoke: process A
12
+ * stores via the daemon → daemon flushes via this helper → daemon
13
+ * returns → process B opens the DB and sees stale content.
14
+ * - The fix: fsync the temp fd before rename. After fsync, the data is
15
+ * durably on disk; the rename then makes that durable data visible
16
+ * atomically. Subsequent readers see the new bytes regardless of cache
17
+ * state.
18
+ * - `fs.renameSync` is atomic on POSIX. On Windows, Node maps it to
19
+ * `MoveFileExW(..., MOVEFILE_REPLACE_EXISTING)`, which replaces the
20
+ * destination near-atomically — concurrent readers always observe either
21
+ * the old file or the new, never a truncated one.
12
22
  * - The unique temp path means concurrent writers can't clobber each other's
13
23
  * in-flight bytes (#635). Last-writer-wins semantics: each rename is fully
14
24
  * atomic, so the destination always reflects exactly one writer's data.
@@ -18,16 +28,28 @@
18
28
  * On any failure, the temp file is best-effort removed and the original
19
29
  * `target` stays intact. The underlying error is always re-thrown.
20
30
  *
31
+ * Windows-only post-rename verify (#1015): on NTFS with antivirus / Defender
32
+ * scanning the freshly-renamed file, a sub-process opening the same path
33
+ * within ~1s can briefly see the file as locked. After a successful rename
34
+ * we poll-open the target until it's readable (or a 250 ms deadline passes)
35
+ * so the next reader doesn't race the AV lock window. The rename itself
36
+ * already succeeded and the data is fsynced, so the verify is best-effort:
37
+ * a timeout returns silently rather than throwing.
38
+ *
21
39
  * `fs` is injectable so the interrupt-mid-write paths can be exercised in
22
40
  * unit tests without depending on ESM-unfriendly module spies.
23
41
  *
24
42
  * @module moflo/cli/shared/utils/atomic-file-write
25
43
  */
26
44
  import * as realFs from 'node:fs';
45
+ const IS_WIN32 = process.platform === 'win32';
46
+ const VERIFY_DEADLINE_MS = 250;
47
+ const VERIFY_STEP_MS = 10;
27
48
  export function atomicWriteFileSync(targetPath, data, fs = realFs) {
28
49
  const tmpPath = `${targetPath}.tmp.${process.pid}.${Math.random().toString(36).slice(2, 8)}`;
29
50
  try {
30
51
  fs.writeFileSync(tmpPath, data);
52
+ fsyncFile(tmpPath, fs);
31
53
  fs.renameSync(tmpPath, targetPath);
32
54
  }
33
55
  catch (err) {
@@ -39,5 +61,61 @@ export function atomicWriteFileSync(targetPath, data, fs = realFs) {
39
61
  }
40
62
  throw err;
41
63
  }
64
+ if (IS_WIN32)
65
+ verifyReadableAfterRename(targetPath, fs);
66
+ }
67
+ /**
68
+ * Open the freshly-written temp file, fsync, close. Ensures the data is
69
+ * durably on disk before rename makes it visible (#1015). Best-effort: an
70
+ * fsync error is swallowed because a real filesystem failure will surface
71
+ * on the rename anyway, and we don't want to mask the more useful error.
72
+ */
73
+ function fsyncFile(tmpPath, fs) {
74
+ const openSync = fs.openSync ?? realFs.openSync;
75
+ const closeSync = fs.closeSync ?? realFs.closeSync;
76
+ const fsyncSync = fs.fsyncSync ?? realFs.fsyncSync;
77
+ let fd = null;
78
+ try {
79
+ fd = openSync(tmpPath, 'r+');
80
+ fsyncSync(fd);
81
+ }
82
+ catch {
83
+ /* fsync best-effort — see fn doc */
84
+ }
85
+ finally {
86
+ if (fd !== null) {
87
+ try {
88
+ closeSync(fd);
89
+ }
90
+ catch { /* close best-effort */ }
91
+ }
92
+ }
93
+ }
94
+ /**
95
+ * Poll-open the target until a reader can succeed, or the deadline passes.
96
+ * Closes the AV-scan settle window on NTFS (#1015). No-op everywhere else.
97
+ *
98
+ * Yields the thread between probes via `Atomics.wait` so we don't pin a CPU
99
+ * during the very contention we're waiting out (`feedback_async_by_default`).
100
+ */
101
+ function verifyReadableAfterRename(targetPath, fs) {
102
+ const openSync = fs.openSync ?? realFs.openSync;
103
+ const closeSync = fs.closeSync ?? realFs.closeSync;
104
+ const deadline = Date.now() + VERIFY_DEADLINE_MS;
105
+ while (true) {
106
+ try {
107
+ closeSync(openSync(targetPath, 'r'));
108
+ return;
109
+ }
110
+ catch {
111
+ if (Date.now() >= deadline)
112
+ return;
113
+ sleepSyncMs(VERIFY_STEP_MS);
114
+ }
115
+ }
116
+ }
117
+ const SLEEP_BUF = new Int32Array(new SharedArrayBuffer(4));
118
+ function sleepSyncMs(ms) {
119
+ Atomics.wait(SLEEP_BUF, 0, 0, ms);
42
120
  }
43
121
  //# sourceMappingURL=atomic-file-write.js.map
@@ -5,8 +5,10 @@
5
5
  * lifecycle. This connector adds server-pool management, lazy spawning, tool
6
6
  * discovery caching, and the SpellConnector interface adapter.
7
7
  *
8
- * The SDK is an optionalDependency and is loaded lazily on first use so
9
- * consumers that don't use the MCP connector don't need it installed.
8
+ * The SDK is a hard `dependency` (MCP is a headline integration), but it is
9
+ * loaded lazily on first use so spells that don't use the MCP connector don't
10
+ * pay its startup cost. The lazy-load also yields an actionable install hint
11
+ * if a corrupted install lost the package.
10
12
  */
11
13
  import { loadOptional } from './shared/optional-import.js';
12
14
  const MCP_INSTALL_MSG = "MCP connector requires '@modelcontextprotocol/sdk' to be installed. Run: npm i @modelcontextprotocol/sdk";
@@ -1,11 +1,18 @@
1
1
  /**
2
2
  * Lazy loader for optional SDK dependencies.
3
3
  *
4
- * Connectors wrapping heavy SDKs (imapflow, mailparser, @modelcontextprotocol/sdk)
5
- * declare them as optionalDependencies so consumers that don't use the connector
6
- * don't need to install them. This helper centralizes the lazy-import +
7
- * MODULE_NOT_FOUND translation + module-scope memoization that each connector
8
- * would otherwise re-implement.
4
+ * Connectors wrapping truly optional SDKs (imapflow, mailparser) declare them
5
+ * as `peerDependenciesMeta.optional` so consumers that don't use the connector
6
+ * don't need to install them. The `@modelcontextprotocol/sdk` is a hard
7
+ * `dependency` because the MCP connector is a headline feature, but it is still
8
+ * routed through this helper so a corrupted install still yields an actionable
9
+ * message instead of a raw MODULE_NOT_FOUND.
10
+ *
11
+ * Every specifier passed to `loadOptional()` MUST be declared in package.json
12
+ * (dependencies, optionalDependencies, or peerDependenciesMeta). The drift
13
+ * guard at `src/cli/__tests__/spells/connectors/optional-import-declared.test.ts`
14
+ * enforces this — it walks shipped connectors, extracts every specifier, and
15
+ * fails the build if one is undeclared.
9
16
  */
10
17
  const moduleCache = new Map();
11
18
  function isModuleNotFound(err) {
@@ -1,24 +1,31 @@
1
1
  /**
2
2
  * Credential Validation
3
3
  *
4
- * Lightweight, no-config shape checks applied to values pulled from the
5
- * encrypted credential store before they are promoted to `process.env`.
4
+ * Shape checks applied to values pulled from the encrypted credential store
5
+ * before they are promoted to `process.env`. Two layers:
6
6
  *
7
- * Two heuristics, both conservative only invalidate when there is
8
- * positive evidence the stored value is bad. Anything we can't classify
9
- * passes through unchanged.
7
+ * 1. **Author-declared format** (preferred): the YAML prereq sets
8
+ * `format: jwt`, and the validator enforces JWT shape + expiry. Any
9
+ * non-JWT value (e.g. a value with no dots) is rejected outright,
10
+ * catching the failure mode where a stored value isn't even a JWT
11
+ * and the spell would otherwise fail mid-cast with a 401.
10
12
  *
11
- * - JWT-shaped values (3 base64url segments) get their `exp` claim
12
- * parsed and compared to "now". An expired JWT is reported as such.
13
- * - Env keys ending in `_URL` must parse via the WHATWG `URL`
14
- * constructor and have a non-empty host.
13
+ * 2. **Conservative heuristics** (fallback when no format is declared):
14
+ * - JWT-shaped values (3 base64url segments) get their `exp` claim
15
+ * parsed and rejected when expired.
16
+ * - Env keys ending in `_URL` must parse via the WHATWG `URL`
17
+ * constructor and have a non-empty host.
18
+ * Anything else passes through.
15
19
  *
16
- * Story #1007: avoid silently reusing stale stored credentials (e.g.
17
- * Microsoft Graph access tokens, which expire in ~1h) so the resolver
18
- * can fall through to the prompt path and the user understands why.
20
+ * Story #1007: catch expired JWTs that survived past their TTL.
21
+ * Story #1009: extend to catch values that aren't even JWT-shaped when
22
+ * the prereq has declared `format: jwt`.
19
23
  */
20
24
  const VALID_JWT_SEGMENT = /^[A-Za-z0-9_-]+$/;
21
- export function validateStoredCredential(envKey, value) {
25
+ export function validateStoredCredential(envKey, value, format) {
26
+ if (format === 'jwt') {
27
+ return validateJwtFormat(value);
28
+ }
22
29
  if (envKey.endsWith('_URL')) {
23
30
  return validateUrlValue(value);
24
31
  }
@@ -27,6 +34,15 @@ export function validateStoredCredential(envKey, value) {
27
34
  }
28
35
  return { valid: true };
29
36
  }
37
+ function validateJwtFormat(value) {
38
+ if (!looksLikeJwt(value)) {
39
+ return {
40
+ valid: false,
41
+ reason: 'stored value is not a JWT (expected three base64url segments separated by ".")',
42
+ };
43
+ }
44
+ return validateJwtExpiry(value);
45
+ }
30
46
  function validateUrlValue(value) {
31
47
  try {
32
48
  const parsed = new URL(value);
@@ -72,6 +72,7 @@ export function compilePrerequisiteSpec(spec) {
72
72
  description: spec.description,
73
73
  promptOnMissing,
74
74
  envKey,
75
+ format: spec.format,
75
76
  };
76
77
  }
77
78
  function defaultHintForDetect(spec) {
@@ -205,7 +206,7 @@ export async function resolveUnmetPrerequisites(prerequisites, options = {}) {
205
206
  const stored = await credentials.get(prereq.envKey);
206
207
  if (typeof stored !== 'string' || stored.length === 0)
207
208
  return;
208
- const validation = validateStoredCredential(prereq.envKey, stored);
209
+ const validation = validateStoredCredential(prereq.envKey, stored, prereq.format);
209
210
  if (!validation.valid) {
210
211
  rejectedFromStore.push({ envKey: prereq.envKey, reason: validation.reason });
211
212
  return;
@@ -6,6 +6,7 @@
6
6
  * deliberately small so step validation can delegate here.
7
7
  */
8
8
  const VALID_DETECT_TYPES = ['env', 'command', 'file'];
9
+ const VALID_FORMATS = ['jwt'];
9
10
  export function validatePrerequisites(prereqs, errors, path) {
10
11
  if (!Array.isArray(prereqs)) {
11
12
  errors.push({ path, message: 'prerequisites must be an array' });
@@ -39,6 +40,13 @@ export function validatePrerequisites(prereqs, errors, path) {
39
40
  if (p.promptOnMissing !== undefined && typeof p.promptOnMissing !== 'boolean') {
40
41
  errors.push({ path: `${pPath}.promptOnMissing`, message: 'promptOnMissing must be a boolean' });
41
42
  }
43
+ if (p.format !== undefined
44
+ && !VALID_FORMATS.includes(p.format)) {
45
+ errors.push({
46
+ path: `${pPath}.format`,
47
+ message: `format must be one of: ${VALID_FORMATS.join(', ')}`,
48
+ });
49
+ }
42
50
  const detect = p.detect;
43
51
  if (!detect || typeof detect !== 'object') {
44
52
  errors.push({ path: `${pPath}.detect`, message: 'detect is required and must be an object' });
@@ -66,6 +74,14 @@ export function validatePrerequisites(prereqs, errors, path) {
66
74
  errors.push({ path: `${pPath}.detect.path`, message: 'detect.path is required for file detector' });
67
75
  }
68
76
  }
77
+ // `format` only applies to stored env values — silently ignoring it on
78
+ // command/file detectors would mask author mistakes.
79
+ if (p.format !== undefined && detect.type !== 'env') {
80
+ errors.push({
81
+ path: `${pPath}.format`,
82
+ message: 'format is only valid on env-type prerequisites',
83
+ });
84
+ }
69
85
  });
70
86
  }
71
87
  //# sourceMappingURL=prerequisites.js.map
@@ -2,5 +2,5 @@
2
2
  * Auto-generated by build. Do not edit manually.
3
3
  * Source of truth: root package.json → scripts/sync-version.mjs
4
4
  */
5
- export const VERSION = '4.9.30';
5
+ export const VERSION = '4.9.32';
6
6
  //# sourceMappingURL=version.js.map
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "moflo",
3
- "version": "4.9.30",
3
+ "version": "4.9.32",
4
4
  "description": "MoFlo — AI agent orchestration for Claude Code. A standalone, opinionated toolkit with semantic memory, learned routing, gates, spells, and the /flo issue-execution skill.",
5
5
  "main": "dist/src/cli/index.js",
6
6
  "type": "module",
@@ -64,6 +64,7 @@
64
64
  },
65
65
  "dependencies": {
66
66
  "@anush008/tokenizers": "^0.6.0",
67
+ "@modelcontextprotocol/sdk": "^1.0.0",
67
68
  "js-yaml": "^4.1.1",
68
69
  "lru-cache": "^11.3.5",
69
70
  "onnxruntime-node": "^1.24.3",
@@ -72,6 +73,18 @@
72
73
  "tar": "^7.5.11",
73
74
  "valibot": "^1.3.1"
74
75
  },
76
+ "peerDependencies": {
77
+ "imapflow": "^1.0.0",
78
+ "mailparser": "^3.0.0"
79
+ },
80
+ "peerDependenciesMeta": {
81
+ "imapflow": {
82
+ "optional": true
83
+ },
84
+ "mailparser": {
85
+ "optional": true
86
+ }
87
+ },
75
88
  "overrides": {
76
89
  "hono": ">=4.11.4",
77
90
  "picomatch": ">=2.3.2",
@@ -84,7 +97,7 @@
84
97
  "@typescript-eslint/eslint-plugin": "^7.18.0",
85
98
  "@typescript-eslint/parser": "^7.18.0",
86
99
  "eslint": "^8.0.0",
87
- "moflo": "^4.9.29",
100
+ "moflo": "^4.9.31",
88
101
  "tsx": "^4.21.0",
89
102
  "typescript": "^5.9.3",
90
103
  "vitest": "^4.0.0"