moflo 4.9.30 → 4.9.32
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/guidance/shipped/moflo-root-cause-discipline.md +167 -0
- package/dist/src/cli/embeddings/fastembed-inline/index.js +14 -0
- package/dist/src/cli/embeddings/fastembed-inline/model-loader.js +140 -25
- package/dist/src/cli/epic/runner-adapter.js +13 -8
- package/dist/src/cli/mcp-tools/hive-mind-tools.js +29 -15
- package/dist/src/cli/mcp-tools/spell-tools.js +8 -0
- package/dist/src/cli/memory/bridge-entries.js +66 -0
- package/dist/src/cli/memory/memory-initializer.js +36 -0
- package/dist/src/cli/services/daemon-dashboard.js +40 -0
- package/dist/src/cli/shared/utils/atomic-file-write.js +83 -5
- package/dist/src/cli/spells/connectors/mcp-client.js +4 -2
- package/dist/src/cli/spells/connectors/shared/optional-import.js +12 -5
- package/dist/src/cli/spells/core/credential-validation.js +29 -13
- package/dist/src/cli/spells/core/prerequisite-checker.js +2 -1
- package/dist/src/cli/spells/schema/validators/prerequisites.js +16 -0
- package/dist/src/cli/version.js +1 -1
- package/package.json +15 -2
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
# Root-Cause Discipline — Measure Twice, Cut Once
|
|
2
|
+
|
|
3
|
+
**Purpose:** The MoFlo standard for fixing bugs. We do not "shoot first and ask questions later" — we measure twice and cut once. Apply this whenever you are about to write a fix, especially when a previous fix on the same surface didn't fully work.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## The Headline Rule
|
|
8
|
+
|
|
9
|
+
**Measure twice, cut once. Step back, understand the problem holistically, then make the simplest fix that eliminates the cause.** Do not pile patch onto patch onto patch.
|
|
10
|
+
|
|
11
|
+
This is the single most important engineering posture in this project. Layered patches have produced the worst regressions, the longest debugging sessions, and the most expensive token bills. When you find yourself reaching for "another layer" — stop.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Before You Write Fix N+1
|
|
16
|
+
|
|
17
|
+
Before adding a new fix on top of an existing one, you MUST answer all four:
|
|
18
|
+
|
|
19
|
+
| Question | If you can't answer | Action |
|
|
20
|
+
|----------|---------------------|--------|
|
|
21
|
+
| What exactly is the failure mode at the lowest level? (Not the symptom — the actual mechanism.) | You don't understand the bug yet | Investigate further; do not fix |
|
|
22
|
+
| Why didn't fix N work? Is it wrong, or just incomplete? | You're guessing at the gap | Read fix N's code + history; reproduce the failure |
|
|
23
|
+
| Would removing fix N + replacing with one cleaner fix simplify the surface? | You haven't considered consolidation | Try the consolidation first |
|
|
24
|
+
| What's the SIMPLEST change that makes the bug structurally impossible? | You're patching symptoms, not causes | Step back further |
|
|
25
|
+
|
|
26
|
+
If three answers are vague, you're in patch-on-patch territory. Stop and re-think.
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Patch-on-Patch Smoke Alarms
|
|
31
|
+
|
|
32
|
+
Stop and reconsider when you see yourself doing any of these:
|
|
33
|
+
|
|
34
|
+
| Smoke alarm | What it usually means | The right move |
|
|
35
|
+
|-------------|----------------------|----------------|
|
|
36
|
+
| Adding a "belt-and-suspenders" cleanup | The first cleanup is racing something — find what | Eliminate the race, not double-cleanup |
|
|
37
|
+
| Adding `try/catch` around code that already has `try/catch` | Outer catch is masking inner failure | Surface the inner error, don't double-wrap |
|
|
38
|
+
| Adding a `setTimeout` retry loop on top of an existing retry | Retry won't fix a logic bug | Fix the logic |
|
|
39
|
+
| Bumping a timeout because tests fail intermittently | The op is slower than expected — find why | Fix the slowness or remove the op |
|
|
40
|
+
| Adding a flag/env-var to "skip the broken path" | You're hiding the bug, not fixing it | Fix the path or delete it |
|
|
41
|
+
| Adding a workaround "until we can fix this properly" | You won't come back; "later" never happens | Fix it now or file with full context |
|
|
42
|
+
| Touching three files to fix one bug | Bug is misdiagnosed; one file usually suffices | Re-diagnose |
|
|
43
|
+
|
|
44
|
+
When **two or more** of these apply at once, the fix is almost certainly wrong. Throw it away and re-investigate.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## The Holistic Step-Back
|
|
49
|
+
|
|
50
|
+
When fix N didn't work, do these in order — not in parallel, not skipping steps:
|
|
51
|
+
|
|
52
|
+
1. **Read every prior fix on this surface in full.** Not the commit message — the code. Note what each one was trying to prevent and what it actually does.
|
|
53
|
+
2. **Reproduce the failure deterministically** before touching code. If you can't reproduce it, you don't understand it.
|
|
54
|
+
3. **Trace the data flow.** Where does the bad state originate? What writes it? What reads it? What invariant got violated?
|
|
55
|
+
4. **Question the test, not just the code.** What invariant does the failing test actually encode? Does that invariant match the runtime contract, or is the test stricter? A test stricter than the contract will produce flakes that look like bugs but aren't. (See #1017 case study.)
|
|
56
|
+
5. **Identify the structural cause** — the place where the bug becomes possible, not the place where it becomes visible.
|
|
57
|
+
6. **Now consider fixes.** The cheapest fix at the structural cause beats the cleverest fix at the symptom every time. If the cause is "test asserts X, runtime contract is Y, X is stricter," the fix is in the test.
|
|
58
|
+
|
|
59
|
+
If step 6 yields a fix smaller and simpler than the existing patches, **delete the existing patches** as part of the same change. Do not stack.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Code Serves the Specification, Not the Test
|
|
64
|
+
|
|
65
|
+
**Periodically ask: "Am I solving an actual problem, or am I flailing to satisfy a flawed test?"** When several attempted fixes haven't moved the needle, the test framework is a likely suspect — but the response is never to degrade production code to make the test pass.
|
|
66
|
+
|
|
67
|
+
**Never introduce substandard code to satisfy shortcomings of the testing infrastructure.** Production code expresses the runtime contract. Tests verify the contract. When they disagree:
|
|
68
|
+
|
|
69
|
+
| Disagreement | Correct response | Wrong response |
|
|
70
|
+
|--------------|------------------|----------------|
|
|
71
|
+
| Test asserts behavior the runtime never promised | Fix the test to match the contract | Add code to satisfy the test's stricter assertion |
|
|
72
|
+
| Test uses an unrealistic environment (mocks the wrong layer, races a SIGKILL'd daemon, single-session asserts on a multi-session contract) | Fix the test environment | Add retry / sleep / workaround in production code |
|
|
73
|
+
| Test framework can't observe a legitimate runtime path | Add a test hook (`_resetForTest`, `getStateForTest`) that doesn't change runtime behavior | Restructure runtime to make the test framework's observation easier |
|
|
74
|
+
| Test is flaky on one platform but the runtime works | Identify why the test, not the runtime, is sensitive | Bump timeouts / retries / sleeps in production paths |
|
|
75
|
+
|
|
76
|
+
**Code purity check before any "make the test pass" change:** would you ship this change if the test didn't exist? If no, you're degrading the code to satisfy the test. Stop. Fix the test.
|
|
77
|
+
|
|
78
|
+
**Signals you're flailing for the test, not solving the bug:**
|
|
79
|
+
|
|
80
|
+
| Signal | What it actually says |
|
|
81
|
+
|--------|----------------------|
|
|
82
|
+
| You've tried 3+ fixes and nothing has moved the needle | The diagnosis is wrong; investigate before patching again |
|
|
83
|
+
| Each fix gets narrower / more defensive without removing the prior layer | You're piling on, not solving |
|
|
84
|
+
| The runtime works fine in real-world usage but the test fails | The test's spec doesn't match the contract — that's the bug |
|
|
85
|
+
| You'd need to add a sleep, retry, lock, or platform-special-case to make the test happy | Production code is paying for a test-environment limitation |
|
|
86
|
+
| Removing the test makes the bug "go away" | The test was right but the fix is wrong, OR the test was the bug — diagnose which |
|
|
87
|
+
|
|
88
|
+
The user said it directly: **"we never want to introduce substandard code to satisfy shortcomings of our testing infrastructure."** Tests serve the code; the code does not serve the tests.
|
|
89
|
+
|
|
90
|
+
When you find that the test is the actual problem: change the test, document why in the commit message, and (if the change weakens an invariant) add a separate test that captures the invariant the original was *trying* to encode without the false strictness.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## Concrete Example: #1017 Hive-Mind Shutdown
|
|
95
|
+
|
|
96
|
+
This is the canonical case study for this guidance — and it has a second-order lesson that makes it even more useful.
|
|
97
|
+
|
|
98
|
+
| Attempt | Approach | Outcome |
|
|
99
|
+
|---------|----------|---------|
|
|
100
|
+
| #1017 first try | Loop list+delete in `clearNamespace` | Race window remained — broadcasts landed mid-loop |
|
|
101
|
+
| #1024 layer 1 | Detach adapter BEFORE `clearNamespace` (after `terminateAgent`) | Race narrowed but not eliminated |
|
|
102
|
+
| #1024 layer 2 | Add `purgeHiveNamespacesDirect` raw sql.js DELETE | Looked bulletproof; actually clobber-prone vs daemon's stale snapshot (#981 single-writer) |
|
|
103
|
+
| #1024 declared green | All 6 CI checks pass once | Same flake reappeared on next PR's CI |
|
|
104
|
+
| #1027 attempt 4 | Move `adapter.detach()` BEFORE `terminateAgent`; delete `purgeHiveNamespacesDirect` | Code simplified by -73 LOC. **Same flake on macos-latest CI.** |
|
|
105
|
+
| #1027 — actual fix | Run launcher a SECOND time after doctor in the populated harness | Test passes. Race is intrinsic to multi-process sql.js + daemon kill timing; the harness assertion was over-strict. |
|
|
106
|
+
|
|
107
|
+
The first three attempts kept asking "how do we delete this row harder?" The fourth attempt was a structural simplification that was correct on its own merits (-73 LOC, removed dead code, simpler shutdown ordering) but **did not fix the flake**.
|
|
108
|
+
|
|
109
|
+
The actual root cause was outside the surface every patch had touched: the populated harness was asserting "ephemerals purged after one launcher run" when the **runtime contract is "ephemerals purged at next session-start launcher"**. The doctor's hive-mind probe writes a row intentionally; that row is supposed to live until the NEXT session purges it. The test was conflating "purge mechanism works" with "purge happens within one session" — those are different invariants and only the first is the real product behavior.
|
|
110
|
+
|
|
111
|
+
**Two lessons stack here:**
|
|
112
|
+
|
|
113
|
+
1. **Don't pile layers** (the original lesson): four shutdown patches, each narrower than the last, none structurally sufficient.
|
|
114
|
+
2. **Question the test, not just the code** (the second-order lesson): if you've been fighting a race for four PRs and the simplest in-code fix doesn't move the needle, the spec encoded in the test may be wrong. A test that is stricter than the runtime contract WILL produce flakes that look like product bugs but aren't.
|
|
115
|
+
|
|
116
|
+
Together: when a fix isn't working, ask both "what writes the bad state?" AND "is this state actually bad in the runtime contract, or only in the test's expectation?"
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## When You Genuinely Need a Belt-and-Suspenders
|
|
121
|
+
|
|
122
|
+
Belts-and-suspenders are not always wrong. They are right when:
|
|
123
|
+
|
|
124
|
+
| Condition | Example |
|
|
125
|
+
|-----------|---------|
|
|
126
|
+
| The two layers protect against **different** failure modes | atomic-write tmp+fsync+rename: tmp protects partial writes; fsync protects OS cache; rename protects readers — three concerns, three mechanisms |
|
|
127
|
+
| The first layer's failure is **silent**, the second surfaces it | A retry that logs the first failure before re-attempting |
|
|
128
|
+
| Removing either layer has a **stated, documented reason** for keeping the other | A fallback path with a comment explaining when the primary doesn't reach |
|
|
129
|
+
|
|
130
|
+
They are wrong when both layers protect against the **same** failure mode and you're hoping at least one wins. That's hope, not engineering.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## What This Means for PR Reviews
|
|
135
|
+
|
|
136
|
+
Reviewers should reject — not just question — PRs that show patch-on-patch signatures:
|
|
137
|
+
|
|
138
|
+
| Signal | Reviewer action |
|
|
139
|
+
|--------|----------------|
|
|
140
|
+
| Same file/function touched in 3+ recent commits, same bug | Ask: "is the prior fix wrong? remove it" |
|
|
141
|
+
| New fix adds a layer without removing one | Ask: "what was wrong with the prior layer? why does it stay?" |
|
|
142
|
+
| Comment in new code says "for safety" or "just in case" | Ask: "what specific failure is this preventing? cite the line that produces it" |
|
|
143
|
+
| The PR description says "this should fix the flake" without a deterministic repro | Ask: "what was the actual root cause? the writeup doesn't name it" |
|
|
144
|
+
|
|
145
|
+
These questions are not pedantic. They are the difference between fixing a bug and growing the surface area of bugs.
|
|
146
|
+
|
|
147
|
+
---
|
|
148
|
+
|
|
149
|
+
## How to Apply When You Are Stuck
|
|
150
|
+
|
|
151
|
+
If you genuinely cannot find the root cause after stepping back:
|
|
152
|
+
|
|
153
|
+
1. **Stop fixing. Start measuring.** Add logging at every state transition. Reproduce. Read the log.
|
|
154
|
+
2. **Ask the user before patching.** A two-line confirmation question costs less than a wrong fix.
|
|
155
|
+
3. **File the issue with what you DO know.** Partial diagnosis with logs is more useful than a guessed fix.
|
|
156
|
+
4. **Never ship "I think this might work."** That phrasing is a self-warning that the diagnosis isn't done.
|
|
157
|
+
|
|
158
|
+
It is always cheaper to admit uncertainty than to ship a layered patch that creates two new bugs.
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## See Also
|
|
163
|
+
|
|
164
|
+
- `.claude/guidance/moflo-error-handling.md` — Silent failures are the prerequisite condition for most patch-on-patch saga; fix those first
|
|
165
|
+
- `.claude/guidance/moflo-source-hygiene.md` — When you decide to delete redundant code, the canonical-location rules tell you what's safe to remove
|
|
166
|
+
- `feedback_no_layered_workarounds.md` (auto-memory) — The personal-feedback version of this rule, recorded from prior incidents
|
|
167
|
+
- `feedback_ci_flake_means_not_done.md` (auto-memory) — A flake that "passed on rerun" is not fixed; root-cause it under this discipline
|
|
@@ -132,6 +132,20 @@ export class FlagEmbedding {
|
|
|
132
132
|
const session = await InferenceSession.create(modelPath, {
|
|
133
133
|
executionProviders: ['cpu'],
|
|
134
134
|
graphOptimizationLevel: 'all',
|
|
135
|
+
// Suppress ORT's WARNING-level chatter on session bring-up. ORT 1.26.0
|
|
136
|
+
// emits a `[W:onnxruntime ... GetPciBusId] Skipping pci_bus_id` line on
|
|
137
|
+
// Linux Azure VMs whose `/sys/devices/...` filenames don't match the
|
|
138
|
+
// `[0-9a-f]+:[0-9a-f]+:[0-9a-f]+.[0-9a-f]+` PCI pattern; the warning
|
|
139
|
+
// is harmless (we run on the CPU EP only) but leaks to stderr and
|
|
140
|
+
// confuses users into thinking moflo is broken. 0=verbose, 1=info,
|
|
141
|
+
// 2=warning (default), 3=error, 4=fatal — error is the right level
|
|
142
|
+
// because session bring-up genuine failures still surface.
|
|
143
|
+
//
|
|
144
|
+
// Re-audit when bumping fastembed or onnxruntime-node: ORT
|
|
145
|
+
// occasionally promotes deprecation / model-compatibility notices to
|
|
146
|
+
// WARNING that would now be hidden. If a model upgrade ever lands
|
|
147
|
+
// alongside this suppression, drop to 2 once to scan the output.
|
|
148
|
+
logSeverityLevel: 3,
|
|
135
149
|
});
|
|
136
150
|
return new FlagEmbedding(tokenizer, session);
|
|
137
151
|
}
|
|
@@ -8,17 +8,30 @@
|
|
|
8
8
|
* For `fast-all-MiniLM-L6-v2`, the URL slug is `sentence-transformers-all-MiniLM-L6-v2`
|
|
9
9
|
* but the on-disk directory keeps the `fast-` prefix — verbatim from upstream.
|
|
10
10
|
*
|
|
11
|
-
* Concurrency:
|
|
12
|
-
*
|
|
13
|
-
*
|
|
11
|
+
* Concurrency: a per-model file lock (`<cacheDir>/.<model>.download.lock`,
|
|
12
|
+
* created with `wx`) serializes the download/extract for any number of
|
|
13
|
+
* parallel processes — only one process performs the work, the rest poll for
|
|
14
|
+
* the completion sentinel. This was issue #1021's secondary failure mode:
|
|
15
|
+
* the smoke harness spawns ~12 parallel doctor + memory probes on a cold
|
|
16
|
+
* cache, and Windows file locking exposed the race when the in-tree
|
|
17
|
+
* "synchronization point" was just a shared directory write.
|
|
14
18
|
*/
|
|
15
19
|
import { createWriteStream, existsSync, mkdirSync, renameSync, rmSync, writeFileSync, } from 'node:fs';
|
|
16
20
|
import { homedir } from 'node:os';
|
|
17
21
|
import { dirname, join } from 'node:path';
|
|
18
22
|
import { pipeline } from 'node:stream/promises';
|
|
19
23
|
import { Readable } from 'node:stream';
|
|
24
|
+
import { setTimeout as delay } from 'node:timers/promises';
|
|
20
25
|
import { x as tarExtract } from 'tar';
|
|
21
26
|
const GCS_BASE_URL = 'https://storage.googleapis.com/qdrant-fastembed';
|
|
27
|
+
// Lock-poll: how long a non-holder waits for the holder to finish before
|
|
28
|
+
// concluding the holder crashed. Cold-fetch is ~90 MB on slow CI runners, so
|
|
29
|
+
// a generous timeout avoids false takeovers under network back-pressure.
|
|
30
|
+
const LOCK_TIMEOUT_MS = 120_000;
|
|
31
|
+
const LOCK_POLL_INTERVAL_MS = 250;
|
|
32
|
+
// Standard transient-error retry per feedback_transient_retry_circuit_breaker.md:
|
|
33
|
+
// 50/200/800ms backoff, only on network errors and 5xx (4xx is deterministic).
|
|
34
|
+
const HTTP_BACKOFF_MS = [50, 200, 800];
|
|
22
35
|
/**
|
|
23
36
|
* Sentinel file written into the model directory only after the tarball has
|
|
24
37
|
* been fully downloaded AND extracted. Cache hits without it are treated as
|
|
@@ -50,28 +63,121 @@ function gcsSlugFor(model) {
|
|
|
50
63
|
export function resolveCacheDir(explicit, env = process.env) {
|
|
51
64
|
return explicit ?? env.FASTEMBED_CACHE ?? join(homedir(), '.cache', 'fastembed');
|
|
52
65
|
}
|
|
66
|
+
class TransientHttpError extends Error {
|
|
67
|
+
constructor(message) {
|
|
68
|
+
super(message);
|
|
69
|
+
this.name = 'TransientHttpError';
|
|
70
|
+
}
|
|
71
|
+
}
|
|
53
72
|
/**
|
|
54
73
|
* Stream the tarball to a unique temp path, then atomic-rename to the final
|
|
55
|
-
* tarball path before extracting. The temp suffix prevents
|
|
56
|
-
*
|
|
57
|
-
*
|
|
74
|
+
* tarball path before extracting. The temp suffix prevents the in-flight
|
|
75
|
+
* write stream from being observed at the final path — extraction always
|
|
76
|
+
* sees a complete file.
|
|
77
|
+
*
|
|
78
|
+
* Throws `TransientHttpError` on 5xx / network failure (caller retries) and
|
|
79
|
+
* a plain Error on 4xx (caller fails fast — retrying won't help).
|
|
58
80
|
*/
|
|
59
81
|
async function downloadTarball(url, destPath, showProgress, deps) {
|
|
60
82
|
const fetchFn = deps.fetchImpl ?? fetch;
|
|
61
83
|
const tmpPath = `${destPath}.${process.pid}.tmp`;
|
|
62
84
|
mkdirSync(dirname(destPath), { recursive: true });
|
|
63
|
-
|
|
85
|
+
let res;
|
|
86
|
+
try {
|
|
87
|
+
res = await fetchFn(url);
|
|
88
|
+
}
|
|
89
|
+
catch (err) {
|
|
90
|
+
throw new TransientHttpError(`Model download failed: GET ${url} → ${err.message}`);
|
|
91
|
+
}
|
|
64
92
|
if (!res.ok || !res.body) {
|
|
65
|
-
|
|
93
|
+
const msg = `Model download failed: GET ${url} → ${res.status} ${res.statusText}`;
|
|
94
|
+
if (res.status >= 500)
|
|
95
|
+
throw new TransientHttpError(msg);
|
|
96
|
+
throw new Error(msg);
|
|
66
97
|
}
|
|
67
98
|
if (showProgress) {
|
|
68
99
|
const total = Number(res.headers.get('content-length') ?? 0);
|
|
69
100
|
const totalMb = (total / (1024 * 1024)).toFixed(1);
|
|
70
101
|
process.stderr.write(`fastembed: downloading ${totalMb} MB from ${url}\n`);
|
|
71
102
|
}
|
|
72
|
-
|
|
103
|
+
try {
|
|
104
|
+
await pipeline(Readable.fromWeb(res.body), createWriteStream(tmpPath));
|
|
105
|
+
}
|
|
106
|
+
catch (err) {
|
|
107
|
+
rmSync(tmpPath, { force: true });
|
|
108
|
+
throw new TransientHttpError(`Model download stream failed mid-transfer (${url}): ${err.message}`);
|
|
109
|
+
}
|
|
73
110
|
renameSync(tmpPath, destPath);
|
|
74
111
|
}
|
|
112
|
+
async function downloadTarballWithRetry(url, destPath, showProgress, deps) {
|
|
113
|
+
let lastErr;
|
|
114
|
+
for (let attempt = 0; attempt <= HTTP_BACKOFF_MS.length; attempt++) {
|
|
115
|
+
try {
|
|
116
|
+
await downloadTarball(url, destPath, showProgress, deps);
|
|
117
|
+
return;
|
|
118
|
+
}
|
|
119
|
+
catch (err) {
|
|
120
|
+
lastErr = err;
|
|
121
|
+
if (!(err instanceof TransientHttpError) || attempt === HTTP_BACKOFF_MS.length)
|
|
122
|
+
break;
|
|
123
|
+
if (showProgress) {
|
|
124
|
+
process.stderr.write(`fastembed: download attempt ${attempt + 1} failed (${err.message}); retrying in ${HTTP_BACKOFF_MS[attempt]}ms.\n`);
|
|
125
|
+
}
|
|
126
|
+
await delay(HTTP_BACKOFF_MS[attempt]);
|
|
127
|
+
}
|
|
128
|
+
}
|
|
129
|
+
throw lastErr;
|
|
130
|
+
}
|
|
131
|
+
/**
|
|
132
|
+
* Cross-process serialization for the download/extract step. Lock holder runs
|
|
133
|
+
* `work`; non-holders poll for the completion sentinel and return as soon as
|
|
134
|
+
* it appears. If the lock holder crashes (lockfile remains but no sentinel
|
|
135
|
+
* after the timeout), the next caller cleans up and retries — preventing a
|
|
136
|
+
* permanently-stuck cache after a Ctrl+C mid-download.
|
|
137
|
+
*/
|
|
138
|
+
async function withModelLock(lockPath, completionPath, work) {
|
|
139
|
+
try {
|
|
140
|
+
writeFileSync(lockPath, String(process.pid), { flag: 'wx' });
|
|
141
|
+
}
|
|
142
|
+
catch (err) {
|
|
143
|
+
if (err.code !== 'EEXIST')
|
|
144
|
+
throw err;
|
|
145
|
+
await waitForCompletionOrTakeover(lockPath, completionPath, work);
|
|
146
|
+
return;
|
|
147
|
+
}
|
|
148
|
+
try {
|
|
149
|
+
await work();
|
|
150
|
+
}
|
|
151
|
+
finally {
|
|
152
|
+
try {
|
|
153
|
+
rmSync(lockPath, { force: true });
|
|
154
|
+
}
|
|
155
|
+
catch { /* best effort */ }
|
|
156
|
+
}
|
|
157
|
+
}
|
|
158
|
+
async function waitForCompletionOrTakeover(lockPath, completionPath, work) {
|
|
159
|
+
const deadline = Date.now() + LOCK_TIMEOUT_MS;
|
|
160
|
+
while (Date.now() < deadline) {
|
|
161
|
+
if (existsSync(completionPath))
|
|
162
|
+
return;
|
|
163
|
+
if (!existsSync(lockPath)) {
|
|
164
|
+
// Holder finished without writing the sentinel (crashed). Try to take
|
|
165
|
+
// over the lock ourselves.
|
|
166
|
+
await withModelLock(lockPath, completionPath, work);
|
|
167
|
+
return;
|
|
168
|
+
}
|
|
169
|
+
await delay(LOCK_POLL_INTERVAL_MS);
|
|
170
|
+
}
|
|
171
|
+
// Stale lock — clear it and let the next caller (or our own retry above)
|
|
172
|
+
// pick up the work. Force unlinking is safer than leaving the cache
|
|
173
|
+
// permanently wedged.
|
|
174
|
+
try {
|
|
175
|
+
rmSync(lockPath, { force: true });
|
|
176
|
+
}
|
|
177
|
+
catch { /* best effort */ }
|
|
178
|
+
throw new Error(`fastembed: timed out after ${LOCK_TIMEOUT_MS}ms waiting for ${lockPath}. ` +
|
|
179
|
+
`Stale lock cleared — retry the operation.`);
|
|
180
|
+
}
|
|
75
181
|
/**
|
|
76
182
|
* Ensure the per-model directory exists in the cache. Returns the absolute
|
|
77
183
|
* path. If already present AND the completion sentinel is in place, no
|
|
@@ -86,25 +192,34 @@ async function downloadTarball(url, destPath, showProgress, deps) {
|
|
|
86
192
|
*/
|
|
87
193
|
export async function retrieveModel(model, cacheDir, showProgress, deps = {}) {
|
|
88
194
|
const modelDir = join(cacheDir, model);
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
process.stderr.write(`fastembed: cached model at ${modelDir} is incomplete (no completion marker); redownloading.\n`);
|
|
94
|
-
}
|
|
95
|
-
rmSync(modelDir, { recursive: true, force: true });
|
|
96
|
-
}
|
|
195
|
+
const completionPath = join(modelDir, COMPLETION_SENTINEL);
|
|
196
|
+
// Fast path: complete cache hit needs no lock, no fs writes.
|
|
197
|
+
if (existsSync(completionPath))
|
|
198
|
+
return modelDir;
|
|
97
199
|
mkdirSync(cacheDir, { recursive: true });
|
|
200
|
+
const lockPath = join(cacheDir, `.${model}.download.lock`);
|
|
98
201
|
const tarballPath = join(cacheDir, `${model}.tar.gz`);
|
|
99
202
|
const url = `${GCS_BASE_URL}/${gcsSlugFor(model)}.tar.gz`;
|
|
100
|
-
await
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
203
|
+
await withModelLock(lockPath, completionPath, async () => {
|
|
204
|
+
// Re-check inside the lock — another process may have completed the
|
|
205
|
+
// download between our fast-path check and our lock acquisition.
|
|
206
|
+
if (existsSync(completionPath))
|
|
207
|
+
return;
|
|
208
|
+
if (existsSync(modelDir)) {
|
|
209
|
+
if (showProgress) {
|
|
210
|
+
process.stderr.write(`fastembed: cached model at ${modelDir} is incomplete (no completion marker); redownloading.\n`);
|
|
211
|
+
}
|
|
212
|
+
rmSync(modelDir, { recursive: true, force: true });
|
|
213
|
+
}
|
|
214
|
+
await downloadTarballWithRetry(url, tarballPath, showProgress, deps);
|
|
215
|
+
const extract = deps.extract ?? tarExtract;
|
|
216
|
+
await extract({ file: tarballPath, cwd: cacheDir });
|
|
217
|
+
rmSync(tarballPath, { force: true });
|
|
218
|
+
if (!existsSync(modelDir)) {
|
|
219
|
+
throw new Error(`Model archive extracted but ${modelDir} is missing — corrupt tarball?`);
|
|
220
|
+
}
|
|
221
|
+
writeFileSync(completionPath, '');
|
|
222
|
+
});
|
|
108
223
|
return modelDir;
|
|
109
224
|
}
|
|
110
225
|
//# sourceMappingURL=model-loader.js.map
|
|
@@ -9,7 +9,7 @@
|
|
|
9
9
|
*/
|
|
10
10
|
import * as readline from 'node:readline';
|
|
11
11
|
import { loadSpellEngine, } from '../services/engine-loader.js';
|
|
12
|
-
import {
|
|
12
|
+
import { getSharedMemoryAccessor } from '../services/daemon-dashboard.js';
|
|
13
13
|
/**
|
|
14
14
|
* Wrap a MemoryAccessor with a write-failure counter so the [epic] summary
|
|
15
15
|
* can warn when spell progress didn't reach disk (#982). Without this, a
|
|
@@ -56,17 +56,22 @@ async function promptAcceptPermissions() {
|
|
|
56
56
|
*/
|
|
57
57
|
export async function runEpicSpell(yamlContent, options = {}) {
|
|
58
58
|
const engine = await loadSpellEngine();
|
|
59
|
-
// Lazily
|
|
60
|
-
// are persisted and visible in the dashboard.
|
|
59
|
+
// Lazily wrap the process-wide shared accessor (#1020) so execution
|
|
60
|
+
// records are persisted and visible in the dashboard. The shared helper
|
|
61
|
+
// owns the warn-and-return-null degradation; we only attach the
|
|
62
|
+
// failed-write counter on top of a successful inner accessor.
|
|
61
63
|
if (!memoryAccessor) {
|
|
62
|
-
|
|
63
|
-
|
|
64
|
+
const inner = await getSharedMemoryAccessor();
|
|
65
|
+
if (inner) {
|
|
64
66
|
memoryAccessor = trackPersistFailures(inner);
|
|
65
67
|
console.log('[epic] Memory accessor ready — spell progress will be persisted');
|
|
66
68
|
}
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
69
|
+
else {
|
|
70
|
+
// The shared helper already emitted `[memory]`-prefixed warns. Add an
|
|
71
|
+
// `[epic]`-tagged note so a user running `flo epic` can correlate the
|
|
72
|
+
// missing dashboard history with this command without scanning for a
|
|
73
|
+
// `[memory]` line elsewhere in the output.
|
|
74
|
+
console.warn('[epic] ⚠ Memory unavailable — this run will not appear in the dashboard');
|
|
70
75
|
}
|
|
71
76
|
}
|
|
72
77
|
// memoryAccessor is module-cached, so `failedWrites` is cumulative across
|
|
@@ -719,9 +719,22 @@ export const hiveMindTools = [
|
|
|
719
719
|
workerCount,
|
|
720
720
|
};
|
|
721
721
|
}
|
|
722
|
-
//
|
|
723
|
-
//
|
|
724
|
-
//
|
|
722
|
+
// #1017 — detach the adapter FIRST, before any code that broadcasts
|
|
723
|
+
// hive-mind events. terminateAgent below sends agent_terminate
|
|
724
|
+
// broadcasts on the hive-mind namespace; with the adapter still
|
|
725
|
+
// listening, those broadcasts register fire-and-forget storeEntry
|
|
726
|
+
// calls that can land after clearNamespace runs. Detaching first means
|
|
727
|
+
// every subsequent broadcast hits a dead listener and never persists,
|
|
728
|
+
// so clearNamespace operates on a deterministic, unchanging set.
|
|
729
|
+
const adapter = _writeThroughAdapter;
|
|
730
|
+
if (adapter) {
|
|
731
|
+
adapter.detach();
|
|
732
|
+
_writeThroughAdapter = null;
|
|
733
|
+
}
|
|
734
|
+
// Story #807: terminate coordinator-side worker records so swarm
|
|
735
|
+
// agent_list reflects the shutdown. allSettled so one failed terminate
|
|
736
|
+
// doesn't strand the rest. Broadcasts emitted here are intentionally
|
|
737
|
+
// ignored by the (now-detached) adapter.
|
|
725
738
|
try {
|
|
726
739
|
const coordinator = await getSwarmCoordinator();
|
|
727
740
|
const results = await Promise.allSettled(hiveState.workers.map(id => coordinator.terminateAgent(id, { reason: 'hive-mind_shutdown', force: true })));
|
|
@@ -734,23 +747,24 @@ export const hiveMindTools = [
|
|
|
734
747
|
catch (err) {
|
|
735
748
|
process.stderr.write(`[hive-mind_shutdown] coordinator cleanup failed: ${err.message}\n`);
|
|
736
749
|
}
|
|
737
|
-
//
|
|
738
|
-
|
|
739
|
-
|
|
740
|
-
|
|
741
|
-
|
|
742
|
-
|
|
743
|
-
|
|
744
|
-
|
|
750
|
+
// Drain whatever the adapter already had in flight at detach, then
|
|
751
|
+
// delete the persisted hive-mind rows. Routed through the chokepoint
|
|
752
|
+
// (deleteEntry → daemon RPC when alive), so the daemon's in-memory
|
|
753
|
+
// snapshot stays consistent with disk and cannot clobber the cleanup
|
|
754
|
+
// on its next flush.
|
|
755
|
+
if (adapter) {
|
|
756
|
+
try {
|
|
757
|
+
await adapter.clearNamespace(HIVE_NS);
|
|
758
|
+
await adapter.clearNamespace(HIVE_MEMORY_NS);
|
|
759
|
+
}
|
|
760
|
+
catch {
|
|
761
|
+
// Best-effort cleanup
|
|
762
|
+
}
|
|
745
763
|
}
|
|
746
764
|
// Shutdown MessageBus for hive-mind
|
|
747
765
|
try {
|
|
748
766
|
const bus = await getMessageBus();
|
|
749
767
|
bus.unsubscribe('hive-mind-system');
|
|
750
|
-
if (_writeThroughAdapter) {
|
|
751
|
-
_writeThroughAdapter.detach();
|
|
752
|
-
_writeThroughAdapter = null;
|
|
753
|
-
}
|
|
754
768
|
}
|
|
755
769
|
catch {
|
|
756
770
|
// Bus may not be initialized
|
|
@@ -12,6 +12,7 @@ import { findProjectRoot } from '../services/project-root.js';
|
|
|
12
12
|
import { buildGrimoire } from '../services/grimoire-builder.js';
|
|
13
13
|
import { errorDetail } from '../shared/utils/error-detail.js';
|
|
14
14
|
import { inferSpellTier } from '../spells/core/spell-tier.js';
|
|
15
|
+
import { getSharedMemoryAccessor } from '../services/daemon-dashboard.js';
|
|
15
16
|
// ============================================================================
|
|
16
17
|
// Constants
|
|
17
18
|
// ============================================================================
|
|
@@ -53,16 +54,23 @@ function trackResult(tracked, result) {
|
|
|
53
54
|
tracked.result = result;
|
|
54
55
|
tracked.completedAt = new Date().toISOString();
|
|
55
56
|
}
|
|
57
|
+
// Memory accessor wiring (#1016): without `getSharedMemoryAccessor()`,
|
|
58
|
+
// runner.storeProgress() writes go to noopMemory and The Luminarium's
|
|
59
|
+
// "Flo Runs" tab never sees flo run / spell_cast invocations. The shared
|
|
60
|
+
// accessor is the same singleton runner-adapter.ts uses for `flo epic`
|
|
61
|
+
// (one cold init per process — see #1020).
|
|
56
62
|
/** Execute a definition via the engine with tracking and error handling. */
|
|
57
63
|
async function executeAndTrack(engine, definition, args, options = {}) {
|
|
58
64
|
const spellId = `sp-${Date.now()}`;
|
|
59
65
|
const tracked = trackStart(spellId, definition.name, definition.description);
|
|
60
66
|
try {
|
|
61
67
|
const sandboxConfig = await engine.loadSandboxConfigFromProject(findProjectRoot());
|
|
68
|
+
const memory = await getSharedMemoryAccessor();
|
|
62
69
|
const result = await engine.bridgeExecuteSpell(definition, args, {
|
|
63
70
|
spellId,
|
|
64
71
|
sandboxConfig,
|
|
65
72
|
forceCredentialReprompt: options.forceCredentialReprompt,
|
|
73
|
+
...(memory ? { memory } : {}),
|
|
66
74
|
});
|
|
67
75
|
trackResult(tracked, result);
|
|
68
76
|
return withSpellSource(serializeResult(result), options.sourceFile, options.tier);
|
|
@@ -112,6 +112,30 @@ export async function bridgeStoreEntry(options) {
|
|
|
112
112
|
const now = Date.now();
|
|
113
113
|
const guardResult = await guardValidate(registry, 'store', { key, namespace, size: value.length });
|
|
114
114
|
if (!guardResult.allowed) {
|
|
115
|
+
// Dedupe rejection means the same `(op, params)` write just succeeded
|
|
116
|
+
// — the caller's data is already durable. Look up the existing row so
|
|
117
|
+
// we can return its id with success:true; this matches what the
|
|
118
|
+
// dedupe semantically means (a no-op, not a failure). Other rejection
|
|
119
|
+
// reasons (rate limit, etc.) remain real failures. Match the literal
|
|
120
|
+
// reason string rather than a substring regex so a future rejection
|
|
121
|
+
// worded with "duplicate mutation" but different semantics doesn't
|
|
122
|
+
// get silently swallowed.
|
|
123
|
+
if (guardResult.reason === 'duplicate mutation within dedupe window') {
|
|
124
|
+
let existingId = null;
|
|
125
|
+
const probe = ctx.db.prepare(`SELECT id FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
|
|
126
|
+
try {
|
|
127
|
+
probe.bind([namespace, key]);
|
|
128
|
+
if (probe.step()) {
|
|
129
|
+
existingId = String(probe.getAsObject().id);
|
|
130
|
+
}
|
|
131
|
+
}
|
|
132
|
+
finally {
|
|
133
|
+
probe.free();
|
|
134
|
+
}
|
|
135
|
+
if (existingId) {
|
|
136
|
+
return { success: true, id: existingId };
|
|
137
|
+
}
|
|
138
|
+
}
|
|
115
139
|
return { success: false, id, error: `MutationGuard rejected: ${guardResult.reason}` };
|
|
116
140
|
}
|
|
117
141
|
const resolved = await resolveBridgeEmbedding(value, options.precomputedEmbedding, options.generateEmbeddingFlag, namespace);
|
|
@@ -120,6 +144,48 @@ export async function bridgeStoreEntry(options) {
|
|
|
120
144
|
}
|
|
121
145
|
const { json: embeddingJson, dimensions, model } = resolved;
|
|
122
146
|
const embeddingResponse = embeddingResponseFrom(resolved);
|
|
147
|
+
// Idempotency guard, mirrors the one in `memory-initializer.ts`'s raw-
|
|
148
|
+
// sql.js fallback. When the daemon route just wrote this exact row but
|
|
149
|
+
// the client missed the ack, we land here with the row already on disk;
|
|
150
|
+
// a plain INSERT would trip UNIQUE and surface as `[moflo] bridge
|
|
151
|
+
// operation failed:` stderr noise even though the data is durable.
|
|
152
|
+
// Probe first so withDb never sees the throw.
|
|
153
|
+
//
|
|
154
|
+
// Limitations carried forward: only `content` is compared, not `tags`
|
|
155
|
+
// or `ttl`. The targeted scenario is the same caller's request being
|
|
156
|
+
// processed twice (daemon write + client retry), where every option is
|
|
157
|
+
// identical by definition — a different caller varying `tags` after a
|
|
158
|
+
// missed-ack would still see this as an idempotent no-op rather than
|
|
159
|
+
// an update. `cached: false, attested: false` because the prior writer
|
|
160
|
+
// already ran post-persist bookkeeping; this process's in-memory cache
|
|
161
|
+
// stays cold for one retrieve until the read path warms it (perf only,
|
|
162
|
+
// not correctness).
|
|
163
|
+
if (!options.upsert) {
|
|
164
|
+
let existingId = null;
|
|
165
|
+
let existingContent = null;
|
|
166
|
+
const probe = ctx.db.prepare(`SELECT id, content FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
|
|
167
|
+
try {
|
|
168
|
+
probe.bind([namespace, key]);
|
|
169
|
+
if (probe.step()) {
|
|
170
|
+
const row = probe.getAsObject();
|
|
171
|
+
existingId = String(row.id);
|
|
172
|
+
existingContent = row.content;
|
|
173
|
+
}
|
|
174
|
+
}
|
|
175
|
+
finally {
|
|
176
|
+
probe.free();
|
|
177
|
+
}
|
|
178
|
+
if (existingId && existingContent === value) {
|
|
179
|
+
return {
|
|
180
|
+
success: true,
|
|
181
|
+
id: existingId,
|
|
182
|
+
embedding: embeddingResponse,
|
|
183
|
+
guarded: true,
|
|
184
|
+
cached: false,
|
|
185
|
+
attested: false,
|
|
186
|
+
};
|
|
187
|
+
}
|
|
188
|
+
}
|
|
123
189
|
const insertSql = options.upsert
|
|
124
190
|
? `INSERT OR REPLACE INTO memory_entries (
|
|
125
191
|
id, key, namespace, content, type,
|
|
@@ -1650,6 +1650,42 @@ export async function storeEntry(options) {
|
|
|
1650
1650
|
embeddingModel = embResult.model;
|
|
1651
1651
|
}
|
|
1652
1652
|
}
|
|
1653
|
+
// Idempotency guard. By the time we reach the raw-sql.js fallback, an
|
|
1654
|
+
// earlier write attempt — daemon route via `tryDaemonStore`, or bridge
|
|
1655
|
+
// via `bridgeStoreEntry` — may have already persisted this exact row to
|
|
1656
|
+
// disk. If a post-persist throw escaped the bridge's inner guards (#994,
|
|
1657
|
+
// #982), `bridgeStoreEntry` returned null and we landed here. Re-running
|
|
1658
|
+
// a plain INSERT would then trip the UNIQUE constraint on `(namespace,
|
|
1659
|
+
// key)` and surface as `exit 1` even though the data is durable on disk
|
|
1660
|
+
// — exactly the cascade described in `bridge-entries.ts:205`. If the
|
|
1661
|
+
// existing row matches the value the caller asked us to write, treat
|
|
1662
|
+
// this as a successful no-op and propagate the existing id instead of
|
|
1663
|
+
// re-inserting. If the content differs, fall through to INSERT — the
|
|
1664
|
+
// UNIQUE error is then a real "key already taken with other content"
|
|
1665
|
+
// signal that the caller deserves to see.
|
|
1666
|
+
if (!upsert) {
|
|
1667
|
+
let existingRow = null;
|
|
1668
|
+
const probe = db.prepare(`SELECT id, content FROM memory_entries WHERE namespace = ? AND key = ? AND status = 'active' LIMIT 1`);
|
|
1669
|
+
try {
|
|
1670
|
+
probe.bind([namespace, key]);
|
|
1671
|
+
if (probe.step()) {
|
|
1672
|
+
existingRow = probe.getAsObject();
|
|
1673
|
+
}
|
|
1674
|
+
}
|
|
1675
|
+
finally {
|
|
1676
|
+
probe.free();
|
|
1677
|
+
}
|
|
1678
|
+
if (existingRow && existingRow.content === value) {
|
|
1679
|
+
db.close();
|
|
1680
|
+
return {
|
|
1681
|
+
success: true,
|
|
1682
|
+
id: String(existingRow.id),
|
|
1683
|
+
embedding: embeddingJson
|
|
1684
|
+
? { dimensions: embeddingDimensions, model: embeddingModel }
|
|
1685
|
+
: undefined,
|
|
1686
|
+
};
|
|
1687
|
+
}
|
|
1688
|
+
}
|
|
1653
1689
|
// Insert or update entry (upsert mode uses REPLACE)
|
|
1654
1690
|
const insertSql = upsert
|
|
1655
1691
|
? `INSERT OR REPLACE INTO memory_entries (
|
|
@@ -16,6 +16,46 @@ import { createServer } from 'node:http';
|
|
|
16
16
|
import { errorDetail } from '../shared/utils/error-detail.js';
|
|
17
17
|
import { handleMemoryStore, handleMemoryDelete, handleMemoryBatch, matchMemoryRpcRoute, } from './daemon-memory-rpc.js';
|
|
18
18
|
export const DEFAULT_DASHBOARD_PORT = 3117;
|
|
19
|
+
/**
|
|
20
|
+
* Process-wide promise for the shared MemoryAccessor. Memoized as a *promise*
|
|
21
|
+
* (not the resolved value) so concurrent first-callers share a single init
|
|
22
|
+
* — without this, two near-simultaneous calls would each kick off their own
|
|
23
|
+
* `createDashboardMemoryAccessor()` chain and the loser's accessor would
|
|
24
|
+
* leak. The race fix originated in #1016 inside `mcp-tools/spell-tools.ts`;
|
|
25
|
+
* #1020 lifted it into this shared helper so `epic/runner-adapter.ts` (which
|
|
26
|
+
* had the same latent race) and any future caller benefit from one cold
|
|
27
|
+
* init per process.
|
|
28
|
+
*/
|
|
29
|
+
let _sharedAccessorPromise = null;
|
|
30
|
+
/**
|
|
31
|
+
* Return the process-wide MemoryAccessor, lazy-initialized on first call and
|
|
32
|
+
* cached as a promise thereafter. Returns `null` (with a warn log) if init
|
|
33
|
+
* fails so callers can degrade gracefully — the spell still runs, the user
|
|
34
|
+
* just doesn't see the run in The Luminarium.
|
|
35
|
+
*/
|
|
36
|
+
export function getSharedMemoryAccessor() {
|
|
37
|
+
if (_sharedAccessorPromise)
|
|
38
|
+
return _sharedAccessorPromise;
|
|
39
|
+
_sharedAccessorPromise = (async () => {
|
|
40
|
+
try {
|
|
41
|
+
return await createDashboardMemoryAccessor();
|
|
42
|
+
}
|
|
43
|
+
catch (err) {
|
|
44
|
+
console.warn(`[memory] dashboard accessor unavailable: ${err.message ?? err}`);
|
|
45
|
+
console.warn('[memory] runs will NOT appear in The Luminarium');
|
|
46
|
+
return null;
|
|
47
|
+
}
|
|
48
|
+
})();
|
|
49
|
+
return _sharedAccessorPromise;
|
|
50
|
+
}
|
|
51
|
+
/**
|
|
52
|
+
* Test-only: reset the cached promise so a subsequent call re-runs init.
|
|
53
|
+
* Production code MUST NOT call this — leaks the previous accessor's DB
|
|
54
|
+
* handle if the prior init succeeded.
|
|
55
|
+
*/
|
|
56
|
+
export function _resetSharedMemoryAccessorForTest() {
|
|
57
|
+
_sharedAccessorPromise = null;
|
|
58
|
+
}
|
|
19
59
|
/**
|
|
20
60
|
* Create a MemoryAccessor backed by the sql.js/HNSW memory database.
|
|
21
61
|
* Lazy-loads memory-initializer to avoid circular deps.
|
|
@@ -4,11 +4,21 @@
|
|
|
4
4
|
* processes write to the same target concurrently.
|
|
5
5
|
*
|
|
6
6
|
* Pattern: write to a process-unique temp path `<target>.tmp.<pid>.<rand>`,
|
|
7
|
-
* then rename onto `target`.
|
|
8
|
-
* - `
|
|
9
|
-
*
|
|
10
|
-
*
|
|
11
|
-
*
|
|
7
|
+
* **fsync the temp file**, then rename onto `target`.
|
|
8
|
+
* - `writeFileSync` does NOT fsync — the OS keeps data in the write cache.
|
|
9
|
+
* On Windows that cache isn't always coherent with what other processes
|
|
10
|
+
* see when they open the freshly-renamed target. Issue #1015 surfaced
|
|
11
|
+
* this as a flaky `memory-retrieve` race in consumer-smoke: process A
|
|
12
|
+
* stores via the daemon → daemon flushes via this helper → daemon
|
|
13
|
+
* returns → process B opens the DB and sees stale content.
|
|
14
|
+
* - The fix: fsync the temp fd before rename. After fsync, the data is
|
|
15
|
+
* durably on disk; the rename then makes that durable data visible
|
|
16
|
+
* atomically. Subsequent readers see the new bytes regardless of cache
|
|
17
|
+
* state.
|
|
18
|
+
* - `fs.renameSync` is atomic on POSIX. On Windows, Node maps it to
|
|
19
|
+
* `MoveFileExW(..., MOVEFILE_REPLACE_EXISTING)`, which replaces the
|
|
20
|
+
* destination near-atomically — concurrent readers always observe either
|
|
21
|
+
* the old file or the new, never a truncated one.
|
|
12
22
|
* - The unique temp path means concurrent writers can't clobber each other's
|
|
13
23
|
* in-flight bytes (#635). Last-writer-wins semantics: each rename is fully
|
|
14
24
|
* atomic, so the destination always reflects exactly one writer's data.
|
|
@@ -18,16 +28,28 @@
|
|
|
18
28
|
* On any failure, the temp file is best-effort removed and the original
|
|
19
29
|
* `target` stays intact. The underlying error is always re-thrown.
|
|
20
30
|
*
|
|
31
|
+
* Windows-only post-rename verify (#1015): on NTFS with antivirus / Defender
|
|
32
|
+
* scanning the freshly-renamed file, a sub-process opening the same path
|
|
33
|
+
* within ~1s can briefly see the file as locked. After a successful rename
|
|
34
|
+
* we poll-open the target until it's readable (or a 250 ms deadline passes)
|
|
35
|
+
* so the next reader doesn't race the AV lock window. The rename itself
|
|
36
|
+
* already succeeded and the data is fsynced, so the verify is best-effort:
|
|
37
|
+
* a timeout returns silently rather than throwing.
|
|
38
|
+
*
|
|
21
39
|
* `fs` is injectable so the interrupt-mid-write paths can be exercised in
|
|
22
40
|
* unit tests without depending on ESM-unfriendly module spies.
|
|
23
41
|
*
|
|
24
42
|
* @module moflo/cli/shared/utils/atomic-file-write
|
|
25
43
|
*/
|
|
26
44
|
import * as realFs from 'node:fs';
|
|
45
|
+
const IS_WIN32 = process.platform === 'win32';
|
|
46
|
+
const VERIFY_DEADLINE_MS = 250;
|
|
47
|
+
const VERIFY_STEP_MS = 10;
|
|
27
48
|
export function atomicWriteFileSync(targetPath, data, fs = realFs) {
|
|
28
49
|
const tmpPath = `${targetPath}.tmp.${process.pid}.${Math.random().toString(36).slice(2, 8)}`;
|
|
29
50
|
try {
|
|
30
51
|
fs.writeFileSync(tmpPath, data);
|
|
52
|
+
fsyncFile(tmpPath, fs);
|
|
31
53
|
fs.renameSync(tmpPath, targetPath);
|
|
32
54
|
}
|
|
33
55
|
catch (err) {
|
|
@@ -39,5 +61,61 @@ export function atomicWriteFileSync(targetPath, data, fs = realFs) {
|
|
|
39
61
|
}
|
|
40
62
|
throw err;
|
|
41
63
|
}
|
|
64
|
+
if (IS_WIN32)
|
|
65
|
+
verifyReadableAfterRename(targetPath, fs);
|
|
66
|
+
}
|
|
67
|
+
/**
|
|
68
|
+
* Open the freshly-written temp file, fsync, close. Ensures the data is
|
|
69
|
+
* durably on disk before rename makes it visible (#1015). Best-effort: an
|
|
70
|
+
* fsync error is swallowed because a real filesystem failure will surface
|
|
71
|
+
* on the rename anyway, and we don't want to mask the more useful error.
|
|
72
|
+
*/
|
|
73
|
+
function fsyncFile(tmpPath, fs) {
|
|
74
|
+
const openSync = fs.openSync ?? realFs.openSync;
|
|
75
|
+
const closeSync = fs.closeSync ?? realFs.closeSync;
|
|
76
|
+
const fsyncSync = fs.fsyncSync ?? realFs.fsyncSync;
|
|
77
|
+
let fd = null;
|
|
78
|
+
try {
|
|
79
|
+
fd = openSync(tmpPath, 'r+');
|
|
80
|
+
fsyncSync(fd);
|
|
81
|
+
}
|
|
82
|
+
catch {
|
|
83
|
+
/* fsync best-effort — see fn doc */
|
|
84
|
+
}
|
|
85
|
+
finally {
|
|
86
|
+
if (fd !== null) {
|
|
87
|
+
try {
|
|
88
|
+
closeSync(fd);
|
|
89
|
+
}
|
|
90
|
+
catch { /* close best-effort */ }
|
|
91
|
+
}
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
/**
|
|
95
|
+
* Poll-open the target until a reader can succeed, or the deadline passes.
|
|
96
|
+
* Closes the AV-scan settle window on NTFS (#1015). No-op everywhere else.
|
|
97
|
+
*
|
|
98
|
+
* Yields the thread between probes via `Atomics.wait` so we don't pin a CPU
|
|
99
|
+
* during the very contention we're waiting out (`feedback_async_by_default`).
|
|
100
|
+
*/
|
|
101
|
+
function verifyReadableAfterRename(targetPath, fs) {
|
|
102
|
+
const openSync = fs.openSync ?? realFs.openSync;
|
|
103
|
+
const closeSync = fs.closeSync ?? realFs.closeSync;
|
|
104
|
+
const deadline = Date.now() + VERIFY_DEADLINE_MS;
|
|
105
|
+
while (true) {
|
|
106
|
+
try {
|
|
107
|
+
closeSync(openSync(targetPath, 'r'));
|
|
108
|
+
return;
|
|
109
|
+
}
|
|
110
|
+
catch {
|
|
111
|
+
if (Date.now() >= deadline)
|
|
112
|
+
return;
|
|
113
|
+
sleepSyncMs(VERIFY_STEP_MS);
|
|
114
|
+
}
|
|
115
|
+
}
|
|
116
|
+
}
|
|
117
|
+
const SLEEP_BUF = new Int32Array(new SharedArrayBuffer(4));
|
|
118
|
+
function sleepSyncMs(ms) {
|
|
119
|
+
Atomics.wait(SLEEP_BUF, 0, 0, ms);
|
|
42
120
|
}
|
|
43
121
|
//# sourceMappingURL=atomic-file-write.js.map
|
|
@@ -5,8 +5,10 @@
|
|
|
5
5
|
* lifecycle. This connector adds server-pool management, lazy spawning, tool
|
|
6
6
|
* discovery caching, and the SpellConnector interface adapter.
|
|
7
7
|
*
|
|
8
|
-
* The SDK is
|
|
9
|
-
*
|
|
8
|
+
* The SDK is a hard `dependency` (MCP is a headline integration), but it is
|
|
9
|
+
* loaded lazily on first use so spells that don't use the MCP connector don't
|
|
10
|
+
* pay its startup cost. The lazy-load also yields an actionable install hint
|
|
11
|
+
* if a corrupted install lost the package.
|
|
10
12
|
*/
|
|
11
13
|
import { loadOptional } from './shared/optional-import.js';
|
|
12
14
|
const MCP_INSTALL_MSG = "MCP connector requires '@modelcontextprotocol/sdk' to be installed. Run: npm i @modelcontextprotocol/sdk";
|
|
@@ -1,11 +1,18 @@
|
|
|
1
1
|
/**
|
|
2
2
|
* Lazy loader for optional SDK dependencies.
|
|
3
3
|
*
|
|
4
|
-
* Connectors wrapping
|
|
5
|
-
*
|
|
6
|
-
* don't need to install them.
|
|
7
|
-
*
|
|
8
|
-
*
|
|
4
|
+
* Connectors wrapping truly optional SDKs (imapflow, mailparser) declare them
|
|
5
|
+
* as `peerDependenciesMeta.optional` so consumers that don't use the connector
|
|
6
|
+
* don't need to install them. The `@modelcontextprotocol/sdk` is a hard
|
|
7
|
+
* `dependency` because the MCP connector is a headline feature, but it is still
|
|
8
|
+
* routed through this helper so a corrupted install still yields an actionable
|
|
9
|
+
* message instead of a raw MODULE_NOT_FOUND.
|
|
10
|
+
*
|
|
11
|
+
* Every specifier passed to `loadOptional()` MUST be declared in package.json
|
|
12
|
+
* (dependencies, optionalDependencies, or peerDependenciesMeta). The drift
|
|
13
|
+
* guard at `src/cli/__tests__/spells/connectors/optional-import-declared.test.ts`
|
|
14
|
+
* enforces this — it walks shipped connectors, extracts every specifier, and
|
|
15
|
+
* fails the build if one is undeclared.
|
|
9
16
|
*/
|
|
10
17
|
const moduleCache = new Map();
|
|
11
18
|
function isModuleNotFound(err) {
|
|
@@ -1,24 +1,31 @@
|
|
|
1
1
|
/**
|
|
2
2
|
* Credential Validation
|
|
3
3
|
*
|
|
4
|
-
*
|
|
5
|
-
*
|
|
4
|
+
* Shape checks applied to values pulled from the encrypted credential store
|
|
5
|
+
* before they are promoted to `process.env`. Two layers:
|
|
6
6
|
*
|
|
7
|
-
*
|
|
8
|
-
*
|
|
9
|
-
*
|
|
7
|
+
* 1. **Author-declared format** (preferred): the YAML prereq sets
|
|
8
|
+
* `format: jwt`, and the validator enforces JWT shape + expiry. Any
|
|
9
|
+
* non-JWT value (e.g. a value with no dots) is rejected outright,
|
|
10
|
+
* catching the failure mode where a stored value isn't even a JWT
|
|
11
|
+
* and the spell would otherwise fail mid-cast with a 401.
|
|
10
12
|
*
|
|
11
|
-
*
|
|
12
|
-
*
|
|
13
|
-
*
|
|
14
|
-
*
|
|
13
|
+
* 2. **Conservative heuristics** (fallback when no format is declared):
|
|
14
|
+
* - JWT-shaped values (3 base64url segments) get their `exp` claim
|
|
15
|
+
* parsed and rejected when expired.
|
|
16
|
+
* - Env keys ending in `_URL` must parse via the WHATWG `URL`
|
|
17
|
+
* constructor and have a non-empty host.
|
|
18
|
+
* Anything else passes through.
|
|
15
19
|
*
|
|
16
|
-
* Story #1007:
|
|
17
|
-
*
|
|
18
|
-
*
|
|
20
|
+
* Story #1007: catch expired JWTs that survived past their TTL.
|
|
21
|
+
* Story #1009: extend to catch values that aren't even JWT-shaped when
|
|
22
|
+
* the prereq has declared `format: jwt`.
|
|
19
23
|
*/
|
|
20
24
|
const VALID_JWT_SEGMENT = /^[A-Za-z0-9_-]+$/;
|
|
21
|
-
export function validateStoredCredential(envKey, value) {
|
|
25
|
+
export function validateStoredCredential(envKey, value, format) {
|
|
26
|
+
if (format === 'jwt') {
|
|
27
|
+
return validateJwtFormat(value);
|
|
28
|
+
}
|
|
22
29
|
if (envKey.endsWith('_URL')) {
|
|
23
30
|
return validateUrlValue(value);
|
|
24
31
|
}
|
|
@@ -27,6 +34,15 @@ export function validateStoredCredential(envKey, value) {
|
|
|
27
34
|
}
|
|
28
35
|
return { valid: true };
|
|
29
36
|
}
|
|
37
|
+
function validateJwtFormat(value) {
|
|
38
|
+
if (!looksLikeJwt(value)) {
|
|
39
|
+
return {
|
|
40
|
+
valid: false,
|
|
41
|
+
reason: 'stored value is not a JWT (expected three base64url segments separated by ".")',
|
|
42
|
+
};
|
|
43
|
+
}
|
|
44
|
+
return validateJwtExpiry(value);
|
|
45
|
+
}
|
|
30
46
|
function validateUrlValue(value) {
|
|
31
47
|
try {
|
|
32
48
|
const parsed = new URL(value);
|
|
@@ -72,6 +72,7 @@ export function compilePrerequisiteSpec(spec) {
|
|
|
72
72
|
description: spec.description,
|
|
73
73
|
promptOnMissing,
|
|
74
74
|
envKey,
|
|
75
|
+
format: spec.format,
|
|
75
76
|
};
|
|
76
77
|
}
|
|
77
78
|
function defaultHintForDetect(spec) {
|
|
@@ -205,7 +206,7 @@ export async function resolveUnmetPrerequisites(prerequisites, options = {}) {
|
|
|
205
206
|
const stored = await credentials.get(prereq.envKey);
|
|
206
207
|
if (typeof stored !== 'string' || stored.length === 0)
|
|
207
208
|
return;
|
|
208
|
-
const validation = validateStoredCredential(prereq.envKey, stored);
|
|
209
|
+
const validation = validateStoredCredential(prereq.envKey, stored, prereq.format);
|
|
209
210
|
if (!validation.valid) {
|
|
210
211
|
rejectedFromStore.push({ envKey: prereq.envKey, reason: validation.reason });
|
|
211
212
|
return;
|
|
@@ -6,6 +6,7 @@
|
|
|
6
6
|
* deliberately small so step validation can delegate here.
|
|
7
7
|
*/
|
|
8
8
|
const VALID_DETECT_TYPES = ['env', 'command', 'file'];
|
|
9
|
+
const VALID_FORMATS = ['jwt'];
|
|
9
10
|
export function validatePrerequisites(prereqs, errors, path) {
|
|
10
11
|
if (!Array.isArray(prereqs)) {
|
|
11
12
|
errors.push({ path, message: 'prerequisites must be an array' });
|
|
@@ -39,6 +40,13 @@ export function validatePrerequisites(prereqs, errors, path) {
|
|
|
39
40
|
if (p.promptOnMissing !== undefined && typeof p.promptOnMissing !== 'boolean') {
|
|
40
41
|
errors.push({ path: `${pPath}.promptOnMissing`, message: 'promptOnMissing must be a boolean' });
|
|
41
42
|
}
|
|
43
|
+
if (p.format !== undefined
|
|
44
|
+
&& !VALID_FORMATS.includes(p.format)) {
|
|
45
|
+
errors.push({
|
|
46
|
+
path: `${pPath}.format`,
|
|
47
|
+
message: `format must be one of: ${VALID_FORMATS.join(', ')}`,
|
|
48
|
+
});
|
|
49
|
+
}
|
|
42
50
|
const detect = p.detect;
|
|
43
51
|
if (!detect || typeof detect !== 'object') {
|
|
44
52
|
errors.push({ path: `${pPath}.detect`, message: 'detect is required and must be an object' });
|
|
@@ -66,6 +74,14 @@ export function validatePrerequisites(prereqs, errors, path) {
|
|
|
66
74
|
errors.push({ path: `${pPath}.detect.path`, message: 'detect.path is required for file detector' });
|
|
67
75
|
}
|
|
68
76
|
}
|
|
77
|
+
// `format` only applies to stored env values — silently ignoring it on
|
|
78
|
+
// command/file detectors would mask author mistakes.
|
|
79
|
+
if (p.format !== undefined && detect.type !== 'env') {
|
|
80
|
+
errors.push({
|
|
81
|
+
path: `${pPath}.format`,
|
|
82
|
+
message: 'format is only valid on env-type prerequisites',
|
|
83
|
+
});
|
|
84
|
+
}
|
|
69
85
|
});
|
|
70
86
|
}
|
|
71
87
|
//# sourceMappingURL=prerequisites.js.map
|
package/dist/src/cli/version.js
CHANGED
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "moflo",
|
|
3
|
-
"version": "4.9.
|
|
3
|
+
"version": "4.9.32",
|
|
4
4
|
"description": "MoFlo — AI agent orchestration for Claude Code. A standalone, opinionated toolkit with semantic memory, learned routing, gates, spells, and the /flo issue-execution skill.",
|
|
5
5
|
"main": "dist/src/cli/index.js",
|
|
6
6
|
"type": "module",
|
|
@@ -64,6 +64,7 @@
|
|
|
64
64
|
},
|
|
65
65
|
"dependencies": {
|
|
66
66
|
"@anush008/tokenizers": "^0.6.0",
|
|
67
|
+
"@modelcontextprotocol/sdk": "^1.0.0",
|
|
67
68
|
"js-yaml": "^4.1.1",
|
|
68
69
|
"lru-cache": "^11.3.5",
|
|
69
70
|
"onnxruntime-node": "^1.24.3",
|
|
@@ -72,6 +73,18 @@
|
|
|
72
73
|
"tar": "^7.5.11",
|
|
73
74
|
"valibot": "^1.3.1"
|
|
74
75
|
},
|
|
76
|
+
"peerDependencies": {
|
|
77
|
+
"imapflow": "^1.0.0",
|
|
78
|
+
"mailparser": "^3.0.0"
|
|
79
|
+
},
|
|
80
|
+
"peerDependenciesMeta": {
|
|
81
|
+
"imapflow": {
|
|
82
|
+
"optional": true
|
|
83
|
+
},
|
|
84
|
+
"mailparser": {
|
|
85
|
+
"optional": true
|
|
86
|
+
}
|
|
87
|
+
},
|
|
75
88
|
"overrides": {
|
|
76
89
|
"hono": ">=4.11.4",
|
|
77
90
|
"picomatch": ">=2.3.2",
|
|
@@ -84,7 +97,7 @@
|
|
|
84
97
|
"@typescript-eslint/eslint-plugin": "^7.18.0",
|
|
85
98
|
"@typescript-eslint/parser": "^7.18.0",
|
|
86
99
|
"eslint": "^8.0.0",
|
|
87
|
-
"moflo": "^4.9.
|
|
100
|
+
"moflo": "^4.9.31",
|
|
88
101
|
"tsx": "^4.21.0",
|
|
89
102
|
"typescript": "^5.9.3",
|
|
90
103
|
"vitest": "^4.0.0"
|