mustflow 2.22.4 → 2.22.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (72) hide show
  1. package/README.md +17 -75
  2. package/dist/cli/commands/classify.js +2 -0
  3. package/dist/cli/commands/contract-lint.js +2 -2
  4. package/dist/cli/commands/dashboard.js +23 -75
  5. package/dist/cli/commands/help.js +8 -9
  6. package/dist/cli/commands/impact.js +2 -3
  7. package/dist/cli/commands/init.js +61 -5
  8. package/dist/cli/commands/run/receipt.js +1 -0
  9. package/dist/cli/commands/run.js +14 -1
  10. package/dist/cli/commands/update.js +2 -2
  11. package/dist/cli/commands/verify/evidence-input.js +269 -0
  12. package/dist/cli/commands/verify/input.js +212 -0
  13. package/dist/cli/commands/verify.js +23 -482
  14. package/dist/cli/commands/version-sources.js +2 -3
  15. package/dist/cli/i18n/en.js +5 -0
  16. package/dist/cli/i18n/es.js +5 -0
  17. package/dist/cli/i18n/fr.js +5 -0
  18. package/dist/cli/i18n/hi.js +5 -0
  19. package/dist/cli/i18n/ko.js +5 -0
  20. package/dist/cli/i18n/zh.js +5 -0
  21. package/dist/cli/lib/agent-context.js +6 -11
  22. package/dist/cli/lib/dashboard-export.js +2 -0
  23. package/dist/cli/lib/dashboard-mutations.js +79 -0
  24. package/dist/cli/lib/local-index/command-effect-index.js +25 -0
  25. package/dist/cli/lib/local-index/hashing.js +7 -0
  26. package/dist/cli/lib/local-index/index.js +127 -823
  27. package/dist/cli/lib/local-index/source-index.js +137 -0
  28. package/dist/cli/lib/local-index/verification-evidence.js +451 -0
  29. package/dist/cli/lib/local-index/workflow-documents.js +204 -0
  30. package/dist/cli/lib/mustflow-read.js +41 -0
  31. package/dist/cli/lib/project-root.js +1 -2
  32. package/dist/cli/lib/repo-map.js +65 -16
  33. package/dist/cli/lib/run-root-trust.js +27 -0
  34. package/dist/cli/lib/templates.js +124 -8
  35. package/dist/cli/lib/toml.js +6 -1
  36. package/dist/cli/lib/validation/constants.js +2 -0
  37. package/dist/cli/lib/validation/index.js +291 -22
  38. package/dist/cli/lib/validation/primitives.js +2 -2
  39. package/dist/cli/lib/validation/test-selection.js +2 -2
  40. package/dist/core/bounded-output.js +32 -7
  41. package/dist/core/change-classification-policy.js +47 -0
  42. package/dist/core/change-classification.js +10 -43
  43. package/dist/core/check-issues.js +7 -1
  44. package/dist/core/command-contract-validation.js +28 -4
  45. package/dist/core/command-env.js +1 -1
  46. package/dist/core/config-loading.js +9 -3
  47. package/dist/core/contract-lint.js +8 -3
  48. package/dist/core/correlation-id.js +16 -0
  49. package/dist/core/run-receipt.js +1 -0
  50. package/dist/core/safe-filesystem.js +11 -4
  51. package/dist/core/skill-route-alignment.js +1 -0
  52. package/dist/core/skill-route-explanation.js +9 -3
  53. package/dist/core/test-selection.js +2 -3
  54. package/dist/core/verification-scheduler.js +7 -6
  55. package/dist/core/version-sources.js +2 -3
  56. package/package.json +4 -1
  57. package/schemas/README.md +4 -0
  58. package/schemas/change-verification-report.schema.json +4 -0
  59. package/schemas/classify-report.schema.json +4 -0
  60. package/schemas/commands.schema.json +1 -0
  61. package/schemas/dashboard-export.schema.json +4 -0
  62. package/schemas/latest-run-pointer.schema.json +4 -0
  63. package/schemas/run-receipt.schema.json +4 -0
  64. package/schemas/verify-report.schema.json +4 -0
  65. package/schemas/verify-run-manifest.schema.json +4 -0
  66. package/templates/default/i18n.toml +3 -3
  67. package/templates/default/locales/en/.mustflow/skills/INDEX.md +10 -6
  68. package/templates/default/locales/en/.mustflow/skills/architecture-deepening-review/SKILL.md +25 -2
  69. package/templates/default/locales/en/.mustflow/skills/routes.toml +2 -2
  70. package/templates/default/locales/en/.mustflow/skills/security-privacy-review/SKILL.md +9 -1
  71. package/templates/default/locales/en/.mustflow/skills/test-design-guard/SKILL.md +9 -1
  72. package/templates/default/manifest.toml +1 -1
@@ -36,14 +36,18 @@ refer to `AGENTS.md` and `.mustflow/config/commands.toml` to implement the most
36
36
 
37
37
  ## Selection Convention
38
38
 
39
- - Choose one primary skill that best describes the main work. Prefer the most specific matching
40
- skill over a broad architecture or review skill.
39
+ - Choose one main route: a `primary` route for ordinary work or an `authoring` route for
40
+ mustflow authoring and maintenance work. Prefer the most specific matching skill over a broad
41
+ architecture or review skill.
42
+ - Treat `authoring` routes as selectable main routes, not adjunct routes. Use them when the task
43
+ creates or maintains mustflow-owned skills, context files, command contracts, route metadata, or
44
+ public documentation entries.
41
45
  - Add no more than two adjunct skills for secondary risks such as tests, documentation, security,
42
46
  privacy, release, or contract drift.
43
47
  - Treat event-triggered skills as inactive until the event occurs. For example, read
44
48
  `failure-triage` only after a configured command intent or verification step fails.
45
- - If several primary skills appear to match, choose the one tied to the files and behavior being
46
- changed now, then report the skipped plausible skills instead of reading every route.
49
+ - If several main routes appear to match, choose the one tied to the files and behavior being
50
+ changed now, then report the skipped plausible routes instead of reading every route.
47
51
 
48
52
  ## Classification Prefilter
49
53
 
@@ -78,8 +82,8 @@ test tasks from requiring a full read of architecture-pattern routes. Categories
78
82
 
79
83
  ## Specific Routes
80
84
 
81
- After choosing a category, choose one primary route and at most two adjunct routes. Event routes
82
- stay inactive until their event occurs.
85
+ After choosing a category, choose one main route (`primary` or `authoring`) and at most two adjunct
86
+ routes. Event routes stay inactive until their event occurs.
83
87
 
84
88
  ### Bug and Failure
85
89
 
@@ -2,7 +2,7 @@
2
2
  mustflow_doc: skill.architecture-deepening-review
3
3
  locale: en
4
4
  canonical: true
5
- revision: 1
5
+ revision: 2
6
6
  lifecycle: mustflow-owned
7
7
  authority: procedure
8
8
  name: architecture-deepening-review
@@ -37,6 +37,7 @@ This is a review-first skill. It helps decide whether code needs a deeper module
37
37
  ## Use When
38
38
 
39
39
  - The user asks for architecture review, module boundaries, structural improvement, codebase deepening, maintainability review, or testability improvement.
40
+ - The user asks where a design will break first as it grows, which responsibility boundary is most likely to blur, or whether a module, service, database owner, permission model, deployment unit, or failure boundary is still clear enough.
40
41
  - A file, module, service, handler, command, controller, or test suite looks broad enough that the next edit may add another responsibility.
41
42
  - Code exposes internal steps to many callers, repeats orchestration, or makes tests hard because policy, I/O, formatting, and dispatch are mixed.
42
43
  - A shallow wrapper adds naming without hiding complexity, or a helper has become a pass-through layer around many unrelated concerns.
@@ -57,6 +58,7 @@ This is a review-first skill. It helps decide whether code needs a deeper module
57
58
  - Target area, current pain, and the user-facing or maintainer-facing reason to inspect architecture.
58
59
  - Relevant source files, call sites, exports, tests, fixtures, schemas, templates, or documentation that show current behavior and ownership.
59
60
  - Local patterns for modules, boundaries, naming, errors, dependency direction, and tests.
61
+ - The data owner, write path, failure mode, and expected 3x, 10x, or 100x growth pressure when the review is about a design rather than only a file split.
60
62
  - Current changed-file list when the worktree is already dirty.
61
63
  - Relevant command-intent contract entries for verification.
62
64
 
@@ -88,7 +90,22 @@ This is a review-first skill. It helps decide whether code needs a deeper module
88
90
  3. Identify one to three candidate boundaries.
89
91
  - Each candidate must name the responsibility it would hide or clarify.
90
92
  - Reject candidates that only rename, wrap, or move code without lowering caller complexity or test cost.
91
- 4. Score each candidate from 1 to 9.
93
+ 4. Force the design through the ownership and failure questions before scoring.
94
+ - Name the first likely mixed-responsibility boundary. Common early failures are business rules leaking into controllers, repositories, external adapters, UI components, or framework-specific handlers.
95
+ - Name the final owner for important data. The owner is the module that protects the invariant, not necessarily the module that reads the value most often.
96
+ - Separate original state from cache, search index, analytics, summary, AI output, provider response, or other derived state.
97
+ - Identify every direct write path for high-impact fields such as status, role, permission, balance, quota, plan, entitlement, deleted state, payment state, or ownership.
98
+ - Ask whether a failure creates a visible failure state or silently creates false success. High-impact paths such as authorization, payment, entitlement, deletion, and destructive administration should fail closed.
99
+ - Ask whether duplicate requests, retries, webhook redelivery, queue replay, or worker restart can repeat a harmful effect. If yes, require an idempotency, ledger, outbox, or reconciliation boundary before calling the design safe.
100
+ 5. Check growth pressure in concrete stages.
101
+ - At 3x scale, look first for implementation-quality failures: missing indexes, N+1 reads, large responses, synchronous file or image work, repeated external calls, and insufficient connection pools.
102
+ - At 10x scale, look first for ownership and state failures: write hot spots, queue delay, cache invalidation, server-local files, scattered permission rules, external API rate limits, and deployment units that change for unrelated reasons.
103
+ - At 100x scale, look first for partitioning and operational failures: data split boundaries, tenant or region hot spots, retry storms, external dependency isolation, long deploy recovery, missing observability, and manual-only recovery paths.
104
+ 6. Check scaling direction without forcing premature distribution.
105
+ - A small team may start with one larger server or a simple server set, but request handlers should not depend on process memory, local uploads, duplicate cron execution, in-transaction external calls, or server-specific job state.
106
+ - Application servers should be able to become stateless. Databases may start with vertical scaling, but the design should not block read replicas, read models, queue-backed work, or future data partitioning.
107
+ - Horizontal scaling is only real if any server can handle the same request, workers can safely duplicate or retry work, and database writes do not all converge on an uncontrolled hot spot.
108
+ 7. Score each candidate from 1 to 9.
92
109
  - User value: whether the structure protects a user-visible or public contract.
93
110
  - Maintenance value: whether future changes become smaller or less error-prone.
94
111
  - Blast radius: how many callers, files, schemas, templates, or docs would change.
@@ -111,6 +128,8 @@ This is a review-first skill. It helps decide whether code needs a deeper module
111
128
 
112
129
  - The output contains a ranked architecture candidate list or one scoped structural change.
113
130
  - Any chosen change has a named reason tied to lower change cost, lower defect risk, or better testability.
131
+ - Important data has a named owner, write path, original-or-derived classification, and failure behavior when the reviewed design touches durable state.
132
+ - Growth pressure is either checked at 3x, 10x, and 100x or explicitly marked not relevant to the current architecture decision.
114
133
  - Behavior changes are excluded or explicitly moved to a separate follow-up.
115
134
  - Verification evidence or verification gaps are reported without claiming unrun checks passed.
116
135
 
@@ -145,6 +164,10 @@ Use documentation and release checks only when the review or chosen change touch
145
164
 
146
165
  - Review target and current pain
147
166
  - Evidence inspected
167
+ - Data owner, write path, and original-versus-derived state when relevant
168
+ - Failure mode, idempotency, and recovery boundary when relevant
169
+ - 3x, 10x, and 100x growth pressure when relevant
170
+ - Vertical versus horizontal scaling direction when relevant
148
171
  - Candidate boundaries and scores
149
172
  - Selected next action
150
173
  - Narrower skill used or intentionally avoided
@@ -2,7 +2,7 @@ schema_version = "1"
2
2
 
3
3
  [routes."artifact-integrity-check"]
4
4
  category = "ui_assets"
5
- route_type = "adjunct"
5
+ route_type = "primary"
6
6
  priority = 80
7
7
  applies_to_reasons = ["package_metadata_change", "release_risk"]
8
8
 
@@ -212,7 +212,7 @@ applies_to_reasons = ["test_change"]
212
212
 
213
213
  [routes."security-privacy-review"]
214
214
  category = "security_privacy"
215
- route_type = "adjunct"
215
+ route_type = "primary"
216
216
  priority = 30
217
217
  applies_to_reasons = ["security_change", "privacy_change"]
218
218
 
@@ -2,7 +2,7 @@
2
2
  mustflow_doc: skill.security-privacy-review
3
3
  locale: en
4
4
  canonical: true
5
- revision: 16
5
+ revision: 17
6
6
  lifecycle: mustflow-owned
7
7
  authority: procedure
8
8
  name: security-privacy-review
@@ -31,6 +31,7 @@ Catch security, privacy, and disclosure risks introduced by ordinary code, docum
31
31
  ## Use When
32
32
 
33
33
  - A change touches authentication, authorization, sessions, admin behavior, tenant boundaries, personal data, secrets, tokens, credentials, API keys, or private files.
34
+ - A feature adds role, permission, administrator, internal-tool, feature-flag, emergency-access, support, or back-office exceptions that could make the authorization model less explicit over time.
34
35
  - A change comes from AI-generated code, vibe-coded output, copied examples, or a broad assistant patch that may have optimized for the happy path without proving abuse boundaries.
35
36
  - A change adds or modifies logging, telemetry, diagnostics, receipts, reports, caches, generated state, retention, redaction, export, or external transmission.
36
37
  - A change adds or modifies behavior analytics events, event schemas, page views, clicks, searches, impressions, scroll data, experiments, attribution, request traces, or observability data that may include personal data or sensitive context.
@@ -76,6 +77,7 @@ Catch security, privacy, and disclosure risks introduced by ordinary code, docum
76
77
  - Changed files, diff summary, and the user goal.
77
78
  - Sensitive data, actor, trust boundary, storage, logging, retention, export, or external disclosure surfaces involved.
78
79
  - Actor, resource owner, tenant boundary, server-side authorization rule, state-changing route, external network target, dependency source, and agent/tool permission surface involved.
80
+ - Permission model shape when authorization is involved: actor, resource, action, scope, condition, default decision, exception path, emergency-access path, and audit expectation.
79
81
  - Read, list, search, update, delete, upload, attach, download, invite, billing, and admin actions affected, including whether the server scopes each action by actor, owner, workspace, organization, team, role, or capability.
80
82
  - Cookie, JWT, OAuth, file upload, file download, business-value, database mutation, ORM bulk operation, CI/CD permission, deployment setting, or secret-source surface involved.
81
83
  - Cryptographic primitive, password hashing, random-token, secure transport, certificate validation, scanner gate, or security invariant involved.
@@ -126,6 +128,9 @@ Catch security, privacy, and disclosure risks introduced by ordinary code, docum
126
128
  - Treat client-provided actor ids, role names, workspace ids, plan names, prices, discounts, entitlement flags, and status values as untrusted input. Derive trusted actor and tenant context from server-side authentication and membership checks.
127
129
  - Check list, search, detail, attachment, export, and download paths as carefully as mutation paths. Read access is still data access.
128
130
  - Reject mass assignment. Server code should allowlist mutable fields instead of passing raw request bodies into database updates where privileged fields could be set by the client.
131
+ - Review permission rules as actor, resource, action, scope, and condition rather than role name alone. "Admin can do it" is not enough; the rule should say which administrator can perform which action on which resource and under which tenant or system scope.
132
+ - Treat growing exceptions such as `isAdmin`, hardcoded user ids, company-email suffixes, internal-tool bypasses, feature-flag bypasses, or support-only shortcuts as authorization-model decay. Replace them with explicit capabilities, scoped roles, or time-limited emergency access.
133
+ - Emergency access should have a reason, time limit, notification or approval path, and audit log. It should not become a permanent silent superuser branch.
129
134
  7. For high-impact admin operations, require a server-side capability or role check, actor attribution, target identity, reason or change note where useful, before/after evidence, and a rollback, preview, or recovery path proportionate to the impact.
130
135
  High-impact examples include publish/unpublish, slug change, redirect change, canonical change, robots or sitemap change, filter definition change, advertisement slot or policy change, cache purge, search reindex, ranking refresh, bulk edit, and role or permission change.
131
136
  8. For high-risk content claims, require source attribution, jurisdiction or market, effective date, verification date, risk tier, review owner, affected-content lookup, and human approval before publication when the domain is legal, privacy, finance, health, safety, eligibility, pricing, ranking, comparison, or compliance.
@@ -194,6 +199,8 @@ Catch security, privacy, and disclosure risks introduced by ordinary code, docum
194
199
  - Public and packaged surfaces do not include unnecessary secrets, personal data, or misleading privacy guarantees.
195
200
  - Admin operations, shared-cache behavior, generated-state rebuilds, and audit logs are treated as security-sensitive when they affect private data, permissions, public indexing, traffic, or monetization.
196
201
  - Client-side permission displays, file upload or download flows, private asset URLs, and API response fields are treated as disclosure and access-control surfaces.
202
+ - Permission models define actor, resource, action, scope, condition, and default-deny behavior when authorization is involved, or the missing model is reported as a risk.
203
+ - Administrator, support, internal-tool, feature-flag, and emergency-access exceptions are audited, time-bounded, or reported as authorization-model drift.
197
204
  - Behavior analytics, observability, and audit logs are separated by durability, retention, attribution, personal-data, and loss-tolerance expectations.
198
205
  - Core security, privacy, billing, entitlement, file, search, job, webhook, and administrator events are internally owned or explicitly reported as SaaS-only with the resulting export, retention, and incident-reconstruction risk.
199
206
  - Trace context, baggage, request ids, user ids, tenant ids, job ids, and webhook ids are reviewed for sensitive data, external propagation, retention, and backend portability when those surfaces exist.
@@ -240,6 +247,7 @@ Use a narrower configured test, build, or documentation intent when it better pr
240
247
  - Data residency, data classification, AI processing location, runtime patch, and hard-limit policy checked when relevant
241
248
  - Claim, comparison, affiliate, user-generated content, data-ownership, deletion, anonymization, export, and retention boundaries checked when relevant
242
249
  - Authorization, session, token, input, file, network, business-logic, dependency, cryptography, transport, deployment, scanner, and agent-tool boundaries checked
250
+ - Permission exception and emergency-access boundaries checked when relevant
243
251
  - Redaction, omission, or wording changes made
244
252
  - Related security-regression test need
245
253
  - Command intents run
@@ -2,7 +2,7 @@
2
2
  mustflow_doc: skill.test-design-guard
3
3
  locale: en
4
4
  canonical: true
5
- revision: 1
5
+ revision: 2
6
6
  lifecycle: mustflow-owned
7
7
  authority: procedure
8
8
  name: test-design-guard
@@ -31,6 +31,8 @@ Guard the design quality of new tests and new test cases. This skill prevents in
31
31
 
32
32
  This skill does not force TDD order. It requires evidence that each new or changed test proves an observable behavior contract.
33
33
 
34
+ Good tests prove that important assumptions fail loudly. They should protect the risky behavior, boundary, state, permission, cost, or integration condition that would matter in production rather than only proving that the happy path can be demonstrated once.
35
+
34
36
  <!-- mustflow-section: use-when -->
35
37
  ## Use When
36
38
 
@@ -54,6 +56,7 @@ This skill does not force TDD order. It requires evidence that each new or chang
54
56
  - Behavior contract source: user request, issue, bug report, schema, command contract, public docs, fixture, template, or current behavior.
55
57
  - Existing tests, fixtures, and helpers near the behavior.
56
58
  - Intended test objective and changed files.
59
+ - Risk list for the changed behavior, including money, permissions, deletion, external calls, AI cost, queues, files, data ownership, retries, timeouts, partial failure, or concurrency when those risks exist.
57
60
  - Baseline status when using a failing test as evidence.
58
61
  - Relevant command-intent contract entries.
59
62
 
@@ -78,6 +81,7 @@ This skill does not force TDD order. It requires evidence that each new or chang
78
81
 
79
82
  1. Confirm the contract and coverage.
80
83
  - Name the observable behavior being protected.
84
+ - Name the production risk the test is supposed to catch. If no risk can be named, prefer reusing existing coverage or reporting the idea as speculative.
81
85
  - Reuse or strengthen existing tests when they already cover the behavior.
82
86
  - Treat uncovered ideas without a contract source as suggestions, not tests.
83
87
  2. Select the smallest useful test shape.
@@ -98,6 +102,8 @@ This skill does not force TDD order. It requires evidence that each new or chang
98
102
  5. Check assertion quality.
99
103
  - Assert at least one observable result: return value, exit code, stdout or stderr, state change, file output, emitted effect, schema result, error shape, or user-visible contract.
100
104
  - Mock interaction assertions may support a test, but they must not be the only evidence of behavior unless the mock interaction itself is the public contract.
105
+ - For high-risk boundaries, prefer assertions over final state, stored records, rejected access, idempotency outcome, usage record, emitted event, or durable failure status rather than only asserting that a mocked collaborator was called.
106
+ - Treat tests that mock every database, transaction, authorization, serialization, queue, provider, or filesystem boundary as unit evidence only. Require a nearby integration, contract, fixture, or schema check when the real boundary is the risk.
101
107
  6. Choose verification by objective.
102
108
  - Use a semantic objective such as `new_behavior`, `bug_regression`, `security_negative`, `stale_test_cleanup`, `contract_sync`, `release_surface`, or `docs_or_template_contract`.
103
109
  - Start with the narrowest configured intent that proves the objective.
@@ -110,6 +116,7 @@ This skill does not force TDD order. It requires evidence that each new or chang
110
116
  ## Postconditions
111
117
 
112
118
  - Each new or changed test has a contract source, selected test shape, and observable assertion.
119
+ - Each new or changed test has a named risk, or the final report explains why the change is low-risk or already covered.
113
120
  - RED evidence is classified as `behavior_red`, `api_scaffold_red`, `invalid_red`, or `not_applicable`.
114
121
  - Speculative edge cases and duplicate coverage are reported instead of silently added.
115
122
  - Verification uses configured command intents and reports any missing or skipped coverage.
@@ -142,6 +149,7 @@ Prefer the narrowest configured intent that proves the selected objective. `test
142
149
  ## Output Format
143
150
 
144
151
  - Contract source
152
+ - Production risk being protected
145
153
  - Verification objective
146
154
  - Selected test shape: `example`, `boundary`, `property`, `mixed`, or `not_applicable`
147
155
  - Cases reused
@@ -1,6 +1,6 @@
1
1
  id = "default"
2
2
  name = "default"
3
- version = "2.22.4"
3
+ version = "2.22.9"
4
4
  description = "Minimal workflow for LLM agents to read, edit, and verify their work in a repository."
5
5
  common_root = "common"
6
6
  locales_root = "locales"