kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +47 -0
  2. package/package.json +3 -2
  3. package/src/agent/context.js +17 -1
  4. package/src/agent/engine.js +467 -100
  5. package/src/agent/llm-client.js +24 -1
  6. package/src/agent/pipelines/_advance-hints.js +92 -0
  7. package/src/agent/pipelines/_milestone-derive.js +325 -20
  8. package/src/agent/pipelines/skill-authoring.js +49 -3
  9. package/src/agent/tools/agent-tool.js +2 -2
  10. package/src/agent/tools/consult-skill.js +15 -0
  11. package/src/agent/tools/dashboard-render.js +48 -1
  12. package/src/agent/tools/document-parse.js +31 -2
  13. package/src/agent/tools/phase-advance.js +17 -13
  14. package/src/agent/tools/release.js +343 -7
  15. package/src/agent/tools/sandbox-exec.js +65 -8
  16. package/src/agent/tools/worker-llm-call.js +95 -15
  17. package/src/agent/workspace.js +25 -4
  18. package/src/cli/components.js +4 -1
  19. package/src/cli/index.js +125 -8
  20. package/src/config.js +19 -2
  21. package/src/marathon/driver.js +217 -0
  22. package/src/marathon/prompts.js +93 -0
  23. package/template/.env.template +17 -1
  24. package/template/AGENT.md +2 -2
  25. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  26. package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
  27. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  28. package/template/skills/en/confidence-system/SKILL.md +30 -8
  29. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  30. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  31. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  32. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  33. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  34. package/template/skills/en/document-chunking/SKILL.md +99 -15
  35. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  36. package/template/skills/en/quality-control/SKILL.md +23 -0
  37. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  38. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  39. package/template/skills/en/skill-authoring/SKILL.md +85 -2
  40. package/template/skills/en/skill-creator/SKILL.md +25 -3
  41. package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
  42. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  43. package/template/skills/en/tree-processing/SKILL.md +1 -1
  44. package/template/skills/en/version-control/SKILL.md +15 -0
  45. package/template/skills/en/work-decomposition/SKILL.md +52 -32
  46. package/template/skills/phase_skills.yaml +5 -0
  47. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  48. package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
  49. package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
  50. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  51. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  52. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  53. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  54. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  55. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  56. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  57. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  58. package/template/skills/zh/document-chunking/SKILL.md +101 -18
  59. package/template/skills/zh/document-parsing/SKILL.md +65 -65
  60. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  61. package/template/skills/zh/entity-extraction/SKILL.md +78 -68
  62. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  63. package/template/skills/zh/quality-control/SKILL.md +23 -0
  64. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  65. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  66. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  67. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  68. package/template/skills/zh/skill-authoring/SKILL.md +136 -58
  69. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  70. package/template/skills/zh/skill-creator/SKILL.md +215 -201
  71. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  72. package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
  73. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  74. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  75. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  76. package/template/skills/zh/tree-processing/SKILL.md +67 -63
  77. package/template/skills/zh/version-control/SKILL.md +15 -0
  78. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  79. package/template/skills/zh/work-decomposition/SKILL.md +52 -30
  80. package/template/workflows/common/llm_client.py +168 -0
  81. package/template/workflows/common/utils.py +132 -0
@@ -1,80 +1,9 @@
1
- # Chunking Strategies for Long Documents
1
+ # Chunking Strategies
2
2
 
3
- When regulation documents exceed what you can process in a single pass, use these proven strategies to decompose them into manageable chunks while preserving semantic coherence.
3
+ The chunking methodology (onion-peeler + wedge fallback + balance
4
+ heuristics) now lives in the `document-chunking` skill. Consult that
5
+ skill directly when designing chunking for rule extraction or any
6
+ downstream processing.
4
7
 
5
- ## The Onion Peeler (Primary Strategy)
6
-
7
- Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
8
-
9
- ### How It Works
10
-
11
- 1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
12
- 2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
13
- 3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
14
- 4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
15
- 5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
16
-
17
- ### Why This Works
18
-
19
- - Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
20
- - Minimizes information loss. You never cut in the middle of a thought.
21
- - Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
22
-
23
- ### Pattern Discovery Shortcut
24
-
25
- Before building a full parser, explore several sample documents for structural patterns:
26
- - Do all chapter titles start with "Chapter X" or "第X章"?
27
- - Are sections numbered consistently (1.1, 1.2, 1.3)?
28
- - Are there visual markers (bold text, specific fonts, horizontal rules)?
29
-
30
- If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
31
- - `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
32
- - `^Chapter \d+` for English chapter headers
33
- - `^\d+\.\d+` for numbered sections
34
-
35
- Always validate the regex against multiple documents before committing to it.
36
-
37
- ## Wedge Driving (Fallback Strategy)
38
-
39
- For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
40
-
41
- ### How It Works
42
-
43
- The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
44
-
45
- **Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
46
-
47
- **Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
48
- - `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
49
- - `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
50
- - `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
51
-
52
- Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
53
-
54
- **Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
55
- 1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
56
- 2. The cut position is immediately after the matched `tokens_before` region.
57
- 3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
58
-
59
- **Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
60
-
61
- ### Why This Works
62
-
63
- - The LLM identifies semantic boundaries, not arbitrary character counts.
64
- - The LLM never regenerates text — it only quotes positions. No hallucination risk.
65
- - K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
66
- - The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
67
- - Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
68
-
69
- ### When to Use
70
-
71
- - Only when the onion peeler cannot split further (no sub-headers available).
72
- - For documents with no structural markup at all.
73
- - Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
74
-
75
- ## Practical Guidelines
76
-
77
- - **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
78
- - **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
79
- - **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
80
- - **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
8
+ This stub remains for legacy references; new content goes to
9
+ `document-chunking`.
@@ -28,6 +28,8 @@ rule-skills/
28
28
 
29
29
  Not every rule needs all of these. A simple threshold check might only need SKILL.md and a script. A complex semantic rule might need detailed references and many samples. Start minimal, add as needed during testing.
30
30
 
31
+ **Filename case matters.** Use uppercase `SKILL.md` (matching the meta-skill convention you see in `template/skills/`). On Linux filesystems this is case-sensitive; engine path-matching, audit scripts, and downstream tooling all assume uppercase. Do not write `skill.md`, `Skill.md`, or any other case variant.
32
+
31
33
  ## Granularity: 1 rule = 1 skill directory (default)
32
34
 
33
35
  Default to **one rule per skill directory**. Group rules into the same file ONLY when they meet BOTH:
@@ -41,12 +43,33 @@ When grouping, name the file with the explicit range so downstream consumers (wo
41
43
 
42
44
  ### Anti-pattern: the unified runner
43
45
 
44
- If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all 110 rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
46
+ If you find yourself writing a single `unified_qc.py` (or `batch_runner.py`, or `master_check.py`) that handles all rules in one Python file, **stop**. That means your per-rule skills are wrong, not that the architecture is wrong. Fix the skills.
45
47
 
46
- E2E #4 demonstrated the cost: an agent wrote `unified_qc.py` to bypass 110 individual skills it didn't trust. Result was 1,150 errors out of 6,930 production checks (16.6%) and a phase counter stuck in `production_qc` while real work happened in skill_authoring. The unified runner felt productive locally and was a global mistake.
48
+ A failure pattern worth flagging: an agent writes a unified runner like `unified_qc.py` to bypass individual skills it doesn't trust. The result is cascading errors a single rule's failure corrupts every other rule's verdict, easily producing 15%+ error rates across thousands of production checks. The unified runner feels productive locally and is a global mistake. It also stalls the phase model: with no individual `check.py` files landing on disk, the engine can't credit the work toward milestone completion.
47
49
 
48
50
  If individual skills aren't running cleanly, the right response is to identify which ones break and fix them, not consolidate. The whole pipeline (extraction → skill_testing → distillation → production_qc) assumes one rule = one verifiable artifact.
49
51
 
52
+ ### Anti-pattern: stub SKILL.md OR stub check.py
53
+
54
+ Each rule_skill folder MUST have BOTH a substantive `SKILL.md` AND a substantive `check.py` (or `check.py` that imports + calls a workflow that does the real work). One side being a stub breaks the contract.
55
+
56
+ **Variant 1**: stub `SKILL.md` (a templated ~20-line scaffold with `检查逻辑: N/A` or equivalent) paired with a real `check.py` (regex methodology embedded in code). SKILL.md is supposed to be the human-readable methodology document. A reader scanning the rule folder for "what does this verify and why" gets nothing. The methodology has been pushed entirely into `check.py` comments — works for the engine, loses the deliverable framing.
57
+
58
+ **Variant 2**: substantive `SKILL.md` (real methodology, PASS/FAIL criteria, source cross-refs) paired with stub `check.py` (a thin scaffold returning `{"verdict": "NOT_APPLICABLE", "evidence": "Check requires worker LLM execution"}`). The real check logic lives in `workflows/<rule_id>/workflow.py` — but `check.py` doesn't import or call it. A user running `python rule_skills/R01-01/check.py document.txt` gets `NOT_APPLICABLE` on every input, which is misleading.
59
+
60
+ **Variant 3 (legacy)**: stub `check.py` returning `{"pass": null, "method": "stub"}` paired with an otherwise-real SKILL.md. Methodology described but never executable.
61
+
62
+ **Variant 4 (the "monolithic verify engine" stub)**: per-rule SKILL.md is a thin 20-35 line scaffold ("see verify_engine.py"), per-rule check.py is a thin shim that imports + calls a single monolithic `rule_skills/verify_engine.py` (or similar root-level file) that holds all 15-20 rules' verification logic in one ~750-LOC file. Each per-rule check.py looks like `from rule_skills.verify_engine import check_R01_01; return check_R01_01(doc)`. This passes the "check.py is not literally a stub" surface check but inverts the canonical per-rule granularity: per-rule files contain no rule-specific reasoning, the monolith holds everything, and the read-this-skill-to-understand-this-rule workflow fails for everyone (developer user, future auditor, downstream agent). The contract says skills are KC's unit of per-rule granularity for a reason. Centralizing all check logic into one big file may look efficient but loses the per-rule auditability that's the whole point. If you find yourself writing `verify_engine.py` with 15+ check functions and stub SKILL.md/check.py per rule, stop — keep the methodology in each rule's SKILL.md (substantive) and either inline the check logic or use the canonical per-rule check.py + workflow_v1.py pattern.
63
+
64
+ **The contract**:
65
+ - ✓ DO: SKILL.md describes WHAT to check + WHY + WHEN to flag it. Substantive — typically 50-300 lines, not a 20-line template.
66
+ - ✓ DO: check.py implements the check. EITHER substantive direct logic OR `from workflows.<rule_id>.workflow_v1 import verify` + delegate. Returns concrete verdicts.
67
+ - ✗ DON'T: stub SKILL.md with methodology in check.py comments (variant 1).
68
+ - ✗ DON'T: substantive SKILL.md with check.py that returns NOT_APPLICABLE without delegating to a workflow (variant 2).
69
+ - ✗ DON'T: stub check.py returning null verdict (variant 3, legacy).
70
+
71
+ The engine may refuse phase advance if too many check.py files are stub-shaped. Better to author them substantively now.
72
+
50
73
  ## Writing SKILL.md
51
74
 
52
75
  ### Frontmatter
@@ -102,6 +125,24 @@ Scripts should be self-contained Python files that can be imported or executed.
102
125
 
103
126
  Do not put LLM prompts in scripts. LLM interactions belong in the SKILL.md body or in the workflow (later phase).
104
127
 
128
+ ### Strip reviewer annotations before keyword matching
129
+
130
+ Sample documents often carry reviewer-annotation footers (`预期命中点: ...`, `标注: ...`, `Expected: ...`) that mark the ground-truth verdict for testing. If your check.py uses keyword/regex matching against the document body, these annotations will leak into the match — producing false-positive PASS on violation samples (your rule "finds" the disclosure keyword inside the annotation itself, not the actual document content).
131
+
132
+ The canonical helper ships at `workflows/common/utils.py` and is auto-populated into every workspace at engine init:
133
+
134
+ ```python
135
+ from workflows.common.utils import strip_annotations
136
+
137
+ def check(document_text):
138
+ text = strip_annotations(document_text)
139
+ # ... your real check logic against `text`, not document_text
140
+ ```
141
+
142
+ Recognized prefixes (Chinese + English variants): 预期命中点, 预期结果, 预期判定, 预期验证, 标注, 审核标注, Expected, expected, EXPECTED, Annotation, annotation. Pass `extra_prefixes=("..."、"...")` if your project uses different labels.
143
+
144
+ A recurring failure mode worth flagging: a non-trivial fraction of rules have standalone check.py false-positive PASS on violation samples because the regex matches the `预期命中点: ...` annotation footer instead of the document body. KC ships the helper as a template file so this trap is one import away from being avoided.
145
+
105
146
  ## Writing References
106
147
 
107
148
  `references/` holds content that the coding agent reads on demand:
@@ -118,6 +159,48 @@ Keep references factual and sourced. They are evidence, not instructions.
118
159
  - **samples.json**: Annotated examples. Each entry: the input (extracted text or entity), the expected result (pass/fail/missing), and the expected comment. Build this incrementally as you test.
119
160
  - **corner_cases.json**: Edge cases that the standard logic does not handle. Each entry: description, detection pattern, resolution, and confidence threshold. See the `corner-case-management` skill for the methodology.
120
161
 
162
+ ## Authoring methodology (from skill-creator core)
163
+
164
+ This section folds in the universal authoring patterns from Anthropic's upstream `skill-creator`. Apply them on top of the KC-specific layout above when drafting any new rule skill.
165
+
166
+ ### Capture intent before drafting
167
+
168
+ Before writing any skill, get clear on four questions:
169
+
170
+ 1. **What should this skill enable Claude (or check.py) to do?** — A single concrete capability, not a category.
171
+ 2. **When should it trigger?** — What user phrases / document contexts should match its description.
172
+ 3. **What's the expected output format?** — verdict + comment + evidence shape; or for non-check skills, the deliverable shape.
173
+ 4. **Do we need test samples?** — If the rule has objective pass/fail criteria (almost all KC rules do), yes. Build `assets/samples.json` incrementally as edge cases appear.
174
+
175
+ If the conversation already contains worked examples (a user pointed at a passage and said "this is non-compliant"), extract answers from history first — don't ask the user to repeat themselves.
176
+
177
+ ### Frontmatter and progressive disclosure
178
+
179
+ Skills load in three tiers; budget each tier for what it has to carry:
180
+
181
+ 1. **Metadata (name + description)** — always in the agent's context. ~100 words. This is the *primary triggering mechanism* — make it specific and slightly "pushy" (Claude tends to under-trigger skills). Include trigger keywords, rule ID, the regulation it derives from, and the document location it expects to find evidence in.
182
+ 2. **SKILL.md body** — loaded when the skill triggers. Target 100–300 lines for typical rules, hard ceiling 500. Explain the WHY behind the rule, not just the mechanics.
183
+ 3. **Bundled resources** (scripts/, references/, assets/) — loaded only when the body explicitly points to them. Big regulation excerpts and sample corpora belong here, not inline.
184
+
185
+ If SKILL.md is approaching 500 lines, that's the cue to push detail down into `references/` and leave a pointer like "See references/edge-cases.md for the full enumeration of corner cases."
186
+
187
+ ### Writing style
188
+
189
+ - **Imperative over passive.** "Extract the ratio" not "the ratio should be extracted."
190
+ - **Explain why, not just what.** Today's LLMs have good theory of mind — a one-line "the regulation flags this to protect retail investors" makes a downstream agent generalize correctly to a case you didn't enumerate.
191
+ - **Be wary of all-caps MUSTs and NEVERs.** If you find yourself reaching for them, that's usually a sign the underlying reasoning hasn't been made explicit. Reframe and explain.
192
+ - **Be specific about location.** "Look in Chapter 2, Section 'Key Regulatory Metrics' or the summary table on page 1" beats "look in the financial disclosures somewhere."
193
+
194
+ ### Test samples before scripts
195
+
196
+ After drafting SKILL.md, write 2–3 realistic sample inputs into `assets/samples.json` with their expected verdicts BEFORE you finalise `check.py`. The samples ground the script: every regex or keyword you add should be there to make a specific sample produce its expected verdict. Writing scripts in the abstract — without samples — almost always produces over-fitted code that fails on the first real document.
197
+
198
+ ### Iterate, don't perfect
199
+
200
+ A rule skill rarely lands correctly on the first draft. Plan for at least one revision after testing surfaces problems. Don't pile on defensive MUSTs to handle every edge case — generalize the methodology. If three samples each needed a different one-off fix, that's a signal the underlying rule statement is too narrow.
201
+
202
+ If your skill needs more sophisticated methodology than this section covers — formal eval loops with quantitative benchmarks, blind A/B comparison between skill versions, or description-optimization runs — consult `skill-creator`.
203
+
121
204
  ## Iteration
122
205
 
123
206
  Skills evolve through testing. After each test iteration:
@@ -1,10 +1,22 @@
1
1
  ---
2
2
  name: skill-creator
3
3
  tier: meta
4
- description: Anthropic's skill-scaffolding toolkit use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, consult `skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
4
+ description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
5
5
  ---
6
6
 
7
- # Skill Creator
7
+ # Skill creator (KC fallback)
8
+
9
+ This is the upstream Anthropic `skill-creator` methodology, vendored into
10
+ KC as a deep-reference for sophisticated skill authoring. KC's primary
11
+ skill-authoring path is the `skill-authoring` skill (which inherits the
12
+ core ideas from this skill). Only consult `skill-creator` directly when
13
+ `skill-authoring`'s guidance feels insufficient for the complexity of the
14
+ skill you need to write — e.g., when you want to run formal eval loops,
15
+ benchmark with variance analysis, or optimise the trigger description.
16
+ For ordering and grouping decisions about which skills to author first,
17
+ see `work-decomposition`.
18
+
19
+ ---
8
20
 
9
21
  A skill for creating new skills and iteratively improving them.
10
22
 
@@ -392,7 +404,7 @@ Use the model ID from your system prompt (the one powering the current session)
392
404
 
393
405
  While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
394
406
 
395
- This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
407
+ This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
396
408
 
397
409
  ### How skill triggering works
398
410
 
@@ -436,6 +448,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
436
448
 
437
449
  **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
438
450
 
451
+ **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
452
+ - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
453
+ - **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
454
+ - **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
455
+
439
456
  ---
440
457
 
441
458
  ## Cowork-Specific Instructions
@@ -448,6 +465,7 @@ If you're in Cowork, the main things to know are:
448
465
  - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
449
466
  - Packaging works — `package_skill.py` just needs Python and a filesystem.
450
467
  - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
468
+ - **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
451
469
 
452
470
  ---
453
471
 
@@ -478,3 +496,7 @@ Repeating one more time the core loop here for emphasis:
478
496
  Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
479
497
 
480
498
  Good luck!
499
+
500
+ ---
501
+
502
+ *Methodology adapted from [anthropics/skills](https://github.com/anthropics/skills) (Apache 2.0).*
@@ -70,7 +70,7 @@ If you do escalate to LLM:
70
70
  - **tier2-3**: bulk extraction with simple semantic checks
71
71
  - **tier4** (cheapest): high-volume keyword-spotting that regex can't handle. Note: tier4 models on SiliconFlow are Qwen3.5 thinking-mode — `content` can return empty if `reasoning_content` consumes max_tokens. Test with realistic prompts before relying. If you see empty responses, either bump max_tokens to ≥8192, shorten your prompt, or fall back to tier1-2.
72
72
 
73
- Both v0.7.1 audit conductors (DS and GLM) defaulted to all-regex distillation and only added LLM escalation when the human user explicitly asked for "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
73
+ A recurring failure pattern worth flagging: agents default to all-regex distillation and only add LLM escalation when the human user explicitly asks for a "V2 with worker LLM". If your rule catalog has any rules where the verification is genuinely semantic, you should reach for `worker_llm_call` yourself — don't wait to be asked.
74
74
 
75
75
  ## Workflow Structure
76
76
 
@@ -148,6 +148,17 @@ All numbers here (10 documents, 5 percentage points, etc.) are recommended start
148
148
 
149
149
  This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
150
150
 
151
+ ### Picking the model inside a tier — quick reference
152
+
153
+ The tier framework above answers "which tier is right for this step?". Inside a tier slot, "which specific model?" still matters. A few heuristics that hold today (refresh from `auto-model-selection` — specifics change in months, not years):
154
+
155
+ - **Tier 1 / Tier 2 worker workhorse**: the current-generation flagship MoE LLMs (200-400B total / ~20B activated experts) are a reasonable starting baseline. Qwen's family flagship and DeepSeek's current premium model are both in this shape; either works.
156
+ - **Tier 3 / Tier 4 small model**: prefer Qwen family for sub-30B options — many cheap, reliable choices. Skip the `coder` / `code` named variants at small sizes (unreliable for general worker tasks). Prefer no-thinking-mode variants when available; these tasks don't benefit from reflection.
157
+ - **Provider stacking**: routing conductor and worker through different providers can isolate per-model throttle / rate-limit exposure (e.g., DeepSeek for workers, SiliconFlow for conductor).
158
+ - **VLM / OCR**: characters / handwriting / seals → dedicated OCR model (Paddle-OCR, GLM-OCR, DeepSeek-OCR or their successors). Complex graphs / tables → larger general VLM.
159
+
160
+ For up-to-date facts (exact model names, context windows, pricing), consult `auto-model-selection` and use Context7. The heuristics above go stale fast — the *shape* (MoE flagship for workhorse, sub-30B non-thinking for cheap bulk, OCR-specific for chars) is what stays.
161
+
151
162
  ## Testing Against Ground Truth
152
163
 
153
164
  The coding agent's skill-based results are the ground truth. For each document in Samples/:
@@ -187,3 +198,64 @@ Worker LLMs are accessed via SiliconFlow API. Connection details are in `.env`:
187
198
  - Model names in `TIER1` through `TIER4`
188
199
 
189
200
  See `references/worker-llm-catalog.md` for current model capabilities and context window sizes.
201
+
202
+ ## Two access paths: `worker_llm_call` tool (preferred) vs direct HTTP
203
+
204
+ KC ships a `worker_llm_call` tool. Use it whenever possible — the engine sees every call, can track cost + token spend, applies rate limiting, and surfaces in audit. A batch mode is supported:
205
+
206
+ ```
207
+ worker_llm_call({
208
+ tier: "tier1",
209
+ prompts: ["check doc A...", "check doc B...", "check doc C..."],
210
+ system_prompt: "You are a compliance assistant. Reply with JSON {verdict, evidence, confidence}.",
211
+ concurrency: 5 // 1-10, default 5
212
+ })
213
+ ```
214
+
215
+ Returns a `{n_total, n_succeeded, n_failed, total_tokens_in, total_tokens_out, results: [...]}` summary. Partial failures don't fail the whole batch.
216
+
217
+ ### The canonical `workflows/common/llm_client.py` (shipped from template)
218
+
219
+ For a workflow that runs **standalone** (no KC session — e.g., a customer deploys the release bundle and runs `python run.py doc.pdf`), the workflow has no access to `worker_llm_call`. The canonical HTTP client shim ships as a template file and is auto-populated into every workspace's `workflows/common/llm_client.py` at engine init. **Do not write your own.** Use the file that's already there:
220
+
221
+ ```python
222
+ from workflows.common.llm_client import call
223
+
224
+ result = call(
225
+ tier="tier2",
226
+ prompt=user_prompt,
227
+ system_prompt="You are a compliance assistant. Reply with JSON.",
228
+ max_tokens=2048,
229
+ )
230
+ # result = {"response": "...", "model_used": "...", "tier": "tier2",
231
+ # "tokens_in": N, "tokens_out": N}
232
+ ```
233
+
234
+ What the shim does:
235
+ - Reads `LLM_API_KEY` + `LLM_BASE_URL` + `TIER1..4` from `.env` (provider-agnostic — works with SiliconFlow, OpenAI, Anthropic, Aliyun, etc.)
236
+ - Sends OpenAI-format chat completions to the configured base URL
237
+ - Writes a line to `output/llm_ledger.jsonl` per call so KC audits can reconstruct cost even when worker_llm_call wasn't used
238
+ - Raises an explicit error if `LLM_BASE_URL` is missing (no silent fallback to a hardcoded vendor)
239
+
240
+ **Don't write your own llm_client.py from scratch.** A recurring failure mode worth flagging: agents repeatedly roll their own shim — buggy (stale model IDs, hardcoded vendor URL, no ledger) and invisible to the engine. Use the canonical shim; if it's missing for some reason, copy it from `template/workflows/common/llm_client.py` in the kc-beta install (the engine also auto-populates at init — check `workflows_common_populated` event in events.jsonl).
241
+
242
+ ### Worse anti-pattern: writing a parallel hand-rolled client AT THE SAME TIME as the canonical one
243
+
244
+ A persistent failure pattern across runs: the canonical `workflows/common/llm_client.py` is present (engine auto-populated it at init), AND the agent ALSO writes its own `workflows/llm_client.py` or `verify_engine_v2.py` with a `requests.post(...)` HTTP call. The agent then uses the hand-rolled one for all the real LLM work. Both files sit in the workspace. Engine cost-tracking sees nothing.
245
+
246
+ When this happens, three things go wrong: (1) Provider routing breaks. The hand-rolled client typically reads `LLM_BASE_URL` from workspace `.env` — which is the **conductor**'s endpoint. KC's worker routing (via `worker_llm_call` and the engine's worker_* config) gets bypassed entirely. If the operator configured workers to a separate provider (say, DeepSeek workers + SiliconFlow conductor), the hand-rolled client wrongly hits the conductor's provider with the worker's model names — produces 400s or worse, silent wrong-model results. (2) Cost / audit visibility is lost. Engine doesn't see the calls; neither does `output/llm_ledger.jsonl` (the canonical client's ledger isn't written by the hand-rolled one). The session looks like it did zero LLM work, but the actual LLM bill exists. (3) Rate-limit / retry / timeout behavior diverges. The canonical client + worker_llm_call inherit engine-level resilience patterns (AbortSignal.timeout, withRetry on 429/5xx, etc.). A hand-rolled `requests.post` has none of that — it stalls, throws ad-hoc errors, or silently corrupts the run.
247
+
248
+ **Rule of thumb**: if the verification rule needs an LLM judgment, your two valid options are `worker_llm_call` (when running inside a KC session) or `from workflows.common.llm_client import call` (when running standalone from a release bundle). If you find yourself typing `import requests` or `urllib.request.urlopen` for an LLM call, stop. That code path will be flagged in audit as a recurring adoption miss and rewritten — save the round trip and use the right tool the first time.
249
+
250
+ ## sandbox_exec timeout for known-slow commands
251
+
252
+ Default `sandbox_exec` timeout is 120 seconds. For commands you expect to take longer — LLM batch processing, large regression runs, document parsing — pass an explicit `timeout_ms` (up to 600000ms = 10 minutes). Don't fight the default by re-batching artificially small chunks; that wastes turns and obscures intent.
253
+
254
+ ```
255
+ sandbox_exec({
256
+ command: "python scripts/v2_full_test.py",
257
+ timeout_ms: 480000 // 8 minutes for 14 rules × 6 docs through worker LLM
258
+ })
259
+ ```
260
+
261
+ If you're at the 10-minute ceiling and still timing out, split the work into multiple invocations OR delegate to a subagent (subagent timeouts are independent of the parent's).
@@ -126,7 +126,7 @@ Five failure modes recur across projects. Learn to recognize them early.
126
126
 
127
127
  ## Integration
128
128
 
129
- Task decomposition sits between rule extraction and skill authoring in the KC Reborn lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
129
+ Task decomposition sits between rule extraction and skill authoring in the KC lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
130
130
 
131
131
  **Input**: A rule catalog from `rule-extraction`. Each rule is an atomic, testable verification requirement. If a rule is not yet atomic, send it back to rule extraction for further decomposition before attempting task decomposition.
132
132
 
@@ -58,7 +58,7 @@ Spend time here. The patterns you find determine whether the tree builder is a s
58
58
  - This is fast, deterministic, and reliable. Prefer this when it works.
59
59
 
60
60
  **If patterns are inconsistent or absent**:
61
- - Use the LLM-guided wedge-driving approach (see `rule-extraction/references/chunking-strategies.md` for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
61
+ - Use the LLM-guided wedge-driving approach (see the `document-chunking` skill for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
62
62
  - This is slower and costs LLM calls, but handles unstructured documents. The rolling window means even very large unstructured leaf nodes can be chunked incrementally.
63
63
 
64
64
  **If the document has a table of contents**:
@@ -164,3 +164,18 @@ Every evolution cycle (see `evolution-loop`) should:
164
164
  3. Log the version change with the evolution iteration number.
165
165
 
166
166
  The version history, combined with the evolution logs, gives you a complete timeline of how the system evolved and why.
167
+
168
+ ## Per-rule check.py — preserve v1 before v2 rewrite
169
+
170
+ When iterating a rule's verification logic from a v1 baseline (often pure regex) to a v2 implementation (often LLM-augmented or hybrid), **copy the v1 file to a sibling before overwriting**:
171
+
172
+ ```bash
173
+ cp rule_skills/Rxx/check.py rule_skills/Rxx/check_v1.py
174
+ # now write the v2 version to check.py
175
+ ```
176
+
177
+ Convention:
178
+ - `check.py` always points at the current best version
179
+ - `check_v1.py`, `check_v2.py`, ... preserve prior iterations
180
+
181
+ This way the v1 lives alongside v2 in the same directory rather than relying on workspace git archaeology (`git log -- check.py` works but is friction). Engine-level `verify_engine_v1.py` / `verify_engine_v2.py` preserve the orchestrator separately; per-rule files need their own convention.
@@ -6,7 +6,7 @@ description: Decide how to decompose the rule set into TaskBoard tasks during ru
6
6
 
7
7
  # Work Decomposition
8
8
 
9
- KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py becomes the unified-runner anti-pattern from E2E #4. If related rules are split across separate skills, the agent re-derives the shared chunker logic 17 times.
9
+ KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py drifts into the unified-runner anti-pattern. If related rules are split across separate skills, the agent re-derives the shared chunker logic many times over.
10
10
 
11
11
  This skill is the conductor's playbook for that decision. It's tagged `tier: meta-meta` because work decomposition is a system-level discipline, not a per-rule technique. The complementary `task-decomposition` skill (also `tier: meta-meta`) covers the *internal* structure of one rule's check — locate, extract, normalize, judge, comment. This skill covers how the rule **set** should be split into TaskBoard items.
12
12
 
@@ -15,6 +15,14 @@ This skill is the conductor's playbook for that decision. It's tagged `tier: met
15
15
  - **Entering rule_extraction.** Read the regulation, decompose into rules, then decide how those rules will be ordered and grouped before declaring the phase done. Coverage audit + chunk refs are downstream of these decisions.
16
16
  - **Entering skill_authoring.** TaskBoard is empty (engine no longer auto-populates per-rule tasks). Read the rule list from `describeState`, decide grouping + order, then call `TaskCreate` for each unit of work.
17
17
  - **Mid-run re-decomposition.** If the TaskBoard feels wrong (rules accumulating in the wrong order, an obviously-bundled pair across two tasks), stop adding work and re-decompose. The cost of pausing 5 minutes to re-plan is recovered within 2 rules of better-shaped work.
18
+ - **Any phase with 3+ parallel sub-goals.** If you find yourself juggling multiple parallel sub-goals in working memory (3+ rules × docs, multiple deliverable-prep items in finalization, several QC batches in production_qc), drop them into the TaskBoard and work serially. Every phase from rule_extraction through finalization benefits from explicit tasks once parallel sub-goals appear — distillation and production_qc included.
19
+
20
+ ## Quick rule: when does the TaskBoard belong?
21
+
22
+ - Touching N+ rules or docs in parallel? → `TaskCreate` one task per rule/doc before starting.
23
+ - Single 1-step request? → Skip; just do it.
24
+ - Subagent internal coordination? → Skip; subagents don't get the TaskBoard.
25
+ - Anything you'd otherwise need to remember "I'll come back to that"? → TaskCreate it now. Working memory loses partial completions over long turns.
18
26
 
19
27
  ## Locked principles
20
28
 
@@ -32,7 +40,7 @@ Pick one explicitly and write it into your first PATTERNS.md entry. "I'm going S
32
40
 
33
41
  Process the **hardest** rule first. Use the chunker, verdict shape, and worker tier that hard rule demands as the design floor. Process subsequent rules in descending difficulty, each one a degenerate case of the machinery already built.
34
42
 
35
- **When to pick:** the rule set has uneven complexity and you suspect a few hard rules will dictate the shape (almost always true for compliance / regulatory work). E2E #5 GLM accidentally followed this path and produced 0.6% ERROR on real LLM-driven workflows; DS started bottom-up and shipped 78% NOT_APPLICABLE.
43
+ **When to pick:** the rule set has uneven complexity and you suspect a few hard rules will dictate the shape (almost always true for compliance / regulatory work). When this method is followed correctly it tends to produce sub-1% ERROR rates on real LLM-driven workflows; the bottom-up alternative typically over-produces NOT_APPLICABLE verdicts because the easy rules' machinery can't handle the hard cases at the end.
36
44
 
37
45
  **Why "Huffman" not "Shannon" for the analogy:** Huffman builds optimal prefix codes by processing low-frequency symbols first. KC's analogue is the high-cost-per-rule, low-frequency rules — the R028s that dominate the design space even though there are few of them. Touch them first. The easy rules inherit the framework cheaply.
38
46
 
@@ -97,10 +105,10 @@ Keep separate when ANY of:
97
105
  - Rules apply to different document types (one applies only to public-fund reports, another only to private-fund reports)
98
106
  - One rule's failure mode is a specific failure mode of another (don't bundle parent + child rules — the child's check redundantly re-runs the parent's)
99
107
 
100
- The v0.6.2 D2 anti-pattern wording captures the failure case clearly:
108
+ The anti-pattern wording captures the failure case clearly:
101
109
  > If you find yourself writing a unified_qc.py-style monolith that bypasses individual skills, your per-rule skills are wrong. Fix them, don't replace them.
102
110
 
103
- That came from E2E #4 where one conductor wrote a 2,400-line `unified_qc.py` that ran all rules at once. It produced 1,150 ERROR verdicts (16.6%) because every rule's failure cascaded into every other rule's verdict. Per-rule skills are KC's unit of granularity for a reason.
111
+ A failure mode worth flagging: a conductor writes a 2,000+ line `unified_qc.py` that runs all rules at once. The result is cascading errors every rule's failure corrupts every other rule's verdict, easily producing 15%+ ERROR rates on production checks. Per-rule skills are KC's unit of granularity for a reason.
104
112
 
105
113
  ### Anti-pattern: stub check.py + real workflow.py
106
114
 
@@ -144,23 +152,13 @@ iterations of the skill (changes to regulation interpretation, edge
144
152
  cases discovered in production) need a single canonical place to
145
153
  update — the skill — not N workflows that have drifted independently.
146
154
 
147
- E2E #6 v070 surfaced this pattern (DS bundled-skill check.py files
148
- all returned `{"pass": null, "method": "stub"}` deferring to
149
- workflows/). v0.7.1 added this anti-pattern explicitly.
155
+ Two failure modes worth flagging:
156
+
157
+ **The pure-stub failure:** bundled-skill check.py files all return `{"pass": null, "method": "stub"}` deferring to `workflows/`. Methodology is described in SKILL.md but never executable from the skill folder.
150
158
 
151
- E2E #7 v071 showed the teaching prevented the stub anti-pattern in
152
- both conductors (no `{"pass": null}` patterns in either run), but
153
- **DS still inverted the canonical-vs-distilled relationship**: DS's
154
- 6 thematic skill folders had SKILL.md only (no check.py), with the
155
- real verification code living in `workflows/<skill>/check.py`. The
156
- absence of stubs is good; the inversion is not — editing a rule then
157
- requires touching both SKILL.md (the doc) and the workflow check.py
158
- (the code). Single source of truth is lost.
159
+ **The inverted-canonical failure:** the agent avoids stubs (good) but inverts the canonical-vs-distilled relationship — thematic skill folders have only SKILL.md (no check.py), with the real verification code living inside `workflows/<skill>/check.py`. The absence of stubs is good; the inversion is not — editing a rule then requires touching both SKILL.md (the doc) and the workflow check.py (the code). Single source of truth is lost.
159
160
 
160
- GLM v071 by contrast landed the canonical pattern: 97/97 skills had
161
- both SKILL.md AND a real `check.py` (median 143 LOC of regex +
162
- applicability logic), and `workflows/<id>/workflow_v1.py` was a
163
- 50-line thin wrapper that imported and called it:
161
+ The canonical landing looks like this: every skill has BOTH a substantive SKILL.md AND a real `check.py` (regex + applicability logic), and `workflows/<id>/workflow_v1.py` is a thin wrapper (~50 LOC) that imports and calls it:
164
162
 
165
163
  ```python
166
164
  # workflows/D01-01/workflow_v1.py — thin wrapper, 52 LOC
@@ -177,11 +175,7 @@ def run(doc_text: str, meta: dict = None) -> dict:
177
175
  return result
178
176
  ```
179
177
 
180
- This is the v0.7.2+ canonical pattern: workflow is a shim that
181
- points at the skill's check.py. To iterate on a rule's verification,
182
- edit `rule_skills/<id>/check.py`. The workflow doesn't change. v0.7.2
183
- clarifies the teaching: avoid stubs AND keep the canonical
184
- relationship (skill is canonical, workflow is distilled wrapper).
178
+ This is the canonical pattern: workflow is a shim that points at the skill's check.py. To iterate on a rule's verification, edit `rule_skills/<id>/check.py`. The workflow doesn't change. The teaching has two parts: avoid stubs AND keep the canonical relationship (skill is canonical, workflow is distilled wrapper).
185
179
 
186
180
  ### Naming convention for grouped checks
187
181
 
@@ -347,7 +341,7 @@ When entering skill_authoring with an empty TaskBoard:
347
341
 
348
342
  ### Calling TaskCreate / TaskUpdate / TaskComplete
349
343
 
350
- The engine registers three task-board tools (v0.7.4):
344
+ The engine registers three task-board tools:
351
345
 
352
346
  - `TaskCreate({id, title, phase, ruleId?})` — adds a task to `tasks.json`. `id` must be unique within the session; pick a stable shape like `<rule_id>-<phase>` for per-rule tasks or `<group-name>-<phase>` for grouped / non-rule tasks. `phase` is the current phase the task belongs to. `ruleId` is optional — set it for per-rule tasks so the engine can credit the rule_id in milestone derivation.
353
347
  - `TaskUpdate({id, status?, summary?})` — change a task's status (`pending` / `in_progress` / `completed` / `failed`), optionally with a short summary.
@@ -355,7 +349,7 @@ The engine registers three task-board tools (v0.7.4):
355
349
 
356
350
  ### Ralph loop scope — within a phase only
357
351
 
358
- Important contract (changed in v0.7.4 after team feedback):
352
+ Important contract:
359
353
 
360
354
  - **Loop scope = current phase only.** TaskCreate populates tasks for the CURRENT phase. The Ralph loop processes them one by one within the phase.
361
355
  - **Loop exits at phase boundaries.** When all current-phase tasks complete OR the phase advances (you call `phase_advance`, or anything else changes `currentPhase`), the loop exits cleanly. Control returns to the user.
@@ -387,9 +381,9 @@ Three formats, each defensible. Pick one and stick with it:
387
381
 
388
382
  - **`rules/PATTERNS.md`** — concise, framework-only, updated as the project progresses. Best for greenfield projects with clear hypothesis-up-front structure. Capped at ~5 KB; entries are transferable shapes / project constraints / anti-patterns with rationale (see "What to write" above).
389
383
 
390
- - **`logs/phase_<name>_complete.md` per phase** — incremental, captures what each phase produced + decisions made + what the next phase inherits. Best for iterative discovery work where the framework crystallizes mid-run. E2E #7 GLM used this pattern across 6 phase docs and an `evolution_summary_v1.2.md`; the methodology was captured even though PATTERNS.md was never written.
384
+ - **`logs/phase_<name>_complete.md` per phase** — incremental, captures what each phase produced + decisions made + what the next phase inherits. Best for iterative discovery work where the framework crystallizes mid-run. A real example pattern: six phase docs plus an `evolution_summary_vN.md` capture the methodology even when PATTERNS.md is never written.
391
385
 
392
- - **`AGENT.md` decisions section + domain notes** — narrative-style, living document of "what we know" and "why". Best for projects with rich domain context to capture (regulations, edge cases, thresholds, sample format distributions). E2E #7 GLM's AGENT.md included regulation enforcement dates, product type taxonomies, threshold values, and sample format counts this is fine; it's a different idiom for the same goal.
386
+ - **`AGENT.md` decisions section + domain notes** — narrative-style, living document of "what we know" and "why". Best for projects with rich domain context to capture (regulations, edge cases, thresholds, sample format distributions). An AGENT.md that records regulation enforcement dates, product type taxonomies, threshold values, and sample format counts works perfectly it's a different idiom for the same goal.
393
387
 
394
388
  What you should NOT do: skip persistence and rely only on the live conversation context. By the time you have N skills authored without any persisted methodology, you've made N implicit decisions about verdict shape, chunker boundaries, and worker tier. Each rule re-derives from scratch. Refactoring requires touching N files instead of one.
395
389
 
@@ -397,8 +391,34 @@ What you should NOT do: skip persistence and rely only on the live conversation
397
391
 
398
392
  ✅ "Before each phase advance, write what I learned to whichever persistence file matches this project's idiom — even if it's tentative."
399
393
 
400
- E2E history:
401
- - E2E #6 v070 DS wrote PATTERNS.md only after a rollback. Per-skill decisions before that point had to be re-touched. v0.7.1 added "PATTERNS.md FIRST" reinforcement.
402
- - E2E #7 v071 neither DS nor GLM wrote PATTERNS.md, but GLM wrote 6 rich phase-completion logs and a comprehensive AGENT.md the methodology WAS captured, just in different files. v0.7.2 blesses the broader principle: persist before you advance, format flexible.
394
+ Failure modes worth flagging:
395
+ - An agent writes PATTERNS.md only after a rollback. All the per-skill decisions made before that point have to be re-touched. "PATTERNS.md FIRST" exists because of this cost.
396
+ - An agent skips PATTERNS.md entirely but writes rich phase-completion logs and a comprehensive AGENT.md. The methodology IS captured, just in different files — which is fine. The broader principle: persist before you advance, format flexible.
397
+
398
+ The engine's filesystem-derived milestones verify coverage on disk regardless of how you split the work. The TaskBoard is your scratchpad; the disk is the contract; the persistence file is your project's memory.
399
+
400
+ ## Subagent batch work: rolling-window writes
401
+
402
+ When you dispatch N subagents to do batch work (regression tests, batch verification, parallel rule processing), DO NOT have them write to a shared coordination file. A failure mode worth flagging: subagents race on `tasks.json` / `rules/catalog.json` / `output/results/summary.json` — one takes the workspace lock for several minutes while the others wait silently.
403
+
404
+ The right pattern: each subagent writes to its OWN file under a known prefix. The parent aggregates after all subagents finish.
405
+
406
+ ```
407
+ sub_agents/
408
+ batch-001-regression/
409
+ output/results/v2_regression.json # ❌ shared — races other subagents
410
+ batch-002-regression/
411
+ output/results/v2_regression.json # ❌ same path, race
412
+
413
+ # vs:
414
+
415
+ output/
416
+ batch_regression_001.json # ✓ each subagent owns one file
417
+ batch_regression_002.json # ✓
418
+ batch_regression_003.json # ✓
419
+ # Parent agent reads all batch_regression_*.json and writes the aggregate.
420
+ ```
421
+
422
+ Engine signal: if you see `lock_blocked` events in events.jsonl during subagent work, that's the symptom — the engine emits this event so the parent has visibility into contention before the subagent times out. If the pattern shows up, refactor to rolling-window writes.
403
423
 
404
- The engine's filesystem-derived milestones (Group A v0.7.0) verify coverage on disk regardless of how you split the work. The TaskBoard is your scratchpad; the disk is the contract; the persistence file is your project's memory.
424
+ Don't write a "coordinate via file locking" subagent batch. The locking primitives exist for safety against accidental concurrent writes, not as a queue. Use the filesystem layout as the coordination mechanism.
@@ -23,6 +23,7 @@ phases:
23
23
  always_loaded:
24
24
  - bootstrap-workspace
25
25
  available:
26
+ - work-decomposition # v0.8 P2-E
26
27
  - auto-model-selection
27
28
  - data-sensibility
28
29
  - document-parsing
@@ -60,6 +61,7 @@ phases:
60
61
  always_loaded:
61
62
  - evolution-loop
62
63
  available:
64
+ - work-decomposition # v0.8 P2-E
63
65
  - skill-authoring
64
66
  - skill-to-workflow
65
67
  - tree-processing
@@ -74,6 +76,7 @@ phases:
74
76
  - skill-to-workflow
75
77
  - evolution-loop
76
78
  available:
79
+ - work-decomposition # v0.8 P2-E
77
80
  - skill-authoring
78
81
  - task-decomposition
79
82
  - corner-case-management
@@ -87,6 +90,7 @@ phases:
87
90
  - quality-control
88
91
  - evolution-loop
89
92
  available:
93
+ - work-decomposition # v0.8 P2-E
90
94
  - skill-authoring
91
95
  - skill-to-workflow
92
96
  - confidence-system
@@ -100,6 +104,7 @@ phases:
100
104
  always_loaded:
101
105
  - quality-control
102
106
  available:
107
+ - work-decomposition # v0.8 P2-E
103
108
  - skill-authoring
104
109
  - skill-to-workflow
105
110
  - dashboard-reporting