kc-beta 0.6.2 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +81 -0
- package/LICENSE-COMMERCIAL.md +125 -0
- package/README.md +21 -3
- package/package.json +14 -5
- package/src/agent/context-window.js +9 -12
- package/src/agent/context.js +14 -1
- package/src/agent/document-parser.js +169 -0
- package/src/agent/engine.js +382 -19
- package/src/agent/history/event-history.js +222 -0
- package/src/agent/llm-client.js +55 -0
- package/src/agent/message-utils.js +63 -0
- package/src/agent/pipelines/_milestone-derive.js +566 -0
- package/src/agent/pipelines/base.js +21 -0
- package/src/agent/pipelines/distillation.js +28 -15
- package/src/agent/pipelines/extraction.js +130 -36
- package/src/agent/pipelines/finalization.js +178 -11
- package/src/agent/pipelines/index.js +6 -1
- package/src/agent/pipelines/initializer.js +74 -8
- package/src/agent/pipelines/production-qc.js +31 -44
- package/src/agent/pipelines/skill-authoring.js +97 -80
- package/src/agent/pipelines/skill-testing.js +106 -23
- package/src/agent/retry.js +10 -2
- package/src/agent/scheduler.js +14 -2
- package/src/agent/session-state.js +18 -1
- package/src/agent/skill-loader.js +13 -7
- package/src/agent/skill-validator.js +19 -5
- package/src/agent/task-manager.js +61 -5
- package/src/agent/tools/document-chunk.js +21 -9
- package/src/agent/tools/phase-advance.js +37 -5
- package/src/agent/tools/release.js +51 -9
- package/src/agent/tools/rule-catalog.js +11 -1
- package/src/agent/tools/workspace-file.js +32 -0
- package/src/agent/workspace.js +39 -1
- package/src/cli/components.js +64 -14
- package/src/cli/index.js +62 -3
- package/src/cli/meme.js +26 -25
- package/src/config.js +65 -22
- package/src/model-tiers.json +24 -8
- package/src/providers.js +42 -0
- package/template/release/v1/README.md.tmpl +108 -0
- package/template/release/v1/catalog.json.tmpl +4 -0
- package/template/release/v1/kc_runtime/__init__.py +11 -0
- package/template/release/v1/kc_runtime/confidence.py +63 -0
- package/template/release/v1/kc_runtime/doc_parser.py +127 -0
- package/template/release/v1/manifest.json.tmpl +11 -0
- package/template/release/v1/render_dashboard.py +117 -0
- package/template/release/v1/run.py +212 -0
- package/template/release/v1/serve.sh +17 -0
- package/template/skills/en/meta-meta/work-decomposition/SKILL.md +326 -0
- package/template/skills/en/skill-creator/SKILL.md +1 -1
- package/template/skills/zh/meta-meta/work-decomposition/SKILL.md +321 -0
- package/template/skills/zh/skill-creator/SKILL.md +1 -1
|
@@ -0,0 +1,326 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: work-decomposition
|
|
3
|
+
description: Decide how to decompose the rule set into TaskBoard tasks during rule_extraction → skill_authoring transition. Covers ordering methodologies (difficulty-first / Shannon–Huffman, breadth-first, depth-first, binary partition), grouping rules (when to bundle multiple rules into one task vs. keep separate), three-axis difficulty estimation, and how to write PATTERNS.md project memory that stays useful across the run. Use when entering rule_extraction, when entering skill_authoring, or whenever the TaskBoard feels wrong and you want to re-decompose.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Work Decomposition
|
|
7
|
+
|
|
8
|
+
KC's main agent is the conductor. The conductor decides what work to do next — and that decision is upstream of every other choice that follows. Wrong decomposition makes the rest of the run expensive: if rules are processed in the wrong order, the agent re-designs the same shape three times. If unrelated rules are bundled into one skill, the resulting check.py becomes the unified-runner anti-pattern from E2E #4. If related rules are split across separate skills, the agent re-derives the shared chunker logic 17 times.
|
|
9
|
+
|
|
10
|
+
This skill is the conductor's playbook for that decision. It ships under `meta-meta/` because work decomposition is a system-level discipline, not a per-rule technique. The complementary `task-decomposition` skill (also under `meta-meta/`) covers the *internal* structure of one rule's check — locate, extract, normalize, judge, comment. This skill covers how the rule **set** should be split into TaskBoard items.
|
|
11
|
+
|
|
12
|
+
## When to use this skill
|
|
13
|
+
|
|
14
|
+
- **Entering rule_extraction.** Read the regulation, decompose into rules, then decide how those rules will be ordered and grouped before declaring the phase done. Coverage audit + chunk refs are downstream of these decisions.
|
|
15
|
+
- **Entering skill_authoring.** TaskBoard is empty (engine no longer auto-populates per-rule tasks). Read the rule list from `describeState`, decide grouping + order, then call `TaskCreate` for each unit of work.
|
|
16
|
+
- **Mid-run re-decomposition.** If the TaskBoard feels wrong (rules accumulating in the wrong order, an obviously-bundled pair across two tasks), stop adding work and re-decompose. The cost of pausing 5 minutes to re-plan is recovered within 2 rules of better-shaped work.
|
|
17
|
+
|
|
18
|
+
## Locked principles
|
|
19
|
+
|
|
20
|
+
1. **Hard tracking, soft executing.** The engine derives milestones from disk facts (`rule_skills/<id>/SKILL.md`, `check_*.py`, `workflows/<id>/...`) regardless of how you grouped your tasks. Coverage is engine-verified; grouping is your call. You cannot bypass the floor by clever task naming, but the floor doesn't dictate task shape.
|
|
21
|
+
2. **The hardest rule contains the most information.** Hard rules force the chunker, classifier, verdict shape, and worker LLM tier you'll need. Easy rules can compile down to a subset of that machinery cheaply. Encode the hard cases first; let the easy cases inherit.
|
|
22
|
+
3. **PATTERNS.md is the load-bearing memory.** Without an accumulating reference, every rule starts from a blank slate and you re-design the same shape repeatedly. With it, work compounds.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Ordering methodologies
|
|
27
|
+
|
|
28
|
+
Pick one explicitly and write it into your first PATTERNS.md entry. "I'm going Shannon–Huffman because R028's multi-party verdict shape will dictate the chunker for everyone else" is a valid decision; "I started at the top of catalog.json and kept going" is not — it's just absence of decision.
|
|
29
|
+
|
|
30
|
+
### Shannon–Huffman (difficulty-first) — recommended default
|
|
31
|
+
|
|
32
|
+
Process the **hardest** rule first. Use the chunker, verdict shape, and worker tier that hard rule demands as the design floor. Process subsequent rules in descending difficulty, each one a degenerate case of the machinery already built.
|
|
33
|
+
|
|
34
|
+
**When to pick:** the rule set has uneven complexity and you suspect a few hard rules will dictate the shape (almost always true for compliance / regulatory work). E2E #5 GLM accidentally followed this path and produced 0.6% ERROR on real LLM-driven workflows; DS started bottom-up and shipped 78% NOT_APPLICABLE.
|
|
35
|
+
|
|
36
|
+
**Why "Huffman" not "Shannon" for the analogy:** Huffman builds optimal prefix codes by processing low-frequency symbols first. KC's analogue is the high-cost-per-rule, low-frequency rules — the R028s that dominate the design space even though there are few of them. Touch them first. The easy rules inherit the framework cheaply.
|
|
37
|
+
|
|
38
|
+
**The ordering compiler-design parallel:** don't optimize the common case until you've handled the worst case correctly. The common case being fast doesn't matter if the worst case requires a redesign.
|
|
39
|
+
|
|
40
|
+
### Breadth-first (round-robin)
|
|
41
|
+
|
|
42
|
+
Process every rule to a shallow depth (skill skeleton + first regex pass), then go back and deepen each one. Useful when:
|
|
43
|
+
|
|
44
|
+
- The full set's quality matters more than per-rule depth (e.g., you need a coverage report fast)
|
|
45
|
+
- You don't yet know which rules are hard
|
|
46
|
+
- You're piloting a new methodology and want to validate the pipeline shape across many rules cheaply
|
|
47
|
+
|
|
48
|
+
**Trap:** you may declare rule_extraction done with shallow skills that never deepen. Worse than depth-first because the gate appears satisfied from coverage alone.
|
|
49
|
+
|
|
50
|
+
### Depth-first (one rule at a time, fully done)
|
|
51
|
+
|
|
52
|
+
Process rule 1 to completion (SKILL.md + check.py + tests passing) before touching rule 2. Useful when:
|
|
53
|
+
|
|
54
|
+
- Rules are largely independent (rare in compliance work)
|
|
55
|
+
- The conductor model has small context and re-loading shape between rules is cheap
|
|
56
|
+
- You're proving the end-to-end pipeline before scaling
|
|
57
|
+
|
|
58
|
+
**Trap:** the first rule's shape gets locked in; refactoring after rule 5 means rewriting 1-4. Combine with PATTERNS.md to mitigate.
|
|
59
|
+
|
|
60
|
+
### Binary partition
|
|
61
|
+
|
|
62
|
+
Split the rule set into two halves on a meaningful axis (public/private products, document type, regulation chapter), then recurse. Useful when:
|
|
63
|
+
|
|
64
|
+
- The split axis is structural (e.g., banking rules vs trust rules) — you can build separate tools per partition
|
|
65
|
+
- Some partitions can be skipped entirely (D6 applicability filter says "not applicable for this corpus")
|
|
66
|
+
|
|
67
|
+
**Trap:** premature partitioning when the axis isn't real. The agent commits to two tools that turn out to need a shared base. Validate the split with 2-3 rules per side before committing.
|
|
68
|
+
|
|
69
|
+
### "Easiest first" — what NOT to default to
|
|
70
|
+
|
|
71
|
+
Tempting because it builds confidence and ships something visible quickly. Do not default to it for regulatory rule sets — the easy rules teach you nothing transferable about the hard ones, and the framework crystallizes around the wrong shape. Use it only when you're piloting tooling on a brand-new project and need to prove the pipeline can produce ANY output before sizing the real work.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Grouping rules
|
|
76
|
+
|
|
77
|
+
The default is **one rule per task → one rule per skill directory**. This keeps coverage measurable and the TaskBoard clear. Group only when grouping reduces total work without coupling unrelated concerns.
|
|
78
|
+
|
|
79
|
+
### When to bundle
|
|
80
|
+
|
|
81
|
+
Bundle multiple rules into a single task (and a single check_r###_r###.py file) when ALL of:
|
|
82
|
+
|
|
83
|
+
- The rules share the same source chunk(s) — looking at the same paragraph of the same regulation
|
|
84
|
+
- They share the same input format (e.g., a required-fields table)
|
|
85
|
+
- The judgment logic for one rule is a substring or close variant of the next
|
|
86
|
+
- A single failure typically implies multiple failures (you can't pass R013 if R015 fails)
|
|
87
|
+
|
|
88
|
+
Example: R013 / R015 / R017 all check that a specific table on page 3 of the report contains certain mandatory fields. Same chunk, same parse, same verdict shape. Bundle as `check_r013_r015_r017.py` and create a single TaskCreate task `R013/R015/R017 — required-fields table`. The engine's filesystem-derived milestones recognize the grouped check.py and credit all three rule_ids.
|
|
89
|
+
|
|
90
|
+
### When to keep separate
|
|
91
|
+
|
|
92
|
+
Keep separate when ANY of:
|
|
93
|
+
|
|
94
|
+
- Rules cite different regulation chapters — even if conceptually related (e.g., R013 disclosure-content and R028 custodian-responsibility — both about reports, but different chapters / different evidence chains)
|
|
95
|
+
- Rules need different worker LLM tiers (R005 needs a flagship for nuanced judgment, R001 is regex)
|
|
96
|
+
- Rules apply to different document types (one applies only to public-fund reports, another only to private-fund reports)
|
|
97
|
+
- One rule's failure mode is a specific failure mode of another (don't bundle parent + child rules — the child's check redundantly re-runs the parent's)
|
|
98
|
+
|
|
99
|
+
The v0.6.2 D2 anti-pattern wording captures the failure case clearly:
|
|
100
|
+
> If you find yourself writing a unified_qc.py-style monolith that bypasses individual skills, your per-rule skills are wrong. Fix them, don't replace them.
|
|
101
|
+
|
|
102
|
+
That came from E2E #4 where one conductor wrote a 2,400-line `unified_qc.py` that ran all rules at once. It produced 1,150 ERROR verdicts (16.6%) because every rule's failure cascaded into every other rule's verdict. Per-rule skills are KC's unit of granularity for a reason.
|
|
103
|
+
|
|
104
|
+
### Anti-pattern: stub check.py + real workflow.py
|
|
105
|
+
|
|
106
|
+
Do NOT make `rule_skills/<id>/check.py` a stub that defers to
|
|
107
|
+
`workflows/<id>/workflow.py`. KC's intent: SKILL.md + check.py is the
|
|
108
|
+
**canonical** verification. workflow.py is the **distilled, cheaper**
|
|
109
|
+
form (regex baseline + LLM fallback). The relationship is
|
|
110
|
+
skill → workflow, not workflow → skill.
|
|
111
|
+
|
|
112
|
+
❌ DON'T:
|
|
113
|
+
```python
|
|
114
|
+
# rule_skills/R001/check.py — STUB, real logic elsewhere
|
|
115
|
+
def check(text):
|
|
116
|
+
rule_ids = re.findall(r"R\d{3}", load_skill())
|
|
117
|
+
return {rid: {"pass": None, "method": "stub",
|
|
118
|
+
"note": "to be implemented later"} for rid in rule_ids}
|
|
119
|
+
# real verification logic only in workflows/R001/workflow_v1.py
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
✅ DO:
|
|
123
|
+
```python
|
|
124
|
+
# rule_skills/R001/check.py — canonical verification
|
|
125
|
+
def check(text):
|
|
126
|
+
matches = re.findall(r"...", text) # actual rule logic
|
|
127
|
+
return {"rule_id": "R001", "passed": bool(matches),
|
|
128
|
+
"evidence": matches[:3], "method": "regex"}
|
|
129
|
+
|
|
130
|
+
# workflows/R001/workflow_v1.py — distilled, cheaper form
|
|
131
|
+
def run(text, llm_fn=None):
|
|
132
|
+
result = check(text) # baseline from skill
|
|
133
|
+
if not result["passed"] and llm_fn:
|
|
134
|
+
result = llm_verify(text, llm_fn) # escalate on fail
|
|
135
|
+
return result
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
Why it matters: distillation phase consumers (release tool, run.py
|
|
139
|
+
harness) load workflow.py. If check.py is a stub, the skill's
|
|
140
|
+
methodology (SKILL.md) becomes documentation-only and the
|
|
141
|
+
verification logic is scattered across N workflow files. Future
|
|
142
|
+
iterations of the skill (changes to regulation interpretation, edge
|
|
143
|
+
cases discovered in production) need a single canonical place to
|
|
144
|
+
update — the skill — not N workflows that have drifted independently.
|
|
145
|
+
|
|
146
|
+
E2E #6 v070 surfaced this pattern (DS bundled-skill check.py files
|
|
147
|
+
all returned `{"pass": null, "method": "stub"}` deferring to
|
|
148
|
+
workflows/). v0.7.1 added this anti-pattern explicitly.
|
|
149
|
+
|
|
150
|
+
### Naming convention for grouped checks
|
|
151
|
+
|
|
152
|
+
When you do bundle, name the file with the explicit range:
|
|
153
|
+
|
|
154
|
+
- `check_r013_r015_r017.py` — three specific rules
|
|
155
|
+
- `check_r002_r007.py` — contiguous range (R002 through R007)
|
|
156
|
+
- `check_r013-r017.py` — alternative spelling, also accepted
|
|
157
|
+
|
|
158
|
+
The engine's filesystem-derived milestones parse these names and credit each constituent rule_id. The grouping is documentation as much as code organization — downstream consumers (workflow-run, dashboards, release tool) read the filename to know coverage.
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## Difficulty estimation — three-axis triage
|
|
163
|
+
|
|
164
|
+
Before you commit to an order, score each extracted rule on three axes. One quick worker LLM call per rule (tier3 is sufficient — not a deep judgment) writes a `rules/difficulty.json` that the conductor then reads when deciding TaskBoard order.
|
|
165
|
+
|
|
166
|
+
### Axis 1 — Chain-of-thought depth
|
|
167
|
+
|
|
168
|
+
How many sequential judgments does the rule require? Count operations the agent has to chain together:
|
|
169
|
+
|
|
170
|
+
- 1: `text contains "industry-unified channel"` (regex)
|
|
171
|
+
- 2: classify product type, then check channel (two-step)
|
|
172
|
+
- 3+: classify product type, locate disclosure section, parse table, compare against another section's table (multi-step)
|
|
173
|
+
|
|
174
|
+
Score: 1 / 2 / 3+ on this axis.
|
|
175
|
+
|
|
176
|
+
### Axis 2 — Module count
|
|
177
|
+
|
|
178
|
+
How many distinct sub-checks does the rule encompass? A "module" is a logically separable predicate.
|
|
179
|
+
|
|
180
|
+
- 1: single predicate ("must mention channel A")
|
|
181
|
+
- 2-3: a small required-fields list ("must mention A, B, C, D")
|
|
182
|
+
- 4+: a large checklist or conditional branch ("if public fund, then channels X+Y; if private, then channel Z; in all cases also include the manager identity")
|
|
183
|
+
|
|
184
|
+
Score: 1 / 2-3 / 4+ on this axis.
|
|
185
|
+
|
|
186
|
+
### Axis 3 — Cross-rule interaction
|
|
187
|
+
|
|
188
|
+
Does the rule reference another rule, depend on its output, or have to resolve consistency with it?
|
|
189
|
+
|
|
190
|
+
- 0: standalone (most rules)
|
|
191
|
+
- 1: cross-references one other rule (e.g., R007 references R013's table existence)
|
|
192
|
+
- 2+: tightly coupled with multiple rules, requires consistency reasoning across them
|
|
193
|
+
|
|
194
|
+
Score: 0 / 1 / 2+ on this axis.
|
|
195
|
+
|
|
196
|
+
### Total difficulty
|
|
197
|
+
|
|
198
|
+
Sum the three axes (1+1+0 = 2 minimum, 3+3+2 = 8 maximum). Sort descending. The 2-3 highest are your design-floor cases — work them first.
|
|
199
|
+
|
|
200
|
+
For a 70-rule corpus, expect difficulty distribution roughly:
|
|
201
|
+
- 10-15 hard (sum 5-8)
|
|
202
|
+
- 30-40 medium (sum 3-4)
|
|
203
|
+
- 20-30 easy (sum 2)
|
|
204
|
+
|
|
205
|
+
Don't over-engineer the triage. It's a planning aid, not a contract. If during work you discover a rule scored 2 was actually a 6, update PATTERNS.md and re-sort the remaining queue.
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
## PATTERNS.md — project memory discipline
|
|
210
|
+
|
|
211
|
+
KC's main agent does not have continuous memory across phases. Every time the agent re-reads `describeState`, it sees the same rule list and the same milestones. Without an external accumulating reference, every rule's design starts from scratch.
|
|
212
|
+
|
|
213
|
+
`rules/PATTERNS.md` is that reference. The agent owns it (writes via `workspace_file`, not via any tool wrapper). The engine surfaces it in every system prompt of skill_authoring + skill_testing. Capped at ~5 KB so token cost stays trivial.
|
|
214
|
+
|
|
215
|
+
### What to write — patterns that transfer
|
|
216
|
+
|
|
217
|
+
A good PATTERNS.md entry captures something that will SAVE work on the next rule. Three legitimate categories:
|
|
218
|
+
|
|
219
|
+
✅ **Transferable shape** — a verdict shape, chunker granularity, or interface decision that subsequent rules will reuse.
|
|
220
|
+
|
|
221
|
+
```
|
|
222
|
+
R028 (custodian responsibility) needed multi-party verdict shape:
|
|
223
|
+
{ primary_party: PASS|FAIL, secondary_parties: [...], reasons: [...] }
|
|
224
|
+
Adopting as default for any rule with multiple liable entities.
|
|
225
|
+
Confirmed reusable on R029, R031.
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
✅ **Project-level constraint** — a fact about the corpus or environment that affects multiple rules.
|
|
229
|
+
|
|
230
|
+
```
|
|
231
|
+
Sample corpus has bilingual table headings (EN+ZH).
|
|
232
|
+
Chunker MUST split on the ZH heading boundary, not the EN one —
|
|
233
|
+
verified on 5 sample docs. Without this, R013 / R015 / R017 all
|
|
234
|
+
under-extract.
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
✅ **Anti-pattern with rationale** — a thing you tried, why it failed, what to do instead.
|
|
238
|
+
|
|
239
|
+
```
|
|
240
|
+
Tried tier4 for JSON-output verdicts → empty responses 80% of the time.
|
|
241
|
+
tier3 (Qwen3.5) hallucinates field names. Settled on tier2 (DeepSeek-V3.2)
|
|
242
|
+
for any structured-output rule. Tier1 reserved for verdict reasoning under
|
|
243
|
+
ambiguous evidence (R005, R024).
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
### What NOT to write — log-dump anti-patterns
|
|
247
|
+
|
|
248
|
+
These add token cost without adding decision value. Future-you reading PATTERNS.md is trying to figure out what to do, not reconstruct what already happened.
|
|
249
|
+
|
|
250
|
+
❌ **Completion log** — already in tasks.json + filesystem.
|
|
251
|
+
|
|
252
|
+
```
|
|
253
|
+
R001 done. R002 done. R003 partial pass. R004 done.
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
❌ **Tool history echo** — already in events.jsonl.
|
|
257
|
+
|
|
258
|
+
```
|
|
259
|
+
Called workspace_file to write check_R013.py. Then called sandbox_exec.
|
|
260
|
+
Then ran the result through worker_llm_call.
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
❌ **Filesystem-authoritative facts** — the engine derives these from disk.
|
|
264
|
+
|
|
265
|
+
```
|
|
266
|
+
Workflows live under workflows/R001_workflow.py. There are 28 of them.
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
❌ **Conversation summary** — neither agent nor user reads PATTERNS.md as narrative.
|
|
270
|
+
|
|
271
|
+
```
|
|
272
|
+
After discussing with the user, we decided to focus on banking rules first.
|
|
273
|
+
The user mentioned that trust products are out of scope.
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
If a project-level decision came out of conversation, write it as a constraint:
|
|
277
|
+
|
|
278
|
+
```
|
|
279
|
+
Trust products excluded from this run (D6 applicability NO).
|
|
280
|
+
Skip R078, R092, R104 — their skills exist as stubs only.
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
### When to update vs append
|
|
284
|
+
|
|
285
|
+
- **Append** when you discover something new and transferable.
|
|
286
|
+
- **Update an existing entry** when work on a later rule reveals a better abstraction. Don't lock yourself into the first hard rule's shape — JIT compilers recompile when profile data invalidates the original assumption; PATTERNS.md should evolve the same way.
|
|
287
|
+
- **Delete an entry** when you discover it was wrong. Mark the deletion with a brief rationale at the bottom of the file:
|
|
288
|
+
|
|
289
|
+
```
|
|
290
|
+
[DELETED 2026-04-29] "Always use tier1 for FAIL verdicts"
|
|
291
|
+
Why: R005 + R007 work fine on tier2; tier1 reserved for genuinely
|
|
292
|
+
ambiguous evidence cases only (3 rules across the set).
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
### Sizing
|
|
296
|
+
|
|
297
|
+
Keep PATTERNS.md under ~5 KB total. If it exceeds, prune the least-actionable entries (the ones that haven't influenced any decision in the last 5 rules). Memory is for what you're using, not what you've seen.
|
|
298
|
+
|
|
299
|
+
---
|
|
300
|
+
|
|
301
|
+
## Putting it together — opening sequence
|
|
302
|
+
|
|
303
|
+
When entering skill_authoring with an empty TaskBoard:
|
|
304
|
+
|
|
305
|
+
1. **Read `describeState`.** Look at the rule list, the milestones (rules with chunk refs / coverage audited), and any existing PATTERNS.md.
|
|
306
|
+
2. **If PATTERNS.md is empty:** spend ~2 turns deciding ordering methodology + first 3-5 patterns. Write PATTERNS.md as your first artifact, before any skill code.
|
|
307
|
+
3. **If `rules/difficulty.json` exists:** sort rules by difficulty descending. Group where appropriate per the rules above. Call `TaskCreate` for each unit.
|
|
308
|
+
4. **If `rules/difficulty.json` doesn't exist:** decide whether to spend the worker LLM calls to triage (almost always yes for a corpus of >20 rules). Run the triage step (one tier3 call per rule, batched in groups of 10 if you want), write `rules/difficulty.json`, then proceed to step 3.
|
|
309
|
+
5. **Pick the first task.** Work it to completion (skill + check + at least one local test). Update PATTERNS.md with whatever you learned. Move to the next task.
|
|
310
|
+
6. **At task ~5 and task ~10:** stop and re-read PATTERNS.md. If patterns suggest a refactor of earlier work, do it now (cheap) rather than later (expensive).
|
|
311
|
+
|
|
312
|
+
### Why PATTERNS.md FIRST, before any skill code
|
|
313
|
+
|
|
314
|
+
If you start writing skill code (rule_skills/<id>/check.py) before PATTERNS.md exists, **stop**. Even a 200-byte initial PATTERNS.md ("decided Shannon-Huffman; first hard rule R028 will dictate verdict shape; sample corpus has bilingual table headings") sets the framework. You'll save 4× the time later not re-deriving the same shapes per rule.
|
|
315
|
+
|
|
316
|
+
❌ "I'll write the skills first, then PATTERNS.md when I have insights."
|
|
317
|
+
|
|
318
|
+
By the time you have N skills, you've made N implicit decisions about verdict shape, chunker boundaries, worker tier — each rule re-derives from scratch. Refactoring requires touching N files instead of one.
|
|
319
|
+
|
|
320
|
+
✅ "Write PATTERNS.md, even tentatively, then re-read it before each new rule. Update it when discoveries change the framework."
|
|
321
|
+
|
|
322
|
+
PATTERNS.md is your project's index card. Build it before the work, update it during the work, harvest it after.
|
|
323
|
+
|
|
324
|
+
E2E #6 v070 surfaced this: DS only wrote PATTERNS.md after a rollback intervention; the per-skill design decisions before that point were already locked in and had to be re-touched. v0.7.1 reinforced this guidance.
|
|
325
|
+
|
|
326
|
+
The engine's filesystem-derived milestones (Group A v0.7.0) verify coverage on disk regardless of how you split the work. The TaskBoard is your scratchpad; the disk is the contract.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: skill-creator
|
|
3
|
-
description:
|
|
3
|
+
description: Anthropic's skill-scaffolding toolkit — use for iterating/improving existing skills or running evals on them, NOT as the primary reference for building KC's per-rule verification skills. For KC rule skills, read `meta-meta/skill-authoring` first (canonical folder layout + granularity rules + KC-specific check.py entry-point conventions) and `meta-meta/work-decomposition` for ordering + grouping decisions. This skill applies once per-rule skills exist and the agent wants to optimize their description/triggering or run formal evals.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Skill Creator
|
|
@@ -0,0 +1,321 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: work-decomposition
|
|
3
|
+
description: 在 rule_extraction → skill_authoring 过渡阶段决定如何把规则集拆分为 TaskBoard 任务。涵盖排序方法(难度优先 / Shannon–Huffman、广度优先、深度优先、二分切分)、分组策略(多条规则合并到一个任务 vs. 各自独立的判断标准)、三轴难度评估、以及如何写一份贯穿全流程都能用得上的 PATTERNS.md 项目记忆。当进入 rule_extraction、进入 skill_authoring 或感觉 TaskBoard 走偏想要重新拆分时使用。
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# 工作拆分(Work Decomposition)
|
|
7
|
+
|
|
8
|
+
KC 的 main agent 是指挥者。指挥者决定下一步做什么——而这个决定凌驾于后续所有选择之上。错误的拆分会让整个会话变得昂贵:规则顺序错了,agent 会把同一种结构重新设计三遍;不相关的规则被合并到一个 skill 里,最终 check.py 就会变成 E2E #4 那种"统一执行器"反模式;本应合并的相关规则被分散到不同 skill,agent 会把同样的 chunker 逻辑重新推导 17 次。
|
|
9
|
+
|
|
10
|
+
这份 skill 是指挥者做这类决定的操作手册。它放在 `meta-meta/` 下,因为工作拆分是系统级的纪律,不是某条规则的具体技巧。互补的 `task-decomposition`(同样在 `meta-meta/` 下)覆盖单条规则**内部**的结构——locate → extract → normalize → judge → comment。本 skill 覆盖的是规则**集合**该如何切分成 TaskBoard 任务。
|
|
11
|
+
|
|
12
|
+
## 何时使用本 skill
|
|
13
|
+
|
|
14
|
+
- **进入 rule_extraction 时**。读完法规、拆出规则之后,在宣布该阶段完成之前,先决定这些规则会以什么顺序被处理、是否分组。覆盖审计与 chunk refs 都是这两个决定下游的工作。
|
|
15
|
+
- **进入 skill_authoring 时**。TaskBoard 是空的(引擎不再自动生成 per-rule 任务)。从 `describeState` 读取规则列表,决定分组与顺序,然后为每个工作单元调用 `TaskCreate`。
|
|
16
|
+
- **运行中觉得拆分不对时**。如果 TaskBoard 越走越奇怪(规则按错误顺序累积、明明该合并的两条规则被拆到两个任务里),停下来重新拆分。暂停 5 分钟重新规划的代价,会在接下来 2 条规则里被更合理的形状收回。
|
|
17
|
+
|
|
18
|
+
## 锁定原则
|
|
19
|
+
|
|
20
|
+
1. **硬跟踪、软执行(Hard tracking, soft executing)**。无论你怎么分组,引擎都从磁盘事实推导里程碑(`rule_skills/<id>/SKILL.md`、`check_*.py`、`workflows/<id>/...`)。覆盖率由引擎核实;分组由你自己决定。聪明的任务命名绕不过下限,但下限也不规定任务形状。
|
|
21
|
+
2. **最难的规则信息量最大**。难规则会强迫你把 chunker、分类器、verdict 形状、worker LLM 层级都设计到位。简单规则可以廉价地复用这套机制的子集。先把硬 case 落实,简单 case 自然继承。
|
|
22
|
+
3. **PATTERNS.md 是承重的记忆**。没有持续累积的参考文件,每条规则都从空白开始,同一种形状会被反复设计。有了它,工作能复利。
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## 排序方法
|
|
27
|
+
|
|
28
|
+
明确选一种,把选择本身写进 PATTERNS.md 的第一条。"我用 Shannon–Huffman,因为 R028 的多方主体 verdict 形状会决定其他规则的 chunker"是合法决定;"我从 catalog.json 顶端开始一条条往下做"不是——那只是没有决定。
|
|
29
|
+
|
|
30
|
+
### Shannon–Huffman(难度优先)—— 推荐默认
|
|
31
|
+
|
|
32
|
+
先处理**最难**的规则。把这条难规则需要的 chunker、verdict 形状、worker 层级当作设计下限。后续规则按难度递减处理,每一条都是已经搭好的机制的退化形态。
|
|
33
|
+
|
|
34
|
+
**何时选**:规则集合复杂度不均匀,并且你怀疑少数几条难规则会决定整体形状(合规/监管类工作几乎总是如此)。E2E #5 里 GLM 阴差阳错走的就是这条路,最终在真实 LLM 驱动的 workflow 上拿到了 0.6% ERROR;DS 走自底向上,最终 78% 的 verdict 是 NOT_APPLICABLE。
|
|
35
|
+
|
|
36
|
+
**为何用 "Huffman" 而不是 "Shannon" 来类比**:Huffman 编码先处理低频符号来构造最优前缀码。KC 的对应物是单条成本高、出现频率低的规则——R028 那种类型,数量少但主导整个设计空间。先碰它们,简单规则就能廉价继承框架。
|
|
37
|
+
|
|
38
|
+
**编译器设计上的对应**:在最坏情况被正确处理之前,不要先优化常见情况。常见情况快没用——如果最坏情况要重新设计整个流水线。
|
|
39
|
+
|
|
40
|
+
### 广度优先(轮询)
|
|
41
|
+
|
|
42
|
+
先把每条规则都做到浅层(skill 骨架 + 第一遍正则),然后回头逐条加深。适用情景:
|
|
43
|
+
|
|
44
|
+
- 整体集合的覆盖比单条规则的深度更重要(例如急需一份覆盖报告)
|
|
45
|
+
- 你还不知道哪些规则难
|
|
46
|
+
- 你在试新方法学,想先廉价地在多条规则上验证流水线形状
|
|
47
|
+
|
|
48
|
+
**陷阱**:你可能在浅层 skill 上宣布 rule_extraction 完成,从此再没回头加深。比深度优先更糟,因为光看覆盖率,门控看上去是过的。
|
|
49
|
+
|
|
50
|
+
### 深度优先(一次只做一条,做到底)
|
|
51
|
+
|
|
52
|
+
把规则 1 完整做完(SKILL.md + check.py + 测试通过)再碰规则 2。适用情景:
|
|
53
|
+
|
|
54
|
+
- 规则之间高度独立(合规工作里少见)
|
|
55
|
+
- 指挥模型 context 较小,规则之间反复加载形状的成本不高
|
|
56
|
+
- 你在跑通端到端流水线,确认能产出之前不去铺规模
|
|
57
|
+
|
|
58
|
+
**陷阱**:第一条规则的形状被写死;第 5 条之后才发现要重构,意味着 1–4 都要重写。配合 PATTERNS.md 来缓解。
|
|
59
|
+
|
|
60
|
+
### 二分切分
|
|
61
|
+
|
|
62
|
+
按一个有意义的轴把规则集分成两半(公募/私募、文档类型、法规章节),然后递归。适用情景:
|
|
63
|
+
|
|
64
|
+
- 切分轴是结构性的(例如银行类规则 vs 信托类规则)——可以为各分区分别建工具
|
|
65
|
+
- 某些分区可以整块跳过(D6 适用性判断说"这个语料不适用")
|
|
66
|
+
|
|
67
|
+
**陷阱**:在切分轴并不真实存在时过早分区。agent 投入两套工具,最后发现需要共同的底座。先用 2–3 条规则在每一侧验证切分再投入。
|
|
68
|
+
|
|
69
|
+
### "最简单的先做" —— 不要默认选这个
|
|
70
|
+
|
|
71
|
+
很有诱惑力,因为它能积累信心、能很快出可见的产出。**对监管规则集不要默认选这个**——简单规则不会教给你任何能迁移到难规则上的东西,框架会围绕错误形状结晶。只在你为一个全新项目调试工具链、必须先证明流水线能产出**任何**东西时再用它。
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## 规则分组
|
|
76
|
+
|
|
77
|
+
默认是**一条规则 → 一个任务 → 一个 skill 目录**。这样覆盖率可量、TaskBoard 清晰。只有当合并真的能减少总工作量、又不会把不相关的关切耦合起来时,才合并。
|
|
78
|
+
|
|
79
|
+
### 何时合并
|
|
80
|
+
|
|
81
|
+
把多条规则合并到一个任务(以及一个 check_r###_r###.py 文件)只有在以下条件**全部**满足时:
|
|
82
|
+
|
|
83
|
+
- 这些规则共享同一个 source chunk(看的是同一段法规的同一段文字)
|
|
84
|
+
- 它们共享同一种输入形态(例如同一张必填字段表)
|
|
85
|
+
- 一条规则的判断逻辑是另一条的子串或近似变体
|
|
86
|
+
- 一次失败通常意味着多条规则同时失败(R013 不可能在 R015 失败的情况下通过)
|
|
87
|
+
|
|
88
|
+
例:R013 / R015 / R017 都在检查报告第 3 页那张表是否包含某些必填字段。同一个 chunk、同一次 parse、同一种 verdict 形状。合并为 `check_r013_r015_r017.py`,并创建一个 TaskCreate 任务 `R013/R015/R017 — 必填字段表`。引擎从文件系统推导里程碑时会识别这个合并 check.py,给三个 rule_id 都计入覆盖。
|
|
89
|
+
|
|
90
|
+
### 何时保持独立
|
|
91
|
+
|
|
92
|
+
任何**一项**满足时就保持独立:
|
|
93
|
+
|
|
94
|
+
- 规则引用不同章节——即使概念上相关(例如 R013 的披露内容和 R028 的托管责任都属"报告",但章节不同、证据链不同)
|
|
95
|
+
- 规则需要不同的 worker LLM 层级(R005 需要旗舰模型做精细判断,R001 是正则)
|
|
96
|
+
- 规则适用于不同文档类型(一条只对公募基金报告生效,另一条只对私募基金报告生效)
|
|
97
|
+
- 一条规则的失败模式是另一条规则的特殊失败模式(不要把父规则和子规则合并——子规则的检查会冗余地重新执行父规则的检查)
|
|
98
|
+
|
|
99
|
+
v0.6.2 D2 的反模式说法已经把失败情形说得很清楚了:
|
|
100
|
+
> 如果你发现自己在写 unified_qc.py 那种绕过单 rule skill 的大杂烩,那就是说明你的 per-rule skill 是错的。是去修它们,不是去替换它们。
|
|
101
|
+
|
|
102
|
+
那段话来自 E2E #4:一个指挥模型写了 2,400 行 `unified_qc.py` 一次性跑所有规则。结果出现 1,150 条 ERROR verdict(16.6%),因为每条规则的失败都连带把所有其他规则的判定也带崩了。Per-rule skill 是 KC 的粒度单元,这是有原因的。
|
|
103
|
+
|
|
104
|
+
### 反模式:check.py 是 stub + workflow.py 才是真逻辑
|
|
105
|
+
|
|
106
|
+
**不要**把 `rule_skills/<id>/check.py` 写成一个把真实逻辑推迟到
|
|
107
|
+
`workflows/<id>/workflow.py` 的占位文件。KC 的设计意图是:SKILL.md
|
|
108
|
+
+ check.py 是**正典**核查;workflow.py 是**蒸馏后、更便宜**的形式
|
|
109
|
+
(regex 优先 + LLM 回退)。关系是 skill → workflow,不是反过来。
|
|
110
|
+
|
|
111
|
+
❌ 不要这样:
|
|
112
|
+
```python
|
|
113
|
+
# rule_skills/R001/check.py —— STUB,真逻辑在别处
|
|
114
|
+
def check(text):
|
|
115
|
+
rule_ids = re.findall(r"R\d{3}", load_skill())
|
|
116
|
+
return {rid: {"pass": None, "method": "stub",
|
|
117
|
+
"note": "待技能测试阶段实现"} for rid in rule_ids}
|
|
118
|
+
# 实际核查逻辑只在 workflows/R001/workflow_v1.py 里
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
✅ 应该这样:
|
|
122
|
+
```python
|
|
123
|
+
# rule_skills/R001/check.py —— 正典核查
|
|
124
|
+
def check(text):
|
|
125
|
+
matches = re.findall(r"...", text) # 真实规则逻辑
|
|
126
|
+
return {"rule_id": "R001", "passed": bool(matches),
|
|
127
|
+
"evidence": matches[:3], "method": "regex"}
|
|
128
|
+
|
|
129
|
+
# workflows/R001/workflow_v1.py —— 蒸馏后的便宜形式
|
|
130
|
+
def run(text, llm_fn=None):
|
|
131
|
+
result = check(text) # skill 提供基线
|
|
132
|
+
if not result["passed"] and llm_fn:
|
|
133
|
+
result = llm_verify(text, llm_fn) # FAIL 时升级到 LLM
|
|
134
|
+
return result
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
为什么重要:蒸馏阶段下游消费者(release 工具、run.py 运行器)加载
|
|
138
|
+
的是 workflow.py。如果 check.py 是 stub,skill 的方法论(SKILL.md)
|
|
139
|
+
就只剩文档作用,而核查逻辑被分散到 N 个 workflow 文件里。后续对
|
|
140
|
+
skill 的迭代(法规解释变化、生产中发现的边缘情形)需要一个**正典
|
|
141
|
+
位置**来更新——也就是 skill——而不是 N 个已经各自漂移的 workflow。
|
|
142
|
+
|
|
143
|
+
E2E #6 v070 暴露了这个反模式(DS 把所有 bundled skill 的 check.py
|
|
144
|
+
都写成 `{"pass": null, "method": "stub"}` 推给 workflows/)。
|
|
145
|
+
v0.7.1 把这个反模式显式写进 skill。
|
|
146
|
+
|
|
147
|
+
### 合并 check 的命名约定
|
|
148
|
+
|
|
149
|
+
确实需要合并时,文件名要把范围写明:
|
|
150
|
+
|
|
151
|
+
- `check_r013_r015_r017.py` — 三条具体规则
|
|
152
|
+
- `check_r002_r007.py` — 连续区间(R002 到 R007)
|
|
153
|
+
- `check_r013-r017.py` — 替代写法,也接受
|
|
154
|
+
|
|
155
|
+
引擎从文件系统推导里程碑时会解析这些文件名,把范围内每条 rule_id 都算进覆盖。合并既是代码组织也是文档——下游消费者(workflow-run、dashboards、release 工具)通过文件名判断覆盖。
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## 难度评估 —— 三轴评分
|
|
160
|
+
|
|
161
|
+
在确定顺序之前,对每条提取出来的规则在三个轴上打分。一次快速 worker LLM 调用即可(tier3 足够,不是要做深判断),把结果写到 `rules/difficulty.json`,指挥者随后在排 TaskBoard 顺序时读取。
|
|
162
|
+
|
|
163
|
+
### 轴 1 — 思维链深度
|
|
164
|
+
|
|
165
|
+
这条规则需要多少次串联判断?数一下 agent 必须依次完成的操作:
|
|
166
|
+
|
|
167
|
+
- 1:`text contains "行业统一信息披露渠道"`(正则)
|
|
168
|
+
- 2:先分类产品类型,再检查渠道(两步)
|
|
169
|
+
- 3+:分类产品类型 → 定位披露段落 → 解析表格 → 与另一段的表格做比较(多步)
|
|
170
|
+
|
|
171
|
+
打分:1 / 2 / 3+。
|
|
172
|
+
|
|
173
|
+
### 轴 2 — 模块数量
|
|
174
|
+
|
|
175
|
+
这条规则包含多少个独立子检查?"模块"指逻辑上可分离的谓词。
|
|
176
|
+
|
|
177
|
+
- 1:单一谓词("必须提及渠道 A")
|
|
178
|
+
- 2-3:小型必填列表("必须提及 A、B、C、D")
|
|
179
|
+
- 4+:长清单或条件分支("如果是公募,则渠道 X+Y;如果是私募,则渠道 Z;任何情况下还要包含管理人身份")
|
|
180
|
+
|
|
181
|
+
打分:1 / 2-3 / 4+。
|
|
182
|
+
|
|
183
|
+
### 轴 3 — 跨规则交互
|
|
184
|
+
|
|
185
|
+
这条规则是否引用其他规则、依赖其输出,或必须与之保持一致性?
|
|
186
|
+
|
|
187
|
+
- 0:独立(多数规则)
|
|
188
|
+
- 1:交叉引用一条其他规则(例如 R007 引用 R013 的表格存在性)
|
|
189
|
+
- 2+:与多条规则紧耦合,需要跨规则的一致性推理
|
|
190
|
+
|
|
191
|
+
打分:0 / 1 / 2+。
|
|
192
|
+
|
|
193
|
+
### 总难度
|
|
194
|
+
|
|
195
|
+
三轴求和(最低 1+1+0=2,最高 3+3+2=8)。降序排序。前 2-3 条是你的设计下限——先做它们。
|
|
196
|
+
|
|
197
|
+
70 条规则的语料典型分布大致:
|
|
198
|
+
- 10-15 条难(求和 5-8)
|
|
199
|
+
- 30-40 条中(求和 3-4)
|
|
200
|
+
- 20-30 条易(求和 2)
|
|
201
|
+
|
|
202
|
+
不要把分级做过头。它是规划辅助,不是契约。如果工作中你发现一条评 2 的实际是 6,更新 PATTERNS.md,把剩余队列重新排序。
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## PATTERNS.md —— 项目记忆纪律
|
|
207
|
+
|
|
208
|
+
KC 的 main agent 在阶段之间没有连续记忆。每次 agent 重新读 `describeState`,看到的都是同一份规则列表和同一份里程碑。没有外部累积参考的话,每条规则的设计都从零开始。
|
|
209
|
+
|
|
210
|
+
`rules/PATTERNS.md` 就是那份参考。Agent 自己拥有它(通过 `workspace_file` 写入,不通过任何 tool wrapper)。引擎在 skill_authoring + skill_testing 阶段的每次系统提示中都会带上它。上限约 5 KB,token 成本可忽略。
|
|
211
|
+
|
|
212
|
+
### 写什么 —— 能迁移的形状
|
|
213
|
+
|
|
214
|
+
一条好的 PATTERNS.md 条目记录的是某个能在**下一条**规则上**省工作**的东西。三类合法内容:
|
|
215
|
+
|
|
216
|
+
✅ **可迁移的形状** —— verdict 形状、chunker 粒度、接口决定,后续规则会复用。
|
|
217
|
+
|
|
218
|
+
```
|
|
219
|
+
R028(托管责任)需要多方 verdict 形状:
|
|
220
|
+
{ primary_party: PASS|FAIL, secondary_parties: [...], reasons: [...] }
|
|
221
|
+
作为有多个责任主体的规则的默认形状。
|
|
222
|
+
在 R029、R031 上确认可复用。
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
✅ **项目级约束** —— 关于语料或环境的事实,影响多条规则。
|
|
226
|
+
|
|
227
|
+
```
|
|
228
|
+
样本语料表头中英双语(EN+ZH)。
|
|
229
|
+
Chunker 必须按 ZH 表头切分,不是 EN —— 5 份样本上验证过。
|
|
230
|
+
不这样做,R013 / R015 / R017 都会少抽。
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
✅ **反模式 + 原因** —— 你试过、为什么失败、应该怎么改。
|
|
234
|
+
|
|
235
|
+
```
|
|
236
|
+
试 tier4 出 JSON verdict → 80% 的时候返回空。
|
|
237
|
+
tier3(Qwen3.5)字段名会幻觉。最终选 tier2(DeepSeek-V3.2)
|
|
238
|
+
作为所有结构化输出规则的默认。Tier1 只留给证据模糊时的
|
|
239
|
+
verdict 推理(R005、R024)。
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### 不写什么 —— 日志倾倒反模式
|
|
243
|
+
|
|
244
|
+
下面这些只增加 token 成本、不增加决策价值。未来的你读 PATTERNS.md 是为了搞清楚**接下来怎么做**,不是回放**已经发生了什么**。
|
|
245
|
+
|
|
246
|
+
❌ **完成日志** —— 已经在 tasks.json + 文件系统里。
|
|
247
|
+
|
|
248
|
+
```
|
|
249
|
+
R001 完成。R002 完成。R003 部分通过。R004 完成。
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
❌ **工具调用回声** —— 已经在 events.jsonl 里。
|
|
253
|
+
|
|
254
|
+
```
|
|
255
|
+
调 workspace_file 写 check_R013.py。再调 sandbox_exec。
|
|
256
|
+
然后把结果交给 worker_llm_call。
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
❌ **文件系统权威事实** —— 引擎从磁盘自己推导。
|
|
260
|
+
|
|
261
|
+
```
|
|
262
|
+
Workflows 都在 workflows/R001_workflow.py 这种位置。一共 28 个。
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
❌ **对话总结** —— PATTERNS.md 不是叙事文档,agent 和用户都不会当故事读。
|
|
266
|
+
|
|
267
|
+
```
|
|
268
|
+
跟用户讨论之后,决定先做银行类规则。用户提到信托产品不在范围。
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
如果某个项目级决定来自对话,写成约束而不是叙述:
|
|
272
|
+
|
|
273
|
+
```
|
|
274
|
+
本次运行排除信托产品(D6 适用性 NO)。
|
|
275
|
+
跳过 R078、R092、R104 —— 它们的 skill 只是占位。
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### 何时更新、何时追加
|
|
279
|
+
|
|
280
|
+
- **追加**:发现了新的、能迁移的东西。
|
|
281
|
+
- **更新已有条目**:在更晚的规则上做事时发现一种更好的抽象。不要被第一条难规则的形状锁死——JIT 编译器在 profile 数据推翻早期假设时会重编译;PATTERNS.md 也应当这样演进。
|
|
282
|
+
- **删除条目**:发现一条之前写的是错的。在文件底部用一行简要原因标记删除:
|
|
283
|
+
|
|
284
|
+
```
|
|
285
|
+
[DELETED 2026-04-29] "FAIL verdict 一律用 tier1"
|
|
286
|
+
原因:R005 + R007 在 tier2 上工作良好;tier1 只留给证据真正
|
|
287
|
+
歧义的情形(整个集合大概 3 条规则)。
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### 容量
|
|
291
|
+
|
|
292
|
+
PATTERNS.md 全文控制在约 5 KB 之内。超过时,剪掉最不可执行的条目(最近 5 条规则里没影响过任何决定的那些)。记忆是给"正在用"的,不是给"看过"的。
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
## 综合起来 —— 开局序列
|
|
297
|
+
|
|
298
|
+
进入 skill_authoring、TaskBoard 为空时:
|
|
299
|
+
|
|
300
|
+
1. **读 `describeState`**。看规则列表、里程碑(哪些规则有 chunk refs / 是否做了覆盖审计),以及现存的 PATTERNS.md。
|
|
301
|
+
2. **如果 PATTERNS.md 为空**:花约 2 个回合定排序方法 + 头 3-5 条 pattern。在写任何 skill 代码之前,PATTERNS.md 是第一份产出。
|
|
302
|
+
3. **如果 `rules/difficulty.json` 已存在**:按难度降序排规则。按本 skill 的规则适当分组。为每个工作单元调 `TaskCreate`。
|
|
303
|
+
4. **如果 `rules/difficulty.json` 不存在**:决定是否花 worker LLM 调用做分级(>20 条规则的语料几乎总是值得)。跑分级(每条规则一次 tier3 调用,可以按 10 条一组批处理),写 `rules/difficulty.json`,再回到第 3 步。
|
|
304
|
+
5. **挑第一个任务**。做到完整(skill + check + 至少一次本地测试)。把学到的写进 PATTERNS.md。换下一个任务。
|
|
305
|
+
6. **任务做到第 5 个、第 10 个时**:停下来重读 PATTERNS.md。如果新积累的 pattern 暗示要重构早期工作,**现在做**(便宜)而不是更晚(昂贵)。
|
|
306
|
+
|
|
307
|
+
### 为什么 PATTERNS.md 要先写、写在 skill 代码之前
|
|
308
|
+
|
|
309
|
+
如果你在 PATTERNS.md 还不存在的时候就开始写 skill 代码(rule_skills/<id>/check.py),**停**。哪怕只是 200 字节的初始 PATTERNS.md("决定走 Shannon-Huffman;第一条难规则 R028 决定 verdict 形状;样本语料表头中英双语")也能搭起框架。后续每条规则少重新推导一次同样的形状,整体能省 4 倍时间。
|
|
310
|
+
|
|
311
|
+
❌ "我先把 skill 写完,等有洞察再写 PATTERNS.md。"
|
|
312
|
+
|
|
313
|
+
到你写完 N 个 skill 时,你已经做了 N 个隐式决定(verdict 形状、chunker 边界、worker tier)——每条规则都是从零推导。重构需要碰 N 个文件,而不是一个。
|
|
314
|
+
|
|
315
|
+
✅ "先写 PATTERNS.md(哪怕是初步的),写每条新规则之前先重读,发现新东西就回头更新。"
|
|
316
|
+
|
|
317
|
+
PATTERNS.md 是项目的索引卡片。工作之前搭好它、工作中更新它、工作之后从中收割。
|
|
318
|
+
|
|
319
|
+
E2E #6 v070 暴露了这个:DS 在用户介入回退之后才写 PATTERNS.md,而那之前每条 skill 的设计决定都已经各自固化、之后还要再碰一遍。v0.7.1 把这个引导写得更明确。
|
|
320
|
+
|
|
321
|
+
引擎从文件系统推导里程碑(v0.7.0 Group A)会按磁盘事实核验覆盖率,无论你怎么切分工作。TaskBoard 是你的草稿;磁盘才是契约。
|