kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/package.json +1 -1
  2. package/src/agent/context.js +17 -1
  3. package/src/agent/engine.js +85 -8
  4. package/src/agent/llm-client.js +24 -1
  5. package/src/agent/pipelines/_milestone-derive.js +78 -7
  6. package/src/agent/pipelines/skill-authoring.js +19 -2
  7. package/src/agent/tools/release.js +94 -1
  8. package/src/cli/index.js +28 -7
  9. package/template/.env.template +1 -1
  10. package/template/AGENT.md +2 -2
  11. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  12. package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
  13. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  14. package/template/skills/en/confidence-system/SKILL.md +30 -8
  15. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  16. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  17. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  18. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  19. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  20. package/template/skills/en/document-chunking/SKILL.md +99 -15
  21. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  22. package/template/skills/en/quality-control/SKILL.md +14 -0
  23. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  24. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  25. package/template/skills/en/skill-authoring/SKILL.md +52 -8
  26. package/template/skills/en/skill-creator/SKILL.md +25 -3
  27. package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
  28. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  29. package/template/skills/en/tree-processing/SKILL.md +1 -1
  30. package/template/skills/en/version-control/SKILL.md +15 -0
  31. package/template/skills/en/work-decomposition/SKILL.md +21 -35
  32. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  33. package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
  34. package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
  35. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  36. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  37. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  38. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  39. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  40. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  41. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  42. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  43. package/template/skills/zh/document-chunking/SKILL.md +96 -20
  44. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  45. package/template/skills/zh/entity-extraction/SKILL.md +14 -4
  46. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  47. package/template/skills/zh/quality-control/SKILL.md +14 -0
  48. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  49. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  50. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  51. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  52. package/template/skills/zh/skill-authoring/SKILL.md +108 -69
  53. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  54. package/template/skills/zh/skill-creator/SKILL.md +71 -61
  55. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  56. package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
  57. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  58. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  59. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  60. package/template/skills/zh/tree-processing/SKILL.md +1 -1
  61. package/template/skills/zh/version-control/SKILL.md +15 -0
  62. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  63. package/template/skills/zh/work-decomposition/SKILL.md +21 -33
@@ -1,107 +1,132 @@
1
1
  ---
2
2
  name: dashboard-reporting
3
3
  tier: meta-meta
4
- description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
4
+ description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants visual reporting, or when they explicitly ask for it. Dashboards are self-contained HTML files. Use this skill **when there's something visual worth showing** — not as a default deliverable. For routine status updates use KC's TUI. The dashboard is a complement to direct reporting, not a substitute.
5
5
  ---
6
6
 
7
7
  # Dashboard Reporting
8
8
 
9
- The dashboard is the developer user's window into the system. They should not need to read logs or parse JSON to understand what is happening. Give them a clear, visual summary that leads with what matters.
9
+ The dashboard is one channel — and not always the most economical one — for letting the developer user see what's going on. KC already reports status directly in the TUI; the HTML dashboard exists for things that are **worth seeing visually**: distributions, timelines, heatmaps, side-by-side comparisons, drill-down tables.
10
10
 
11
- ## Dashboard Types
11
+ Don't treat dashboard generation as a checkbox to satisfy this skill. Treat it as a deliverable the developer user actually asked for, or where a picture genuinely saves them time over reading TUI output or JSON.
12
12
 
13
- ### Results Dashboard
14
- Generated after each batch of documents is processed.
13
+ ## Minimum vs. nice-to-have
15
14
 
16
- Key elements:
17
- - **Summary bar**: Total documents, pass rate, fail rate, missing rate, error rate.
18
- - **Per-rule breakdown**: Table showing each rule's pass/fail counts, accuracy, and average confidence.
19
- - **Failed cases**: List of documents that failed, with the rule, extracted value, expected value, and comment. Sortable and filterable.
20
- - **Confidence distribution**: Histogram showing how many results fall in each confidence band.
15
+ When the developer user asks for a dashboard, **start with the minimum and expand only if it adds real value**.
21
16
 
22
- ### Progress Dashboard
23
- Generated on demand to show the system's evolution.
17
+ ### Minimum
24
18
 
25
- Key elements:
26
- - **Lifecycle status**: Which rules are in which phase (skill testing, workflow testing, production, stable).
27
- - **Rule catalog**: Table of all rules with their current status, accuracy, and version.
28
- - **Evolution timeline**: For each rule, how many iterations it took, what was the accuracy at each step.
29
- - **Model tier assignments**: Which model is being used for each step of each rule.
19
+ A useful dashboard at the floor:
20
+ - A summary header: total documents, top-line pass/fail/missing counts.
21
+ - A per-rule table: rule_id, accuracy, pass / fail / NA counts, optional confidence column.
22
+ - A list of failed cases the user can click into for details (rule, extracted value, expected value, comment).
30
23
 
31
- ### Quality Dashboard
32
- Generated after quality control reviews.
24
+ That's enough to ship. If the user can answer "is this batch healthy and which rules failed" in 3 seconds, the minimum is done.
33
25
 
34
- Key elements:
35
- - **Accuracy over time**: Line chart showing per-rule and overall accuracy across batches.
36
- - **Sampling rate over time**: Showing how monitoring is decreasing (or not).
37
- - **Flagged issues**: Open issues that need developer user attention.
38
- - **Cost metrics**: LLM calls and tokens per document, per rule.
26
+ ### Nice-to-have
39
27
 
40
- ## Feedback Collection
28
+ Add these only when they're justified by the data on hand or the user's actual need:
29
+ - Confidence distribution histogram (useful when confidence is calibrated and the user cares about the distribution shape).
30
+ - Accuracy-over-time line chart (useful only when there's enough history to draw a meaningful curve).
31
+ - Per-product-type / per-issuer breakdown (useful when the corpus has meaningful segmentation).
32
+ - Cost metrics (useful when cost is a live concern; otherwise skip).
33
+ - Drill-down navigation (summary → rule → document).
34
+ - Inline feedback widgets (correction-on-click, flag-as-wrong).
41
35
 
42
- Every dashboard must include mechanisms for users to report errors and comment directly on verification results. This is not a nice-to-have — user feedback is the most valuable data source in the system.
36
+ Don't add a section to look thorough. An empty "Confidence distribution" chart with no calibrated data is worse than no chart.
43
37
 
44
- ### Developer User Feedback
38
+ ## Dashboard types (when to use which)
45
39
 
46
- Developer users see full result detail. Their feedback interface should support:
47
- - **Field-level correction**: Click on an extracted value, provide the correct value.
48
- - **Result override**: Change a pass to fail (or vice versa) with a reason.
49
- - **Rule re-evaluation request**: Flag a result for re-processing with a different approach.
50
- - **Comment**: Free-text annotation on any result.
40
+ ### Results dashboard
41
+ After a batch of documents is processed. The minimum above usually covers it.
51
42
 
52
- ### End User Feedback
43
+ ### Progress dashboard
44
+ On demand, to show the system's evolution across phases. Lifecycle status per rule, rule catalog table, evolution timeline. Mostly useful when the developer user wants a "where are we" snapshot mid-build.
53
45
 
54
- End users of the verification app see simplified results. Their interface should support:
55
- - **Flag as wrong**: One-click to report a result they believe is incorrect.
56
- - **Add comment**: Brief text explaining what they think is wrong.
57
- - **Severity indicator**: How impactful is this error? (Critical / Important / Minor)
46
+ ### Quality dashboard
47
+ After QC review cycles. Accuracy-over-time, sampling rate trend, flagged issues, cost. Useful when QC has accumulated enough cycles to show a trend.
58
48
 
59
- ### Feedback as Ground Truth
49
+ If only one of the three would actually help the developer user right now, build only that one. Don't generate all three by default.
60
50
 
61
- User-reported errors are ground truth. They override the coding agent's judgment and the worker LLM's output. The feedback data flow:
51
+ ## Feedback collection (optional but recommended when applicable)
62
52
 
63
- 1. User submits feedback via dashboard stored as structured record.
64
- 2. Record schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
65
- 3. Feedback records are fed into the `evolution-loop` as confirmed failures.
66
- 4. Dashboard surfaces feedback trends: correction rate over time, most-reported issues, rules with highest user correction rates.
53
+ When the dashboard is destined for an audience that's going to review the results (developer user, end user, domain expert), include feedback widgets. When the dashboard is purely for developer-user inspection mid-build, feedback widgets are usually overkill — they pretend at a workflow the user isn't going to follow.
67
54
 
68
- Build the feedback collection mechanism alongside the dashboard generation — they are not separate features. Every generated HTML dashboard should include the feedback UI, even if it initially writes to a local JSON file that the coding agent reads on the next iteration.
55
+ ### Developer-user feedback
56
+
57
+ Full result detail visible. Useful widgets:
58
+ - Field-level correction: click an extracted value, provide the right one.
59
+ - Result override: change pass to fail (or vice versa) with a reason.
60
+ - Comment: free-text annotation on any result.
61
+
62
+ ### End-user feedback
63
+
64
+ Simplified results visible. Useful widgets:
65
+ - Flag-as-wrong: one-click to report a result believed incorrect.
66
+ - Comment: brief text explanation.
67
+ - Severity indicator: critical / important / minor.
68
+
69
+ ### Feedback as ground truth
70
+
71
+ User-reported errors are ground truth. They override agent judgment and worker-LLM output. Flow:
72
+
73
+ 1. Submit via dashboard → stored as structured record.
74
+ 2. Schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
75
+ 3. Records feed into `evolution-loop` as confirmed failures.
76
+ 4. Surface feedback trends in subsequent dashboards (correction rate over time, most-reported issues, rules with highest correction rates).
69
77
 
70
78
  ## Technology
71
79
 
72
- Self-contained HTML with embedded CSS and JavaScript. Requirements:
73
- - **No external dependencies.** No CDN links, no npm packages, no server. Everything is inlined.
74
- - **No server required.** The developer user double-clicks the HTML file to open it in their browser.
75
- - **Responsive layout.** Should work on both desktop and mobile screens.
76
- - **Dark/light mode.** Respect the system preference or provide a toggle.
80
+ Self-contained HTML with embedded CSS / JavaScript.
81
+ - **No external dependencies.** No CDN links, no npm packages, no server. Everything inlined.
82
+ - **No server required.** Developer user double-clicks the HTML file.
83
+ - **Responsive layout.** Should work on desktop and mobile.
84
+ - **Dark/light mode** respect system preference or provide a toggle.
77
85
 
78
- For charts, use inline SVG or a lightweight chart library that can be embedded (e.g., Chart.js or lightweight alternatives, inlined as a script tag).
86
+ For charts, use inline SVG or a lightweight chart library inlined as a `<script>` tag.
79
87
 
80
- ## Data Sources
88
+ ## Data sources
81
89
 
82
90
  Dashboards read from:
83
91
  - `Output/` for verification results.
84
92
  - `logs/` for evolution and testing history.
85
- - `versions.json` for current system state.
86
- - QC review records (stored alongside Output/).
93
+ - `versions.json` (or git log) for current system state.
94
+ - QC review records (stored alongside `Output/`).
87
95
 
88
- The dashboard generation script should accept input paths and produce a single HTML file.
96
+ The generation script should accept input paths and produce a single HTML file.
89
97
 
90
- ## Generation Triggers
98
+ ## Generation triggers
91
99
 
92
- Generate dashboards automatically after:
93
- - Each testing round completes (skill testing or workflow testing).
94
- - Each production batch finishes processing.
95
- - Each quality control review cycle.
96
- - Developer user explicitly requests it.
100
+ Generate dashboards when:
101
+ - A testing round completes AND there's enough data to be worth visualizing.
102
+ - A production batch finishes AND the developer user wants a visual.
103
+ - A QC review cycle completes.
104
+ - The developer user explicitly requests one.
97
105
 
98
- Store generated dashboards in `Output/dashboards/` with timestamps in filenames for history.
106
+ Don't auto-generate on every minor event — the dashboards pile up fast and the user won't open most of them. When unsure, ask the user ("Want me to generate a dashboard?") instead of producing one unprompted.
99
107
 
100
- ## Design Principles
108
+ Store generated dashboards in `Output/dashboards/` with timestamped filenames for history.
101
109
 
102
- - **Lead with the summary.** The developer user should understand the system's health in 3 seconds.
103
- - **Drill down on demand.** Summary → rule-level → document-level. Do not overwhelm with details upfront.
110
+ ## Design principles
111
+
112
+ - **Lead with the summary.** Developer user should understand health in 3 seconds.
113
+ - **Drill down on demand.** Summary → rule-level → document-level. Don't overwhelm with details upfront.
104
114
  - **Color coding.** Green for pass/healthy, red for fail/critical, yellow for warning/attention. Simple and universal.
105
- - **Actionable.** Every flagged issue should suggest what to do next.
115
+ - **Actionable.** Every flagged issue should suggest a next step.
116
+
117
+ A starter script is available in `scripts/generate_dashboard.py`. Adapt to the specific scenario — and feel free to trim the script when half its sections wouldn't have content. A small dashboard that answers the user's question beats a comprehensive one they don't need.
118
+
119
+ ## Relationship to TUI reporting
120
+
121
+ KC's TUI already supports rich status reporting during the run. Use TUI for:
122
+ - Ongoing progress narration.
123
+ - Per-phase summaries.
124
+ - Quick "what just happened" updates.
125
+ - Anything that can be communicated in a few lines of text.
126
+
127
+ Use HTML dashboards for:
128
+ - Visual artifacts that wouldn't fit (distributions, charts, filterable tables).
129
+ - Hand-off to non-KC users (developer-user reviewing later, end-user audience).
130
+ - Persistent records the user wants to revisit.
106
131
 
107
- A starter script is available in `scripts/generate_dashboard.py`. Adapt it to the specific business scenario.
132
+ When in doubt, prefer the TUI. A short status message the user is already reading beats a dashboard they have to open.
@@ -107,7 +107,7 @@ def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> st
107
107
  <head>
108
108
  <meta charset="UTF-8">
109
109
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
110
- <title>KC Reborn — Verification Dashboard</title>
110
+ <title>KC — Verification Dashboard</title>
111
111
  <style>
112
112
  :root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
113
113
  @media (prefers-color-scheme: light) {{
@@ -27,23 +27,17 @@ Do this for each new document type. Do it again when document sources change. 30
27
27
 
28
28
  After reading, answer these questions explicitly — write the answers down, not just think them:
29
29
 
30
- **What is consistent across all documents?**
31
- Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
30
+ **What is consistent across all documents?** Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
32
31
 
33
- **What varies?**
34
- Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
32
+ **What varies?** Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
35
33
 
36
- **What is surprising?**
37
- Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
34
+ **What is surprising?** Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
38
35
 
39
- **Document subtypes?**
40
- Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
36
+ **Document subtypes?** Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
41
37
 
42
- **Section lengths?**
43
- Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
38
+ **Section lengths?** Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
44
39
 
45
- **Encoding issues?**
46
- Full-width vs half-width characters (12.5% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
40
+ **Encoding issues?** Full-width vs half-width characters (12.5% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
47
41
 
48
42
  ## Spot-Check Protocol
49
43
 
@@ -105,6 +99,19 @@ When something goes wrong — and it will — you can inspect each stage indepen
105
99
 
106
100
  Keep intermediates for at least the current iteration. Delete old iterations only when disk space becomes a real constraint.
107
101
 
102
+ ## Looking at the corpus when it doesn't fit in your head
103
+
104
+ A foundational constraint to plan around: you have a finite context window. Reading dozens of sample documents in a row will push earlier observations out of your working memory before you finish, leaving you with the impression of having seen the corpus but not the ability to actually generalize from it.
105
+
106
+ Treat the corpus the way a statistician would treat a population: sample, summarize, and don't try to keep the population in your head. A few approaches that work in practice:
107
+
108
+ - **Use the file system as memory.** Write a `notes/data_observations.md` (or per-rule `notes/<rule_id>_observations.md`) as you scan. Note field name variants, format quirks, missing-section patterns, surprising values. Re-read the notes file next session instead of re-scanning the docs.
109
+ - **Per-rule notepads / memory.md.** For each rule, keep a short `memory.md` that captures "what I've seen across the sample set for this rule" — which documents trigger it, what values appear, what edge cases exist. Update incrementally rather than re-deriving it each time you look at the rule.
110
+ - **Dispatch subagents to explore samples.** When the corpus is large, send a subagent (via the `agent_tool`) to scan a directory and return summary statistics or a short markdown report. The subagent's full reads stay in its own context; you receive only the digest. This is the right tool when you'd otherwise spend context budget reading dozens of files for a single observation.
111
+ - **Statistical / meta views over individual reads.** Instead of reading 20 income certificates, run a regex over all of them and count format variants. Instead of opening every annual report, list filenames and group by issuer / year. Build the meta view first, then dive into representatives.
112
+
113
+ The principle: aim for **enough samples to characterize the distribution**, not enough samples to memorize the corpus. The former fits in your head and in your notes. The latter doesn't.
114
+
108
115
  ## Integration
109
116
 
110
117
  Feed your observations into downstream skills:
@@ -2,32 +2,116 @@
2
2
  name: document-chunking
3
3
  tier: meta
4
4
  description: >
5
- Fast, cheap chunking for processing batches of sample and input documents.
6
- Use when you need to split documents into manageable pieces for initial observation,
7
- data sensibility checks, or feeding to extraction workflows. Not for production
8
- verification chunking for that, use tree-processing to design a tailored chunking script.
5
+ Split documents into chunks for downstream processing. Use when batching samples
6
+ for observation, feeding extraction workflows, or breaking long regulation documents
7
+ into pieces small enough to fit a worker LLM. Covers cheap methods (page, fixed-size,
8
+ header-based) for quick exploration AND the onion-peeler hierarchical strategy +
9
+ wedge fallback for production-grade chunking of long structured documents. Also
10
+ covers the central balance question: chunk-too-big (information lost in a haystack)
11
+ vs. chunk-too-small (semantic continuity broken).
9
12
  ---
10
13
 
11
14
  # Document Chunking
12
15
 
13
- Split documents into pieces for downstream processing. This is the fast, cheap version — for batch processing of samples and inputs, not for precision verification workflows.
16
+ Split documents into pieces for downstream processing. Two regimes:
14
17
 
15
- ## Methods
18
+ - **Cheap chunking** — fast methods for batch observation and exploratory processing of samples.
19
+ - **Hierarchical chunking** — the onion-peeler strategy (borrowed from pdf2skills' methodology) for long structured documents where semantic boundaries matter, with the wedge fallback for stretches that have no headers.
20
+
21
+ The most important question across both regimes: **how big should a chunk be**? See "Finding the balance" below before settling on specific sizes.
22
+
23
+ ## Quick Methods
16
24
 
17
25
  **Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
18
26
 
19
- **Fixed-size chunks** — split by character/token count with overlap. Good for search and initial observation. Typical: 2000-4000 chars with 200 char overlap.
27
+ **Fixed-size chunks** — split by character or token count with overlap. Good for search and initial observation. Typical: a few thousand chars with modest overlap to keep cross-boundary phrases recoverable.
28
+
29
+ **Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Works when the document has a consistent header convention you can express as regex.
30
+
31
+ ## Onion Peeler — Hierarchical Strategy (primary for long structured docs)
32
+
33
+ Hierarchical, header-based decomposition. Called "onion peeler" because you peel the document layer by layer, from the outermost structure inward.
34
+
35
+ ### How it works
36
+
37
+ 1. **Parse the document's heading hierarchy.** Identify all headers at every level (H1, H2, H3 — or the document's equivalent: "Part I", "Chapter 1", "Section 1.1", "Article 1").
38
+ 2. **Build a tree.** Each header is a node. Content between headers belongs to the nearest ancestor.
39
+ 3. **Check size.** Walk the tree. If a node's content (including all descendants) fits within the processing budget, stop there — that node is one chunk.
40
+ 4. **Descend only when needed.** If a node is over budget, descend into its children. Only split when the node is genuinely too large AND has sub-headers available.
41
+ 5. **Leaf nodes still over budget** → hand off to the wedge fallback.
42
+
43
+ ### Why it works
44
+
45
+ - Respects the document's own semantic structure. "Chapter 3 — Risk Disclosure" stays as one chunk because that's how the author intended it.
46
+ - Minimizes information loss. Never cuts mid-meaning.
47
+ - Produces variable-size chunks — and that's a feature. A short chapter as one whole chunk is better than the same chapter forcibly split in half.
48
+
49
+ ### Shortcuts for pattern discovery
50
+
51
+ Before building a full parser, explore structural patterns on a few sample documents:
52
+ - Do all chapter headers start with "Chapter X" or "第X章"?
53
+ - Is section numbering consistent (1.1, 1.2, 1.3)?
54
+ - Are there visual markers (bold, specific font, horizontal rules)?
55
+
56
+ If you find a stable pattern, a regex-based chunker is faster and more reliable than LLM-based structure detection. Examples:
57
+ - `^第[一二三四五六七八九十百]+章` matches Chinese chapter headers
58
+ - `^Chapter \d+` matches English chapter headers
59
+ - `^\d+\.\d+` matches numbered subsections
60
+
61
+ Validate the regex on multiple documents before relying on it.
62
+
63
+ ## Wedge Fallback (for content without clear headers)
64
+
65
+ For dense legal text, continuous prose, or onion-peeler leaf nodes that are still too large with no sub-headers to descend into.
66
+
67
+ ### How it works
68
+
69
+ Uses a **rolling context window** so the algorithm scales to documents of arbitrary length.
70
+
71
+ 1. **Window the content.** Load up to MAX_TOKENS of unprocessed text into a window (configurable; pick a size your LLM can comfortably read).
72
+ 2. **Have the LLM mark cut points.** Prompt the LLM to identify 1-3 natural breakpoints in the window where topic / subject shifts. For each cut point, the LLM returns:
73
+ - `tokens_before`: ~K tokens (e.g., K=50) preceding the cut, quoted verbatim from the source.
74
+ - `tokens_after`: ~K tokens following the cut, quoted verbatim.
75
+ - `chunk_title`: a short title (5-10 chars) for the chunk before the cut.
76
+ 3. **Locate cuts via fuzzy match.** The LLM's quoted tokens won't match the source exactly (minor rewording, whitespace differences). Use Levenshtein distance to find the best position. Require a reasonable similarity threshold; fall back to `tokens_before`-only matching if `tokens_after` can't be located.
77
+ 4. **Slide and repeat.** Cut the text before the first confirmed breakpoint as a chunk. Slide the window to start at the cut point. Repeat until the remaining text fits in a single chunk.
78
+
79
+ ### Why it works
80
+
81
+ - LLM identifies semantic boundaries, not arbitrary character positions.
82
+ - LLM doesn't regenerate text — it only quotes positions. No hallucination risk.
83
+ - Token-quote + Levenshtein matching is language-agnostic: works on Chinese, English, mixed-language docs.
84
+ - Rolling window scales to any document length.
85
+ - Fuzzy matching handles inevitable small differences between LLM-quoted text and source.
86
+
87
+ ### When to use it
88
+
89
+ - Only when onion-peeler can't proceed (no sub-headers available).
90
+ - For unstructured documents with no formal markers.
91
+ - Cost-aware: this method calls the LLM. Pick the cheapest model that can identify topic boundaries (typically tier 3 or 4 is enough).
92
+
93
+ ## Finding the balance — when to stop splitting
94
+
95
+ The two failure modes:
96
+
97
+ - **Chunks too big**: relevant content gets buried in a haystack inside the LLM's context. Even within the LLM's window, attention spreads thin across long inputs — the longer the chunk, the more likely the actual evidence is missed.
98
+ - **Chunks too small**: semantic continuity breaks. A rule that needs "the company is a bank" + "the loan exceeds threshold X" to fire might see those facts split across chunks and lose the conjunction.
99
+
100
+ How to find the balance:
20
101
 
21
- **Header-based splits** detect section headers and split at boundaries. Preserves semantic units. Use regex patterns for the document's header convention.
102
+ 1. **Anchor on the downstream task, not the LLM's context window.** The chunk should be large enough to contain the evidence a downstream rule needs in one piece. If a rule needs to compare two clauses, those clauses must end up in the same chunk.
103
+ 2. **Use semantic boundaries over fixed sizes.** A chunk that ends at a section boundary is more useful than a chunk that hit a target token count mid-sentence. Onion-peeler stops where the document stops; lean on that.
104
+ 3. **Test with the actual downstream consumer.** Run a sample extraction or judgment on the chunked output. If the consumer misses evidence that's present in the source, your chunks are wrong shape — usually too big or split at the wrong boundary.
105
+ 4. **Track variance, not just average size.** A handful of giant chunks among many small ones is more of a problem than a uniform distribution at any reasonable size. The big ones are where you'd lose information.
106
+ 5. **Don't optimize blindly for the LLM's context window.** A 128K context model can technically swallow a 100K chunk; the attention to retrieve specific evidence from that chunk is a different question. Smaller, well-bounded chunks usually win.
22
107
 
23
- ## When to Use What
108
+ ## Practical Tips
24
109
 
25
- Pick the simplest method that serves the task:
26
- - Batch document observation page-level
27
- - Full-text search index fixed-size with overlap
28
- - Section-level extraction header-based
29
- - Table of contents available → parse TOC for structure
110
+ - **Chunk size depends on the downstream task.** Rule extraction by the coding agent can take very large chunks. Worker LLM verification needs chunks that comfortably fit inside its context with room for prompt + response.
111
+ - **Preserve context.** When splitting, carry the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so the downstream consumer knows where it sits.
112
+ - **Cache the chunk tree.** Once a document's structure is parsed, save the tree. Many rules may need the same document's content; re-parsing is waste.
113
+ - **Log chunking decisions.** Which strategy was used, how many chunks were produced, what the size distribution looks like. Helpful for downstream debugging.
30
114
 
31
115
  ## Relationship to tree-processing
32
116
 
33
- This skill is for quick, cheap chunking during exploration and batch processing. When you need production-grade chunking for verification workflows — where the chunking mechanism must be precise, consistent, and coded as a script — use `tree-processing` instead.
117
+ This skill covers chunking methods. `tree-processing` covers designing the precise, coded chunking script for production verification workflows — where chunking must be deterministic, reproducible, and tested. Reach for `tree-processing` when the cheap methods above don't give you enough control for the production path.
@@ -38,11 +38,9 @@ Extraction method selection is a cost-accuracy search. The goal is finding the c
38
38
 
39
39
  ### Available Methods
40
40
 
41
- **Regex / Python** — Cost: zero. Speed: instant. Deterministic.
42
- Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
41
+ **Regex / Python** — Cost: zero. Speed: instant. Deterministic. Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
43
42
 
44
- **Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding.
45
- Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
43
+ **Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding. Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
46
44
 
47
45
  Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
48
46
 
@@ -119,3 +117,15 @@ When designing extraction for worker LLM workflows:
119
117
  3. If the section exceeds available context, narrow further via tree processing.
120
118
  4. Always leave room for the model's response.
121
119
  5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
120
+
121
+ ## Extraction has corner cases too
122
+
123
+ Extraction is **as important as judgment** for final accuracy. A common observation across projects: more than half of the final errors trace back to extraction problems, not judgment — the extractor returned the wrong value, the wrong unit, or pulled from the wrong section, and the judge faithfully concluded the wrong verdict from the wrong input.
124
+
125
+ Treat extraction with the same iteration discipline as judgment:
126
+
127
+ - **Reflection / iteration**: after running an extractor on the sample set, look at the cases where it failed. Is the failure a missing pattern (add to the prompt or regex)? A format quirk (unit conversion, locale)? A document-class issue (extractor right for class A but wrong for class B)?
128
+ - **Corner-case registration**: when an extraction failure can't be fixed without disproportionate cost to the standard extractor, log it as a corner case in `corner-case-management` — same registry shape as a judgment corner case, just resolution typed as `code` / `prompt` / `parser`-class transformation.
129
+ - **Validate the extractor independently of the judge**: an end-to-end test that fails only on the judgment side may hide a bad extractor whose outputs happen to verdict correctly *most* of the time. Use QC review to spot-check extracted values, not just final verdicts.
130
+
131
+ When you're tempted to fix accuracy by tuning the judge's prompt, first check whether the extractor is giving the judge the right input. The cheaper, more durable fix is almost always in the extractor.
@@ -8,6 +8,20 @@ description: Design and execute quality control for production verification work
8
8
 
9
9
  Quality control is the Observer role. You are watching the worker LLMs perform and deciding whether they are doing it well enough. The goal is not to review every result — that would defeat the purpose of automation. The goal is to review just enough to maintain confidence that the system is working.
10
10
 
11
+ ## How this skill cooperates with the others
12
+
13
+ Quality control is one part of a tightly-cooperating set of skills. Don't replicate content from a sibling skill here — point to it. Skills loaded together in the same phase are already accessible to the conductor; re-injecting their material into this skill just bloats both.
14
+
15
+ The relationships:
16
+
17
+ - `confidence-system` defines how confidence is composed and calibrated. When QC uses confidence to triage which results need more review, it consumes confidence — but the design of confidence belongs there.
18
+ - `evolution-loop` is the closed-loop machinery for turning QC findings into improvements. QC produces signals (failures, drift, recurring patterns); evolution-loop decides how to act on them.
19
+ - `corner-case-management` is where exceptions discovered by QC live. QC surfaces "this one didn't fit"; corner-case-management decides whether it's a corner case to register, a systemic problem to promote to mainline, or a data-quality issue to escalate.
20
+ - `cross-document-verification` is its own check class. QC's job is to verify those rules are running as designed, not to re-explain how to build them.
21
+ - `dashboard-reporting` is where QC results surface to the developer user. QC produces the data; the dashboard renders it.
22
+
23
+ Practical implication for authoring: if you find yourself writing in this file something that more naturally belongs to one of the skills above, write a one-sentence pointer here ("see `confidence-system` for how confidence is composed") and leave the depth in the right place. The conductor will have the other skill loaded when it needs the detail.
24
+
11
25
  ## Five-Layer QA Architecture
12
26
 
13
27
  Quality control is not one activity — it is five layers that build on each other. Lower layers must pass before higher layers run.