kc-beta 0.8.1 → 0.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/agent/context.js +17 -1
- package/src/agent/engine.js +85 -8
- package/src/agent/llm-client.js +24 -1
- package/src/agent/pipelines/_milestone-derive.js +78 -7
- package/src/agent/pipelines/skill-authoring.js +19 -2
- package/src/agent/tools/release.js +94 -1
- package/src/cli/index.js +28 -7
- package/template/.env.template +1 -1
- package/template/AGENT.md +2 -2
- package/template/skills/en/auto-model-selection/SKILL.md +55 -35
- package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/en/compliance-judgment/SKILL.md +14 -0
- package/template/skills/en/confidence-system/SKILL.md +30 -8
- package/template/skills/en/corner-case-management/SKILL.md +53 -33
- package/template/skills/en/cross-document-verification/SKILL.md +88 -83
- package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
- package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/en/data-sensibility/SKILL.md +19 -12
- package/template/skills/en/document-chunking/SKILL.md +99 -15
- package/template/skills/en/entity-extraction/SKILL.md +14 -4
- package/template/skills/en/quality-control/SKILL.md +14 -0
- package/template/skills/en/rule-extraction/SKILL.md +92 -94
- package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
- package/template/skills/en/skill-authoring/SKILL.md +52 -8
- package/template/skills/en/skill-creator/SKILL.md +25 -3
- package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
- package/template/skills/en/task-decomposition/SKILL.md +1 -1
- package/template/skills/en/tree-processing/SKILL.md +1 -1
- package/template/skills/en/version-control/SKILL.md +15 -0
- package/template/skills/en/work-decomposition/SKILL.md +21 -35
- package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
- package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
- package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
- package/template/skills/zh/confidence-system/SKILL.md +34 -9
- package/template/skills/zh/corner-case-management/SKILL.md +71 -104
- package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
- package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
- package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
- package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/zh/data-sensibility/SKILL.md +13 -0
- package/template/skills/zh/document-chunking/SKILL.md +96 -20
- package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
- package/template/skills/zh/entity-extraction/SKILL.md +14 -4
- package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
- package/template/skills/zh/quality-control/SKILL.md +14 -0
- package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
- package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
- package/template/skills/zh/rule-extraction/SKILL.md +199 -188
- package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
- package/template/skills/zh/skill-authoring/SKILL.md +108 -69
- package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
- package/template/skills/zh/skill-creator/SKILL.md +71 -61
- package/template/skills/zh/skill-creator/references/schemas.md +60 -60
- package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
- package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
- package/template/skills/zh/task-decomposition/SKILL.md +1 -1
- package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
- package/template/skills/zh/tree-processing/SKILL.md +1 -1
- package/template/skills/zh/version-control/SKILL.md +15 -0
- package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
- package/template/skills/zh/work-decomposition/SKILL.md +21 -33
|
@@ -1,107 +1,132 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: dashboard-reporting
|
|
3
3
|
tier: meta-meta
|
|
4
|
-
description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants
|
|
4
|
+
description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants visual reporting, or when they explicitly ask for it. Dashboards are self-contained HTML files. Use this skill **when there's something visual worth showing** — not as a default deliverable. For routine status updates use KC's TUI. The dashboard is a complement to direct reporting, not a substitute.
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Dashboard Reporting
|
|
8
8
|
|
|
9
|
-
The dashboard is the developer user's
|
|
9
|
+
The dashboard is one channel — and not always the most economical one — for letting the developer user see what's going on. KC already reports status directly in the TUI; the HTML dashboard exists for things that are **worth seeing visually**: distributions, timelines, heatmaps, side-by-side comparisons, drill-down tables.
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
Don't treat dashboard generation as a checkbox to satisfy this skill. Treat it as a deliverable the developer user actually asked for, or where a picture genuinely saves them time over reading TUI output or JSON.
|
|
12
12
|
|
|
13
|
-
|
|
14
|
-
Generated after each batch of documents is processed.
|
|
13
|
+
## Minimum vs. nice-to-have
|
|
15
14
|
|
|
16
|
-
|
|
17
|
-
- **Summary bar**: Total documents, pass rate, fail rate, missing rate, error rate.
|
|
18
|
-
- **Per-rule breakdown**: Table showing each rule's pass/fail counts, accuracy, and average confidence.
|
|
19
|
-
- **Failed cases**: List of documents that failed, with the rule, extracted value, expected value, and comment. Sortable and filterable.
|
|
20
|
-
- **Confidence distribution**: Histogram showing how many results fall in each confidence band.
|
|
15
|
+
When the developer user asks for a dashboard, **start with the minimum and expand only if it adds real value**.
|
|
21
16
|
|
|
22
|
-
###
|
|
23
|
-
Generated on demand to show the system's evolution.
|
|
17
|
+
### Minimum
|
|
24
18
|
|
|
25
|
-
|
|
26
|
-
-
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
- **Model tier assignments**: Which model is being used for each step of each rule.
|
|
19
|
+
A useful dashboard at the floor:
|
|
20
|
+
- A summary header: total documents, top-line pass/fail/missing counts.
|
|
21
|
+
- A per-rule table: rule_id, accuracy, pass / fail / NA counts, optional confidence column.
|
|
22
|
+
- A list of failed cases the user can click into for details (rule, extracted value, expected value, comment).
|
|
30
23
|
|
|
31
|
-
|
|
32
|
-
Generated after quality control reviews.
|
|
24
|
+
That's enough to ship. If the user can answer "is this batch healthy and which rules failed" in 3 seconds, the minimum is done.
|
|
33
25
|
|
|
34
|
-
|
|
35
|
-
- **Accuracy over time**: Line chart showing per-rule and overall accuracy across batches.
|
|
36
|
-
- **Sampling rate over time**: Showing how monitoring is decreasing (or not).
|
|
37
|
-
- **Flagged issues**: Open issues that need developer user attention.
|
|
38
|
-
- **Cost metrics**: LLM calls and tokens per document, per rule.
|
|
26
|
+
### Nice-to-have
|
|
39
27
|
|
|
40
|
-
|
|
28
|
+
Add these only when they're justified by the data on hand or the user's actual need:
|
|
29
|
+
- Confidence distribution histogram (useful when confidence is calibrated and the user cares about the distribution shape).
|
|
30
|
+
- Accuracy-over-time line chart (useful only when there's enough history to draw a meaningful curve).
|
|
31
|
+
- Per-product-type / per-issuer breakdown (useful when the corpus has meaningful segmentation).
|
|
32
|
+
- Cost metrics (useful when cost is a live concern; otherwise skip).
|
|
33
|
+
- Drill-down navigation (summary → rule → document).
|
|
34
|
+
- Inline feedback widgets (correction-on-click, flag-as-wrong).
|
|
41
35
|
|
|
42
|
-
|
|
36
|
+
Don't add a section to look thorough. An empty "Confidence distribution" chart with no calibrated data is worse than no chart.
|
|
43
37
|
|
|
44
|
-
|
|
38
|
+
## Dashboard types (when to use which)
|
|
45
39
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
- **Result override**: Change a pass to fail (or vice versa) with a reason.
|
|
49
|
-
- **Rule re-evaluation request**: Flag a result for re-processing with a different approach.
|
|
50
|
-
- **Comment**: Free-text annotation on any result.
|
|
40
|
+
### Results dashboard
|
|
41
|
+
After a batch of documents is processed. The minimum above usually covers it.
|
|
51
42
|
|
|
52
|
-
###
|
|
43
|
+
### Progress dashboard
|
|
44
|
+
On demand, to show the system's evolution across phases. Lifecycle status per rule, rule catalog table, evolution timeline. Mostly useful when the developer user wants a "where are we" snapshot mid-build.
|
|
53
45
|
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
- **Add comment**: Brief text explaining what they think is wrong.
|
|
57
|
-
- **Severity indicator**: How impactful is this error? (Critical / Important / Minor)
|
|
46
|
+
### Quality dashboard
|
|
47
|
+
After QC review cycles. Accuracy-over-time, sampling rate trend, flagged issues, cost. Useful when QC has accumulated enough cycles to show a trend.
|
|
58
48
|
|
|
59
|
-
|
|
49
|
+
If only one of the three would actually help the developer user right now, build only that one. Don't generate all three by default.
|
|
60
50
|
|
|
61
|
-
|
|
51
|
+
## Feedback collection (optional but recommended when applicable)
|
|
62
52
|
|
|
63
|
-
|
|
64
|
-
2. Record schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
|
|
65
|
-
3. Feedback records are fed into the `evolution-loop` as confirmed failures.
|
|
66
|
-
4. Dashboard surfaces feedback trends: correction rate over time, most-reported issues, rules with highest user correction rates.
|
|
53
|
+
When the dashboard is destined for an audience that's going to review the results (developer user, end user, domain expert), include feedback widgets. When the dashboard is purely for developer-user inspection mid-build, feedback widgets are usually overkill — they pretend at a workflow the user isn't going to follow.
|
|
67
54
|
|
|
68
|
-
|
|
55
|
+
### Developer-user feedback
|
|
56
|
+
|
|
57
|
+
Full result detail visible. Useful widgets:
|
|
58
|
+
- Field-level correction: click an extracted value, provide the right one.
|
|
59
|
+
- Result override: change pass to fail (or vice versa) with a reason.
|
|
60
|
+
- Comment: free-text annotation on any result.
|
|
61
|
+
|
|
62
|
+
### End-user feedback
|
|
63
|
+
|
|
64
|
+
Simplified results visible. Useful widgets:
|
|
65
|
+
- Flag-as-wrong: one-click to report a result believed incorrect.
|
|
66
|
+
- Comment: brief text explanation.
|
|
67
|
+
- Severity indicator: critical / important / minor.
|
|
68
|
+
|
|
69
|
+
### Feedback as ground truth
|
|
70
|
+
|
|
71
|
+
User-reported errors are ground truth. They override agent judgment and worker-LLM output. Flow:
|
|
72
|
+
|
|
73
|
+
1. Submit via dashboard → stored as structured record.
|
|
74
|
+
2. Schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
|
|
75
|
+
3. Records feed into `evolution-loop` as confirmed failures.
|
|
76
|
+
4. Surface feedback trends in subsequent dashboards (correction rate over time, most-reported issues, rules with highest correction rates).
|
|
69
77
|
|
|
70
78
|
## Technology
|
|
71
79
|
|
|
72
|
-
Self-contained HTML with embedded CSS
|
|
73
|
-
- **No external dependencies.** No CDN links, no npm packages, no server. Everything
|
|
74
|
-
- **No server required.**
|
|
75
|
-
- **Responsive layout.** Should work on
|
|
76
|
-
- **Dark/light mode
|
|
80
|
+
Self-contained HTML with embedded CSS / JavaScript.
|
|
81
|
+
- **No external dependencies.** No CDN links, no npm packages, no server. Everything inlined.
|
|
82
|
+
- **No server required.** Developer user double-clicks the HTML file.
|
|
83
|
+
- **Responsive layout.** Should work on desktop and mobile.
|
|
84
|
+
- **Dark/light mode** — respect system preference or provide a toggle.
|
|
77
85
|
|
|
78
|
-
For charts, use inline SVG or a lightweight chart library
|
|
86
|
+
For charts, use inline SVG or a lightweight chart library inlined as a `<script>` tag.
|
|
79
87
|
|
|
80
|
-
## Data
|
|
88
|
+
## Data sources
|
|
81
89
|
|
|
82
90
|
Dashboards read from:
|
|
83
91
|
- `Output/` for verification results.
|
|
84
92
|
- `logs/` for evolution and testing history.
|
|
85
|
-
- `versions.json` for current system state.
|
|
86
|
-
- QC review records (stored alongside Output
|
|
93
|
+
- `versions.json` (or git log) for current system state.
|
|
94
|
+
- QC review records (stored alongside `Output/`).
|
|
87
95
|
|
|
88
|
-
The
|
|
96
|
+
The generation script should accept input paths and produce a single HTML file.
|
|
89
97
|
|
|
90
|
-
## Generation
|
|
98
|
+
## Generation triggers
|
|
91
99
|
|
|
92
|
-
Generate dashboards
|
|
93
|
-
-
|
|
94
|
-
-
|
|
95
|
-
-
|
|
96
|
-
-
|
|
100
|
+
Generate dashboards when:
|
|
101
|
+
- A testing round completes AND there's enough data to be worth visualizing.
|
|
102
|
+
- A production batch finishes AND the developer user wants a visual.
|
|
103
|
+
- A QC review cycle completes.
|
|
104
|
+
- The developer user explicitly requests one.
|
|
97
105
|
|
|
98
|
-
|
|
106
|
+
Don't auto-generate on every minor event — the dashboards pile up fast and the user won't open most of them. When unsure, ask the user ("Want me to generate a dashboard?") instead of producing one unprompted.
|
|
99
107
|
|
|
100
|
-
|
|
108
|
+
Store generated dashboards in `Output/dashboards/` with timestamped filenames for history.
|
|
101
109
|
|
|
102
|
-
|
|
103
|
-
|
|
110
|
+
## Design principles
|
|
111
|
+
|
|
112
|
+
- **Lead with the summary.** Developer user should understand health in 3 seconds.
|
|
113
|
+
- **Drill down on demand.** Summary → rule-level → document-level. Don't overwhelm with details upfront.
|
|
104
114
|
- **Color coding.** Green for pass/healthy, red for fail/critical, yellow for warning/attention. Simple and universal.
|
|
105
|
-
- **Actionable.** Every flagged issue should suggest
|
|
115
|
+
- **Actionable.** Every flagged issue should suggest a next step.
|
|
116
|
+
|
|
117
|
+
A starter script is available in `scripts/generate_dashboard.py`. Adapt to the specific scenario — and feel free to trim the script when half its sections wouldn't have content. A small dashboard that answers the user's question beats a comprehensive one they don't need.
|
|
118
|
+
|
|
119
|
+
## Relationship to TUI reporting
|
|
120
|
+
|
|
121
|
+
KC's TUI already supports rich status reporting during the run. Use TUI for:
|
|
122
|
+
- Ongoing progress narration.
|
|
123
|
+
- Per-phase summaries.
|
|
124
|
+
- Quick "what just happened" updates.
|
|
125
|
+
- Anything that can be communicated in a few lines of text.
|
|
126
|
+
|
|
127
|
+
Use HTML dashboards for:
|
|
128
|
+
- Visual artifacts that wouldn't fit (distributions, charts, filterable tables).
|
|
129
|
+
- Hand-off to non-KC users (developer-user reviewing later, end-user audience).
|
|
130
|
+
- Persistent records the user wants to revisit.
|
|
106
131
|
|
|
107
|
-
A
|
|
132
|
+
When in doubt, prefer the TUI. A short status message the user is already reading beats a dashboard they have to open.
|
|
@@ -107,7 +107,7 @@ def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> st
|
|
|
107
107
|
<head>
|
|
108
108
|
<meta charset="UTF-8">
|
|
109
109
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
110
|
-
<title>KC
|
|
110
|
+
<title>KC — Verification Dashboard</title>
|
|
111
111
|
<style>
|
|
112
112
|
:root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
|
|
113
113
|
@media (prefers-color-scheme: light) {{
|
|
@@ -27,23 +27,17 @@ Do this for each new document type. Do it again when document sources change. 30
|
|
|
27
27
|
|
|
28
28
|
After reading, answer these questions explicitly — write the answers down, not just think them:
|
|
29
29
|
|
|
30
|
-
**What is consistent across all documents?**
|
|
31
|
-
Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
|
|
30
|
+
**What is consistent across all documents?** Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
|
|
32
31
|
|
|
33
|
-
**What varies?**
|
|
34
|
-
Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
|
|
32
|
+
**What varies?** Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
|
|
35
33
|
|
|
36
|
-
**What is surprising?**
|
|
37
|
-
Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
|
|
34
|
+
**What is surprising?** Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
|
|
38
35
|
|
|
39
|
-
**Document subtypes?**
|
|
40
|
-
Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
|
|
36
|
+
**Document subtypes?** Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
|
|
41
37
|
|
|
42
|
-
**Section lengths?**
|
|
43
|
-
Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
|
|
38
|
+
**Section lengths?** Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
|
|
44
39
|
|
|
45
|
-
**Encoding issues?**
|
|
46
|
-
Full-width vs half-width characters (12.5% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
|
|
40
|
+
**Encoding issues?** Full-width vs half-width characters (12.5% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
|
|
47
41
|
|
|
48
42
|
## Spot-Check Protocol
|
|
49
43
|
|
|
@@ -105,6 +99,19 @@ When something goes wrong — and it will — you can inspect each stage indepen
|
|
|
105
99
|
|
|
106
100
|
Keep intermediates for at least the current iteration. Delete old iterations only when disk space becomes a real constraint.
|
|
107
101
|
|
|
102
|
+
## Looking at the corpus when it doesn't fit in your head
|
|
103
|
+
|
|
104
|
+
A foundational constraint to plan around: you have a finite context window. Reading dozens of sample documents in a row will push earlier observations out of your working memory before you finish, leaving you with the impression of having seen the corpus but not the ability to actually generalize from it.
|
|
105
|
+
|
|
106
|
+
Treat the corpus the way a statistician would treat a population: sample, summarize, and don't try to keep the population in your head. A few approaches that work in practice:
|
|
107
|
+
|
|
108
|
+
- **Use the file system as memory.** Write a `notes/data_observations.md` (or per-rule `notes/<rule_id>_observations.md`) as you scan. Note field name variants, format quirks, missing-section patterns, surprising values. Re-read the notes file next session instead of re-scanning the docs.
|
|
109
|
+
- **Per-rule notepads / memory.md.** For each rule, keep a short `memory.md` that captures "what I've seen across the sample set for this rule" — which documents trigger it, what values appear, what edge cases exist. Update incrementally rather than re-deriving it each time you look at the rule.
|
|
110
|
+
- **Dispatch subagents to explore samples.** When the corpus is large, send a subagent (via the `agent_tool`) to scan a directory and return summary statistics or a short markdown report. The subagent's full reads stay in its own context; you receive only the digest. This is the right tool when you'd otherwise spend context budget reading dozens of files for a single observation.
|
|
111
|
+
- **Statistical / meta views over individual reads.** Instead of reading 20 income certificates, run a regex over all of them and count format variants. Instead of opening every annual report, list filenames and group by issuer / year. Build the meta view first, then dive into representatives.
|
|
112
|
+
|
|
113
|
+
The principle: aim for **enough samples to characterize the distribution**, not enough samples to memorize the corpus. The former fits in your head and in your notes. The latter doesn't.
|
|
114
|
+
|
|
108
115
|
## Integration
|
|
109
116
|
|
|
110
117
|
Feed your observations into downstream skills:
|
|
@@ -2,32 +2,116 @@
|
|
|
2
2
|
name: document-chunking
|
|
3
3
|
tier: meta
|
|
4
4
|
description: >
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
5
|
+
Split documents into chunks for downstream processing. Use when batching samples
|
|
6
|
+
for observation, feeding extraction workflows, or breaking long regulation documents
|
|
7
|
+
into pieces small enough to fit a worker LLM. Covers cheap methods (page, fixed-size,
|
|
8
|
+
header-based) for quick exploration AND the onion-peeler hierarchical strategy +
|
|
9
|
+
wedge fallback for production-grade chunking of long structured documents. Also
|
|
10
|
+
covers the central balance question: chunk-too-big (information lost in a haystack)
|
|
11
|
+
vs. chunk-too-small (semantic continuity broken).
|
|
9
12
|
---
|
|
10
13
|
|
|
11
14
|
# Document Chunking
|
|
12
15
|
|
|
13
|
-
Split documents into pieces for downstream processing.
|
|
16
|
+
Split documents into pieces for downstream processing. Two regimes:
|
|
14
17
|
|
|
15
|
-
|
|
18
|
+
- **Cheap chunking** — fast methods for batch observation and exploratory processing of samples.
|
|
19
|
+
- **Hierarchical chunking** — the onion-peeler strategy (borrowed from pdf2skills' methodology) for long structured documents where semantic boundaries matter, with the wedge fallback for stretches that have no headers.
|
|
20
|
+
|
|
21
|
+
The most important question across both regimes: **how big should a chunk be**? See "Finding the balance" below before settling on specific sizes.
|
|
22
|
+
|
|
23
|
+
## Quick Methods
|
|
16
24
|
|
|
17
25
|
**Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
|
|
18
26
|
|
|
19
|
-
**Fixed-size chunks** — split by character
|
|
27
|
+
**Fixed-size chunks** — split by character or token count with overlap. Good for search and initial observation. Typical: a few thousand chars with modest overlap to keep cross-boundary phrases recoverable.
|
|
28
|
+
|
|
29
|
+
**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Works when the document has a consistent header convention you can express as regex.
|
|
30
|
+
|
|
31
|
+
## Onion Peeler — Hierarchical Strategy (primary for long structured docs)
|
|
32
|
+
|
|
33
|
+
Hierarchical, header-based decomposition. Called "onion peeler" because you peel the document layer by layer, from the outermost structure inward.
|
|
34
|
+
|
|
35
|
+
### How it works
|
|
36
|
+
|
|
37
|
+
1. **Parse the document's heading hierarchy.** Identify all headers at every level (H1, H2, H3 — or the document's equivalent: "Part I", "Chapter 1", "Section 1.1", "Article 1").
|
|
38
|
+
2. **Build a tree.** Each header is a node. Content between headers belongs to the nearest ancestor.
|
|
39
|
+
3. **Check size.** Walk the tree. If a node's content (including all descendants) fits within the processing budget, stop there — that node is one chunk.
|
|
40
|
+
4. **Descend only when needed.** If a node is over budget, descend into its children. Only split when the node is genuinely too large AND has sub-headers available.
|
|
41
|
+
5. **Leaf nodes still over budget** → hand off to the wedge fallback.
|
|
42
|
+
|
|
43
|
+
### Why it works
|
|
44
|
+
|
|
45
|
+
- Respects the document's own semantic structure. "Chapter 3 — Risk Disclosure" stays as one chunk because that's how the author intended it.
|
|
46
|
+
- Minimizes information loss. Never cuts mid-meaning.
|
|
47
|
+
- Produces variable-size chunks — and that's a feature. A short chapter as one whole chunk is better than the same chapter forcibly split in half.
|
|
48
|
+
|
|
49
|
+
### Shortcuts for pattern discovery
|
|
50
|
+
|
|
51
|
+
Before building a full parser, explore structural patterns on a few sample documents:
|
|
52
|
+
- Do all chapter headers start with "Chapter X" or "第X章"?
|
|
53
|
+
- Is section numbering consistent (1.1, 1.2, 1.3)?
|
|
54
|
+
- Are there visual markers (bold, specific font, horizontal rules)?
|
|
55
|
+
|
|
56
|
+
If you find a stable pattern, a regex-based chunker is faster and more reliable than LLM-based structure detection. Examples:
|
|
57
|
+
- `^第[一二三四五六七八九十百]+章` matches Chinese chapter headers
|
|
58
|
+
- `^Chapter \d+` matches English chapter headers
|
|
59
|
+
- `^\d+\.\d+` matches numbered subsections
|
|
60
|
+
|
|
61
|
+
Validate the regex on multiple documents before relying on it.
|
|
62
|
+
|
|
63
|
+
## Wedge Fallback (for content without clear headers)
|
|
64
|
+
|
|
65
|
+
For dense legal text, continuous prose, or onion-peeler leaf nodes that are still too large with no sub-headers to descend into.
|
|
66
|
+
|
|
67
|
+
### How it works
|
|
68
|
+
|
|
69
|
+
Uses a **rolling context window** so the algorithm scales to documents of arbitrary length.
|
|
70
|
+
|
|
71
|
+
1. **Window the content.** Load up to MAX_TOKENS of unprocessed text into a window (configurable; pick a size your LLM can comfortably read).
|
|
72
|
+
2. **Have the LLM mark cut points.** Prompt the LLM to identify 1-3 natural breakpoints in the window where topic / subject shifts. For each cut point, the LLM returns:
|
|
73
|
+
- `tokens_before`: ~K tokens (e.g., K=50) preceding the cut, quoted verbatim from the source.
|
|
74
|
+
- `tokens_after`: ~K tokens following the cut, quoted verbatim.
|
|
75
|
+
- `chunk_title`: a short title (5-10 chars) for the chunk before the cut.
|
|
76
|
+
3. **Locate cuts via fuzzy match.** The LLM's quoted tokens won't match the source exactly (minor rewording, whitespace differences). Use Levenshtein distance to find the best position. Require a reasonable similarity threshold; fall back to `tokens_before`-only matching if `tokens_after` can't be located.
|
|
77
|
+
4. **Slide and repeat.** Cut the text before the first confirmed breakpoint as a chunk. Slide the window to start at the cut point. Repeat until the remaining text fits in a single chunk.
|
|
78
|
+
|
|
79
|
+
### Why it works
|
|
80
|
+
|
|
81
|
+
- LLM identifies semantic boundaries, not arbitrary character positions.
|
|
82
|
+
- LLM doesn't regenerate text — it only quotes positions. No hallucination risk.
|
|
83
|
+
- Token-quote + Levenshtein matching is language-agnostic: works on Chinese, English, mixed-language docs.
|
|
84
|
+
- Rolling window scales to any document length.
|
|
85
|
+
- Fuzzy matching handles inevitable small differences between LLM-quoted text and source.
|
|
86
|
+
|
|
87
|
+
### When to use it
|
|
88
|
+
|
|
89
|
+
- Only when onion-peeler can't proceed (no sub-headers available).
|
|
90
|
+
- For unstructured documents with no formal markers.
|
|
91
|
+
- Cost-aware: this method calls the LLM. Pick the cheapest model that can identify topic boundaries (typically tier 3 or 4 is enough).
|
|
92
|
+
|
|
93
|
+
## Finding the balance — when to stop splitting
|
|
94
|
+
|
|
95
|
+
The two failure modes:
|
|
96
|
+
|
|
97
|
+
- **Chunks too big**: relevant content gets buried in a haystack inside the LLM's context. Even within the LLM's window, attention spreads thin across long inputs — the longer the chunk, the more likely the actual evidence is missed.
|
|
98
|
+
- **Chunks too small**: semantic continuity breaks. A rule that needs "the company is a bank" + "the loan exceeds threshold X" to fire might see those facts split across chunks and lose the conjunction.
|
|
99
|
+
|
|
100
|
+
How to find the balance:
|
|
20
101
|
|
|
21
|
-
|
|
102
|
+
1. **Anchor on the downstream task, not the LLM's context window.** The chunk should be large enough to contain the evidence a downstream rule needs in one piece. If a rule needs to compare two clauses, those clauses must end up in the same chunk.
|
|
103
|
+
2. **Use semantic boundaries over fixed sizes.** A chunk that ends at a section boundary is more useful than a chunk that hit a target token count mid-sentence. Onion-peeler stops where the document stops; lean on that.
|
|
104
|
+
3. **Test with the actual downstream consumer.** Run a sample extraction or judgment on the chunked output. If the consumer misses evidence that's present in the source, your chunks are wrong shape — usually too big or split at the wrong boundary.
|
|
105
|
+
4. **Track variance, not just average size.** A handful of giant chunks among many small ones is more of a problem than a uniform distribution at any reasonable size. The big ones are where you'd lose information.
|
|
106
|
+
5. **Don't optimize blindly for the LLM's context window.** A 128K context model can technically swallow a 100K chunk; the attention to retrieve specific evidence from that chunk is a different question. Smaller, well-bounded chunks usually win.
|
|
22
107
|
|
|
23
|
-
##
|
|
108
|
+
## Practical Tips
|
|
24
109
|
|
|
25
|
-
|
|
26
|
-
-
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
- Table of contents available → parse TOC for structure
|
|
110
|
+
- **Chunk size depends on the downstream task.** Rule extraction by the coding agent can take very large chunks. Worker LLM verification needs chunks that comfortably fit inside its context with room for prompt + response.
|
|
111
|
+
- **Preserve context.** When splitting, carry the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so the downstream consumer knows where it sits.
|
|
112
|
+
- **Cache the chunk tree.** Once a document's structure is parsed, save the tree. Many rules may need the same document's content; re-parsing is waste.
|
|
113
|
+
- **Log chunking decisions.** Which strategy was used, how many chunks were produced, what the size distribution looks like. Helpful for downstream debugging.
|
|
30
114
|
|
|
31
115
|
## Relationship to tree-processing
|
|
32
116
|
|
|
33
|
-
This skill
|
|
117
|
+
This skill covers chunking methods. `tree-processing` covers designing the precise, coded chunking script for production verification workflows — where chunking must be deterministic, reproducible, and tested. Reach for `tree-processing` when the cheap methods above don't give you enough control for the production path.
|
|
@@ -38,11 +38,9 @@ Extraction method selection is a cost-accuracy search. The goal is finding the c
|
|
|
38
38
|
|
|
39
39
|
### Available Methods
|
|
40
40
|
|
|
41
|
-
**Regex / Python** — Cost: zero. Speed: instant. Deterministic.
|
|
42
|
-
Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
|
|
41
|
+
**Regex / Python** — Cost: zero. Speed: instant. Deterministic. Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
|
|
43
42
|
|
|
44
|
-
**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding.
|
|
45
|
-
Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
|
|
43
|
+
**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding. Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
|
|
46
44
|
|
|
47
45
|
Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
|
|
48
46
|
|
|
@@ -119,3 +117,15 @@ When designing extraction for worker LLM workflows:
|
|
|
119
117
|
3. If the section exceeds available context, narrow further via tree processing.
|
|
120
118
|
4. Always leave room for the model's response.
|
|
121
119
|
5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
|
|
120
|
+
|
|
121
|
+
## Extraction has corner cases too
|
|
122
|
+
|
|
123
|
+
Extraction is **as important as judgment** for final accuracy. A common observation across projects: more than half of the final errors trace back to extraction problems, not judgment — the extractor returned the wrong value, the wrong unit, or pulled from the wrong section, and the judge faithfully concluded the wrong verdict from the wrong input.
|
|
124
|
+
|
|
125
|
+
Treat extraction with the same iteration discipline as judgment:
|
|
126
|
+
|
|
127
|
+
- **Reflection / iteration**: after running an extractor on the sample set, look at the cases where it failed. Is the failure a missing pattern (add to the prompt or regex)? A format quirk (unit conversion, locale)? A document-class issue (extractor right for class A but wrong for class B)?
|
|
128
|
+
- **Corner-case registration**: when an extraction failure can't be fixed without disproportionate cost to the standard extractor, log it as a corner case in `corner-case-management` — same registry shape as a judgment corner case, just resolution typed as `code` / `prompt` / `parser`-class transformation.
|
|
129
|
+
- **Validate the extractor independently of the judge**: an end-to-end test that fails only on the judgment side may hide a bad extractor whose outputs happen to verdict correctly *most* of the time. Use QC review to spot-check extracted values, not just final verdicts.
|
|
130
|
+
|
|
131
|
+
When you're tempted to fix accuracy by tuning the judge's prompt, first check whether the extractor is giving the judge the right input. The cheaper, more durable fix is almost always in the extractor.
|
|
@@ -8,6 +8,20 @@ description: Design and execute quality control for production verification work
|
|
|
8
8
|
|
|
9
9
|
Quality control is the Observer role. You are watching the worker LLMs perform and deciding whether they are doing it well enough. The goal is not to review every result — that would defeat the purpose of automation. The goal is to review just enough to maintain confidence that the system is working.
|
|
10
10
|
|
|
11
|
+
## How this skill cooperates with the others
|
|
12
|
+
|
|
13
|
+
Quality control is one part of a tightly-cooperating set of skills. Don't replicate content from a sibling skill here — point to it. Skills loaded together in the same phase are already accessible to the conductor; re-injecting their material into this skill just bloats both.
|
|
14
|
+
|
|
15
|
+
The relationships:
|
|
16
|
+
|
|
17
|
+
- `confidence-system` defines how confidence is composed and calibrated. When QC uses confidence to triage which results need more review, it consumes confidence — but the design of confidence belongs there.
|
|
18
|
+
- `evolution-loop` is the closed-loop machinery for turning QC findings into improvements. QC produces signals (failures, drift, recurring patterns); evolution-loop decides how to act on them.
|
|
19
|
+
- `corner-case-management` is where exceptions discovered by QC live. QC surfaces "this one didn't fit"; corner-case-management decides whether it's a corner case to register, a systemic problem to promote to mainline, or a data-quality issue to escalate.
|
|
20
|
+
- `cross-document-verification` is its own check class. QC's job is to verify those rules are running as designed, not to re-explain how to build them.
|
|
21
|
+
- `dashboard-reporting` is where QC results surface to the developer user. QC produces the data; the dashboard renders it.
|
|
22
|
+
|
|
23
|
+
Practical implication for authoring: if you find yourself writing in this file something that more naturally belongs to one of the skills above, write a one-sentence pointer here ("see `confidence-system` for how confidence is composed") and leave the depth in the right place. The conductor will have the other skill loaded when it needs the detail.
|
|
24
|
+
|
|
11
25
|
## Five-Layer QA Architecture
|
|
12
26
|
|
|
13
27
|
Quality control is not one activity — it is five layers that build on each other. Lower layers must pass before higher layers run.
|