npm - kc-beta - Versions diffs - 0.8.1 → 0.8.3 - Mend

kc-beta 0.8.1 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

package/template/skills/en/dashboard-reporting/SKILL.md CHANGED Viewed

@@ -1,107 +1,132 @@
 ---
 name: dashboard-reporting
 tier: meta-meta
-description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
+description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants visual reporting, or when they explicitly ask for it. Dashboards are self-contained HTML files. Use this skill **when there's something visual worth showing** — not as a default deliverable. For routine status updates use KC's TUI. The dashboard is a complement to direct reporting, not a substitute.
 ---
 # Dashboard Reporting
-The dashboard is the developer user's window into the system. They should not need to read logs or parse JSON to understand what is happening. Give them a clear, visual summary that leads with what matters.
+The dashboard is one channel — and not always the most economical one — for letting the developer user see what's going on. KC already reports status directly in the TUI; the HTML dashboard exists for things that are **worth seeing visually**: distributions, timelines, heatmaps, side-by-side comparisons, drill-down tables.
-## Dashboard Types
+Don't treat dashboard generation as a checkbox to satisfy this skill. Treat it as a deliverable the developer user actually asked for, or where a picture genuinely saves them time over reading TUI output or JSON.
-### Results Dashboard
-Generated after each batch of documents is processed.
+## Minimum vs. nice-to-have
-Key elements:
-- **Summary bar**: Total documents, pass rate, fail rate, missing rate, error rate.
-- **Per-rule breakdown**: Table showing each rule's pass/fail counts, accuracy, and average confidence.
-- **Failed cases**: List of documents that failed, with the rule, extracted value, expected value, and comment. Sortable and filterable.
-- **Confidence distribution**: Histogram showing how many results fall in each confidence band.
+When the developer user asks for a dashboard, **start with the minimum and expand only if it adds real value**.
-### Progress Dashboard
-Generated on demand to show the system's evolution.
+### Minimum
-Key elements:
-- **Lifecycle status**: Which rules are in which phase (skill testing, workflow testing, production, stable).
-- **Rule catalog**: Table of all rules with their current status, accuracy, and version.
-- **Evolution timeline**: For each rule, how many iterations it took, what was the accuracy at each step.
-- **Model tier assignments**: Which model is being used for each step of each rule.
+A useful dashboard at the floor:
+- A summary header: total documents, top-line pass/fail/missing counts.
+- A per-rule table: rule_id, accuracy, pass / fail / NA counts, optional confidence column.
+- A list of failed cases the user can click into for details (rule, extracted value, expected value, comment).
-### Quality Dashboard
-Generated after quality control reviews.
+That's enough to ship. If the user can answer "is this batch healthy and which rules failed" in 3 seconds, the minimum is done.
-Key elements:
-- **Accuracy over time**: Line chart showing per-rule and overall accuracy across batches.
-- **Sampling rate over time**: Showing how monitoring is decreasing (or not).
-- **Flagged issues**: Open issues that need developer user attention.
-- **Cost metrics**: LLM calls and tokens per document, per rule.
+### Nice-to-have
-## Feedback Collection
+Add these only when they're justified by the data on hand or the user's actual need:
+- Confidence distribution histogram (useful when confidence is calibrated and the user cares about the distribution shape).
+- Accuracy-over-time line chart (useful only when there's enough history to draw a meaningful curve).
+- Per-product-type / per-issuer breakdown (useful when the corpus has meaningful segmentation).
+- Cost metrics (useful when cost is a live concern; otherwise skip).
+- Drill-down navigation (summary → rule → document).
+- Inline feedback widgets (correction-on-click, flag-as-wrong).
-Every dashboard must include mechanisms for users to report errors and comment directly on verification results. This is not a nice-to-have — user feedback is the most valuable data source in the system.
+Don't add a section to look thorough. An empty "Confidence distribution" chart with no calibrated data is worse than no chart.
-### Developer User Feedback
+## Dashboard types (when to use which)
-Developer users see full result detail. Their feedback interface should support:
-- **Field-level correction**: Click on an extracted value, provide the correct value.
-- **Result override**: Change a pass to fail (or vice versa) with a reason.
-- **Rule re-evaluation request**: Flag a result for re-processing with a different approach.
-- **Comment**: Free-text annotation on any result.
+### Results dashboard
+After a batch of documents is processed. The minimum above usually covers it.
-### End User Feedback
+### Progress dashboard
+On demand, to show the system's evolution across phases. Lifecycle status per rule, rule catalog table, evolution timeline. Mostly useful when the developer user wants a "where are we" snapshot mid-build.
-End users of the verification app see simplified results. Their interface should support:
-- **Flag as wrong**: One-click to report a result they believe is incorrect.
-- **Add comment**: Brief text explaining what they think is wrong.
-- **Severity indicator**: How impactful is this error? (Critical / Important / Minor)
+### Quality dashboard
+After QC review cycles. Accuracy-over-time, sampling rate trend, flagged issues, cost. Useful when QC has accumulated enough cycles to show a trend.
-### Feedback as Ground Truth
+If only one of the three would actually help the developer user right now, build only that one. Don't generate all three by default.
-User-reported errors are ground truth. They override the coding agent's judgment and the worker LLM's output. The feedback data flow:
+## Feedback collection (optional but recommended when applicable)
-1. User submits feedback via dashboard → stored as structured record.
-2. Record schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
-3. Feedback records are fed into the `evolution-loop` as confirmed failures.
-4. Dashboard surfaces feedback trends: correction rate over time, most-reported issues, rules with highest user correction rates.
+When the dashboard is destined for an audience that's going to review the results (developer user, end user, domain expert), include feedback widgets. When the dashboard is purely for developer-user inspection mid-build, feedback widgets are usually overkill — they pretend at a workflow the user isn't going to follow.
-Build the feedback collection mechanism alongside the dashboard generation — they are not separate features. Every generated HTML dashboard should include the feedback UI, even if it initially writes to a local JSON file that the coding agent reads on the next iteration.
+### Developer-user feedback
+Full result detail visible. Useful widgets:
+- Field-level correction: click an extracted value, provide the right one.
+- Result override: change pass to fail (or vice versa) with a reason.
+- Comment: free-text annotation on any result.
+### End-user feedback
+Simplified results visible. Useful widgets:
+- Flag-as-wrong: one-click to report a result believed incorrect.
+- Comment: brief text explanation.
+- Severity indicator: critical / important / minor.
+### Feedback as ground truth
+User-reported errors are ground truth. They override agent judgment and worker-LLM output. Flow:
+1. Submit via dashboard → stored as structured record.
+2. Schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
+3. Records feed into `evolution-loop` as confirmed failures.
+4. Surface feedback trends in subsequent dashboards (correction rate over time, most-reported issues, rules with highest correction rates).
 ## Technology
-Self-contained HTML with embedded CSS and JavaScript. Requirements:
-- **No external dependencies.** No CDN links, no npm packages, no server. Everything is inlined.
-- **No server required.** The developer user double-clicks the HTML file to open it in their browser.
-- **Responsive layout.** Should work on both desktop and mobile screens.
-- **Dark/light mode.** Respect the system preference or provide a toggle.
+Self-contained HTML with embedded CSS / JavaScript.
+- **No external dependencies.** No CDN links, no npm packages, no server. Everything inlined.
+- **No server required.** Developer user double-clicks the HTML file.
+- **Responsive layout.** Should work on desktop and mobile.
+- **Dark/light mode** — respect system preference or provide a toggle.
-For charts, use inline SVG or a lightweight chart library that can be embedded (e.g., Chart.js or lightweight alternatives, inlined as a script tag).
+For charts, use inline SVG or a lightweight chart library inlined as a `<script>` tag.
-## Data Sources
+## Data sources
 Dashboards read from:
 - `Output/` for verification results.
 - `logs/` for evolution and testing history.
-- `versions.json` for current system state.
-- QC review records (stored alongside Output/).
+- `versions.json` (or git log) for current system state.
+- QC review records (stored alongside `Output/`).
-The dashboard generation script should accept input paths and produce a single HTML file.
+The generation script should accept input paths and produce a single HTML file.
-## Generation Triggers
+## Generation triggers
-Generate dashboards automatically after:
-- Each testing round completes (skill testing or workflow testing).
-- Each production batch finishes processing.
-- Each quality control review cycle.
-- Developer user explicitly requests it.
+Generate dashboards when:
+- A testing round completes AND there's enough data to be worth visualizing.
+- A production batch finishes AND the developer user wants a visual.
+- A QC review cycle completes.
+- The developer user explicitly requests one.
-Store generated dashboards in `Output/dashboards/` with timestamps in filenames for history.
+Don't auto-generate on every minor event — the dashboards pile up fast and the user won't open most of them. When unsure, ask the user ("Want me to generate a dashboard?") instead of producing one unprompted.
-## Design Principles
+Store generated dashboards in `Output/dashboards/` with timestamped filenames for history.
-- **Lead with the summary.** The developer user should understand the system's health in 3 seconds.
-- **Drill down on demand.** Summary → rule-level → document-level. Do not overwhelm with details upfront.
+## Design principles
+- **Lead with the summary.** Developer user should understand health in 3 seconds.
+- **Drill down on demand.** Summary → rule-level → document-level. Don't overwhelm with details upfront.
 - **Color coding.** Green for pass/healthy, red for fail/critical, yellow for warning/attention. Simple and universal.
-- **Actionable.** Every flagged issue should suggest what to do next.
+- **Actionable.** Every flagged issue should suggest a next step.
+A starter script is available in `scripts/generate_dashboard.py`. Adapt to the specific scenario — and feel free to trim the script when half its sections wouldn't have content. A small dashboard that answers the user's question beats a comprehensive one they don't need.
+## Relationship to TUI reporting
+KC's TUI already supports rich status reporting during the run. Use TUI for:
+- Ongoing progress narration.
+- Per-phase summaries.
+- Quick "what just happened" updates.
+- Anything that can be communicated in a few lines of text.
+Use HTML dashboards for:
+- Visual artifacts that wouldn't fit (distributions, charts, filterable tables).
+- Hand-off to non-KC users (developer-user reviewing later, end-user audience).
+- Persistent records the user wants to revisit.
-A starter script is available in `scripts/generate_dashboard.py`. Adapt it to the specific business scenario.
+When in doubt, prefer the TUI. A short status message the user is already reading beats a dashboard they have to open.

package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py CHANGED Viewed

@@ -107,7 +107,7 @@ def generate_html(summary: dict, per_rule: dict, failed_cases: list[dict]) -> st
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>KC Reborn — Verification Dashboard</title>
+<title>KC — Verification Dashboard</title>
 <style>
     :root {{ --bg: #1a1a2e; --surface: #16213e; --text: #e0e0e0; --accent: #4caf50; --warn: #ff9800; --err: #f44336; }}
     @media (prefers-color-scheme: light) {{

package/template/skills/en/data-sensibility/SKILL.md CHANGED Viewed

@@ -27,23 +27,17 @@ Do this for each new document type. Do it again when document sources change. 30
 After reading, answer these questions explicitly — write the answers down, not just think them:
-**What is consistent across all documents?**
-Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
+**What is consistent across all documents?** Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
-**What varies?**
-Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
+**What varies?** Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
-**What is surprising?**
-Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
+**What is surprising?** Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
-**Document subtypes?**
-Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
+**Document subtypes?** Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
-**Section lengths?**
-Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
+**Section lengths?** Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
-**Encoding issues?**
-Full-width vs half-width characters (１２.５% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
+**Encoding issues?** Full-width vs half-width characters (１２.５% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
 ## Spot-Check Protocol
@@ -105,6 +99,19 @@ When something goes wrong — and it will — you can inspect each stage indepen
 Keep intermediates for at least the current iteration. Delete old iterations only when disk space becomes a real constraint.
+## Looking at the corpus when it doesn't fit in your head
+A foundational constraint to plan around: you have a finite context window. Reading dozens of sample documents in a row will push earlier observations out of your working memory before you finish, leaving you with the impression of having seen the corpus but not the ability to actually generalize from it.
+Treat the corpus the way a statistician would treat a population: sample, summarize, and don't try to keep the population in your head. A few approaches that work in practice:
+- **Use the file system as memory.** Write a `notes/data_observations.md` (or per-rule `notes/<rule_id>_observations.md`) as you scan. Note field name variants, format quirks, missing-section patterns, surprising values. Re-read the notes file next session instead of re-scanning the docs.
+- **Per-rule notepads / memory.md.** For each rule, keep a short `memory.md` that captures "what I've seen across the sample set for this rule" — which documents trigger it, what values appear, what edge cases exist. Update incrementally rather than re-deriving it each time you look at the rule.
+- **Dispatch subagents to explore samples.** When the corpus is large, send a subagent (via the `agent_tool`) to scan a directory and return summary statistics or a short markdown report. The subagent's full reads stay in its own context; you receive only the digest. This is the right tool when you'd otherwise spend context budget reading dozens of files for a single observation.
+- **Statistical / meta views over individual reads.** Instead of reading 20 income certificates, run a regex over all of them and count format variants. Instead of opening every annual report, list filenames and group by issuer / year. Build the meta view first, then dive into representatives.
+The principle: aim for **enough samples to characterize the distribution**, not enough samples to memorize the corpus. The former fits in your head and in your notes. The latter doesn't.
 ## Integration
 Feed your observations into downstream skills:

package/template/skills/en/document-chunking/SKILL.md CHANGED Viewed

@@ -2,32 +2,116 @@
 name: document-chunking
 tier: meta
 description: >
-  Fast, cheap chunking for processing batches of sample and input documents.
-  Use when you need to split documents into manageable pieces for initial observation,
-  data sensibility checks, or feeding to extraction workflows. Not for production
-  verification chunking — for that, use tree-processing to design a tailored chunking script.
+  Split documents into chunks for downstream processing. Use when batching samples
+  for observation, feeding extraction workflows, or breaking long regulation documents
+  into pieces small enough to fit a worker LLM. Covers cheap methods (page, fixed-size,
+  header-based) for quick exploration AND the onion-peeler hierarchical strategy +
+  wedge fallback for production-grade chunking of long structured documents. Also
+  covers the central balance question: chunk-too-big (information lost in a haystack)
+  vs. chunk-too-small (semantic continuity broken).
 ---
 # Document Chunking
-Split documents into pieces for downstream processing. This is the fast, cheap version — for batch processing of samples and inputs, not for precision verification workflows.
+Split documents into pieces for downstream processing. Two regimes:
-## Methods
+- **Cheap chunking** — fast methods for batch observation and exploratory processing of samples.
+- **Hierarchical chunking** — the onion-peeler strategy (borrowed from pdf2skills' methodology) for long structured documents where semantic boundaries matter, with the wedge fallback for stretches that have no headers.
+The most important question across both regimes: **how big should a chunk be**? See "Finding the balance" below before settling on specific sizes.
+## Quick Methods
 **Page-level splits** — simplest. Each page is a chunk. Works for most document processing where you need to iterate over content.
-**Fixed-size chunks** — split by character/token count with overlap. Good for search and initial observation. Typical: 2000-4000 chars with 200 char overlap.
+**Fixed-size chunks** — split by character or token count with overlap. Good for search and initial observation. Typical: a few thousand chars with modest overlap to keep cross-boundary phrases recoverable.
+**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Works when the document has a consistent header convention you can express as regex.
+## Onion Peeler — Hierarchical Strategy (primary for long structured docs)
+Hierarchical, header-based decomposition. Called "onion peeler" because you peel the document layer by layer, from the outermost structure inward.
+### How it works
+1. **Parse the document's heading hierarchy.** Identify all headers at every level (H1, H2, H3 — or the document's equivalent: "Part I", "Chapter 1", "Section 1.1", "Article 1").
+2. **Build a tree.** Each header is a node. Content between headers belongs to the nearest ancestor.
+3. **Check size.** Walk the tree. If a node's content (including all descendants) fits within the processing budget, stop there — that node is one chunk.
+4. **Descend only when needed.** If a node is over budget, descend into its children. Only split when the node is genuinely too large AND has sub-headers available.
+5. **Leaf nodes still over budget** → hand off to the wedge fallback.
+### Why it works
+- Respects the document's own semantic structure. "Chapter 3 — Risk Disclosure" stays as one chunk because that's how the author intended it.
+- Minimizes information loss. Never cuts mid-meaning.
+- Produces variable-size chunks — and that's a feature. A short chapter as one whole chunk is better than the same chapter forcibly split in half.
+### Shortcuts for pattern discovery
+Before building a full parser, explore structural patterns on a few sample documents:
+- Do all chapter headers start with "Chapter X" or "第X章"?
+- Is section numbering consistent (1.1, 1.2, 1.3)?
+- Are there visual markers (bold, specific font, horizontal rules)?
+If you find a stable pattern, a regex-based chunker is faster and more reliable than LLM-based structure detection. Examples:
+- `^第[一二三四五六七八九十百]+章` matches Chinese chapter headers
+- `^Chapter \d+` matches English chapter headers
+- `^\d+\.\d+` matches numbered subsections
+Validate the regex on multiple documents before relying on it.
+## Wedge Fallback (for content without clear headers)
+For dense legal text, continuous prose, or onion-peeler leaf nodes that are still too large with no sub-headers to descend into.
+### How it works
+Uses a **rolling context window** so the algorithm scales to documents of arbitrary length.
+1. **Window the content.** Load up to MAX_TOKENS of unprocessed text into a window (configurable; pick a size your LLM can comfortably read).
+2. **Have the LLM mark cut points.** Prompt the LLM to identify 1-3 natural breakpoints in the window where topic / subject shifts. For each cut point, the LLM returns:
+ - `tokens_before`: ~K tokens (e.g., K=50) preceding the cut, quoted verbatim from the source.
+   - `tokens_after`: ~K tokens following the cut, quoted verbatim.
+ - `chunk_title`: a short title (5-10 chars) for the chunk before the cut.
+3. **Locate cuts via fuzzy match.** The LLM's quoted tokens won't match the source exactly (minor rewording, whitespace differences). Use Levenshtein distance to find the best position. Require a reasonable similarity threshold; fall back to `tokens_before`-only matching if `tokens_after` can't be located.
+4. **Slide and repeat.** Cut the text before the first confirmed breakpoint as a chunk. Slide the window to start at the cut point. Repeat until the remaining text fits in a single chunk.
+### Why it works
+- LLM identifies semantic boundaries, not arbitrary character positions.
+- LLM doesn't regenerate text — it only quotes positions. No hallucination risk.
+- Token-quote + Levenshtein matching is language-agnostic: works on Chinese, English, mixed-language docs.
+- Rolling window scales to any document length.
+- Fuzzy matching handles inevitable small differences between LLM-quoted text and source.
+### When to use it
+- Only when onion-peeler can't proceed (no sub-headers available).
+- For unstructured documents with no formal markers.
+- Cost-aware: this method calls the LLM. Pick the cheapest model that can identify topic boundaries (typically tier 3 or 4 is enough).
+## Finding the balance — when to stop splitting
+The two failure modes:
+- **Chunks too big**: relevant content gets buried in a haystack inside the LLM's context. Even within the LLM's window, attention spreads thin across long inputs — the longer the chunk, the more likely the actual evidence is missed.
+- **Chunks too small**: semantic continuity breaks. A rule that needs "the company is a bank" + "the loan exceeds threshold X" to fire might see those facts split across chunks and lose the conjunction.
+How to find the balance:
-**Header-based splits** — detect section headers and split at boundaries. Preserves semantic units. Use regex patterns for the document's header convention.
+1. **Anchor on the downstream task, not the LLM's context window.** The chunk should be large enough to contain the evidence a downstream rule needs in one piece. If a rule needs to compare two clauses, those clauses must end up in the same chunk.
+2. **Use semantic boundaries over fixed sizes.** A chunk that ends at a section boundary is more useful than a chunk that hit a target token count mid-sentence. Onion-peeler stops where the document stops; lean on that.
+3. **Test with the actual downstream consumer.** Run a sample extraction or judgment on the chunked output. If the consumer misses evidence that's present in the source, your chunks are wrong shape — usually too big or split at the wrong boundary.
+4. **Track variance, not just average size.** A handful of giant chunks among many small ones is more of a problem than a uniform distribution at any reasonable size. The big ones are where you'd lose information.
+5. **Don't optimize blindly for the LLM's context window.** A 128K context model can technically swallow a 100K chunk; the attention to retrieve specific evidence from that chunk is a different question. Smaller, well-bounded chunks usually win.
-## When to Use What
+## Practical Tips
-Pick the simplest method that serves the task:
-- Batch document observation → page-level
-- Full-text search index → fixed-size with overlap
-- Section-level extraction → header-based
-- Table of contents available → parse TOC for structure
+- **Chunk size depends on the downstream task.** Rule extraction by the coding agent can take very large chunks. Worker LLM verification needs chunks that comfortably fit inside its context with room for prompt + response.
+- **Preserve context.** When splitting, carry the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so the downstream consumer knows where it sits.
+- **Cache the chunk tree.** Once a document's structure is parsed, save the tree. Many rules may need the same document's content; re-parsing is waste.
+- **Log chunking decisions.** Which strategy was used, how many chunks were produced, what the size distribution looks like. Helpful for downstream debugging.
 ## Relationship to tree-processing
-This skill is for quick, cheap chunking during exploration and batch processing. When you need production-grade chunking for verification workflows — where the chunking mechanism must be precise, consistent, and coded as a script — use `tree-processing` instead.
+This skill covers chunking methods. `tree-processing` covers designing the precise, coded chunking script for production verification workflows — where chunking must be deterministic, reproducible, and tested. Reach for `tree-processing` when the cheap methods above don't give you enough control for the production path.

package/template/skills/en/entity-extraction/SKILL.md CHANGED Viewed

@@ -38,11 +38,9 @@ Extraction method selection is a cost-accuracy search. The goal is finding the c
 ### Available Methods
-**Regex / Python** — Cost: zero. Speed: instant. Deterministic.
-Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
+**Regex / Python** — Cost: zero. Speed: instant. Deterministic. Works well for: dates, monetary amounts, percentages, identifiers, fixed phrases, any value with a predictable format.
-**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding.
-Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
+**Worker LLM** — Cost: API tokens. Speed: seconds. Semantic understanding. Works well for: contextual interpretation, conditional values, semantic matching, ambiguous structures, suggestive or misleading language detection, table interpretation, anything requiring understanding rather than pattern matching.
 Many real verification tasks require semantic understanding — "is this description misleading?", "does this clause adequately disclose risk?", "is this guarantor's business description consistent with their stated industry?" — regex cannot handle these. Use worker LLM without hesitation for such tasks.
@@ -119,3 +117,15 @@ When designing extraction for worker LLM workflows:
 3. If the section exceeds available context, narrow further via tree processing.
 4. Always leave room for the model's response.
 5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
+## Extraction has corner cases too
+Extraction is **as important as judgment** for final accuracy. A common observation across projects: more than half of the final errors trace back to extraction problems, not judgment — the extractor returned the wrong value, the wrong unit, or pulled from the wrong section, and the judge faithfully concluded the wrong verdict from the wrong input.
+Treat extraction with the same iteration discipline as judgment:
+- **Reflection / iteration**: after running an extractor on the sample set, look at the cases where it failed. Is the failure a missing pattern (add to the prompt or regex)? A format quirk (unit conversion, locale)? A document-class issue (extractor right for class A but wrong for class B)?
+- **Corner-case registration**: when an extraction failure can't be fixed without disproportionate cost to the standard extractor, log it as a corner case in `corner-case-management` — same registry shape as a judgment corner case, just resolution typed as `code` / `prompt` / `parser`-class transformation.
+- **Validate the extractor independently of the judge**: an end-to-end test that fails only on the judgment side may hide a bad extractor whose outputs happen to verdict correctly *most* of the time. Use QC review to spot-check extracted values, not just final verdicts.
+When you're tempted to fix accuracy by tuning the judge's prompt, first check whether the extractor is giving the judge the right input. The cheaper, more durable fix is almost always in the extractor.

package/template/skills/en/quality-control/SKILL.md CHANGED Viewed

@@ -8,6 +8,20 @@ description: Design and execute quality control for production verification work
 Quality control is the Observer role. You are watching the worker LLMs perform and deciding whether they are doing it well enough. The goal is not to review every result — that would defeat the purpose of automation. The goal is to review just enough to maintain confidence that the system is working.
+## How this skill cooperates with the others
+Quality control is one part of a tightly-cooperating set of skills. Don't replicate content from a sibling skill here — point to it. Skills loaded together in the same phase are already accessible to the conductor; re-injecting their material into this skill just bloats both.
+The relationships:
+- `confidence-system` defines how confidence is composed and calibrated. When QC uses confidence to triage which results need more review, it consumes confidence — but the design of confidence belongs there.
+- `evolution-loop` is the closed-loop machinery for turning QC findings into improvements. QC produces signals (failures, drift, recurring patterns); evolution-loop decides how to act on them.
+- `corner-case-management` is where exceptions discovered by QC live. QC surfaces "this one didn't fit"; corner-case-management decides whether it's a corner case to register, a systemic problem to promote to mainline, or a data-quality issue to escalate.
+- `cross-document-verification` is its own check class. QC's job is to verify those rules are running as designed, not to re-explain how to build them.
+- `dashboard-reporting` is where QC results surface to the developer user. QC produces the data; the dashboard renders it.
+Practical implication for authoring: if you find yourself writing in this file something that more naturally belongs to one of the skills above, write a one-sentence pointer here ("see `confidence-system` for how confidence is composed") and leave the depth in the right place. The conductor will have the other skill loaded when it needs the detail.
 ## Five-Layer QA Architecture
 Quality control is not one activity — it is five layers that build on each other. Lower layers must pass before higher layers run.