npm - kc-beta - Versions diffs - 0.1.0 - Mend

kc-beta 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (141) hide show

package/template/skills/en/meta-meta/skill-authoring/SKILL.md ADDED Viewed

@@ -0,0 +1,108 @@
+---
+name: skill-authoring
+description: Write each verification rule into a Claude Code skill folder following the official skill format. Use when converting extracted rules into skill folders, when iterating on existing rule skills after testing, or when the developer user wants to capture domain knowledge as a skill. Each skill folder must be self-contained with business logic in SKILL.md, code in scripts/, regulation context in references/, and sample data in assets/. Also use the bundled skill-creator for the full eval/iterate workflow.
+---
+# Skill Authoring
+Each verification rule becomes a skill folder. The skill must be self-contained: anyone (or any agent) reading just this folder should have everything needed to verify compliance with that one rule.
+## Skill Folder Structure
+Follow the official Claude Code skill format strictly. See `references/skill-format-spec.md` for the complete specification.
+```
+rule-skills/
+  rule-001-capital-adequacy/
+    SKILL.md            # The verification logic and methodology
+    scripts/
+      check.py          # Deterministic checks (regex, calculations)
+    references/
+      regulation.md     # Original regulation text, verbatim
+      interpretation.md # Expert notes on how to interpret edge cases
+    assets/
+      samples.json      # Annotated sample extractions with expected results
+      corner_cases.json # Known edge cases with their resolutions
+```
+Not every rule needs all of these. A simple threshold check might only need SKILL.md and a script. A complex semantic rule might need detailed references and many samples. Start minimal, add as needed during testing.
+## Writing SKILL.md
+### Frontmatter
+```yaml
+---
+name: rule-001-capital-adequacy
+description: Verify that the capital adequacy ratio reported in the document meets the regulatory minimum of 8%. Use when checking capital adequacy compliance in bank financial reports. Check the capital adequacy section or table for the reported ratio and compare against the threshold.
+---
+```
+- **name**: Must match the directory name exactly. Use lowercase, hyphens, no spaces. Prefix with the rule ID from your catalog.
+- **description**: Write it as if explaining to another coding agent when they should use this skill. Be specific about what the rule checks, where to look in the document, and what constitutes pass/fail. Be pushy — include trigger keywords.
+### Body Content
+The body should cover:
+1. **What this rule checks** — one paragraph explaining the rule in plain language. Include the regulatory source and intent.
+2. **Where to look** — which section, chapter, table, or part of the document contains the relevant information. Be specific. "The capital adequacy ratio is typically found in Chapter 2, Section 'Key Regulatory Metrics' or in the summary table on page 1."
+3. **What to extract** — the specific entities needed. "Extract the reported capital adequacy ratio as a percentage." Define the expected format and any normalization needed.
+4. **How to judge** — the logic for pass/fail. "The ratio must be >= 8.0%. If the ratio is missing, flag as MISSING rather than FAIL." For semantic judgments, describe the criteria in natural language.
+5. **Edge cases** — known tricky situations. "Some reports express the ratio as a decimal (0.12) rather than a percentage (12%). Normalize before comparing."
+6. **Comment format** — what to say when the rule fails. Keep it concise and actionable. "Capital adequacy ratio is X%, which is below the regulatory minimum of 8%."
+### Length and Style
+- Keep SKILL.md under 500 lines. Most rules should be 100-200 lines.
+- Explain the WHY behind the rule, not just the mechanics. Understanding intent helps handle edge cases.
+- Write in imperative form: "Extract the ratio" not "The ratio should be extracted."
+- If detailed regulation text is long, put it in `references/regulation.md` and reference it from SKILL.md.
+## Writing Scripts
+Scripts in `scripts/` handle deterministic operations:
+- **Regex patterns** for entity extraction (dates, amounts, ratios, identifiers).
+- **Calculation logic** for threshold checks, ratio computations, cross-field validation.
+- **Format normalization** (Chinese numerals → digits, date format standardization, unit conversion).
+Scripts should be self-contained Python files that can be imported or executed. Include clear input/output documentation in the script's docstring.
+Do not put LLM prompts in scripts. LLM interactions belong in the SKILL.md body or in the workflow (later phase).
+## Writing References
+`references/` holds content that the coding agent reads on demand:
+- **regulation.md**: The original regulation text, verbatim. Include the source, date, and version. This is the ground truth that the rule is derived from.
+- **interpretation.md**: Expert notes from the developer user or from the coding agent's own analysis. "When the regulation says 'adequate disclosure', in practice this means the section must be at least 2 paragraphs and cover risks A, B, and C."
+Keep references factual and sourced. They are evidence, not instructions.
+## Writing Assets
+`assets/` holds data that supports testing and edge case handling:
+- **samples.json**: Annotated examples. Each entry: the input (extracted text or entity), the expected result (pass/fail/missing), and the expected comment. Build this incrementally as you test.
+- **corner_cases.json**: Edge cases that the standard logic does not handle. Each entry: description, detection pattern, resolution, and confidence threshold. See the `corner-case-management` skill for the methodology.
+## Iteration
+Skills evolve through testing. After each test iteration:
+1. Update SKILL.md if the logic needs adjustment.
+2. Add failing cases to `assets/samples.json`.
+3. Add newly discovered edge cases to `assets/corner_cases.json`.
+4. Update `references/interpretation.md` with new insights.
+5. Log what changed and why.
+Use the bundled `skill-creator` skill if you want to run the full eval/iterate workflow with quantitative benchmarks.
+## Bilingual Skills
+Write skills in the language matching the LANGUAGE setting in `.env`. If rules and documents are in Chinese, write the SKILL.md body in Chinese using proper financial/regulatory terminology. The frontmatter (name, description) stays in English for system compatibility.

package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md ADDED Viewed

@@ -0,0 +1,78 @@
+# Claude Code Skill Format Specification
+Distilled from the official Anthropic skill-creator. This is the authoritative reference for writing correctly formatted skill folders.
+## Skill Folder Structure
+```
+skill-name/
+├── SKILL.md          (required)  Metadata + instructions
+├── scripts/          (optional)  Executable code
+├── references/       (optional)  Detailed documentation, loaded on demand
+└── assets/           (optional)  Templates, data files, images
+```
+The directory name must match the `name` field in SKILL.md frontmatter exactly.
+## SKILL.md Format
+### Frontmatter (YAML)
+```yaml
+---
+name: skill-identifier
+description: What this skill does and when to use it.
+---
+```
+**Required fields:**
+| Field | Constraints |
+|-------|-------------|
+| `name` | Max 64 chars. Lowercase letters, numbers, hyphens only. No leading/trailing/consecutive hyphens. Must match parent directory name. |
+| `description` | Max 1024 chars. Non-empty. Describe what it does AND when to use it. |
+**Optional fields:** `license`, `compatibility`, `metadata`
+### Description Best Practices
+The description is the primary triggering mechanism — Claude uses it to decide when to invoke the skill. Make descriptions "pushy" to combat under-triggering:
+- Include both capability AND trigger contexts.
+- Use specific keywords the user might mention.
+- List concrete use cases.
+- State what NOT to use it for.
+- Aim for 100-200 words.
+### Markdown Body
+Following the frontmatter, write instructions in Markdown. Guidelines:
+- **Under 500 lines.** If approaching this, move detail to references/.
+- **Imperative form.** "Extract the value" not "The value should be extracted."
+- **Explain the why** behind instructions, not just the what.
+- **Include examples** when they clarify the expected behavior. One or two well-chosen examples beat ten mediocre ones.
+## Progressive Disclosure
+Skills use three-level loading:
+1. **Metadata** (name + description): Always in context. ~100 tokens.
+2. **SKILL.md body**: Loaded when skill triggers. <500 lines ideal.
+3. **Bundled resources**: Loaded on demand. Unlimited size. Scripts can execute without loading.
+Reference files clearly from SKILL.md with guidance on when to read them. For large reference files (>300 lines), include a table of contents.
+## File Referencing
+Use relative paths from the skill root:
+```markdown
+See [the reference guide](references/regulation.md) for the full regulation text.
+Run the check script: `python scripts/check.py`
+```
+## Naming Conventions
+- Directory names: lowercase with hyphens (`my-skill`, not `MySkill` or `my_skill`)
+- Keep names short and descriptive
+- For rule skills, prefix with the rule ID: `rule-001-capital-adequacy`

package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md ADDED Viewed

@@ -0,0 +1,150 @@
+---
+name: skill-to-workflow
+description: Distill a proven verification skill into a Python workflow with worker LLM prompts. Use when a rule skill has been tested and reaches the SKILL_ACCURACY threshold defined in .env. Covers the decision of what to implement as code vs LLM calls, prompt engineering for small context windows, model tier selection and progressive downgrade, and testing workflows against the coding agent's own results as ground truth. Also use when optimizing existing workflows for cost or speed.
+---
+# Skill to Workflow
+The skill is the ground truth. The workflow is a cheaper, faster approximation. Your job is to make the approximation as good as the original while being as cheap as possible.
+## When to Start
+A skill is ready for workflow distillation when:
+- It has been tested on all documents in Samples/.
+- Its accuracy meets or exceeds the SKILL_ACCURACY threshold in `.env`.
+- Edge cases are documented in the skill's `assets/corner_cases.json`.
+- You understand the rule well enough to explain exactly how you verify it.
+If any of these are not true, go back and iterate on the skill first.
+## The Distillation Decision
+For each step in your skill-based verification process, ask:
+### Can this be done with regex or Python? (Cost: zero)
+- Date extraction with known formats → regex
+- Numeric comparison against threshold → Python arithmetic
+- Chinese numeral conversion → Python lookup table
+- Format validation (ID numbers, codes) → regex
+- Table cell extraction from structured markdown → string manipulation
+If yes, write it as code. These are free, fast, and deterministic.
+### Does this require language understanding? (Cost: worker LLM call)
+- Finding the relevant section in a document → LLM
+- Extracting an entity described in natural language → LLM
+- Judging semantic adequacy ("adequate risk disclosure") → LLM
+- Resolving ambiguous references → LLM
+If yes, design a worker LLM prompt. Use the smallest model tier that maintains accuracy.
+### The hybrid approach (most common)
+Most rules are a mix: regex extracts the number, Python compares it to the threshold, LLM handles the exceptional cases. Design the workflow as a pipeline where cheap steps run first and expensive steps run only when needed.
+## Workflow Structure
+A workflow is a Python file (or small set of files) in `workflows/`:
+```
+workflows/
+  rule_001_capital_adequacy/
+    workflow_v1.py        # The main workflow script
+    prompts/
+      extract.txt         # Worker LLM prompt for extraction
+      judge.txt           # Worker LLM prompt for judgment (if needed)
+    config.json           # Model assignments, thresholds
+```
+The workflow file should have a clear entry point:
+```python
+def verify(document_text: str, config: dict) -> dict:
+    """
+    Returns:
+        {
+            "rule_id": "R001",
+            "result": "pass" | "fail" | "missing" | "error",
+            "extracted_value": ...,
+            "confidence": 0.0-1.0,
+            "comment": "..." (only when fail),
+            "model_used": "...",
+            "llm_calls": int,
+            "llm_tokens": int
+        }
+    """
+```
+This is a reference, not a rigid contract. Adapt the structure to the specific rule. The important thing is that every workflow produces a result that can be compared against the skill-based ground truth.
+## Prompt Engineering for Worker LLMs
+Worker LLMs have smaller context windows (typically 16K-32K tokens). Design prompts that:
+1. **Are self-contained.** Include everything the model needs in the prompt. Do not assume the model has context from previous calls.
+2. **Specify the output format.** "Return a JSON object with fields: value, confidence, reasoning." Structured output reduces parsing errors.
+3. **Include the narrowed context.** Do not send the entire document. Use the tree-processing pipeline (full document → relevant chapter → relevant section) to narrow the context before calling the worker LLM.
+4. **Are written in the document's language.** Chinese documents get Chinese prompts. English documents get English prompts. Do not mix languages in a single prompt.
+5. **Provide examples sparingly.** One or two examples help. Ten examples waste context window and risk overfitting.
+## Model Tier Selection
+Start with the highest tier (TIER1) for each step. Measure accuracy. Then try lower tiers:
+1. Run the workflow with TIER1 on all Samples/. Record accuracy per step.
+2. For each step, try TIER2. If accuracy stays above WORKFLOW_ACCURACY, keep TIER2.
+3. Continue downgrading per step until accuracy drops below threshold.
+4. Record the optimal tier per step in `config.json`.
+Different steps within the same workflow can use different model tiers. Extraction might need TIER2 while judgment might work fine with TIER3.
+### Formal Downgrade Protocol
+The basic approach above works, but a more rigorous protocol prevents premature tier commitments:
+**Direction**: Start top-down (TIER1 → TIER4) to establish the accuracy ceiling first. You need to know the best possible accuracy before trading it for cost savings.
+**Minimum test runs**: Run at least a meaningful number of documents (e.g., min(10, total_samples)) at each candidate tier before making a tier decision. Small samples are unreliable — a 3-document test could be misleading.
+**Accuracy delta trigger**: If a lower tier's accuracy is significantly below the higher tier (e.g., >5 percentage points), stay at the higher tier for that step. If the delta is within tolerance, use the cheaper tier.
+**Per-step independence**: Each workflow step is assessed separately. Record the optimal tier per step in `config.json`. Do not assume the whole workflow must use one tier.
+**Re-assessment trigger**: If production quality control shows a step's accuracy degrading (e.g., due to new document formats), re-run the tier assessment for that step.
+**Model-task recommendation list**: Maintain a per-project mapping of (task_type → recommended_tier) based on your testing experience. Over time, these lists can be collected across projects to build generalized tier recommendations.
+All numbers here (10 documents, 5 percentage points, etc.) are recommended starting points. The coding agent and developer user should calibrate these — or replace them entirely with a different assessment approach — based on their specific volume, accuracy requirements, and cost constraints. The pattern matters: **test at each tier → compare accuracy → commit when within tolerance → re-assess on degradation**.
+This follows the same tier-transition framework as parser escalation in `document-parsing`: a quality/accuracy score drives the decision to stay, escalate, or skip.
+## Testing Against Ground Truth
+The coding agent's skill-based results are the ground truth. For each document in Samples/:
+1. Run the workflow.
+2. Compare the workflow's result against the skill-based result.
+3. Log discrepancies: which step failed, what was expected vs actual.
+4. Compute accuracy: `(matching results) / (total documents)`.
+5. If accuracy < WORKFLOW_ACCURACY, diagnose and fix. Use `evolution-loop` methodology.
+## Versioning
+Each iteration of a workflow is a new version file: `workflow_v1.py`, `workflow_v2.py`, etc. Track which version is active in `config.json`. See `version-control` skill for the full methodology.
+## Cost Tracking
+Track the cost of each workflow run:
+- Number of LLM calls per document.
+- Total tokens consumed per document.
+- Model tier used per call.
+This data helps the developer user understand the production cost and informs further optimization.
+## Worker LLM API
+Worker LLMs are accessed via SiliconFlow API. Connection details are in `.env`:
+- `SILICONFLOW_API_KEY` for authentication
+- `SILICONFLOW_BASE_URL` for the API endpoint
+- Model names in `TIER1` through `TIER4`
+See `references/worker-llm-catalog.md` for current model capabilities and context window sizes.

package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md ADDED Viewed

@@ -0,0 +1,50 @@
+# Worker LLM Catalog
+Models available via SiliconFlow API for worker LLM tasks. Update this catalog as models change.
+## Text Models
+| Tier | Model | Context Window | Strengths | Notes |
+|------|-------|---------------|-----------|-------|
+| TIER1 | Pro/zai-org/GLM-5 | 128K | Strong reasoning, Chinese language | Top tier for complex judgment |
+| TIER1 | Pro/moonshotai/Kimi-K2.5 | 128K | Long context, strong extraction | Good for full-document processing |
+| TIER2 | Pro/deepseek-ai/DeepSeek-V3.2 | 64K | Balanced capability/cost | Good general purpose |
+| TIER2 | Pro/MiniMaxAI/MiniMax-M2.5 | 64K | Strong Chinese, fast | Good for Chinese documents |
+| TIER2 | Qwen/Qwen3.5-397B-A17B | 32K | Large MoE, strong reasoning | Cost-effective for complex tasks |
+| TIER3 | Qwen/Qwen3.5-122B-A10B | 32K | Good accuracy, lower cost | Sweet spot for many tasks |
+| TIER4 | Qwen/Qwen3.5-35B-A3B | 16K | Fast, cheap | Best for simple extraction |
+## Vision/OCR Models
+| Tier | Model | Strengths | Notes |
+|------|-------|-----------|-------|
+| OCR_TIER1 | zai-org/GLM-4.6V | Best OCR accuracy | Use for complex tables/charts |
+| OCR_TIER2 | Qwen/Qwen3.5-397B-A17B | Good general vision | Multimodal version |
+| OCR_TIER3 | PaddlePaddle/PaddleOCR-VL-1.5 | Fast, lightweight OCR | Best for standard text |
+## Selection Guidelines
+- Start with the highest tier that fits your context window needs.
+- For extraction of simple entities (dates, amounts, names): TIER3-4 often sufficient.
+- For semantic judgment (adequacy, compliance): TIER1-2 usually needed.
+- For Chinese financial documents: prefer GLM and Qwen models over DeepSeek for domain terminology.
+- Context window constraint: if the section to process exceeds the model's window, either narrow the context further (tree processing) or use a model with a larger window.
+## API Configuration
+```python
+import openai
+client = openai.OpenAI(
+    api_key=os.getenv("SILICONFLOW_API_KEY"),
+    base_url=os.getenv("SILICONFLOW_BASE_URL")
+)
+response = client.chat.completions.create(
+    model="Qwen/Qwen3.5-122B-A10B",  # Use the model name from the table
+    messages=[{"role": "user", "content": prompt}],
+    temperature=0.1  # Low temperature for deterministic extraction
+)
+```
+This catalog should be maintained by the coding agent. Add new models as they become available, remove deprecated models, and update capability assessments based on testing experience.

package/template/skills/en/meta-meta/task-decomposition/SKILL.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+name: task-decomposition
+description: Decompose each verification rule into independent sub-tasks and assign the optimal method (rule, code, LLM, manual) to each. Use when converting extracted rules into implementation plans, when a rule skill is too expensive or inaccurate and needs restructuring, or when designing a multi-step verification pipeline. Covers MECE decomposition, method selection via the four-dimension decision matrix, cost-benefit analysis, and source tagging. Also use when auditing an existing workflow for cost optimization opportunities.
+---
+# Task Decomposition
+Every verification rule is composite. Even the simplest-sounding rule — "check that the invoice date is within the contract period" — decomposes into a chain of distinct operations: locate the date field, extract its value, normalize the format, compare against the contract dates, and generate a comment if it fails.
+The temptation is to throw the entire chain at an LLM. It works. It is also 100x more expensive than necessary and impossible to debug when it breaks.
+The Lancet Method is scalpel-precision decomposition. Cut each rule into the smallest sub-tasks that are methodologically homogeneous — meaning each sub-task can be solved entirely by one method. Then assign the cheapest method that works for each sub-task. The name is deliberate: a lancet, not a cleaver. Precision matters because the cut points determine everything downstream — cost, debuggability, testability, and the eventual workflow architecture.
+## MECE Decomposition
+Decompose every rule into sub-tasks that are mutually exclusive and collectively exhaustive (MECE):
+- **Mutually exclusive**: no two sub-tasks do the same work. If sub-task A extracts the invoice date, sub-task B does not also extract the invoice date.
+- **Collectively exhaustive**: the sub-tasks together cover the entire rule. Nothing falls through the cracks. If you execute all sub-tasks in sequence, the rule is fully verified.
+Each sub-task has exactly one input and one output. The output of one sub-task becomes the input of the next. This creates a pipeline with clean interfaces between stages.
+Stop decomposing when a sub-task is **methodologically homogeneous** — it can be handled entirely by one method (regex, Python code, LLM call, or manual review). If a sub-task still requires two different methods, it is not yet atomic. Keep cutting.
+A practical test: describe the sub-task in one sentence. If you need "and" or "then" in the sentence, it probably needs further decomposition. "Extract the date and compare it to the threshold" is two sub-tasks. "Extract the date" is one.
+The canonical decomposition chain for most document verification rules is:
+```
+locate → extract → normalize → judge → comment
+```
+Not every rule has all five stages. Some rules skip normalization. Some rules do not need a comment on pass. But this chain is a reliable starting framework.
+For cross-document rules (e.g., "invoice amount matches contract amount"), the chain branches: two parallel locate-extract-normalize pipelines converge at a single judge step. Draw this out before implementing. The pipeline topology — linear, branching, or converging — determines how you structure the skill folder and later the workflow.
+Three common topologies:
+- **Linear**: Single document, single field. `locate → extract → normalize → judge → comment`. Most threshold checks follow this pattern.
+- **Converging**: Two fields from different documents or different sections. Two parallel locate-extract chains merge at the judge step. Cross-field validations and cross-document matching follow this pattern.
+- **Fan-out**: One rule applied to many items within a document (e.g., validating every line item in an invoice). The locate step produces N items, each of which flows through the remaining chain independently. Scale is the critical dimension here — if N is large, the method assignments must account for per-item cost.
+## The Decision Matrix
+After decomposition, assign a method to each sub-task. Do not guess. Use a structured evaluation based on four dimensions. See `references/decision-matrix.md` for the complete matrix with worked examples and a cost estimation template.
+**Certainty** — How predictable is the input format? If the date is always in `YYYY-MM-DD` at a known position, certainty is high. If the date appears in free-form prose with varying formats, certainty is low.
+**Scale** — How many items must be processed? One field per document is low scale. A thousand line items per invoice is high scale.
+**Semantic depth** — How much language understanding is required? Comparing two numbers requires none. Judging whether a risk disclosure is "adequate" requires deep understanding.
+**Cost sensitivity** — What is the budget per document? A bank processing 10,000 loan files per month has different economics than a one-time audit of 50 contracts.
+These four dimensions map to a method hierarchy. Always prefer the cheapest method that achieves the required accuracy:
+1. **Rule / Regex** — Zero cost, instant, deterministic. Use when certainty is high and semantic depth is zero.
+2. **Code / Python** — Zero cost, instant, deterministic. Use for calculations, transformations, and structured comparisons.
+3. **LLM** — Variable cost, latency, probabilistic. Use when semantic understanding is required and cheaper methods fail.
+4. **Manual** — Highest cost, highest latency, highest accuracy. Reserve for edge cases that defeat all automated methods.
+Do not skip levels. Try regex before code. Try code before LLM. Try LLM before manual. Each escalation must be justified by a failure at the lower level.
+When scoring a sub-task on these dimensions, be honest about uncertainty. If you are unsure whether a regex can handle the input variability, score certainty conservatively and test the regex on samples before committing. A wrong method assignment wastes more time than a conservative initial assignment that gets optimized later.
+Note that dimensions interact. High scale combined with high cost sensitivity pushes hard toward code-based solutions even when moderate semantic depth would normally suggest LLM. Conversely, low scale relaxes cost pressure, making LLM viable even for tasks that could theoretically be solved with complex regex. Let the combination of dimensions guide you, not any single dimension alone.
+## Cost-Benefit Awareness
+Method assignment is not an academic exercise. It directly determines the cost per document in production. Every LLM call that could have been a regex is money burned. Every regex that should have been an LLM call is accuracy lost.
+Consider a real scenario: matching invoices against contracts in a large enterprise. There are 31,800 invoices and 15,940 contracts. The naive approach — send every possible pair to an LLM for comparison — means 507 million pairs. At any non-trivial cost per call, this is economically absurd.
+The Lancet Method decomposes this into layers:
+1. **Rule layer**: Match on exact supplier name and contract number. Cost: near zero. Eliminates 99.5% of pairs.
+2. **Code layer**: Fuzzy match on amount ranges and date overlap. Cost: near zero. Reduces to 12,400 candidate pairs.
+3. **LLM layer**: Semantic comparison of line-item descriptions against contract scope. Cost: moderate. Reduces to 7,652 confirmed matches.
+4. **Manual layer**: Human review of ~200 low-confidence matches where the LLM was uncertain. Cost: labor hours. Resolves the final ambiguous cases.
+The result: 200x lower cost than the naive approach. Same accuracy. Better debuggability because each layer's output is independently verifiable. And each layer can be tested, monitored, and optimized in isolation.
+The principle: **filter cheap before reasoning expensive**. Always calculate the cost per document for each sub-task. If the LLM cost for one sub-task dominates the total, that sub-task is the optimization target.
+Use the cost estimation template in `references/decision-matrix.md` to plan costs at decomposition time. Do not wait until production to discover that a workflow is too expensive. The developer user has a budget. Respect it by designing within it.
+## Source Tagging
+Every output from every sub-task must carry an `extraction_method` tag. This is not optional metadata — it is load-bearing infrastructure. Without it, the system degrades into an opaque pipeline that nobody can diagnose, cost-optimize, or trust.
+Tags enable three capabilities that you cannot afford to lose:
+1. **Debugging**: When a verification result is wrong, the tag tells you which sub-task produced the error and which method was responsible. Without tags, you are debugging a black box. With tags, you can immediately narrow the investigation to one sub-task and one method.
+2. **Cost attribution**: Tags let you calculate the actual cost contribution of each method per rule and per document. This drives optimization decisions — you can identify which LLM calls are consuming the most budget and target them for replacement with cheaper methods.
+3. **Confidence calibration**: Different methods have different reliability profiles. A regex extraction is either right or wrong — binary confidence. An LLM extraction has a confidence distribution that varies by model tier and prompt quality. Tags feed directly into the `confidence-system` method prior, enabling calibrated confidence scores that reflect the actual reliability of each extraction source.
+Tag format: a simple string field on every intermediate output. Example values: `regex`, `python_calc`, `llm_tier2`, `manual_review`. Be consistent within a project. Define the tag vocabulary once at project setup and enforce it across all skills and workflows.
+## Anti-Patterns
+Five failure modes recur across projects. Learn to recognize them early.
+**LLM-for-everything.** Sending an entire document to an LLM with "check if this complies with Rule X" works in demos. In production, it costs 100x more than a decomposed pipeline, provides no accuracy gain for deterministic checks, and is impossible to debug because you cannot tell which sub-check failed. The diagnostic signal: if a sub-task's input is fully predictable and requires zero language understanding, it does not belong in an LLM call.
+**Rule over-engineering.** Building a 500-line regex to handle every possible date format when an LLM handles normalization better. If a rule becomes brittle and requires constant maintenance, the sub-task belongs at a higher method level. The diagnostic signal: if the regex needs patching after every new document batch, the sub-task has outgrown regex.
+**Black-box pipeline.** Chaining sub-tasks without intermediate outputs. When the final result is wrong, you cannot tell where the error entered. Every sub-task must produce a logged, inspectable intermediate result. If debugging a rule requires re-running the entire pipeline end-to-end, the pipeline lacks checkpoints.
+**Monolithic end-to-end.** Running every sub-task for every document, even when an early sub-task could short-circuit the pipeline. If the locate step finds that the relevant section does not exist, skip extract, normalize, judge, and comment. Go directly to "field missing." Short-circuit logic saves both cost and time.
+**Premature optimization.** Spending days designing the optimal method assignment before testing anything. Get the decomposition right first. Assign all sub-tasks to LLM. Prove it works end-to-end on Samples/. Then optimize by pushing sub-tasks down to cheaper methods one at a time, verifying accuracy is maintained at each step. Correctness first, cost second. The decomposition itself is the hard part — method assignment can always be revised later.
+## Integration
+Task decomposition sits between rule extraction and skill authoring in the KC Reborn lifecycle. It is the bridge that translates abstract rules into concrete implementation plans.
+**Input**: A rule catalog from `rule-extraction`. Each rule is an atomic, testable verification requirement. If a rule is not yet atomic, send it back to rule extraction for further decomposition before attempting task decomposition.
+**Output**: A per-rule sub-task decomposition — a list of sub-tasks, each with a defined input, output, and assigned method. This decomposition feeds directly into `skill-authoring`, where each rule's sub-tasks become the implementation plan for the skill folder. The decomposition also serves as the testing contract: each sub-task's output is independently testable.
+Method assignments also inform tier selection in `skill-to-workflow`. When a skill is distilled into a workflow, the method assignments from decomposition become the initial workflow architecture:
+- Regex and code sub-tasks become deterministic code in `scripts/`.
+- LLM sub-tasks become worker LLM prompts in `prompts/`, with model tier selected per the `skill-to-workflow` downgrade protocol.
+- Manual sub-tasks become escalation paths in the `quality-control` layer, triggered by low confidence scores.
+The decomposition is not static. As you test and iterate via `evolution-loop`, you will discover that some method assignments were wrong. A sub-task you thought was deterministic turns out to have edge cases that need LLM handling. A sub-task you assigned to LLM turns out to be solvable with a simple regex. Update the decomposition. Track changes with `version-control`.
+A well-decomposed rule is a well-understood rule. If you struggle to decompose a rule into clean sub-tasks, that usually means you do not yet understand the rule well enough. Go back to the developer user. Ask how they verify this rule manually. Their manual process is often the best decomposition blueprint — it reveals the natural sub-task boundaries that no amount of abstract analysis will surface.

package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Decision Matrix for Method Selection
+This reference provides the detailed decision matrix for assigning methods to sub-tasks during task decomposition. Read `task-decomposition` SKILL.md first for the philosophy; this document is the operational reference.
+## The Four Dimensions
+| Dimension | Definition | 1 (Low) | 3 (Medium) | 5 (High) |
+|---|---|---|---|---|
+| **Certainty** | Predictability of input format and location | Free-form prose, no fixed structure | Semi-structured with known sections but variable formatting | Fixed template, exact field positions |
+| **Scale** | Number of items to process per document | 1-5 items | 10-100 items | 1,000+ items |
+| **Semantic Depth** | Language understanding required | None — pure pattern or numeric | Moderate — entity recognition, simple context | Deep — judgment, adequacy assessment, intent interpretation |
+| **Cost Sensitivity** | Budget constraint per document | Unlimited (one-off audit) | Moderate (monthly batch of hundreds) | Tight (daily batch of thousands) |
+## Method Assignment Rules
+Use the highest-priority method whose requirements are met. Priority order: Rule/Regex > Code > LLM > Manual.
+| Certainty | Scale | Semantic Depth | Cost Sensitivity | Assigned Method | Rationale |
+|---|---|---|---|---|---|
+| High (4-5) | Any | Low (1-2) | Any | **Rule / Regex** | Predictable input + no language understanding = deterministic pattern matching |
+| High (4-5) | Any | Low (1-2) | Any | **Code / Python** | Calculations, comparisons, transformations on structured data |
+| Medium (3) | High (4-5) | Low (1-2) | High (4-5) | **Code + Regex** | Volume demands speed; invest in parsing code to avoid per-item LLM cost |
+| Medium (3) | Low (1-2) | Medium (3) | Low (1-2) | **LLM** | Moderate understanding needed, low volume makes LLM cost acceptable |
+| Low (1-2) | Any | High (4-5) | Any | **LLM** | Deep semantic understanding has no cheaper alternative |
+| Low (1-2) | High (4-5) | High (4-5) | High (4-5) | **LLM (low tier) + sampling** | Volume + semantics + budget = use cheapest LLM, sample-verify with higher tier |
+| Any | Any | Any | — | **Manual** | Last resort when automated methods fail accuracy threshold |
+The table covers common patterns, not every combination. When a sub-task falls between categories, test both candidate methods on a sample and measure accuracy and cost. Let data decide.
+## Worked Example: Cross-Field Validation
+**Rule**: "The loan amount must not exceed 70% of the appraised collateral value."
+Decomposition into sub-tasks with method assignments:
+| # | Sub-task | Input | Output | Method | Rationale |
+|---|---|---|---|---|---|
+| 1 | Locate loan amount field | Full document text | Page/section reference | LLM (Tier 3) | Field position varies across document types |
+| 2 | Extract loan amount | Located section text | Numeric value (float) | Regex | Amount follows pattern: ¥/$/digits with commas |
+| 3 | Locate collateral section | Full document text | Page/section reference | LLM (Tier 3) | Section name varies: "Collateral", "Security", "Pledged Assets" |
+| 4 | Extract appraised value | Located section text | Numeric value (float) | Regex + Code | Regex extracts; code handles unit conversion (万/亿) |
+| 5 | Calculate threshold | Loan amount, collateral value | 70% threshold value | Code | Pure arithmetic: `collateral * 0.70` |
+| 6 | Compare | Loan amount, threshold | Pass/Fail | Code | Simple comparison: `loan_amount <= threshold` |
+| 7 | Generate comment | All extracted values | Comment string | Code (template) | Template: "Loan amount {X} is {above/within} 70% of collateral value {Y} (threshold: {Z})" |
+LLM calls: 2 (locate steps only). Everything else is regex or code. Total LLM cost per document: ~0.002 USD at Tier 3 pricing.
+## Worked Example: Large-Scale Filtering
+**Task**: Match 31,800 invoices against 15,940 contracts to find which invoices belong to which contracts.
+Naive approach: 507M pairwise LLM comparisons. Estimated cost: $50,000+. Time: weeks.
+Layered decomposition:
+| Layer | Method | Input Size | Output Size | Reduction | Cost |
+|---|---|---|---|---|---|
+| 1. Exact match on supplier name + contract number | Rule/Regex | 507M pairs | 25,200 matches | 99.5% eliminated | ~$0 |
+| 2. Fuzzy match on amount range (±5%) + date overlap | Code | Remaining unmatched pairs | 12,400 candidates | 97.6% of remainder eliminated | ~$0 |
+| 3. Semantic comparison of line-item descriptions | LLM (Tier 3) | 12,400 candidates | 7,652 confirmed | Final precision filter | ~$25 |
+| 4. Manual review of low-confidence matches | Manual | ~200 uncertain | ~200 resolved | Edge cases | ~$100 (labor) |
+Total cost: ~$125. Time: hours. Same accuracy as the naive approach.
+The key insight: each layer's method is chosen because it is the cheapest method that can reliably make the distinctions required at that stage.
+## Cost Estimation Template
+Use this template during decomposition planning to estimate per-document cost.
+| Sub-task | Method | Est. Cost/Call | Calls/Document | Subtotal |
+|---|---|---|---|---|
+| Locate section | LLM Tier 3 | $0.001 | 2 | $0.002 |
+| Extract fields | Regex | $0.000 | 5 | $0.000 |
+| Normalize values | Python | $0.000 | 5 | $0.000 |
+| Cross-field comparison | Python | $0.000 | 1 | $0.000 |
+| Semantic judgment | LLM Tier 2 | $0.003 | 1 | $0.003 |
+| Comment generation | Template | $0.000 | 1 | $0.000 |
+| **Total per document** | | | | **$0.005** |
+Multiply by expected document volume to get batch cost. Compare against the developer user's budget. If total exceeds budget, optimize the most expensive sub-tasks first — usually the LLM calls with the highest per-call cost or the highest call count.