kc-beta 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (141) hide show
  1. package/bin/kc-beta.js +16 -0
  2. package/package.json +32 -0
  3. package/src/agent/confidence-scorer.js +120 -0
  4. package/src/agent/context.js +124 -0
  5. package/src/agent/corner-case-registry.js +119 -0
  6. package/src/agent/engine.js +224 -0
  7. package/src/agent/events.js +27 -0
  8. package/src/agent/history.js +101 -0
  9. package/src/agent/llm-client.js +131 -0
  10. package/src/agent/pipelines/base.js +14 -0
  11. package/src/agent/pipelines/distillation.js +113 -0
  12. package/src/agent/pipelines/extraction.js +92 -0
  13. package/src/agent/pipelines/index.js +23 -0
  14. package/src/agent/pipelines/initializer.js +163 -0
  15. package/src/agent/pipelines/production-qc.js +99 -0
  16. package/src/agent/pipelines/skill-authoring.js +83 -0
  17. package/src/agent/pipelines/skill-testing.js +111 -0
  18. package/src/agent/tools/agent-tool.js +100 -0
  19. package/src/agent/tools/base.js +35 -0
  20. package/src/agent/tools/dashboard-render.js +146 -0
  21. package/src/agent/tools/document-parse.js +184 -0
  22. package/src/agent/tools/document-search.js +111 -0
  23. package/src/agent/tools/evolution-cycle.js +150 -0
  24. package/src/agent/tools/qc-sample.js +94 -0
  25. package/src/agent/tools/registry.js +55 -0
  26. package/src/agent/tools/rule-catalog.js +113 -0
  27. package/src/agent/tools/sandbox-exec.js +106 -0
  28. package/src/agent/tools/tier-downgrade.js +114 -0
  29. package/src/agent/tools/worker-llm-call.js +109 -0
  30. package/src/agent/tools/workflow-run.js +138 -0
  31. package/src/agent/tools/workspace-file.js +122 -0
  32. package/src/agent/version-manager.js +130 -0
  33. package/src/agent/workspace.js +82 -0
  34. package/src/cli/components.js +164 -0
  35. package/src/cli/index.js +329 -0
  36. package/src/cli/init.js +80 -0
  37. package/src/cli/onboard.js +182 -0
  38. package/src/cli/terminal.js +143 -0
  39. package/src/config.js +93 -0
  40. package/template/.env.template +31 -0
  41. package/template/CLAUDE.md +137 -0
  42. package/template/Input/.gitkeep +0 -0
  43. package/template/Output/.gitkeep +0 -0
  44. package/template/Rules/.gitkeep +0 -0
  45. package/template/Samples/.gitkeep +0 -0
  46. package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
  47. package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
  48. package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
  49. package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
  50. package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
  51. package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  52. package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
  53. package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
  54. package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
  55. package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
  56. package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
  57. package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
  58. package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
  59. package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  60. package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
  61. package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  62. package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
  63. package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
  64. package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  65. package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
  66. package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  67. package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
  68. package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
  69. package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  70. package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
  71. package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  72. package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
  73. package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  74. package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
  75. package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
  76. package/template/skills/en/skill-creator/LICENSE.txt +202 -0
  77. package/template/skills/en/skill-creator/SKILL.md +479 -0
  78. package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
  79. package/template/skills/en/skill-creator/agents/comparator.md +202 -0
  80. package/template/skills/en/skill-creator/agents/grader.md +223 -0
  81. package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
  82. package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
  83. package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
  84. package/template/skills/en/skill-creator/references/schemas.md +430 -0
  85. package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
  86. package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
  87. package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
  88. package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
  89. package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
  90. package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
  91. package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
  92. package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
  93. package/template/skills/en/skill-creator/scripts/utils.py +47 -0
  94. package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
  95. package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
  96. package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
  97. package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
  98. package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
  99. package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  100. package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
  101. package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
  102. package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
  103. package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
  104. package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
  105. package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
  106. package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
  107. package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  108. package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
  109. package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  110. package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
  111. package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
  112. package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  113. package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
  114. package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  115. package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
  116. package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
  117. package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  118. package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
  119. package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  120. package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
  121. package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  122. package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
  123. package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
  124. package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
  125. package/template/skills/zh/skill-creator/SKILL.md +479 -0
  126. package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
  127. package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
  128. package/template/skills/zh/skill-creator/agents/grader.md +223 -0
  129. package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
  130. package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
  131. package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
  132. package/template/skills/zh/skill-creator/references/schemas.md +430 -0
  133. package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
  134. package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
  135. package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
  136. package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
  137. package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
  138. package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
  139. package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
  140. package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
  141. package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
@@ -0,0 +1,138 @@
1
+ ---
2
+ name: quality-control
3
+ description: Design and execute quality control for production verification workflows. Use when workflows are deployed on Input/ documents and results need to be monitored, when designing the QC sampling strategy for a rule, or when evaluating whether monitoring can be reduced. Covers LLM-as-Judge evaluation, adaptive sampling strategies, confidence-based triage, and the transition from active monitoring to stable oversight. Also use when production quality drops and you need to diagnose whether to trigger the evolution loop.
4
+ ---
5
+
6
+ # Quality Control
7
+
8
+ Quality control is the Observer role. You are watching the worker LLMs perform and deciding whether they are doing it well enough. The goal is not to review every result — that would defeat the purpose of automation. The goal is to review just enough to maintain confidence that the system is working.
9
+
10
+ ## Five-Layer QA Architecture
11
+
12
+ Quality control is not one activity — it is five layers that build on each other. Lower layers must pass before higher layers run.
13
+
14
+ | Layer | Name | What It Checks | Method |
15
+ |-------|------|---------------|--------|
16
+ | L1 | Text Integrity | Files exist, encoding correct, source text preserved after processing | Scripts (`lint_*`) |
17
+ | L2 | Syntax | Output format valid (JSON/CSV), required fields present, types correct | Scripts (`lint_*`) |
18
+ | L3 | Data Completeness | Required fields populated, values in valid domain (dates are dates, amounts are positive) | Scripts (`validate_*`) |
19
+ | L4 | Business Logic | Cross-field consistency, threshold compliance, sequence reasonableness | Scripts + LLM |
20
+ | L5 | Cross-Phase | Entities in results match extraction output, rules match catalog, workflow output matches skill ground truth | Scripts (`cross_validate_*`) + LLM |
21
+
22
+ **Key principles:**
23
+ - **Fail fast**: If L1 fails (file missing), do not run L4 (business logic). Lower layers block higher layers.
24
+ - **Code first**: L1-L3 should be pure code — cheap and deterministic. LLM-as-Judge (below) operates at L4 and L5.
25
+ - **Naming convention**: `lint_*` for L1-L2, `validate_*` for L3-L4, `cross_validate_*` for L5.
26
+
27
+ **QC vs Reflection**: QC catches problems in outputs (this skill). Reflection diagnoses WHY problems occur and fixes them (see `evolution-loop`). QC feeds data to Reflection; Reflection feeds fixes back to the system.
28
+
29
+ See `references/qa-layers.md` for detailed layer specifications and example patterns.
30
+
31
+ ## LLM-as-Judge
32
+
33
+ Use yourself (the coding agent) or a high-tier worker LLM to evaluate workflow outputs. The judge receives:
34
+ - The source document (or relevant section).
35
+ - The workflow's extracted value.
36
+ - The workflow's pass/fail determination.
37
+ - The workflow's comment (if any).
38
+
39
+ The judge produces a structured verdict:
40
+
41
+ - **correct**: The extraction and judgment are both right.
42
+ - **partial**: The extraction is roughly right but imprecise, or the judgment is right but the comment is misleading.
43
+ - **incorrect**: The extraction or judgment is wrong.
44
+ - **missing**: The workflow failed to produce a result when it should have.
45
+
46
+ Field-level verdicts should include: the field name, the verdict, the expected value (what the judge thinks is correct), and brief reasoning.
47
+
48
+ ## Adaptive Sampling
49
+
50
+ Do not review everything forever. Start at high coverage and reduce as quality stabilizes:
51
+
52
+ ### Initial Deployment
53
+ - Review 100% of results for the first batch. This is your baseline.
54
+ - Establish per-rule accuracy from this batch.
55
+
56
+ ### Early Production
57
+ - Review all results that the workflow flagged as low confidence.
58
+ - Additionally, randomly sample a percentage of high-confidence results.
59
+ - The percentage depends on MONITOR_FREQUENCY in `.env`:
60
+ - `high`: 50% random sample
61
+ - `mid`: 20% random sample
62
+ - `low`: 10% random sample
63
+
64
+ ### Stable Production
65
+ - Review only low-confidence results.
66
+ - Random sample drops further (5-10%).
67
+ - Trigger full review if random samples reveal unexpected failures.
68
+
69
+ ### The Decay Function
70
+ The sampling rate should decay based on observed accuracy, not just time. If accuracy stays above the threshold for N consecutive batches, reduce sampling. If accuracy drops, increase sampling immediately.
71
+
72
+ The exact decay curve is for you to design per rule, based on:
73
+ - How variable the documents are.
74
+ - How complex the rule is.
75
+ - How much the developer user trusts the workflow.
76
+
77
+ Discuss the sampling strategy with the developer user. They may have preferences.
78
+
79
+ ## Confidence-Based Triage
80
+
81
+ Confidence scores (from the `confidence-system` skill) determine review priority:
82
+
83
+ - **High confidence (above auto-accept threshold)**: Spot-check only. These are results the workflow is sure about.
84
+ - **Medium confidence (between thresholds)**: Sample at the rate determined by MONITOR_FREQUENCY.
85
+ - **Low confidence (below full-review threshold)**: Review every one. These are results the workflow is uncertain about.
86
+
87
+ The thresholds themselves should be calibrated per rule. Start with reasonable defaults (e.g., 0.9 / 0.6) and adjust based on whether the thresholds actually distinguish correct from incorrect results.
88
+
89
+ ## Triggering Evolution
90
+
91
+ Quality control can trigger the evolution loop. Criteria:
92
+
93
+ - Average accuracy across reviewed results drops below WORKFLOW_ACCURACY.
94
+ - A single rule's accuracy drops significantly (e.g., 15+ percentage points below its historical average).
95
+ - A new pattern of failures emerges that was not seen during skill/workflow testing.
96
+
97
+ When triggering evolution, provide the quality control data as input to the diagnosis step.
98
+
99
+ ## Batch Processing
100
+
101
+ For production Input/ documents:
102
+
103
+ 1. Process the batch through workflows. Results go to Output/.
104
+ 2. Assign confidence scores to each result.
105
+ 3. Select the review sample based on confidence triage + random sampling.
106
+ 4. Review the selected results (LLM-as-Judge or manual review by the developer user).
107
+ 5. Compute batch accuracy from reviewed results.
108
+ 6. Log batch QC report.
109
+ 7. If accuracy is acceptable, finalize the batch. If not, trigger evolution loop.
110
+
111
+ ## Developer User Involvement
112
+
113
+ The developer user should see QC results through the dashboard (see `dashboard-reporting`). Key metrics to surface:
114
+ - Per-rule accuracy over time.
115
+ - Confidence distribution (are most results high-confidence?).
116
+ - Sampling rate (is monitoring decreasing as expected?).
117
+ - Flagged issues requiring human attention.
118
+
119
+ The developer user may also want to review specific results themselves, especially for high-stakes rules. Accommodate this by marking results for human review in the output.
120
+
121
+ ## User Feedback Collection
122
+
123
+ Build error reporting and comment mechanisms into every verification app you create. These are not optional features — they are essential data sources.
124
+
125
+ ### Two Audiences
126
+
127
+ - **Developer users**: Technical error reporting — field-level correction, rule re-evaluation requests, false positive/negative flags with context. They see the full result detail.
128
+ - **End users**: Simplified feedback — flag a result as wrong, add a comment, indicate severity. They see a clean interface without technical internals.
129
+
130
+ ### Feedback as Ground Truth
131
+
132
+ When a user reports an error on a verification result, that correction is ground truth. It overrides the coding agent's judgment and the worker LLM's output. Feed user corrections into the `evolution-loop` immediately as confirmed failures — they are higher priority than agent-detected issues.
133
+
134
+ ### Feedback Data Flow
135
+
136
+ Collect feedback via the dashboard (see `dashboard-reporting`) → Store as structured records (result_id, reporter_role, feedback_type, corrected_value, comment, timestamp) → Feed into `evolution-loop` as regression test cases → Track correction trends in QC metrics.
137
+
138
+ See `references/sampling-strategies.md` for detailed sampling patterns.
@@ -0,0 +1,92 @@
1
+ # QA Layer Specifications
2
+
3
+ Detailed specifications for the five-layer QA architecture. Each layer builds on the one below it.
4
+
5
+ ## Layer Details
6
+
7
+ ### L1: Text Integrity
8
+
9
+ - **Description**: Verify that source files exist, are readable, and that text content is preserved correctly after any processing (parsing, OCR, conversion).
10
+ - **Input**: Raw document files and their processed text output.
11
+ - **Output**: Pass/fail per file with error details.
12
+ - **Example checks**: File exists and is non-empty. Encoding is UTF-8 (or declared encoding). No null bytes in text output. Character count is within expected range for document type.
13
+ - **Common failures**: File path changed after processing. OCR produced empty output. Encoding mismatch causes garbled characters.
14
+ - **Escalation**: If L1 fails, do not proceed to higher layers. Log the failure and flag for reprocessing.
15
+
16
+ ### L2: Syntax
17
+
18
+ - **Description**: Verify that output files conform to their declared format and schema.
19
+ - **Input**: Output files (JSON, CSV, etc.) from workflows.
20
+ - **Output**: Pass/fail per file with parse errors or schema violations.
21
+ - **Example checks**: JSON is valid (parses without error). Required top-level keys exist. Array fields are arrays, not strings. Date fields match ISO 8601 format.
22
+ - **Common failures**: Trailing comma in JSON. Missing closing bracket. CSV with inconsistent column count. Unexpected null where value is required.
23
+ - **Escalation**: Syntax failures indicate a bug in the output generation code. Fix the code, not the data.
24
+
25
+ ### L3: Data Completeness
26
+
27
+ - **Description**: Verify that required data fields are populated with values in their valid domain.
28
+ - **Input**: Parsed output records.
29
+ - **Output**: Per-field validation results with reasons for any failures.
30
+ - **Example checks**: Invoice date is a valid date (not "N/A" or empty). Amount is a positive number. Entity name is non-empty and does not contain only whitespace. Enum fields contain allowed values.
31
+ - **Common failures**: Extraction returned "unable to determine" as a value. Amount includes currency symbol (string instead of number). Date extracted as partial (month and day but no year).
32
+ - **Escalation**: Completeness failures feed back to extraction prompt improvement. If a field is consistently incomplete, the extraction logic needs work.
33
+
34
+ ### L4: Business Logic
35
+
36
+ - **Description**: Verify cross-field consistency and compliance with business rules.
37
+ - **Input**: Complete, validated records from L3.
38
+ - **Output**: Per-rule validation results with reasoning.
39
+ - **Example checks**: Contract end date is after start date. Invoice date falls within contract validity period. Total amount equals sum of line items. Signatory name matches authorized personnel list.
40
+ - **Common failures**: Date comparison fails due to timezone differences. Rounding errors in amount calculations. Cross-reference lookup fails because entity names differ slightly (e.g., "ABC Corp" vs "ABC Corporation").
41
+ - **Escalation**: Business logic failures may indicate rule misunderstanding. Consult the developer user if the rule intent is ambiguous.
42
+
43
+ ### L5: Cross-Phase
44
+
45
+ - **Description**: Verify consistency across different phases of the verification pipeline.
46
+ - **Input**: Outputs from multiple pipeline stages (extraction, verification, reporting).
47
+ - **Output**: Cross-phase consistency report.
48
+ - **Example checks**: Entities in final results match those in extraction output (nothing added or dropped). Rule IDs in results exist in the rule catalog. Workflow output for a skill matches the skill's own ground truth output. Confidence scores in results match those computed by the confidence system.
49
+ - **Common failures**: A rule was added to the catalog but the workflow was not updated to include it. Extraction found 5 entities but results only report 4. Workflow output diverges from skill ground truth on edge cases.
50
+ - **Escalation**: Cross-phase failures often indicate integration issues. Check the pipeline connections, not individual components.
51
+
52
+ ## Script Naming Convention
53
+
54
+ | Prefix | Layer | Purpose | Examples |
55
+ |--------|-------|---------|----------|
56
+ | `lint_` | L1-L2 | Fast, syntactic checks | `lint_json.py`, `lint_encoding.py`, `lint_schema.py` |
57
+ | `validate_` | L3-L4 | Domain and logic validation | `validate_fields.py`, `validate_dates.py`, `validate_amounts.py` |
58
+ | `cross_validate_` | L5 | Cross-phase consistency | `cross_validate_extraction.py`, `cross_validate_rules.py` |
59
+
60
+ Scripts should:
61
+ - Accept a file or directory path as input.
62
+ - Output structured JSON results (pass/fail per check, with reasons).
63
+ - Return exit code 0 if all checks pass, non-zero otherwise.
64
+ - Be idempotent — running twice produces the same result.
65
+
66
+ ## QC vs Reflection
67
+
68
+ | Dimension | QC (this skill) | Reflection (evolution-loop) |
69
+ |-----------|-----------------|---------------------------|
70
+ | **Who runs it** | Coding agent or automated scripts | Coding agent |
71
+ | **What triggers it** | Every batch, on schedule | QC failures, accuracy drops |
72
+ | **Input** | Workflow outputs | QC reports, failure logs, iteration history |
73
+ | **Output** | Pass/fail verdicts, accuracy metrics | Root cause diagnosis, fix proposals |
74
+ | **Cost** | Low (mostly scripts, some LLM at L4-L5) | Higher (deep analysis, prompt rewriting) |
75
+ | **When to use** | Always — every production batch | Only when QC reveals problems |
76
+ | **Goal** | Detect problems | Fix problems |
77
+
78
+ QC without Reflection detects issues but cannot fix them. Reflection without QC has no data to work from. They are complementary, not alternatives.
79
+
80
+ ## Integration Points
81
+
82
+ ### With `data-sensibility`
83
+
84
+ The `data-sensibility` skill provides input validation that feeds L1-L3. If data-sensibility checks flag a document as anomalous before processing, QC can prioritize reviewing that document's outputs. Data-sensibility operates on inputs; QC operates on outputs. Together they bracket the pipeline.
85
+
86
+ ### With `cross-document-verification`
87
+
88
+ Cross-document verification enables L5 cross-doc consistency checks. When multiple documents reference the same entity (e.g., same contract number across invoice and purchase order), L5 can verify that extracted values are consistent across documents. Without cross-document verification, L5 is limited to single-document cross-phase checks.
89
+
90
+ ### With `confidence-system`
91
+
92
+ QC results calibrate the confidence system. When QC reveals that high-confidence results are sometimes wrong, the confidence thresholds need adjustment. Conversely, confidence scores drive QC sampling — low-confidence results get more review. This creates a feedback loop: QC improves confidence calibration, better calibration improves QC efficiency.
@@ -0,0 +1,76 @@
1
+ # Sampling Strategies for Quality Control
2
+
3
+ ## Adaptive Sampling
4
+
5
+ The core idea: review more when you are uncertain, less when you are confident. Confidence grows with evidence — consecutive batches of high accuracy.
6
+
7
+ ### Continuous Decay Model
8
+
9
+ Rather than cliff-edge transitions between phases, use a smooth exponential decay driven by observed accuracy:
10
+
11
+ ```
12
+ sampling_rate = max(floor_rate, exp(-λ × consecutive_successes))
13
+ ```
14
+
15
+ Where:
16
+ - `consecutive_successes`: number of consecutive batches where accuracy meets or exceeds the threshold. **Resets to 0** whenever a batch's accuracy drops below the threshold. This is the self-correcting mechanism — quality drops immediately increase monitoring.
17
+ - `λ` (decay speed): controlled by MONITOR_FREQUENCY in `.env`.
18
+ - `floor_rate`: the minimum sampling rate, never goes below this.
19
+
20
+ ### MONITOR_FREQUENCY Mapping
21
+
22
+ | Setting | λ | floor_rate | Character |
23
+ |---------|---|------------|-----------|
24
+ | `high` | 0.1 | 0.10 | Slow decay, cautious — for high-stakes verification where errors are costly |
25
+ | `mid` | 0.2 | 0.05 | Balanced decay — standard for most scenarios |
26
+ | `low` | 0.3 | 0.05 | Fast decay — for well-understood domains with simple rules |
27
+
28
+ As a rough mental model of the curve shape (for `mid`):
29
+ - After 1 success: ~82% sampling
30
+ - After 3 successes: ~55%
31
+ - After 5 successes: ~37%
32
+ - After 10 successes: ~14%
33
+ - After 15 successes: ~5% (floor)
34
+
35
+ These numbers, the formula, and even the exponential shape are recommended defaults. The coding agent and developer user should discuss and calibrate based on the specific business scenario. If a different decay function (linear, sigmoid, or hand-tuned) works better, use it. The framework — accuracy-driven decay with reset on quality drop — matters more than the specific formula.
36
+
37
+ ## Priority Sampling
38
+
39
+ Not all results are equally worth reviewing. Priority sampling ensures that the most informative results are always in the review set:
40
+
41
+ ### Always Review
42
+ - Results where the workflow reported low confidence (below the full-review threshold from `confidence-system`).
43
+ - Results where the workflow produced an error or missing result.
44
+ - Results from document types not seen during skill/workflow testing.
45
+
46
+ ### Usually Review
47
+ - Results where the workflow's confidence is in the medium band.
48
+ - Results from rules that historically have lower accuracy.
49
+ - Results from the first occurrence of a new document format or variant.
50
+
51
+ ### Spot-Check
52
+ - Results with high confidence from rules that historically have high accuracy.
53
+ - These are selected randomly from the high-confidence pool.
54
+ - The purpose is regression detection, not active improvement.
55
+
56
+ ## Stratified Sampling
57
+
58
+ When documents vary significantly in complexity or type, stratify the sample:
59
+
60
+ 1. **Group documents** by type, complexity, or any relevant characteristic.
61
+ 2. **Sample proportionally** from each group, ensuring that minority groups are represented.
62
+ 3. **Over-sample** from groups that historically have lower accuracy.
63
+
64
+ This prevents the random sample from being dominated by easy documents while missing systematic failures in hard documents.
65
+
66
+ ## Confidence Calibration Check
67
+
68
+ Periodically (every N batches), run a calibration check:
69
+
70
+ 1. Take a random sample of high-confidence results.
71
+ 2. Review them (LLM-as-Judge or human).
72
+ 3. Compare: are 90%+ of "high confidence" results actually correct?
73
+ 4. If not, the confidence system needs recalibration (see `confidence-system` skill).
74
+ 5. If yes, you can safely reduce the sampling rate for high-confidence results.
75
+
76
+ This is a meta-check on the quality of the quality control system itself.
@@ -0,0 +1,100 @@
1
+ ---
2
+ name: rule-extraction
3
+ description: Extract and organize business verification rules from regulation documents into discrete, testable units. Use when processing documents in Rules/ to identify individual verification rules, when decomposing a regulation into atomic checks, or when the developer user adds new regulation files. Covers reading regulation text, identifying rule boundaries, determining granularity, handling cross-references, and producing a rule catalog. Also use when rules are provided in structured formats like xlsx or csv.
4
+ ---
5
+
6
+ # Rule Extraction
7
+
8
+ Rules are the atoms of verification. Each rule you extract will become its own skill folder, its own workflow, and its own production pipeline. The quality of your extraction determines everything downstream.
9
+
10
+ ## Philosophy
11
+
12
+ A well-extracted rule is:
13
+ - **Atomic**: it checks one thing. "The borrower's debt-to-income ratio must not exceed 50%" is one rule. "The loan agreement must comply with Regulation X" is not — it is a container for many rules.
14
+ - **Testable**: given a document, you can definitively say whether the rule passes or fails (or is not applicable).
15
+ - **Self-contained**: the rule's meaning does not require reading ten other rules to understand. Cross-references should be resolved into the rule's description.
16
+ - **Scoped**: you know WHERE in the document to look. "Chapter 3, Section 2" or "the risk disclosure section" or "the signature page."
17
+
18
+ But perfection is the enemy of progress. Extract rules at the granularity that feels right for the regulation and the business scenario. You will iterate. The developer user will tell you if rules are too coarse or too fine.
19
+
20
+ ## Rule Schema Design Principles
21
+
22
+ Individual rules should be atomic and testable (above). The rule catalog as a whole must also satisfy system-level properties:
23
+
24
+ ### Coverage Target
25
+ Extracted rules should cover at least 95% of the regulation's checkable requirements. After initial extraction, perform a coverage audit: read the source regulation end-to-end and mark which paragraphs are covered by at least one rule. Uncovered paragraphs are either non-checkable (definitions, context) or gaps to close.
26
+
27
+ ### Atomicity Test
28
+ One rule = one pass/fail outcome. If a rule can produce two independent pass/fail results, it should be two rules. Ask: "Can this rule partially pass?" If yes, decompose further.
29
+
30
+ ### Ambiguity Minimization
31
+ No two rules should produce contradictory results on the same document. After extraction, review rule pairs that touch overlapping scope. If Rule A says pass and Rule B says fail for the same entity, their scope boundaries are unclear — fix them.
32
+
33
+ ### Downstream Anticipation
34
+ Rules will be distilled into workflows (see `skill-to-workflow`). Design with distillation in mind: clear input/output boundaries, explicit judgment criteria, minimal reliance on implicit domain knowledge. If a rule requires reading between the lines, make the interpretation explicit. Use `task-decomposition` to identify natural boundaries between rules.
35
+
36
+ ### Catalog Versioning
37
+ When rules change (additions, modifications, deprecations), version the entire rule catalog as a unit. Individual rule versions track specific rules; the catalog version tracks the coherent set. Record the catalog version in `versions.json` alongside individual rule versions.
38
+
39
+ ## Extraction Strategies
40
+
41
+ ### Strategy 1: Structured Input (Developer User Provides Rules)
42
+
43
+ When the developer user provides rules in xlsx, csv, or a structured document where each row/entry is a distinct rule with clear scope:
44
+ - Follow their structure exactly. Do not re-decompose.
45
+ - Map each row to a rule, preserving the developer user's identifiers.
46
+ - Ask clarifying questions only if entries are ambiguous.
47
+
48
+ ### Strategy 2: Hierarchical Extraction from Regulation Text
49
+
50
+ For raw regulation documents (PDF, DOCX, legal text):
51
+
52
+ 1. **Survey the document structure.** Read the table of contents or scan headers. Understand the hierarchy: parts, chapters, sections, articles, clauses.
53
+ 2. **Identify rule-bearing sections.** Not every section contains a verification rule. Some are definitions, some are procedural, some are context. Focus on sections that impose obligations, prohibitions, thresholds, or requirements.
54
+ 3. **Peel the onion.** Start at the highest structural level and work downward:
55
+ - Level 1: What major areas does the regulation cover? (e.g., capital adequacy, risk disclosure, governance)
56
+ - Level 2: Within each area, what are the specific chapters or sections?
57
+ - Level 3: Within each section, what are the individual requirements?
58
+ - Stop peeling when you reach atomic rules.
59
+ 4. **Handle cross-references.** Regulations love to say "as defined in Section X" or "subject to the conditions in Article Y." Resolve these by including the referenced content in the rule's description, not just the reference.
60
+ 5. **Handle compound rules.** "The report must include (a) risk factors, (b) financial projections, and (c) management discussion" — this is three rules, not one. Decompose unless the developer user specifically wants them grouped.
61
+
62
+ For long documents (100+ pages), use the onion-peeler approach described in `references/chunking-strategies.md`. Do not try to read the entire document in one pass.
63
+
64
+ ### Strategy 3: Expert Notes
65
+
66
+ Sometimes rules come from the developer user's domain expertise rather than formal regulations:
67
+ - "We always check that the guarantor's signature matches the name on page 1."
68
+ - "If the collateral value is below 120% of the loan amount, flag it."
69
+
70
+ Capture these with the same rigor as formal regulation rules. They are equally important in the verification app.
71
+
72
+ ## Rule Catalog
73
+
74
+ Maintain a lightweight catalog of all extracted rules. This is your index, not the rules themselves (those live in skill folders). The catalog should track:
75
+
76
+ - Rule ID (simple sequential: R001, R002, ...)
77
+ - Rule title (one line)
78
+ - Source (which regulation document, which section)
79
+ - Status (extracted / skill-written / skill-tested / workflow-written / workflow-tested / production)
80
+ - Dependencies (rules that must be checked before this one)
81
+
82
+ Format: a simple markdown table or JSON file. Do not over-engineer this. The catalog exists to give you and the developer user an overview of progress.
83
+
84
+ ## Handling Ambiguity
85
+
86
+ Regulations are often ambiguous. When you encounter ambiguity:
87
+ 1. Extract the rule as you understand it.
88
+ 2. Note the ambiguity explicitly in the rule description.
89
+ 3. Ask the developer user for clarification.
90
+ 4. Update the rule after receiving clarification.
91
+
92
+ Do not skip ambiguous rules. They are often the most important ones.
93
+
94
+ ## When Rules Change
95
+
96
+ Regulations evolve. When the developer user adds new or updated regulation documents:
97
+ 1. Identify which existing rules are affected.
98
+ 2. Extract new rules or update existing ones.
99
+ 3. Mark affected workflows for re-testing.
100
+ 4. Use `version-control` to track the change.
@@ -0,0 +1,80 @@
1
+ # Chunking Strategies for Long Documents
2
+
3
+ When regulation documents exceed what you can process in a single pass, use these proven strategies to decompose them into manageable chunks while preserving semantic coherence.
4
+
5
+ ## The Onion Peeler (Primary Strategy)
6
+
7
+ Hierarchical header-based decomposition. Named because you peel the document layer by layer, from the outermost structure inward.
8
+
9
+ ### How It Works
10
+
11
+ 1. **Parse the document's header hierarchy.** Identify all headers by level (H1, H2, H3, etc. — or their equivalents in the document's formatting: "Part I", "Chapter 1", "Section 1.1", "Article 1").
12
+ 2. **Build a tree.** Each header becomes a node. Content between headers belongs to the nearest preceding header at that level.
13
+ 3. **Check sizes.** Walk the tree. If a node's content (including all its children) fits within your processing limit, stop — this node is a chunk.
14
+ 4. **Split only when necessary.** If a node exceeds the limit, descend to its children. Only split when a node is too large AND has sub-headers to split on.
15
+ 5. **Leaf nodes that are still too large** get handled by the wedge-driving fallback (see below).
16
+
17
+ ### Why This Works
18
+
19
+ - Respects the document's own semantic structure. A "Chapter 3: Risk Disclosure" chunk contains exactly what the author intended that chapter to contain.
20
+ - Minimizes information loss. You never cut in the middle of a thought.
21
+ - Produces chunks of varying size — and that is fine. A short chapter is better as one chunk than split into artificial halves.
22
+
23
+ ### Pattern Discovery Shortcut
24
+
25
+ Before building a full parser, explore several sample documents for structural patterns:
26
+ - Do all chapter titles start with "Chapter X" or "第X章"?
27
+ - Are sections numbered consistently (1.1, 1.2, 1.3)?
28
+ - Are there visual markers (bold text, specific fonts, horizontal rules)?
29
+
30
+ If you find consistent patterns, a regex-based splitter is faster and more reliable than LLM-based structure detection. For example:
31
+ - `^第[一二三四五六七八九十百]+章` for Chinese chapter headers
32
+ - `^Chapter \d+` for English chapter headers
33
+ - `^\d+\.\d+` for numbered sections
34
+
35
+ Always validate the regex against multiple documents before committing to it.
36
+
37
+ ## Wedge Driving (Fallback Strategy)
38
+
39
+ For content without clear headers — dense legal text, continuous prose, or leaf nodes from the onion peeler that are still too large.
40
+
41
+ ### How It Works
42
+
43
+ The algorithm uses a **rolling context window** to process documents of arbitrary length without loading the full text at once.
44
+
45
+ **Step 1: Window the content.** Load up to MAX_TOKENS (e.g., 100K tokens — configurable) of the remaining unprocessed text into a window. If the remaining text fits in a single chunk, stop — no further splitting needed.
46
+
47
+ **Step 2: Ask an LLM for cut points.** Prompt the LLM to identify 1-3 natural break points within the window where topic or subject changes. For each cut point, the LLM returns:
48
+ - `tokens_before`: ~K tokens (default K=50) immediately BEFORE the cut, copied verbatim from the text.
49
+ - `tokens_after`: ~K tokens immediately AFTER the cut, copied verbatim.
50
+ - `chunk_title`: a 5-10 word title describing the chunk that precedes the cut.
51
+
52
+ Using token count (not word count) gives consistent granularity across languages — critical for Chinese text which has no whitespace-delimited words.
53
+
54
+ **Step 3: Locate the cuts via fuzzy matching.** The LLM's quoted tokens will not be a perfect match to the source text (minor paraphrasing, whitespace differences, encoding artifacts). Use Levenshtein distance (edit distance) to find the best match:
55
+ 1. Search the source text for the position that best matches `tokens_before`. Require at least 70% similarity (similarity = 1 - edit_distance / max_length).
56
+ 2. The cut position is immediately after the matched `tokens_before` region.
57
+ 3. Verify by checking that `tokens_after` appears near the cut position. If `tokens_after` cannot be matched, fall back to the position derived from `tokens_before` alone.
58
+
59
+ **Step 4: Slide and repeat.** Create a chunk from the text before the first confirmed cut. Move the window forward: the new window starts from the last cut point. Repeat until all remaining text fits in a single chunk.
60
+
61
+ ### Why This Works
62
+
63
+ - The LLM identifies semantic boundaries, not arbitrary character counts.
64
+ - The LLM never regenerates text — it only quotes positions. No hallucination risk.
65
+ - K-token quoting with Levenshtein matching is language-agnostic. It works for Chinese, English, and mixed-language documents equally well.
66
+ - The rolling window means documents of any length can be processed incrementally — the algorithm is not bounded by context window size.
67
+ - Fuzzy matching handles the inevitable small differences between the LLM's quoted text and the actual source.
68
+
69
+ ### When to Use
70
+
71
+ - Only when the onion peeler cannot split further (no sub-headers available).
72
+ - For documents with no structural markup at all.
73
+ - Cost consideration: this requires LLM calls. Use the cheapest model that can identify topic boundaries (often TIER3 or TIER4 is sufficient).
74
+
75
+ ## Practical Guidelines
76
+
77
+ - **Chunk size depends on the downstream task.** For rule extraction by the coding agent, chunks can be large (100K+ tokens). For worker LLM processing, chunks must fit in 16K-32K context.
78
+ - **Preserve context.** When splitting, include the parent header chain as context. A chunk from "Part II > Chapter 3 > Section 3.2" should include those headers so downstream processing knows where the content belongs.
79
+ - **Cache the tree.** Once a document's structure is parsed, save the tree. Multiple rules may need content from the same document, and re-parsing is wasteful.
80
+ - **Log your chunking decisions.** Which strategy was used, how many chunks were produced, their sizes. This helps debug downstream issues.
@@ -0,0 +1,118 @@
1
+ ---
2
+ name: rule-graph
3
+ description: Build and maintain a graph of relationships between verification rules — shared entities, logical dependencies, and conflicts. Use when analyzing the impact of a regulation change, when optimizing extraction to avoid duplicate work, when checking rule catalog completeness, or when rolling up document-level results into a summary. Critical constraint — the graph is an overlay for analysis, NOT a prerequisite for execution. Every rule must remain independently runnable.
4
+ ---
5
+
6
+ # Rule Graph
7
+
8
+ Rules do not exist in isolation. They share entities, depend on each other's results, and sometimes contradict each other. The rule graph makes these relationships explicit so you can reason about the rule catalog as a system rather than a flat list.
9
+
10
+ But a graph can become a trap. If rules cannot run without the graph, you have built a monolith. The graph is a lens for analysis — never a gate for execution.
11
+
12
+ ## Independence First
13
+
14
+ **CRITICAL: Every rule MUST have a self-sufficient workflow.** This is not a suggestion. It is the single most important constraint in the entire rule graph design.
15
+
16
+ Enterprise systems integrate at the individual rule level. A bank's existing compliance platform calls Rule R001 to check capital adequacy, independently of any other rule. If R001 requires R003 to run first, the integration breaks. If R001 requires the graph to be loaded, the integration breaks.
17
+
18
+ The graph is an overlay. It exists for the coding agent's analysis and optimization. It does NOT create hard dependencies between rules. Any rule, extracted from the graph and dropped into a standalone system, must produce correct results with only its own inputs.
19
+
20
+ When you find a genuine dependency (Rule B only makes sense if Rule A has already been evaluated), encode it as metadata in the graph. The workflow for Rule B should handle the case where Rule A's result is unavailable — by computing it inline, requesting it as an optional input, or marking its own result as incomplete.
21
+
22
+ ## What the Graph Captures
23
+
24
+ ### Shared Entities
25
+
26
+ Rules that extract the same entity from the same document region. R001 (capital adequacy) and R004 (tier-1 capital ratio) both need the capital figures from the balance sheet. The graph records this so an optimizer can extract once and share.
27
+
28
+ But shared extraction is an optimization, not a requirement. Each rule's workflow must include its own extraction logic as the default path. The shared extraction is a fast path that the system can use when rules run together.
29
+
30
+ ### Logical Dependencies
31
+
32
+ Rule B applies only if Rule A passes. "If the borrower is classified as high-risk (R012), then enhanced due diligence documents are required (R013)." The graph captures this so that:
33
+ - When presenting results, R013 can be shown in context of R012.
34
+ - When R012's logic changes, you know R013 may be affected.
35
+
36
+ The workflow for R013 must still function if R012's result is not available — it should either compute the high-risk classification itself or flag that it cannot determine applicability.
37
+
38
+ ### Conflicts
39
+
40
+ Two rules that can produce contradictory guidance. Regulation A requires disclosure of all related-party transactions; Regulation B's privacy provisions restrict disclosure of certain transaction details. The graph marks this conflict so the coding agent can escalate to the developer user rather than silently choosing one rule over the other.
41
+
42
+ ### Shared Corner Cases
43
+
44
+ Edge cases that affect multiple rules. A document with an unusual structure (merged cells in a table, non-standard date format) may cause extraction failures across several rules. The graph links these rules to the shared corner case so a fix in one propagates awareness to others.
45
+
46
+ ## Three Uses
47
+
48
+ ### 1. Impact Analysis
49
+
50
+ A regulation changes. Which rules are affected? Traverse the graph from the changed rule:
51
+ - Direct edges: rules that share entities or have dependencies.
52
+ - Second-degree edges: rules that depend on affected rules.
53
+ - Conflict edges: rules whose conflict resolution may need revisiting.
54
+
55
+ Without the graph, impact analysis requires reading every rule to check for overlap. With the graph, it is a traversal.
56
+
57
+ ### 2. Optimization
58
+
59
+ When processing a batch of documents against many rules, the graph identifies:
60
+ - **Shared extraction**: extract entity X once, use it in rules R001, R004, R007.
61
+ - **Execution ordering**: if R012 runs before R013, R013 can use R012's result as a shortcut (but must not require it).
62
+ - **Parallel groups**: rules with no edges between them can run in parallel.
63
+
64
+ The optimization is opportunistic. It makes batch processing faster. It must never make individual rules dependent on the batch context. A rule pulled out of the batch and run alone must still work.
65
+
66
+ ### 3. Completeness Checking
67
+
68
+ Map regulation paragraphs to rules. The graph helps identify:
69
+ - **Uncovered sections**: regulation paragraphs that no rule addresses.
70
+ - **Over-covered sections**: paragraphs addressed by multiple overlapping rules (potential redundancy).
71
+ - **Coverage target**: aim for 95% of substantive regulation paragraphs mapped to at least one rule.
72
+
73
+ ### 4. Document-Level Rollup
74
+
75
+ When presenting results for a document checked against many rules, use the graph to:
76
+ - **Group related results**: show capital adequacy rules together, show disclosure rules together.
77
+ - **Order by dependency**: show R012 (risk classification) before R013 (enhanced due diligence).
78
+ - **Highlight conflicts**: when two rules produce contradictory results, surface the conflict rather than burying it in a flat list.
79
+ - **Aggregate severity**: if five related rules all fail, that cluster may be more significant than five unrelated failures.
80
+
81
+ ## Representation
82
+
83
+ The graph is stored as a JSON adjacency list in the rule catalog directory.
84
+
85
+ ```json
86
+ {
87
+ "nodes": {
88
+ "R001": {"name": "Capital adequacy ratio", "category": "capital"},
89
+ "R004": {"name": "Tier-1 capital ratio", "category": "capital"},
90
+ "R012": {"name": "Borrower risk classification", "category": "risk"},
91
+ "R013": {"name": "Enhanced due diligence", "category": "due_diligence"}
92
+ },
93
+ "edges": [
94
+ {"from": "R001", "to": "R004", "type": "shares_entity", "entity": "capital_figures"},
95
+ {"from": "R012", "to": "R013", "type": "depends_on", "condition": "R012 result is high-risk"},
96
+ {"from": "R008", "to": "R015", "type": "conflicts_with", "detail": "Disclosure vs privacy"}
97
+ ]
98
+ }
99
+ ```
100
+
101
+ **Edge types:**
102
+ - `shares_entity`: Both rules extract the same entity. Optimization opportunity.
103
+ - `depends_on`: Target rule's applicability depends on source rule's result. Metadata only — not a hard execution dependency.
104
+ - `conflicts_with`: Rules may produce contradictory guidance. Requires escalation logic.
105
+ - `shares_corner_case`: Both rules are affected by the same edge case pattern.
106
+
107
+ The graph is updated whenever the rule catalog changes — new rules added, rules modified, rules retired. It is visualizable through `dashboard-reporting` as a node-link diagram.
108
+
109
+ ## Integration
110
+
111
+ **Input:** Rule catalog from `rule-extraction`. The graph is built after rules are extracted and updated as rules evolve.
112
+
113
+ **Enriches:**
114
+ - `skill-authoring`: When writing a new rule skill, check the graph for related rules. Reference shared entities and dependencies in the skill's SKILL.md.
115
+ - `quality-control`: When a rule's accuracy drops, check the graph for downstream rules that may also be affected.
116
+
117
+ **Feeds:**
118
+ - `dashboard-reporting`: Graph visualization, impact analysis results, coverage metrics.