kc-beta 0.7.5 → 0.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +47 -0
  2. package/package.json +3 -2
  3. package/src/agent/context.js +17 -1
  4. package/src/agent/engine.js +467 -100
  5. package/src/agent/llm-client.js +24 -1
  6. package/src/agent/pipelines/_advance-hints.js +92 -0
  7. package/src/agent/pipelines/_milestone-derive.js +325 -20
  8. package/src/agent/pipelines/skill-authoring.js +49 -3
  9. package/src/agent/tools/agent-tool.js +2 -2
  10. package/src/agent/tools/consult-skill.js +15 -0
  11. package/src/agent/tools/dashboard-render.js +48 -1
  12. package/src/agent/tools/document-parse.js +31 -2
  13. package/src/agent/tools/phase-advance.js +17 -13
  14. package/src/agent/tools/release.js +343 -7
  15. package/src/agent/tools/sandbox-exec.js +65 -8
  16. package/src/agent/tools/worker-llm-call.js +95 -15
  17. package/src/agent/workspace.js +25 -4
  18. package/src/cli/components.js +4 -1
  19. package/src/cli/index.js +125 -8
  20. package/src/config.js +19 -2
  21. package/src/marathon/driver.js +217 -0
  22. package/src/marathon/prompts.js +93 -0
  23. package/template/.env.template +17 -1
  24. package/template/AGENT.md +2 -2
  25. package/template/skills/en/auto-model-selection/SKILL.md +55 -35
  26. package/template/skills/en/bootstrap-workspace/SKILL.md +27 -0
  27. package/template/skills/en/compliance-judgment/SKILL.md +14 -0
  28. package/template/skills/en/confidence-system/SKILL.md +30 -8
  29. package/template/skills/en/corner-case-management/SKILL.md +53 -33
  30. package/template/skills/en/cross-document-verification/SKILL.md +88 -83
  31. package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
  32. package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  33. package/template/skills/en/data-sensibility/SKILL.md +19 -12
  34. package/template/skills/en/document-chunking/SKILL.md +99 -15
  35. package/template/skills/en/entity-extraction/SKILL.md +14 -4
  36. package/template/skills/en/quality-control/SKILL.md +23 -0
  37. package/template/skills/en/rule-extraction/SKILL.md +92 -94
  38. package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
  39. package/template/skills/en/skill-authoring/SKILL.md +85 -2
  40. package/template/skills/en/skill-creator/SKILL.md +25 -3
  41. package/template/skills/en/skill-to-workflow/SKILL.md +73 -1
  42. package/template/skills/en/task-decomposition/SKILL.md +1 -1
  43. package/template/skills/en/tree-processing/SKILL.md +1 -1
  44. package/template/skills/en/version-control/SKILL.md +15 -0
  45. package/template/skills/en/work-decomposition/SKILL.md +52 -32
  46. package/template/skills/phase_skills.yaml +5 -0
  47. package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
  48. package/template/skills/zh/bootstrap-workspace/SKILL.md +27 -0
  49. package/template/skills/zh/compliance-judgment/SKILL.md +51 -37
  50. package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
  51. package/template/skills/zh/confidence-system/SKILL.md +34 -9
  52. package/template/skills/zh/corner-case-management/SKILL.md +71 -104
  53. package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
  54. package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
  55. package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
  56. package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
  57. package/template/skills/zh/data-sensibility/SKILL.md +13 -0
  58. package/template/skills/zh/document-chunking/SKILL.md +101 -18
  59. package/template/skills/zh/document-parsing/SKILL.md +65 -65
  60. package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
  61. package/template/skills/zh/entity-extraction/SKILL.md +78 -68
  62. package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
  63. package/template/skills/zh/quality-control/SKILL.md +23 -0
  64. package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
  65. package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
  66. package/template/skills/zh/rule-extraction/SKILL.md +199 -188
  67. package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
  68. package/template/skills/zh/skill-authoring/SKILL.md +136 -58
  69. package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
  70. package/template/skills/zh/skill-creator/SKILL.md +215 -201
  71. package/template/skills/zh/skill-creator/references/schemas.md +60 -60
  72. package/template/skills/zh/skill-to-workflow/SKILL.md +73 -1
  73. package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
  74. package/template/skills/zh/task-decomposition/SKILL.md +1 -1
  75. package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
  76. package/template/skills/zh/tree-processing/SKILL.md +67 -63
  77. package/template/skills/zh/version-control/SKILL.md +15 -0
  78. package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
  79. package/template/skills/zh/work-decomposition/SKILL.md +52 -30
  80. package/template/workflows/common/llm_client.py +168 -0
  81. package/template/workflows/common/utils.py +132 -0
@@ -1,22 +1,35 @@
1
1
  ---
2
2
  name: corner-case-management
3
3
  tier: meta
4
- description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use when the evolution loop classifies a failure as a corner case (affecting less than ~10% of documents), when adding a new edge case to the registry, or when deciding whether a corner case should be promoted to a systemic fix. Also use when designing the corner case detection mechanism for a workflow.
4
+ description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use AFTER several rounds of skill/workflow iteration have surfaced documents that genuinely don't fit and can't be accommodated by reasonable changes to the standard flow. Covers when something is a corner case vs. a systemic fix needed, how to register it, how to wire detection-and-resolve so the corner case stays out of the hot path until a similar document appears again, and when to promote a corner case to a regular rule.
5
5
  ---
6
6
 
7
7
  # Corner Case Management
8
8
 
9
- A good workflow handles 90% of cases cleanly. Corner cases are the other 10%. They are individually rare but collectively significant. The key insight: do NOT patch the main workflow to handle them. That leads to spaghetti logic, fragile code, and regressions.
9
+ A good workflow handles the bulk of cases cleanly. Corner cases are documents that don't fit individually rare, collectively non-negligible. The key insight: do NOT patch the main workflow to handle them. That leads to spaghetti logic, fragile code, and regressions.
10
10
 
11
- Instead, maintain a separate registry. Check incoming documents against the registry before the standard workflow. Handle matches with specific resolutions.
11
+ Instead, maintain a separate registry. Check incoming documents against the registry; handle matches with specific resolutions; keep the main workflow lean.
12
+
13
+ ## When to reach for this skill
14
+
15
+ Corner cases are identified **after** the testing iterations, not during initial design. The typical pathway:
16
+
17
+ 1. You've built a skill/workflow for a rule.
18
+ 2. You've run it on the sample set. Most cases pass; some fail.
19
+ 3. You iterate — change code, change prompts, run back-tests.
20
+ 4. After several rounds, some cases stubbornly don't fit. Either:
21
+ - The current workflow doesn't catch them and changing the workflow to catch them would cost too much (breaks unrelated cases, adds too much complexity, requires architectural rework).
22
+ - The cases are swinging — one round of changes fits them, the next round breaks them, and oscillation continues.
23
+
24
+ That's when you use this skill. Not on the first iteration. Not when you can plausibly fix it in the workflow. Only when fitting the case into the standard flow costs more than the case is worth.
12
25
 
13
26
  ## Philosophy
14
27
 
15
- Corner cases are a fact of life in document verification. Financial documents are produced by thousands of different organizations, each with their own formatting quirks, templates, and interpretations of regulations. No workflow will handle every variant.
28
+ Corner cases are a fact of life in document verification. Financial documents come from thousands of different organizations, each with their own formatting quirks, templates, and interpretations of regulations. No workflow will handle every variant.
16
29
 
17
- The question is not "how do I eliminate corner cases?" but "how do I manage them efficiently?"
30
+ The question is not "how do I eliminate corner cases?" but "how do I manage them efficiently without polluting the main path?"
18
31
 
19
- The answer is: separate them from the main logic. Keep the workflow clean. Keep the corner cases cataloged, detectable, and resolvable.
32
+ The answer is: separate them. Keep the main workflow clean. Keep the corner cases cataloged, detectable, and resolvable when they recur.
20
33
 
21
34
  ## The Corner Case Registry
22
35
 
@@ -39,7 +52,7 @@ A structured file (`corner_cases.json`) in the rule skill's `assets/` directory:
39
52
  "action": "Multiply extracted value by 100 before threshold comparison",
40
53
  "code_snippet": "if value < 1.0: value *= 100"
41
54
  },
42
- "discovered_at": "2026-04-01",
55
+ "discovered_at": "<date>",
43
56
  "iteration": 3,
44
57
  "status": "active"
45
58
  }
@@ -53,60 +66,67 @@ Each entry captures:
53
66
  - **When** it was discovered and in which iteration.
54
67
  - **Status**: active, promoted (moved to main workflow), or deprecated.
55
68
 
56
- ## Detection During Execution
69
+ ## How to handle corner cases — lazy retrieval with a high bar
70
+
71
+ Corner cases live in the registry, not in the main verification path. They're **lazily loaded** with a **high retrieval threshold**, meaning:
57
72
 
58
- Before running the standard workflow on a document:
73
+ - The main workflow runs first on every document.
74
+ - The registry is consulted only when a new incoming document resembles a previously-collected corner case strongly enough to trigger that entry.
75
+ - The "strongly enough" bar should be set high — false matches drag in a wrong resolution and pollute the main result. Better to miss a match (the main workflow handles it, possibly imperfectly) than to apply an irrelevant corner-case patch.
59
76
 
60
- 1. Load the corner case registry for the relevant rule.
61
- 2. Check the document against each active corner case's detection pattern.
62
- 3. If a match exceeds the confidence threshold, apply the specific resolution instead of (or in addition to) the standard workflow.
63
- 4. Log that a corner case was triggered.
77
+ Two detection patterns work for this:
64
78
 
65
- This is similar to a RAG pipeline with progressive disclosure:
66
- - The registry is the knowledge base.
67
- - Detection patterns are the retrieval queries.
68
- - High confidence thresholds prevent false matches.
69
- - Only relevant corner cases are loaded into context.
79
+ - **Cheap deterministic match**: regex or keyword fingerprint over document text. Fast, run on every document.
80
+ - **Embedding-based similarity**: compute embedding of the document's rule-relevant region, compare to embeddings of registered corner cases. Catches semantic resemblance the regex misses; costs more.
70
81
 
71
- ## When to Add a Corner Case
82
+ For early-stage rules with a small registry, deterministic detection is enough. For mature rules with dozens of registered cases, embedding-based similarity scales better.
83
+
84
+ When a match fires:
85
+ 1. Log that the corner case was triggered (audit visibility).
86
+ 2. Apply the registered resolution.
87
+ 3. Skip / supplement / override the standard workflow as the resolution prescribes.
88
+
89
+ ## When to add a corner case
72
90
 
73
91
  Add a corner case when:
74
- - The evolution loop classifies a failure as non-systemic (affects <10% of documents).
75
- - The failure has a recognizable, describable pattern.
76
- - The resolution is clear and self-contained.
92
+ - Iterations of the standard workflow can't accommodate the document without disproportionate cost.
93
+ - The failure has a recognizable, describable pattern (a fingerprint to detect again).
94
+ - The resolution is clear and self-contained (not "rewrite the workflow").
77
95
 
78
96
  Do NOT add a corner case when:
79
- - The failure affects many documents (that is a systemic issue fix the workflow).
80
- - The failure has no discernible pattern (that may be a data quality issue escalate to developer user).
81
- - The resolution would require changing the core judgment logic (that belongs in the main workflow).
97
+ - The failure affects many documents that's not a corner case, it's a pattern, and the main workflow should change.
98
+ - The failure has no discernible pattern that may be a data quality issue; escalate to the developer user.
99
+ - The resolution would require changing the core judgment logic that belongs in the main workflow.
82
100
 
83
- ## When to Promote a Corner Case
101
+ ## When to promote a corner case
84
102
 
85
103
  A corner case should be promoted to the main workflow (i.e., the resolution becomes part of the standard logic) when:
86
- - It starts appearing in >10% of documents. It is no longer a corner case — it is a pattern.
104
+ - It starts appearing across many documents. It's no longer a corner case — it's a pattern.
87
105
  - Multiple similar corner cases suggest a common underlying issue.
88
106
  - The developer user explicitly says "this is how it always works."
89
107
 
90
108
  When promoting, remove the corner case from the registry and update the workflow. Version both changes.
91
109
 
92
- ## Human Visibility
110
+ ## Human visibility
93
111
 
94
112
  The corner case registry must be readable and manageable by the developer user:
113
+
95
114
  - Format it clearly (JSON or a markdown table).
96
- - Include enough context that a domain expert can understand each case without reading the code.
115
+ - Include enough context that a domain expert can understand each case without reading code.
97
116
  - Report new corner cases in the dashboard.
98
117
  - Allow the developer user to add corner cases from their own expertise.
99
118
 
100
- Developer users often know about edge cases that the coding agent has not yet encountered. They should be able to add entries like:
119
+ Developer users often know about edge cases that the agent has not yet encountered. They should be able to add entries like:
101
120
  - "Bank XYZ always uses a different template for their Q4 reports."
102
121
  - "Mutual fund documents from before 2020 follow the old regulation format."
103
122
 
104
123
  These are valuable inputs that prevent future failures.
105
124
 
106
- ## Corner Case Cost
125
+ ## Corner case cost
107
126
 
108
127
  Every corner case has a runtime cost: the detection check runs on every document. Keep the registry lean:
128
+
109
129
  - Remove deprecated corner cases.
110
130
  - Merge similar corner cases into a single entry with a broader pattern.
111
- - Keep detection patterns efficient (prefer regex over LLM-based detection).
112
- - Monitor the registry size. If it grows beyond ~50 entries for a single rule, that suggests the workflow itself needs improvement.
131
+ - Keep detection patterns efficient (prefer regex over LLM-based detection where possible).
132
+ - Monitor the registry size. When it grows large for a single rule, that's a signal the workflow itself needs improvement — many of those "corner" cases probably want promotion to mainline handling.
@@ -1,132 +1,137 @@
1
1
  ---
2
2
  name: cross-document-verification
3
3
  tier: meta
4
- description: Perform case-level analysis across multiple documents for the same transaction. Use when documents do not exist in isolation main contracts have appendices, loan applications come bundled with income certificates, bank statements, credit reports, and property appraisals. Use to build comparison matrices, detect contradictions (hard mismatches and soft implausibilities), classify severity, and flag fraud signals. Also use when user or end-user reports a cross-document inconsistency these reports are ground truth and take priority over agent judgment.
4
+ description: Build cross-document verification rule-skills and workflows i.e., rules where the verdict depends on facts that appear in MORE than one document. Use when authoring or distilling a rule that requires comparing entities/values across documents in a case (main contract + appendices, loan application + income certificate + bank statement, etc.). Covers what the rule definition must specify (source-doc entity target-doc entity consistency level), how the resulting check.py/workflow walks the case, contradiction types to handle, and severity classification. Distinct from `compliance-judgment` (within-doc rules); KC is the builder, not the executor.
5
5
  ---
6
6
 
7
- # Cross-Document Verification
7
+ # Cross-Document Verification — Building Rules That Span Documents
8
8
 
9
- Single-document verification asks: does this document comply with the rules? Cross-document verification asks a different question: do all the documents in this transaction tell a consistent story?
9
+ KC's job in this phase is **building** verification rules that span multiple documents — not executing cross-doc checks at runtime. The output is a rule-skill / workflow that, once shipped, can be applied by KC or downstream systems to any case.
10
10
 
11
- This is the difference between a document checker and a case analyst. A document checker reviews files one at a time. A case analyst lays them all on the table and looks for the threads that connect — or contradict — each other. When you activate cross-document verification, you are upgrading the system from checker to analyst.
11
+ If a rule's verdict depends on facts in more than one document, that fact has to be encoded in the rule itself. You can't build a within-doc check.py and hope the cross-doc check happens by magic at production time.
12
12
 
13
- ## The Case Concept
13
+ ## What the rule definition must specify
14
14
 
15
- A case is the set of documents that belong to one transaction, one borrower, or one deal. Documents within a case share entities: names, dates, amounts, identifiers. These shared entities are your anchors for comparison.
15
+ For a cross-doc rule, the catalog entry needs more than just "description + falsifiability_statement". It needs an **explicit cross-doc trajectory**:
16
16
 
17
- Two common case patterns:
17
+ 1. **Starting document class**: which document do we start from? How do we classify a document as belonging to this class? (e.g., "loan application form", "main contract", "annual disclosure report")
18
+ 2. **Anchor entity in the starting doc**: what do we extract first? (e.g., applicant ID, borrower name, contract reference number, reported total amount)
19
+ 3. **Target document class(es)**: which document(s) must we cross-check against? How do we identify them within the same case? (shared anchor key, naming pattern, explicit reference in the starting doc)
20
+ 4. **Target entity in each target doc**: what do we extract from each target doc to compare? It may or may not have the same field name — "monthly income" on the application vs. "average deposit" on the bank statement.
21
+ 5. **Consistency level for PASS**: when do we say the documents agree? Exact equality? Equality within tolerance? Plausibility check (within 1.5× of the other value)? Reciprocal-reference present?
22
+ 6. **What FAIL looks like**: hard mismatch, soft contradiction, or missing reciprocal reference — and what severity each carries.
18
23
 
19
- 1. **Main contract + appendices/supplements.** Same issuer, formally cross-referenced. The main contract states the total, the appendices break it down. The main contract references Appendix B; Appendix B must exist and match. The team has extensive experience with inconsistencies in this pattern — totals that do not match line item sums, referenced appendices that are missing or outdated, version mismatches between main body and supplements.
24
+ Without these five anchors, the rule is just a wish. With them, the resulting check.py can walk the case deterministically.
20
25
 
21
- 2. **Multi-source bundle.** Loan application + income certificate + bank statement + credit report + property appraisal. Different issuers, different formats, same borrower. The applicant's name, ID number, income, and employment must be consistent across all documents.
26
+ ## How this differs from `compliance-judgment`
22
27
 
23
- In both patterns, the shared entities are the anchors. Every anchor that appears in more than one document is a candidate for cross-verification.
28
+ `compliance-judgment` covers within-doc rules: read one document, apply the rule, return a verdict. The judgment can be regex, deterministic Python, or LLM-augmented but the input is one document.
24
29
 
25
- ## Building the Comparison Matrix
30
+ This skill covers rules where the **input is a case** (a set of related documents). The check.py for such a rule:
26
31
 
27
- The comparison matrix is the core artifact. Rows are shared fields. Columns are source documents. Each cell contains the value found in that document for that field, or is marked absent.
32
+ - Receives a case-level structure (list of documents + metadata), not a single doc.
33
+ - Performs entity extraction across the relevant documents.
34
+ - Builds the comparison matrix needed for that specific rule (not a generic everything-vs-everything matrix — only the fields this rule cares about).
35
+ - Applies the consistency check defined in the rule.
36
+ - Returns the verdict plus the matrix-cells that drove it (evidence).
28
37
 
29
- Example matrix for a loan case:
38
+ KC is the **builder** of this check.py / workflow, not the executor. The teaching below is about what to design, not what to do at runtime.
30
39
 
31
- | Field | Application | Income Cert | Bank Statement | Credit Report |
32
- |-------|-------------|-------------|----------------|---------------|
33
- | Applicant name | Zhang Wei | Zhang Wei | Zhang Wei | Zhang W. |
34
- | ID number | 310...1234 | 310...1234 | — | 310...1234 |
35
- | Monthly income | 85,000 | 82,000 | avg 43,000 | — |
36
- | Employer | ABC Corp | ABC Corp Ltd | — | ABC Corporation |
37
- | Employment start | 2019-03 | 2019-06 | — | 2019 |
40
+ ## Case patterns to design for
38
41
 
39
- Example matrix for a contract + appendices case:
42
+ Two patterns cover most cross-doc rules:
40
43
 
41
- | Field | Main Contract | Appendix A | Appendix B |
42
- |-------|---------------|------------|------------|
43
- | Total amount | 5,000,000 | — | 4,850,000 (sum of items) |
44
- | Party B name | Shenzhen XX Co. | Shenzhen XX Co., Ltd | Shenzhen XX Co. |
45
- | Effective date | 2024-01-15 | 2024-01-15 | 2024-02-01 |
44
+ 1. **Main contract + appendices/supplements.** Same issuer, formally cross-referenced. The main contract states the total, the appendices break it down. Issues to expect: totals that don't match line-item sums, referenced appendices missing or outdated, version mismatches between main body and supplements.
46
45
 
47
- Populate the matrix using output from `entity-extraction`. An empty cell means the field was not found in that document this is absence, not necessarily an error. But absence in a field that should be present is itself a finding.
46
+ 2. **Multi-source bundle.** Loan application + income certificate + bank statement + credit report + property appraisal. Different issuers, different formats, same borrower/case. Issues to expect: identity drift across docs (name spellings, employer name variants), value contradictions (income claimed vs. observed), suspicious consistency (multiple docs with identical formatting/phrasing → possible fabrication).
48
47
 
49
- ## Contradiction Types
48
+ When authoring a rule, name the pattern it targets. The check.py shape differs between the two — the contract+appendices pattern expects formal references and reciprocal links; the multi-source pattern expects normalization (entity reconciliation across formats).
50
49
 
51
- ### Hard Contradictions
50
+ ## Contradiction taxonomy (what the rule may need to detect)
52
51
 
53
- Exact mismatches in fields that must be identical across documents:
52
+ A cross-doc rule may flag one or more of these:
54
53
 
55
- - **Identity mismatch**: Name spelled differently, ID number differs by digits, date of birth inconsistent.
56
- - **Amount mismatch**: Main contract states 5,000,000 but appendix line items sum to 4,850,000. Income certificate says 82,000/month but application says 85,000/month.
57
- - **Date mismatch**: Contract effective date in the main body differs from the date in an appendix.
54
+ ### Hard contradictions
55
+ Exact mismatches in fields that must be identical across docs.
56
+ - Identity field mismatch (name, ID, DoB).
57
+ - Amount mismatch beyond tolerance (totals, line items, reported incomes).
58
+ - Date mismatch on fields that should be identical (effective date, signing date).
58
59
 
59
- Hard contradictions are binary. Either the values match (within defined tolerance) or they do not.
60
+ Hard contradictions are binary. Either the values match (within defined tolerance) or they don't.
60
61
 
61
- ### Soft Contradictions
62
+ ### Soft contradictions
63
+ Values not directly inconsistent but **implausible** when considered together.
64
+ - Monthly income claimed at one level; bank statement averages significantly below it.
65
+ - Property appraisal close to the loan amount (high LTV — technically possible, but a signal).
66
+ - Employment dates with multi-month gaps across documents.
62
67
 
63
- Values that are not directly inconsistent but are implausible when considered together:
68
+ Soft contradictions need thresholds and judgment. They are findings, not verdicts the rule should specify whether they trigger FAIL, WARNING, or just evidence-collection for downstream review.
64
69
 
65
- - Monthly income claimed as 100,000 but bank statement average deposits are 5,000.
66
- - Property appraised at 3,000,000 but loan amount is 2,800,000 (93% LTV — technically possible but suspicious).
67
- - Employment start date 2019-03 on the application but 2019-06 on the income certificate a 3-month gap that could be rounding or could be fabrication.
70
+ ### Cross-reference failures
71
+ Structural integrity problems within formally linked documents.
72
+ - Main contract references "Appendix B Payment Schedule" but Appendix B is missing or has a different title.
73
+ - Appendix references a clause that doesn't exist in the main contract.
74
+ - Missing reciprocal reference: appendix references main contract, but main contract doesn't list appendix.
75
+ - Version mismatch: main contract dated one date, appendix dated a much later date with different terms.
68
76
 
69
- Soft contradictions require thresholds and judgment. They are findings, not verdicts.
77
+ See `references/contradiction-taxonomy.md` for the field-level taxonomy with tolerances and severity templates the rule designer can borrow.
70
78
 
71
- ### Cross-Reference Failures
79
+ ## Severity classification (rule-level decision)
72
80
 
73
- Structural integrity problems within formally linked documents:
81
+ Not all contradictions carry the same weight. The rule must specify how its findings map to severity:
74
82
 
75
- - Main contract references "Appendix B — Payment Schedule" but Appendix B is titled "Technical Specifications" or is missing entirely.
76
- - Appendix references a clause in the main contract that does not exist in the provided version.
77
- - Missing reciprocal reference: Appendix C references the main contract, but the main contract does not list Appendix C.
78
- - Version mismatch: main contract dated January, appendix dated March with different terms.
83
+ | Severity | Typical example |
84
+ |----------|-----------------|
85
+ | Critical | Identity field mismatch (different ID across docs) |
86
+ | High | Significant financial amount discrepancy |
87
+ | Medium | Date or employment-detail mismatch |
88
+ | Low | Formatting / abbreviation difference ("ABC Corp" vs "ABC Corporation") |
79
89
 
80
- See `references/contradiction-taxonomy.md` for the full field-level taxonomy with tolerances and severity classifications.
90
+ The actual thresholds are domain-specific. The developer user sets them per field; the rule's design should leave room for the configured values rather than hardcode percentages.
81
91
 
82
- ## Severity Classification
92
+ ## Fraud-signal patterns the rule may surface
83
93
 
84
- Not all contradictions are equal. Classify by impact:
94
+ Some patterns are visible only at the case level:
85
95
 
86
- | Severity | Criteria | Example |
87
- |----------|----------|---------|
88
- | **Critical** | Identity field mismatch | ID number differs between documents |
89
- | **High** | Financial amount discrepancy > 10% | Income 85K vs bank avg 43K |
90
- | **Medium** | Date or employment detail mismatch | Employment start date 3 months apart |
91
- | **Low** | Formatting or abbreviation difference | "ABC Corp" vs "ABC Corp Ltd" |
96
+ - **Consistent small discrepancies across docs**: e.g., every doc inflated by ~5%, every date shifted by exactly one month. Consistency-in-error suggests coordinated fabrication rather than honest mistakes.
97
+ - **Suspicious doc consistency**: multiple docs from supposedly different issuers using identical formatting / phrasing / typos.
98
+ - **Values at regulatory thresholds**: LTV exactly at the limit, DTI exactly at the limit. One coincidence is harmless; a pattern across the case is a signal.
99
+ - **Temporal impossibilities**: income certificate issued before employment start; bank statement covering a period before account opening; appraisal dated after loan disbursement.
92
100
 
93
- The developer user sets severity thresholds per field in the project configuration. The defaults above are starting points. Adjust based on the business context in some scenarios, a 5% amount discrepancy is critical; in others, 15% is acceptable.
101
+ If the rule's job includes any of these, encode them. These are **signals for escalation**, not conclusions the rule should output "flagged" with evidence, and let the developer user or downstream review process determine the final verdict.
94
102
 
95
- ## Fraud Signal Patterns
103
+ ## Designing the case-walk in check.py
96
104
 
97
- Cross-document analysis reveals patterns that single-document review cannot detect. Flag these — do not accuse, flag:
105
+ For a cross-doc rule, a typical check.py:
98
106
 
99
- - **Consistent small discrepancies across documents.** Income inflated by exactly 5% in every document. Dates shifted by exactly one month. This consistency in error suggests coordinated fabrication rather than honest mistakes.
100
- - **Suspicious document consistency.** Multiple documents from supposedly different issuers use identical formatting, identical phrasing, or identical typos. Legitimate documents from different organizations look different.
101
- - **Values at regulatory thresholds.** LTV ratio at exactly 69.9% when the limit is 70%. DTI at exactly 49.8% when the limit is 50%. One occurrence is coincidence. A pattern across the case is a signal.
102
- - **Temporal impossibilities.** Income certificate issued before the employment start date. Bank statement covering a period before the account opening date. Appraisal dated after the loan disbursement.
107
+ 1. Receives the case (list of doc paths/handles + classification metadata).
108
+ 2. Selects the **starting document** of the class declared in the rule.
109
+ 3. Extracts the **anchor entity** (using entity-extraction).
110
+ 4. Locates the **target document(s)** in the case by the anchor or by formal reference.
111
+ 5. Extracts the **target entity** from each target doc.
112
+ 6. Applies the **consistency check** per the rule's PASS criteria.
113
+ 7. Returns:
114
+ - verdict (PASS / FAIL / WARNING / NOT_APPLICABLE)
115
+ - evidence (the cells of the comparison matrix that drove the verdict, with source-doc + page references)
116
+ - confidence (per `confidence-system`)
103
117
 
104
- These are signals for escalation, not conclusions. Present them with evidence and let the developer user or downstream review process make the determination.
105
-
106
- ## Workflow Sequence
107
-
108
- The recommended sequence for cross-document verification within a case:
109
-
110
- 1. **Identify the case boundary.** Which documents belong together? Use shared identifiers (borrower name, loan number, contract reference) to group documents into cases.
111
- 2. **Extract anchors.** Run `entity-extraction` on each document independently. Collect the shared fields.
112
- 3. **Build the matrix.** Populate the comparison matrix. Flag empty cells.
113
- 4. **Detect contradictions.** Apply hard/soft/cross-reference checks per `references/contradiction-taxonomy.md`.
114
- 5. **Classify severity.** Assign severity per the configured thresholds.
115
- 6. **Scan for fraud signals.** Run pattern checks across the matrix.
116
- 7. **Produce the report.** Output the case-level consistency report with all findings.
118
+ The comparison matrix is rule-specific only the fields this rule needs. Building a generic "all entities × all docs" matrix is expensive and pollutes the evidence trail.
117
119
 
118
120
  ## Integration
119
121
 
120
- **Inputs:**
121
- - Entity extraction results from `entity-extraction` (the raw field values per document).
122
- - Compliance judgment results from `compliance-judgment` (per-document pass/fail already computed).
122
+ **Inputs to the rule's check.py at runtime:**
123
+ - Case structure (set of related documents + identifiers).
124
+ - Entity extraction available per-doc (skill: `entity-extraction`).
125
+ - Compliance judgment may have already produced per-doc verdicts (skill: `compliance-judgment`) that this rule can use as part of its inputs.
123
126
 
124
127
  **Outputs:**
125
- - Case-level consistency report: the comparison matrix, all contradictions found, severity classifications, and fraud signal flags.
128
+ - A case-level verdict plus the comparison-matrix slice that justifies it.
126
129
 
127
130
  **Feeds into:**
128
- - `confidence-system`: Cross-document contradictions lower confidence in affected fields.
129
- - `evolution-loop`: Recurring contradiction patterns trigger workflow refinement.
130
- - `dashboard-reporting`: Case-level view alongside document-level results.
131
+ - `confidence-system`: cross-doc contradictions are a strong signal, often lowering confidence in affected fields.
132
+ - `evolution-loop`: recurring contradiction patterns the rule didn't anticipate trigger workflow refinement.
133
+ - `dashboard-reporting`: case-level view alongside per-doc results.
134
+
135
+ ## Ground-truth principle
131
136
 
132
- **Ground truth principle:** User and end-user contradiction reports feed back as ground truth. When a user or end-user reports a cross-document inconsistency that the system missed, that report is prior to agent judgment. Log it, learn from it, and adjust detection thresholds accordingly. The system's job is to catch what humans catch — and then go further.
137
+ When the developer user or end user reports a cross-doc inconsistency that this rule missed, that report is prior to agent judgment. Log it, treat it as a corner case (`corner-case-management`) or as a trigger for rule refinement, and adjust detection thresholds accordingly. The rule's job is to catch what humans catch — and then go further.
@@ -1,107 +1,132 @@
1
1
  ---
2
2
  name: dashboard-reporting
3
3
  tier: meta-meta
4
- description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants to see the system's status, or at any point where visual reporting would help communicate progress. Dashboards should be self-contained HTML files that can be opened by double-clicking. Also use when the developer user asks about results, accuracy, or system health.
4
+ description: Generate HTML dashboards for developer users to visualize verification results, system progress, and quality metrics. Use when a testing round completes, when production batches finish processing, when the developer user wants visual reporting, or when they explicitly ask for it. Dashboards are self-contained HTML files. Use this skill **when there's something visual worth showing** — not as a default deliverable. For routine status updates use KC's TUI. The dashboard is a complement to direct reporting, not a substitute.
5
5
  ---
6
6
 
7
7
  # Dashboard Reporting
8
8
 
9
- The dashboard is the developer user's window into the system. They should not need to read logs or parse JSON to understand what is happening. Give them a clear, visual summary that leads with what matters.
9
+ The dashboard is one channel — and not always the most economical one — for letting the developer user see what's going on. KC already reports status directly in the TUI; the HTML dashboard exists for things that are **worth seeing visually**: distributions, timelines, heatmaps, side-by-side comparisons, drill-down tables.
10
10
 
11
- ## Dashboard Types
11
+ Don't treat dashboard generation as a checkbox to satisfy this skill. Treat it as a deliverable the developer user actually asked for, or where a picture genuinely saves them time over reading TUI output or JSON.
12
12
 
13
- ### Results Dashboard
14
- Generated after each batch of documents is processed.
13
+ ## Minimum vs. nice-to-have
15
14
 
16
- Key elements:
17
- - **Summary bar**: Total documents, pass rate, fail rate, missing rate, error rate.
18
- - **Per-rule breakdown**: Table showing each rule's pass/fail counts, accuracy, and average confidence.
19
- - **Failed cases**: List of documents that failed, with the rule, extracted value, expected value, and comment. Sortable and filterable.
20
- - **Confidence distribution**: Histogram showing how many results fall in each confidence band.
15
+ When the developer user asks for a dashboard, **start with the minimum and expand only if it adds real value**.
21
16
 
22
- ### Progress Dashboard
23
- Generated on demand to show the system's evolution.
17
+ ### Minimum
24
18
 
25
- Key elements:
26
- - **Lifecycle status**: Which rules are in which phase (skill testing, workflow testing, production, stable).
27
- - **Rule catalog**: Table of all rules with their current status, accuracy, and version.
28
- - **Evolution timeline**: For each rule, how many iterations it took, what was the accuracy at each step.
29
- - **Model tier assignments**: Which model is being used for each step of each rule.
19
+ A useful dashboard at the floor:
20
+ - A summary header: total documents, top-line pass/fail/missing counts.
21
+ - A per-rule table: rule_id, accuracy, pass / fail / NA counts, optional confidence column.
22
+ - A list of failed cases the user can click into for details (rule, extracted value, expected value, comment).
30
23
 
31
- ### Quality Dashboard
32
- Generated after quality control reviews.
24
+ That's enough to ship. If the user can answer "is this batch healthy and which rules failed" in 3 seconds, the minimum is done.
33
25
 
34
- Key elements:
35
- - **Accuracy over time**: Line chart showing per-rule and overall accuracy across batches.
36
- - **Sampling rate over time**: Showing how monitoring is decreasing (or not).
37
- - **Flagged issues**: Open issues that need developer user attention.
38
- - **Cost metrics**: LLM calls and tokens per document, per rule.
26
+ ### Nice-to-have
39
27
 
40
- ## Feedback Collection
28
+ Add these only when they're justified by the data on hand or the user's actual need:
29
+ - Confidence distribution histogram (useful when confidence is calibrated and the user cares about the distribution shape).
30
+ - Accuracy-over-time line chart (useful only when there's enough history to draw a meaningful curve).
31
+ - Per-product-type / per-issuer breakdown (useful when the corpus has meaningful segmentation).
32
+ - Cost metrics (useful when cost is a live concern; otherwise skip).
33
+ - Drill-down navigation (summary → rule → document).
34
+ - Inline feedback widgets (correction-on-click, flag-as-wrong).
41
35
 
42
- Every dashboard must include mechanisms for users to report errors and comment directly on verification results. This is not a nice-to-have — user feedback is the most valuable data source in the system.
36
+ Don't add a section to look thorough. An empty "Confidence distribution" chart with no calibrated data is worse than no chart.
43
37
 
44
- ### Developer User Feedback
38
+ ## Dashboard types (when to use which)
45
39
 
46
- Developer users see full result detail. Their feedback interface should support:
47
- - **Field-level correction**: Click on an extracted value, provide the correct value.
48
- - **Result override**: Change a pass to fail (or vice versa) with a reason.
49
- - **Rule re-evaluation request**: Flag a result for re-processing with a different approach.
50
- - **Comment**: Free-text annotation on any result.
40
+ ### Results dashboard
41
+ After a batch of documents is processed. The minimum above usually covers it.
51
42
 
52
- ### End User Feedback
43
+ ### Progress dashboard
44
+ On demand, to show the system's evolution across phases. Lifecycle status per rule, rule catalog table, evolution timeline. Mostly useful when the developer user wants a "where are we" snapshot mid-build.
53
45
 
54
- End users of the verification app see simplified results. Their interface should support:
55
- - **Flag as wrong**: One-click to report a result they believe is incorrect.
56
- - **Add comment**: Brief text explaining what they think is wrong.
57
- - **Severity indicator**: How impactful is this error? (Critical / Important / Minor)
46
+ ### Quality dashboard
47
+ After QC review cycles. Accuracy-over-time, sampling rate trend, flagged issues, cost. Useful when QC has accumulated enough cycles to show a trend.
58
48
 
59
- ### Feedback as Ground Truth
49
+ If only one of the three would actually help the developer user right now, build only that one. Don't generate all three by default.
60
50
 
61
- User-reported errors are ground truth. They override the coding agent's judgment and the worker LLM's output. The feedback data flow:
51
+ ## Feedback collection (optional but recommended when applicable)
62
52
 
63
- 1. User submits feedback via dashboard stored as structured record.
64
- 2. Record schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
65
- 3. Feedback records are fed into the `evolution-loop` as confirmed failures.
66
- 4. Dashboard surfaces feedback trends: correction rate over time, most-reported issues, rules with highest user correction rates.
53
+ When the dashboard is destined for an audience that's going to review the results (developer user, end user, domain expert), include feedback widgets. When the dashboard is purely for developer-user inspection mid-build, feedback widgets are usually overkill — they pretend at a workflow the user isn't going to follow.
67
54
 
68
- Build the feedback collection mechanism alongside the dashboard generation — they are not separate features. Every generated HTML dashboard should include the feedback UI, even if it initially writes to a local JSON file that the coding agent reads on the next iteration.
55
+ ### Developer-user feedback
56
+
57
+ Full result detail visible. Useful widgets:
58
+ - Field-level correction: click an extracted value, provide the right one.
59
+ - Result override: change pass to fail (or vice versa) with a reason.
60
+ - Comment: free-text annotation on any result.
61
+
62
+ ### End-user feedback
63
+
64
+ Simplified results visible. Useful widgets:
65
+ - Flag-as-wrong: one-click to report a result believed incorrect.
66
+ - Comment: brief text explanation.
67
+ - Severity indicator: critical / important / minor.
68
+
69
+ ### Feedback as ground truth
70
+
71
+ User-reported errors are ground truth. They override agent judgment and worker-LLM output. Flow:
72
+
73
+ 1. Submit via dashboard → stored as structured record.
74
+ 2. Schema: `{result_id, trace_id, reporter_role, feedback_type, original_result, corrected_value, comment, timestamp}`.
75
+ 3. Records feed into `evolution-loop` as confirmed failures.
76
+ 4. Surface feedback trends in subsequent dashboards (correction rate over time, most-reported issues, rules with highest correction rates).
69
77
 
70
78
  ## Technology
71
79
 
72
- Self-contained HTML with embedded CSS and JavaScript. Requirements:
73
- - **No external dependencies.** No CDN links, no npm packages, no server. Everything is inlined.
74
- - **No server required.** The developer user double-clicks the HTML file to open it in their browser.
75
- - **Responsive layout.** Should work on both desktop and mobile screens.
76
- - **Dark/light mode.** Respect the system preference or provide a toggle.
80
+ Self-contained HTML with embedded CSS / JavaScript.
81
+ - **No external dependencies.** No CDN links, no npm packages, no server. Everything inlined.
82
+ - **No server required.** Developer user double-clicks the HTML file.
83
+ - **Responsive layout.** Should work on desktop and mobile.
84
+ - **Dark/light mode** respect system preference or provide a toggle.
77
85
 
78
- For charts, use inline SVG or a lightweight chart library that can be embedded (e.g., Chart.js or lightweight alternatives, inlined as a script tag).
86
+ For charts, use inline SVG or a lightweight chart library inlined as a `<script>` tag.
79
87
 
80
- ## Data Sources
88
+ ## Data sources
81
89
 
82
90
  Dashboards read from:
83
91
  - `Output/` for verification results.
84
92
  - `logs/` for evolution and testing history.
85
- - `versions.json` for current system state.
86
- - QC review records (stored alongside Output/).
93
+ - `versions.json` (or git log) for current system state.
94
+ - QC review records (stored alongside `Output/`).
87
95
 
88
- The dashboard generation script should accept input paths and produce a single HTML file.
96
+ The generation script should accept input paths and produce a single HTML file.
89
97
 
90
- ## Generation Triggers
98
+ ## Generation triggers
91
99
 
92
- Generate dashboards automatically after:
93
- - Each testing round completes (skill testing or workflow testing).
94
- - Each production batch finishes processing.
95
- - Each quality control review cycle.
96
- - Developer user explicitly requests it.
100
+ Generate dashboards when:
101
+ - A testing round completes AND there's enough data to be worth visualizing.
102
+ - A production batch finishes AND the developer user wants a visual.
103
+ - A QC review cycle completes.
104
+ - The developer user explicitly requests one.
97
105
 
98
- Store generated dashboards in `Output/dashboards/` with timestamps in filenames for history.
106
+ Don't auto-generate on every minor event — the dashboards pile up fast and the user won't open most of them. When unsure, ask the user ("Want me to generate a dashboard?") instead of producing one unprompted.
99
107
 
100
- ## Design Principles
108
+ Store generated dashboards in `Output/dashboards/` with timestamped filenames for history.
101
109
 
102
- - **Lead with the summary.** The developer user should understand the system's health in 3 seconds.
103
- - **Drill down on demand.** Summary → rule-level → document-level. Do not overwhelm with details upfront.
110
+ ## Design principles
111
+
112
+ - **Lead with the summary.** Developer user should understand health in 3 seconds.
113
+ - **Drill down on demand.** Summary → rule-level → document-level. Don't overwhelm with details upfront.
104
114
  - **Color coding.** Green for pass/healthy, red for fail/critical, yellow for warning/attention. Simple and universal.
105
- - **Actionable.** Every flagged issue should suggest what to do next.
115
+ - **Actionable.** Every flagged issue should suggest a next step.
116
+
117
+ A starter script is available in `scripts/generate_dashboard.py`. Adapt to the specific scenario — and feel free to trim the script when half its sections wouldn't have content. A small dashboard that answers the user's question beats a comprehensive one they don't need.
118
+
119
+ ## Relationship to TUI reporting
120
+
121
+ KC's TUI already supports rich status reporting during the run. Use TUI for:
122
+ - Ongoing progress narration.
123
+ - Per-phase summaries.
124
+ - Quick "what just happened" updates.
125
+ - Anything that can be communicated in a few lines of text.
126
+
127
+ Use HTML dashboards for:
128
+ - Visual artifacts that wouldn't fit (distributions, charts, filterable tables).
129
+ - Hand-off to non-KC users (developer-user reviewing later, end-user audience).
130
+ - Persistent records the user wants to revisit.
106
131
 
107
- A starter script is available in `scripts/generate_dashboard.py`. Adapt it to the specific business scenario.
132
+ When in doubt, prefer the TUI. A short status message the user is already reading beats a dashboard they have to open.