kc-beta 0.8.1 → 0.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/src/agent/context.js +17 -1
- package/src/agent/engine.js +85 -8
- package/src/agent/llm-client.js +24 -1
- package/src/agent/pipelines/_milestone-derive.js +78 -7
- package/src/agent/pipelines/skill-authoring.js +19 -2
- package/src/agent/tools/release.js +94 -1
- package/src/cli/index.js +28 -7
- package/template/.env.template +1 -1
- package/template/AGENT.md +2 -2
- package/template/skills/en/auto-model-selection/SKILL.md +55 -35
- package/template/skills/en/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/en/compliance-judgment/SKILL.md +14 -0
- package/template/skills/en/confidence-system/SKILL.md +30 -8
- package/template/skills/en/corner-case-management/SKILL.md +53 -33
- package/template/skills/en/cross-document-verification/SKILL.md +88 -83
- package/template/skills/en/dashboard-reporting/SKILL.md +91 -66
- package/template/skills/en/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/en/data-sensibility/SKILL.md +19 -12
- package/template/skills/en/document-chunking/SKILL.md +99 -15
- package/template/skills/en/entity-extraction/SKILL.md +14 -4
- package/template/skills/en/quality-control/SKILL.md +14 -0
- package/template/skills/en/rule-extraction/SKILL.md +92 -94
- package/template/skills/en/rule-extraction/references/chunking-strategies.md +7 -78
- package/template/skills/en/skill-authoring/SKILL.md +52 -8
- package/template/skills/en/skill-creator/SKILL.md +25 -3
- package/template/skills/en/skill-to-workflow/SKILL.md +23 -4
- package/template/skills/en/task-decomposition/SKILL.md +1 -1
- package/template/skills/en/tree-processing/SKILL.md +1 -1
- package/template/skills/en/version-control/SKILL.md +15 -0
- package/template/skills/en/work-decomposition/SKILL.md +21 -35
- package/template/skills/zh/auto-model-selection/SKILL.md +54 -33
- package/template/skills/zh/bootstrap-workspace/SKILL.md +13 -0
- package/template/skills/zh/compliance-judgment/SKILL.md +14 -0
- package/template/skills/zh/compliance-judgment/references/output-format.md +62 -62
- package/template/skills/zh/confidence-system/SKILL.md +34 -9
- package/template/skills/zh/corner-case-management/SKILL.md +71 -104
- package/template/skills/zh/cross-document-verification/SKILL.md +90 -195
- package/template/skills/zh/cross-document-verification/references/contradiction-taxonomy.md +36 -36
- package/template/skills/zh/dashboard-reporting/SKILL.md +82 -232
- package/template/skills/zh/dashboard-reporting/scripts/generate_dashboard.py +1 -1
- package/template/skills/zh/data-sensibility/SKILL.md +13 -0
- package/template/skills/zh/document-chunking/SKILL.md +96 -20
- package/template/skills/zh/document-parsing/references/parser-catalog.md +26 -26
- package/template/skills/zh/entity-extraction/SKILL.md +14 -4
- package/template/skills/zh/evolution-loop/references/convergence-guide.md +38 -38
- package/template/skills/zh/quality-control/SKILL.md +14 -0
- package/template/skills/zh/quality-control/references/qa-layers.md +65 -65
- package/template/skills/zh/quality-control/references/sampling-strategies.md +49 -49
- package/template/skills/zh/rule-extraction/SKILL.md +199 -188
- package/template/skills/zh/rule-extraction/references/chunking-strategies.md +5 -78
- package/template/skills/zh/skill-authoring/SKILL.md +108 -69
- package/template/skills/zh/skill-authoring/references/skill-format-spec.md +39 -39
- package/template/skills/zh/skill-creator/SKILL.md +71 -61
- package/template/skills/zh/skill-creator/references/schemas.md +60 -60
- package/template/skills/zh/skill-to-workflow/SKILL.md +24 -5
- package/template/skills/zh/skill-to-workflow/references/worker-llm-catalog.md +24 -24
- package/template/skills/zh/task-decomposition/SKILL.md +1 -1
- package/template/skills/zh/task-decomposition/references/decision-matrix.md +54 -54
- package/template/skills/zh/tree-processing/SKILL.md +1 -1
- package/template/skills/zh/version-control/SKILL.md +15 -0
- package/template/skills/zh/version-control/references/trace-id-spec.md +34 -34
- package/template/skills/zh/work-decomposition/SKILL.md +21 -33
|
@@ -2,53 +2,73 @@
|
|
|
2
2
|
name: auto-model-selection
|
|
3
3
|
tier: meta
|
|
4
4
|
description: >
|
|
5
|
-
Use Context7 CLI
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
(npm i -g context7).
|
|
5
|
+
Use Context7 CLI for up-to-date model facts (size, API format, context window) and the
|
|
6
|
+
guidance below for "what kind of models are good for what part of a doc verification app".
|
|
7
|
+
Consult whenever you need to pick a model for a tier slot, decide between provider
|
|
8
|
+
alternatives, or sanity-check an existing tier assignment. Context7 gives you fresh
|
|
9
|
+
facts; this skill gives you the heuristic that maps those facts onto KC's pipeline.
|
|
10
|
+
Optional plugin (Context7 install: npm i -g context7).
|
|
11
11
|
---
|
|
12
12
|
|
|
13
|
-
# Auto Model Selection
|
|
13
|
+
# Auto Model Selection
|
|
14
14
|
|
|
15
|
-
|
|
15
|
+
Model selection is not a frequently called skill — for most users, the tiers in workspace `.env` are already set sensibly, and a 4-tier cost-sensitive selection is overkill. This skill exists for two moments: when the conductor is bootstrapping fresh tier assignments (rare), and as a reference inside `skill-to-workflow` when a workflow needs to pick which worker LLM is right for its job.
|
|
16
16
|
|
|
17
|
-
|
|
18
|
-
- `c7 library <query>` — search for a library/provider by name
|
|
19
|
-
- `c7 docs <libraryId> <query>` — get specific documentation and code examples
|
|
17
|
+
The teaching below is empirical — what the author has found works in this domain. Treat it as a starting heuristic with a 3-6 month shelf life, since model families update quickly.
|
|
20
18
|
|
|
21
|
-
##
|
|
19
|
+
## Worker LLM family — practical heuristic
|
|
22
20
|
|
|
23
|
-
-
|
|
24
|
-
-
|
|
25
|
-
-
|
|
26
|
-
- Onboarding `/models` endpoint failed and curated list is stale
|
|
21
|
+
- **Qwen family** — robust and cheap for general worker work. The flagship MoE in this family at any given time is usually one of the best workhorses for routine extraction / classification. Smaller sizes (3B-70B) are plentiful and reliable. Good default.
|
|
22
|
+
- **DeepSeek** — excellent on more complicated tasks. Worth reaching for when the rule involves multi-step reasoning, nested judgment, or anything where Qwen's shorter-context behavior shows strain.
|
|
23
|
+
- **GLM and Kimi** — also strong for the same complicated-task range as DeepSeek. Trade-off: they don't usually ship smaller variants (3B-70B), so they're tier1/tier2 only, not tier3/tier4.
|
|
27
24
|
|
|
28
|
-
##
|
|
25
|
+
## Flagship-MoE shape and the tier1 baseline
|
|
29
26
|
|
|
30
|
-
|
|
31
|
-
2. Use `c7 library <provider-name>` to find the provider's library ID
|
|
32
|
-
3. Use `c7 docs <id> "available models"` to get current model listings
|
|
33
|
-
4. From the docs, identify: model names, capabilities (reasoning, coding, vision), context window sizes, pricing tiers
|
|
34
|
-
5. Assign models to tiers based on capability and cost:
|
|
35
|
-
- LLM tier1: most capable (complex judgment, extraction)
|
|
36
|
-
- LLM tier2-3: mid-range (routine extraction, simple judgment)
|
|
37
|
-
- LLM tier4: cheapest (high-volume simple tasks)
|
|
38
|
-
- VLM tier1-3: vision models for document parsing/OCR
|
|
39
|
-
6. Update `model-tiers.json` or workspace `.env` with assignments
|
|
27
|
+
The current generation of flagship MoE LLMs has a recognizable shape: total parameter count in the 200-400B range, ~20B of activated expert size per token. Examples (these will be stale within months): Qwen's flagship MoE in the 200-400B-A20B band, DeepSeek-V4-Flash, etc.
|
|
40
28
|
|
|
41
|
-
|
|
29
|
+
This shape is a good first-choice worker LLM — not necessarily the absolute best at tier1, but a reasonable benchmark to start from. When you're picking a tier1 model, start from one of these and only move if you have a specific reason.
|
|
42
30
|
|
|
43
|
-
|
|
44
|
-
- Regex is tier0 — smaller than any LLM
|
|
45
|
-
- Not all tiers need to be filled — blank tiers are fine if the provider lacks suitable models
|
|
46
|
-
- Record what works in AGENT.md for future reference
|
|
31
|
+
## Smaller LLMs — basically free below 30B
|
|
47
32
|
|
|
48
|
-
|
|
33
|
+
Once you drop below ~30B parameters, models are extremely cheap on most providers. Qwen ships a ton of choices in this range and they work well.
|
|
34
|
+
|
|
35
|
+
Two rules for picking in this range:
|
|
36
|
+
- **Avoid "coder" variants** (models with `coder` / `code` in the name) at small sizes — these are mostly unreliable for general worker tasks. Use them only when the task is literally code-related.
|
|
37
|
+
- **Prefer no-thinking-mode variants** when available. Tasks assigned to small workers are simple and fixed; extra thinking buys nothing and adds latency + token cost.
|
|
38
|
+
|
|
39
|
+
## VLM / OCR selection
|
|
40
|
+
|
|
41
|
+
First question: what's the visual task?
|
|
42
|
+
|
|
43
|
+
- **Characters in scanned docs, seal stamps, handwriting** — use a dedicated OCR model. Current strong choices (subject to change): Paddle-OCR family, GLM-OCR, DeepSeek-OCR. Previous versions still work fine. No need to use a larger general VLM for plain character recognition.
|
|
44
|
+
- **Graphs, complex tables, structures with strange or no frame lines** — try a larger and more expensive general VLM. The structure-understanding gap between OCR-specific models and general VLMs widens fast on these cases.
|
|
45
|
+
|
|
46
|
+
When in doubt, run the cheapest OCR option first and only upgrade if the cheap path misses structure information.
|
|
47
|
+
|
|
48
|
+
## Context7 — model facts on demand
|
|
49
|
+
|
|
50
|
+
The heuristics above are about *which kind* of model to pick. For the specific facts (current model names, exact context window, pricing, API format), use Context7:
|
|
49
51
|
|
|
50
52
|
```bash
|
|
51
|
-
|
|
53
|
+
c7 library <provider-name>
|
|
54
|
+
c7 docs <libraryId> "available models"
|
|
52
55
|
```
|
|
53
56
|
|
|
54
|
-
|
|
57
|
+
Two commands. The first finds the provider's library ID, the second fetches up-to-date docs and code examples. Useful when:
|
|
58
|
+
|
|
59
|
+
- Workspace `model-tiers.json` looks stale (KC hasn't been updated since the last model launch)
|
|
60
|
+
- User switched providers and needs model discovery
|
|
61
|
+
- The `/models` endpoint on a new provider was empty or unhelpful
|
|
62
|
+
- You're sanity-checking that a model name in `.env` is still served
|
|
63
|
+
|
|
64
|
+
Install: `npm i -g context7`. Verify: `c7 library openai` should return results.
|
|
65
|
+
|
|
66
|
+
## When tier1/tier2 are picked
|
|
67
|
+
|
|
68
|
+
The tier assignment ends up driving cost. Cheapest model that meets the accuracy bar wins. Regex is the implicit "tier 0" and should be the first reach when the rule can be satisfied by pattern matching alone — see `skill-to-workflow` for when to escalate from regex to worker LLMs.
|
|
69
|
+
|
|
70
|
+
Not all tiers need to be filled. Blank tier3/tier4 slots are fine if the provider doesn't ship suitable small models. Record what works in `AGENT.md` so the next session inherits the choice.
|
|
71
|
+
|
|
72
|
+
## Refresh cadence
|
|
73
|
+
|
|
74
|
+
This skill's heuristics will drift. Author plans to revisit every 3-6 months as new model generations land. If you find yourself applying advice here that contradicts what Context7 shows today, trust Context7 — the facts have moved.
|
|
@@ -87,6 +87,19 @@ Update AGENT.md at natural checkpoints: after the developer user gives you a sub
|
|
|
87
87
|
|
|
88
88
|
A future session resumes by reading AGENT.md first. The richer it is, the less re-explanation the developer user has to do.
|
|
89
89
|
|
|
90
|
+
### Phase-transition cadence
|
|
91
|
+
|
|
92
|
+
A recurring failure mode worth flagging: agents bootstrap AGENT.md richly, then never touch it again — many hours of phase work pass without a single AGENT.md commit. That defeats the long-term-memory purpose.
|
|
93
|
+
|
|
94
|
+
Cadence to adopt: **append a one-line decision log to AGENT.md at each phase transition**. Format:
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
[<timestamp> | rule_extraction → skill_authoring]
|
|
98
|
+
N rules extracted; coverage_audit complete; R03/R05/R07 flagged judgment-heavy.
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
Three lines of friction per phase transition; thirty lines of insight for the next auditor / next session. The format is loose — the cadence matters more than perfect prose.
|
|
102
|
+
|
|
90
103
|
## When to Re-Bootstrap
|
|
91
104
|
|
|
92
105
|
Return to this skill when:
|
|
@@ -81,3 +81,17 @@ Map these dependencies in the rule catalog. Execute rules in dependency order. P
|
|
|
81
81
|
- **Multiple values**: The document contains the entity in multiple places with different values. Flag as `uncertain`. Report all found values.
|
|
82
82
|
- **Conditional rules**: "If the loan exceeds 1M, then collateral is required." Check the condition before applying the rule. If the condition is not met, the rule does not apply — result is `pass` (or `not_applicable` if you add that category).
|
|
83
83
|
- **Negative results**: Some rules check for absence. "The document must NOT contain guarantees to related parties." Searching for absence is harder than searching for presence. Be thorough in the search, then be confident in the negative.
|
|
84
|
+
|
|
85
|
+
## Cross-document consistency of confidence bars
|
|
86
|
+
|
|
87
|
+
For a rule that judges multiple documents (which is the common case), **the confidence threshold for "pass" must be the same across all documents** for that rule. A rule that requires 0.85 confidence to pass on document A and 0.75 on document B isn't a rule — it's two rules pretending to be one.
|
|
88
|
+
|
|
89
|
+
When the worker LLM is the judge, set the threshold in the prompt or in postprocessing, not "let the LLM decide" per call. Random variation between LLM calls means each call lands somewhere on a distribution; your job is to set the bar that distribution must clear.
|
|
90
|
+
|
|
91
|
+
Two ways to enforce:
|
|
92
|
+
- **In the prompt**: explicit "output 'PASS' only if confidence ≥ 0.85" language. Cheap. Vulnerable to LLM drift between calls.
|
|
93
|
+
- **In postprocessing**: ask the LLM for verdict + confidence separately, then apply the threshold in a small Python wrapper. More reliable. Engine sees the threshold as code, not prompt prose.
|
|
94
|
+
|
|
95
|
+
For rules with stable patterns (formats, presence checks), prefer postprocessing. For rules with subjective judgment (adequacy, sufficiency), the prompt-level threshold is easier to write but worth auditing — sample a batch and check whether the LLM honored the bar.
|
|
96
|
+
|
|
97
|
+
The `confidence-system` skill describes how to compose confidence from multiple signals; this section is about applying it consistently across documents for the same rule.
|
|
@@ -50,21 +50,43 @@ Does this document match any known corner case pattern?
|
|
|
50
50
|
- Near miss: slightly lower confidence.
|
|
51
51
|
- No match: neutral (no adjustment).
|
|
52
52
|
|
|
53
|
+
### Signal: Format-Pattern Conformance
|
|
54
|
+
Does the extracted value match a regex-expressible format that the rule implies? Many fields have a known shape — phone numbers, ID numbers, dates, currency amounts, regulatory codes — that can be checked cheaply with a regex independent of the LLM's confidence in its own output.
|
|
55
|
+
- Value matches the expected format pattern: positive signal.
|
|
56
|
+
- Value violates the format pattern (e.g., a phone field with letters): strong negative signal, often a hallucination indicator.
|
|
57
|
+
- No applicable format pattern for the field type: neutral.
|
|
58
|
+
|
|
59
|
+
This catches a common failure: the LLM correctly identifies *where* the value lives but gets the format wrong (typos a digit, drops a country code, returns a stand-in like "see appendix").
|
|
60
|
+
|
|
61
|
+
### Signal: Statistical Outlier
|
|
62
|
+
For numeric values, does this value sit far from the typical range seen across other documents for the same field?
|
|
63
|
+
- Within 1 standard deviation of average: neutral / mild positive.
|
|
64
|
+
- 2-3 standard deviations: mild negative.
|
|
65
|
+
- Beyond 3 std dev or outside a domain-plausible range: strong negative — often a unit error (yuan vs. 万元), a decimal mistake, or a hallucinated digit.
|
|
66
|
+
|
|
67
|
+
Useful especially for amounts, percentages, ratios. Compute the reference range from QC-confirmed historical data; refresh periodically. For categorical fields, the analog is "value not in the observed value-set" — flag novel values for review.
|
|
68
|
+
|
|
53
69
|
### Combining Signals
|
|
54
70
|
|
|
55
|
-
|
|
71
|
+
Combine the signals into a single confidence score. The form is usually a weighted sum:
|
|
56
72
|
|
|
57
73
|
```
|
|
58
|
-
confidence =
|
|
74
|
+
confidence = w_method * method_prior
|
|
75
|
+
+ w_source * source_presence
|
|
76
|
+
+ w_history * historical_accuracy
|
|
77
|
+
+ w_corner * corner_case_adjustment
|
|
78
|
+
+ w_format * format_conformance
|
|
79
|
+
+ w_outlier * outlier_check
|
|
59
80
|
```
|
|
60
81
|
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
-
|
|
64
|
-
-
|
|
65
|
-
-
|
|
82
|
+
How to set the weights is **your judgment call** per rule and per project. A few orienting principles, not prescriptions:
|
|
83
|
+
|
|
84
|
+
- Historical accuracy is the most predictive signal once you have data — favor it heavily after the first few QC cycles.
|
|
85
|
+
- Method prior and source presence are always available — they carry the early iterations before history exists.
|
|
86
|
+
- Format conformance and outlier signals are independent of the LLM — they keep working even when the LLM is overconfident, which is exactly when you want them.
|
|
87
|
+
- Corner-case adjustment is usually small but should fire decisively when a known corner case is matched.
|
|
66
88
|
|
|
67
|
-
When historical accuracy is not yet available (early iterations), redistribute its weight to the other signals.
|
|
89
|
+
When historical accuracy is not yet available (early iterations), redistribute its weight to the other signals. The right weights for this rule on this corpus will reveal themselves through calibration — that's the loop the next section describes.
|
|
68
90
|
|
|
69
91
|
## Threshold Bands
|
|
70
92
|
|
|
@@ -1,22 +1,35 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: corner-case-management
|
|
3
3
|
tier: meta
|
|
4
|
-
description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use
|
|
4
|
+
description: Identify, catalog, and handle corner cases that do not fit the mainstream verification workflow. Use AFTER several rounds of skill/workflow iteration have surfaced documents that genuinely don't fit and can't be accommodated by reasonable changes to the standard flow. Covers when something is a corner case vs. a systemic fix needed, how to register it, how to wire detection-and-resolve so the corner case stays out of the hot path until a similar document appears again, and when to promote a corner case to a regular rule.
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
# Corner Case Management
|
|
8
8
|
|
|
9
|
-
A good workflow handles
|
|
9
|
+
A good workflow handles the bulk of cases cleanly. Corner cases are documents that don't fit — individually rare, collectively non-negligible. The key insight: do NOT patch the main workflow to handle them. That leads to spaghetti logic, fragile code, and regressions.
|
|
10
10
|
|
|
11
|
-
Instead, maintain a separate registry. Check incoming documents against the registry
|
|
11
|
+
Instead, maintain a separate registry. Check incoming documents against the registry; handle matches with specific resolutions; keep the main workflow lean.
|
|
12
|
+
|
|
13
|
+
## When to reach for this skill
|
|
14
|
+
|
|
15
|
+
Corner cases are identified **after** the testing iterations, not during initial design. The typical pathway:
|
|
16
|
+
|
|
17
|
+
1. You've built a skill/workflow for a rule.
|
|
18
|
+
2. You've run it on the sample set. Most cases pass; some fail.
|
|
19
|
+
3. You iterate — change code, change prompts, run back-tests.
|
|
20
|
+
4. After several rounds, some cases stubbornly don't fit. Either:
|
|
21
|
+
- The current workflow doesn't catch them and changing the workflow to catch them would cost too much (breaks unrelated cases, adds too much complexity, requires architectural rework).
|
|
22
|
+
- The cases are swinging — one round of changes fits them, the next round breaks them, and oscillation continues.
|
|
23
|
+
|
|
24
|
+
That's when you use this skill. Not on the first iteration. Not when you can plausibly fix it in the workflow. Only when fitting the case into the standard flow costs more than the case is worth.
|
|
12
25
|
|
|
13
26
|
## Philosophy
|
|
14
27
|
|
|
15
|
-
Corner cases are a fact of life in document verification. Financial documents
|
|
28
|
+
Corner cases are a fact of life in document verification. Financial documents come from thousands of different organizations, each with their own formatting quirks, templates, and interpretations of regulations. No workflow will handle every variant.
|
|
16
29
|
|
|
17
|
-
The question is not "how do I eliminate corner cases?" but "how do I manage them efficiently?"
|
|
30
|
+
The question is not "how do I eliminate corner cases?" but "how do I manage them efficiently without polluting the main path?"
|
|
18
31
|
|
|
19
|
-
The answer is: separate them
|
|
32
|
+
The answer is: separate them. Keep the main workflow clean. Keep the corner cases cataloged, detectable, and resolvable when they recur.
|
|
20
33
|
|
|
21
34
|
## The Corner Case Registry
|
|
22
35
|
|
|
@@ -39,7 +52,7 @@ A structured file (`corner_cases.json`) in the rule skill's `assets/` directory:
|
|
|
39
52
|
"action": "Multiply extracted value by 100 before threshold comparison",
|
|
40
53
|
"code_snippet": "if value < 1.0: value *= 100"
|
|
41
54
|
},
|
|
42
|
-
"discovered_at": "
|
|
55
|
+
"discovered_at": "<date>",
|
|
43
56
|
"iteration": 3,
|
|
44
57
|
"status": "active"
|
|
45
58
|
}
|
|
@@ -53,60 +66,67 @@ Each entry captures:
|
|
|
53
66
|
- **When** it was discovered and in which iteration.
|
|
54
67
|
- **Status**: active, promoted (moved to main workflow), or deprecated.
|
|
55
68
|
|
|
56
|
-
##
|
|
69
|
+
## How to handle corner cases — lazy retrieval with a high bar
|
|
70
|
+
|
|
71
|
+
Corner cases live in the registry, not in the main verification path. They're **lazily loaded** with a **high retrieval threshold**, meaning:
|
|
57
72
|
|
|
58
|
-
|
|
73
|
+
- The main workflow runs first on every document.
|
|
74
|
+
- The registry is consulted only when a new incoming document resembles a previously-collected corner case strongly enough to trigger that entry.
|
|
75
|
+
- The "strongly enough" bar should be set high — false matches drag in a wrong resolution and pollute the main result. Better to miss a match (the main workflow handles it, possibly imperfectly) than to apply an irrelevant corner-case patch.
|
|
59
76
|
|
|
60
|
-
|
|
61
|
-
2. Check the document against each active corner case's detection pattern.
|
|
62
|
-
3. If a match exceeds the confidence threshold, apply the specific resolution instead of (or in addition to) the standard workflow.
|
|
63
|
-
4. Log that a corner case was triggered.
|
|
77
|
+
Two detection patterns work for this:
|
|
64
78
|
|
|
65
|
-
|
|
66
|
-
-
|
|
67
|
-
- Detection patterns are the retrieval queries.
|
|
68
|
-
- High confidence thresholds prevent false matches.
|
|
69
|
-
- Only relevant corner cases are loaded into context.
|
|
79
|
+
- **Cheap deterministic match**: regex or keyword fingerprint over document text. Fast, run on every document.
|
|
80
|
+
- **Embedding-based similarity**: compute embedding of the document's rule-relevant region, compare to embeddings of registered corner cases. Catches semantic resemblance the regex misses; costs more.
|
|
70
81
|
|
|
71
|
-
|
|
82
|
+
For early-stage rules with a small registry, deterministic detection is enough. For mature rules with dozens of registered cases, embedding-based similarity scales better.
|
|
83
|
+
|
|
84
|
+
When a match fires:
|
|
85
|
+
1. Log that the corner case was triggered (audit visibility).
|
|
86
|
+
2. Apply the registered resolution.
|
|
87
|
+
3. Skip / supplement / override the standard workflow as the resolution prescribes.
|
|
88
|
+
|
|
89
|
+
## When to add a corner case
|
|
72
90
|
|
|
73
91
|
Add a corner case when:
|
|
74
|
-
-
|
|
75
|
-
- The failure has a recognizable, describable pattern.
|
|
76
|
-
- The resolution is clear and self-contained.
|
|
92
|
+
- Iterations of the standard workflow can't accommodate the document without disproportionate cost.
|
|
93
|
+
- The failure has a recognizable, describable pattern (a fingerprint to detect again).
|
|
94
|
+
- The resolution is clear and self-contained (not "rewrite the workflow").
|
|
77
95
|
|
|
78
96
|
Do NOT add a corner case when:
|
|
79
|
-
- The failure affects many documents
|
|
80
|
-
- The failure has no discernible pattern
|
|
81
|
-
- The resolution would require changing the core judgment logic
|
|
97
|
+
- The failure affects many documents — that's not a corner case, it's a pattern, and the main workflow should change.
|
|
98
|
+
- The failure has no discernible pattern — that may be a data quality issue; escalate to the developer user.
|
|
99
|
+
- The resolution would require changing the core judgment logic — that belongs in the main workflow.
|
|
82
100
|
|
|
83
|
-
## When to
|
|
101
|
+
## When to promote a corner case
|
|
84
102
|
|
|
85
103
|
A corner case should be promoted to the main workflow (i.e., the resolution becomes part of the standard logic) when:
|
|
86
|
-
- It starts appearing
|
|
104
|
+
- It starts appearing across many documents. It's no longer a corner case — it's a pattern.
|
|
87
105
|
- Multiple similar corner cases suggest a common underlying issue.
|
|
88
106
|
- The developer user explicitly says "this is how it always works."
|
|
89
107
|
|
|
90
108
|
When promoting, remove the corner case from the registry and update the workflow. Version both changes.
|
|
91
109
|
|
|
92
|
-
## Human
|
|
110
|
+
## Human visibility
|
|
93
111
|
|
|
94
112
|
The corner case registry must be readable and manageable by the developer user:
|
|
113
|
+
|
|
95
114
|
- Format it clearly (JSON or a markdown table).
|
|
96
|
-
- Include enough context that a domain expert can understand each case without reading
|
|
115
|
+
- Include enough context that a domain expert can understand each case without reading code.
|
|
97
116
|
- Report new corner cases in the dashboard.
|
|
98
117
|
- Allow the developer user to add corner cases from their own expertise.
|
|
99
118
|
|
|
100
|
-
Developer users often know about edge cases that the
|
|
119
|
+
Developer users often know about edge cases that the agent has not yet encountered. They should be able to add entries like:
|
|
101
120
|
- "Bank XYZ always uses a different template for their Q4 reports."
|
|
102
121
|
- "Mutual fund documents from before 2020 follow the old regulation format."
|
|
103
122
|
|
|
104
123
|
These are valuable inputs that prevent future failures.
|
|
105
124
|
|
|
106
|
-
## Corner
|
|
125
|
+
## Corner case cost
|
|
107
126
|
|
|
108
127
|
Every corner case has a runtime cost: the detection check runs on every document. Keep the registry lean:
|
|
128
|
+
|
|
109
129
|
- Remove deprecated corner cases.
|
|
110
130
|
- Merge similar corner cases into a single entry with a broader pattern.
|
|
111
|
-
- Keep detection patterns efficient (prefer regex over LLM-based detection).
|
|
112
|
-
- Monitor the registry size.
|
|
131
|
+
- Keep detection patterns efficient (prefer regex over LLM-based detection where possible).
|
|
132
|
+
- Monitor the registry size. When it grows large for a single rule, that's a signal the workflow itself needs improvement — many of those "corner" cases probably want promotion to mainline handling.
|
|
@@ -1,132 +1,137 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: cross-document-verification
|
|
3
3
|
tier: meta
|
|
4
|
-
description:
|
|
4
|
+
description: Build cross-document verification rule-skills and workflows — i.e., rules where the verdict depends on facts that appear in MORE than one document. Use when authoring or distilling a rule that requires comparing entities/values across documents in a case (main contract + appendices, loan application + income certificate + bank statement, etc.). Covers what the rule definition must specify (source-doc → entity → target-doc → entity → consistency level), how the resulting check.py/workflow walks the case, contradiction types to handle, and severity classification. Distinct from `compliance-judgment` (within-doc rules); KC is the builder, not the executor.
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
-
# Cross-Document Verification
|
|
7
|
+
# Cross-Document Verification — Building Rules That Span Documents
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
KC's job in this phase is **building** verification rules that span multiple documents — not executing cross-doc checks at runtime. The output is a rule-skill / workflow that, once shipped, can be applied by KC or downstream systems to any case.
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
If a rule's verdict depends on facts in more than one document, that fact has to be encoded in the rule itself. You can't build a within-doc check.py and hope the cross-doc check happens by magic at production time.
|
|
12
12
|
|
|
13
|
-
##
|
|
13
|
+
## What the rule definition must specify
|
|
14
14
|
|
|
15
|
-
|
|
15
|
+
For a cross-doc rule, the catalog entry needs more than just "description + falsifiability_statement". It needs an **explicit cross-doc trajectory**:
|
|
16
16
|
|
|
17
|
-
|
|
17
|
+
1. **Starting document class**: which document do we start from? How do we classify a document as belonging to this class? (e.g., "loan application form", "main contract", "annual disclosure report")
|
|
18
|
+
2. **Anchor entity in the starting doc**: what do we extract first? (e.g., applicant ID, borrower name, contract reference number, reported total amount)
|
|
19
|
+
3. **Target document class(es)**: which document(s) must we cross-check against? How do we identify them within the same case? (shared anchor key, naming pattern, explicit reference in the starting doc)
|
|
20
|
+
4. **Target entity in each target doc**: what do we extract from each target doc to compare? It may or may not have the same field name — "monthly income" on the application vs. "average deposit" on the bank statement.
|
|
21
|
+
5. **Consistency level for PASS**: when do we say the documents agree? Exact equality? Equality within tolerance? Plausibility check (within 1.5× of the other value)? Reciprocal-reference present?
|
|
22
|
+
6. **What FAIL looks like**: hard mismatch, soft contradiction, or missing reciprocal reference — and what severity each carries.
|
|
18
23
|
|
|
19
|
-
|
|
24
|
+
Without these five anchors, the rule is just a wish. With them, the resulting check.py can walk the case deterministically.
|
|
20
25
|
|
|
21
|
-
|
|
26
|
+
## How this differs from `compliance-judgment`
|
|
22
27
|
|
|
23
|
-
|
|
28
|
+
`compliance-judgment` covers within-doc rules: read one document, apply the rule, return a verdict. The judgment can be regex, deterministic Python, or LLM-augmented — but the input is one document.
|
|
24
29
|
|
|
25
|
-
|
|
30
|
+
This skill covers rules where the **input is a case** (a set of related documents). The check.py for such a rule:
|
|
26
31
|
|
|
27
|
-
|
|
32
|
+
- Receives a case-level structure (list of documents + metadata), not a single doc.
|
|
33
|
+
- Performs entity extraction across the relevant documents.
|
|
34
|
+
- Builds the comparison matrix needed for that specific rule (not a generic everything-vs-everything matrix — only the fields this rule cares about).
|
|
35
|
+
- Applies the consistency check defined in the rule.
|
|
36
|
+
- Returns the verdict plus the matrix-cells that drove it (evidence).
|
|
28
37
|
|
|
29
|
-
|
|
38
|
+
KC is the **builder** of this check.py / workflow, not the executor. The teaching below is about what to design, not what to do at runtime.
|
|
30
39
|
|
|
31
|
-
|
|
32
|
-
|-------|-------------|-------------|----------------|---------------|
|
|
33
|
-
| Applicant name | Zhang Wei | Zhang Wei | Zhang Wei | Zhang W. |
|
|
34
|
-
| ID number | 310...1234 | 310...1234 | — | 310...1234 |
|
|
35
|
-
| Monthly income | 85,000 | 82,000 | avg 43,000 | — |
|
|
36
|
-
| Employer | ABC Corp | ABC Corp Ltd | — | ABC Corporation |
|
|
37
|
-
| Employment start | 2019-03 | 2019-06 | — | 2019 |
|
|
40
|
+
## Case patterns to design for
|
|
38
41
|
|
|
39
|
-
|
|
42
|
+
Two patterns cover most cross-doc rules:
|
|
40
43
|
|
|
41
|
-
|
|
42
|
-
|-------|---------------|------------|------------|
|
|
43
|
-
| Total amount | 5,000,000 | — | 4,850,000 (sum of items) |
|
|
44
|
-
| Party B name | Shenzhen XX Co. | Shenzhen XX Co., Ltd | Shenzhen XX Co. |
|
|
45
|
-
| Effective date | 2024-01-15 | 2024-01-15 | 2024-02-01 |
|
|
44
|
+
1. **Main contract + appendices/supplements.** Same issuer, formally cross-referenced. The main contract states the total, the appendices break it down. Issues to expect: totals that don't match line-item sums, referenced appendices missing or outdated, version mismatches between main body and supplements.
|
|
46
45
|
|
|
47
|
-
|
|
46
|
+
2. **Multi-source bundle.** Loan application + income certificate + bank statement + credit report + property appraisal. Different issuers, different formats, same borrower/case. Issues to expect: identity drift across docs (name spellings, employer name variants), value contradictions (income claimed vs. observed), suspicious consistency (multiple docs with identical formatting/phrasing → possible fabrication).
|
|
48
47
|
|
|
49
|
-
|
|
48
|
+
When authoring a rule, name the pattern it targets. The check.py shape differs between the two — the contract+appendices pattern expects formal references and reciprocal links; the multi-source pattern expects normalization (entity reconciliation across formats).
|
|
50
49
|
|
|
51
|
-
|
|
50
|
+
## Contradiction taxonomy (what the rule may need to detect)
|
|
52
51
|
|
|
53
|
-
|
|
52
|
+
A cross-doc rule may flag one or more of these:
|
|
54
53
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
-
|
|
54
|
+
### Hard contradictions
|
|
55
|
+
Exact mismatches in fields that must be identical across docs.
|
|
56
|
+
- Identity field mismatch (name, ID, DoB).
|
|
57
|
+
- Amount mismatch beyond tolerance (totals, line items, reported incomes).
|
|
58
|
+
- Date mismatch on fields that should be identical (effective date, signing date).
|
|
58
59
|
|
|
59
|
-
Hard contradictions are binary. Either the values match (within defined tolerance) or they
|
|
60
|
+
Hard contradictions are binary. Either the values match (within defined tolerance) or they don't.
|
|
60
61
|
|
|
61
|
-
### Soft
|
|
62
|
+
### Soft contradictions
|
|
63
|
+
Values not directly inconsistent but **implausible** when considered together.
|
|
64
|
+
- Monthly income claimed at one level; bank statement averages significantly below it.
|
|
65
|
+
- Property appraisal close to the loan amount (high LTV — technically possible, but a signal).
|
|
66
|
+
- Employment dates with multi-month gaps across documents.
|
|
62
67
|
|
|
63
|
-
|
|
68
|
+
Soft contradictions need thresholds and judgment. They are findings, not verdicts — the rule should specify whether they trigger FAIL, WARNING, or just evidence-collection for downstream review.
|
|
64
69
|
|
|
65
|
-
-
|
|
66
|
-
|
|
67
|
-
-
|
|
70
|
+
### Cross-reference failures
|
|
71
|
+
Structural integrity problems within formally linked documents.
|
|
72
|
+
- Main contract references "Appendix B — Payment Schedule" but Appendix B is missing or has a different title.
|
|
73
|
+
- Appendix references a clause that doesn't exist in the main contract.
|
|
74
|
+
- Missing reciprocal reference: appendix references main contract, but main contract doesn't list appendix.
|
|
75
|
+
- Version mismatch: main contract dated one date, appendix dated a much later date with different terms.
|
|
68
76
|
|
|
69
|
-
|
|
77
|
+
See `references/contradiction-taxonomy.md` for the field-level taxonomy with tolerances and severity templates the rule designer can borrow.
|
|
70
78
|
|
|
71
|
-
|
|
79
|
+
## Severity classification (rule-level decision)
|
|
72
80
|
|
|
73
|
-
|
|
81
|
+
Not all contradictions carry the same weight. The rule must specify how its findings map to severity:
|
|
74
82
|
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
83
|
+
| Severity | Typical example |
|
|
84
|
+
|----------|-----------------|
|
|
85
|
+
| Critical | Identity field mismatch (different ID across docs) |
|
|
86
|
+
| High | Significant financial amount discrepancy |
|
|
87
|
+
| Medium | Date or employment-detail mismatch |
|
|
88
|
+
| Low | Formatting / abbreviation difference ("ABC Corp" vs "ABC Corporation") |
|
|
79
89
|
|
|
80
|
-
|
|
90
|
+
The actual thresholds are domain-specific. The developer user sets them per field; the rule's design should leave room for the configured values rather than hardcode percentages.
|
|
81
91
|
|
|
82
|
-
##
|
|
92
|
+
## Fraud-signal patterns the rule may surface
|
|
83
93
|
|
|
84
|
-
|
|
94
|
+
Some patterns are visible only at the case level:
|
|
85
95
|
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
| **Medium** | Date or employment detail mismatch | Employment start date 3 months apart |
|
|
91
|
-
| **Low** | Formatting or abbreviation difference | "ABC Corp" vs "ABC Corp Ltd" |
|
|
96
|
+
- **Consistent small discrepancies across docs**: e.g., every doc inflated by ~5%, every date shifted by exactly one month. Consistency-in-error suggests coordinated fabrication rather than honest mistakes.
|
|
97
|
+
- **Suspicious doc consistency**: multiple docs from supposedly different issuers using identical formatting / phrasing / typos.
|
|
98
|
+
- **Values at regulatory thresholds**: LTV exactly at the limit, DTI exactly at the limit. One coincidence is harmless; a pattern across the case is a signal.
|
|
99
|
+
- **Temporal impossibilities**: income certificate issued before employment start; bank statement covering a period before account opening; appraisal dated after loan disbursement.
|
|
92
100
|
|
|
93
|
-
|
|
101
|
+
If the rule's job includes any of these, encode them. These are **signals for escalation**, not conclusions — the rule should output "flagged" with evidence, and let the developer user or downstream review process determine the final verdict.
|
|
94
102
|
|
|
95
|
-
##
|
|
103
|
+
## Designing the case-walk in check.py
|
|
96
104
|
|
|
97
|
-
|
|
105
|
+
For a cross-doc rule, a typical check.py:
|
|
98
106
|
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
107
|
+
1. Receives the case (list of doc paths/handles + classification metadata).
|
|
108
|
+
2. Selects the **starting document** of the class declared in the rule.
|
|
109
|
+
3. Extracts the **anchor entity** (using entity-extraction).
|
|
110
|
+
4. Locates the **target document(s)** in the case by the anchor or by formal reference.
|
|
111
|
+
5. Extracts the **target entity** from each target doc.
|
|
112
|
+
6. Applies the **consistency check** per the rule's PASS criteria.
|
|
113
|
+
7. Returns:
|
|
114
|
+
- verdict (PASS / FAIL / WARNING / NOT_APPLICABLE)
|
|
115
|
+
- evidence (the cells of the comparison matrix that drove the verdict, with source-doc + page references)
|
|
116
|
+
- confidence (per `confidence-system`)
|
|
103
117
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
## Workflow Sequence
|
|
107
|
-
|
|
108
|
-
The recommended sequence for cross-document verification within a case:
|
|
109
|
-
|
|
110
|
-
1. **Identify the case boundary.** Which documents belong together? Use shared identifiers (borrower name, loan number, contract reference) to group documents into cases.
|
|
111
|
-
2. **Extract anchors.** Run `entity-extraction` on each document independently. Collect the shared fields.
|
|
112
|
-
3. **Build the matrix.** Populate the comparison matrix. Flag empty cells.
|
|
113
|
-
4. **Detect contradictions.** Apply hard/soft/cross-reference checks per `references/contradiction-taxonomy.md`.
|
|
114
|
-
5. **Classify severity.** Assign severity per the configured thresholds.
|
|
115
|
-
6. **Scan for fraud signals.** Run pattern checks across the matrix.
|
|
116
|
-
7. **Produce the report.** Output the case-level consistency report with all findings.
|
|
118
|
+
The comparison matrix is rule-specific — only the fields this rule needs. Building a generic "all entities × all docs" matrix is expensive and pollutes the evidence trail.
|
|
117
119
|
|
|
118
120
|
## Integration
|
|
119
121
|
|
|
120
|
-
**Inputs:**
|
|
121
|
-
-
|
|
122
|
-
-
|
|
122
|
+
**Inputs to the rule's check.py at runtime:**
|
|
123
|
+
- Case structure (set of related documents + identifiers).
|
|
124
|
+
- Entity extraction available per-doc (skill: `entity-extraction`).
|
|
125
|
+
- Compliance judgment may have already produced per-doc verdicts (skill: `compliance-judgment`) that this rule can use as part of its inputs.
|
|
123
126
|
|
|
124
127
|
**Outputs:**
|
|
125
|
-
-
|
|
128
|
+
- A case-level verdict plus the comparison-matrix slice that justifies it.
|
|
126
129
|
|
|
127
130
|
**Feeds into:**
|
|
128
|
-
- `confidence-system`:
|
|
129
|
-
- `evolution-loop`:
|
|
130
|
-
- `dashboard-reporting`:
|
|
131
|
+
- `confidence-system`: cross-doc contradictions are a strong signal, often lowering confidence in affected fields.
|
|
132
|
+
- `evolution-loop`: recurring contradiction patterns the rule didn't anticipate trigger workflow refinement.
|
|
133
|
+
- `dashboard-reporting`: case-level view alongside per-doc results.
|
|
134
|
+
|
|
135
|
+
## Ground-truth principle
|
|
131
136
|
|
|
132
|
-
|
|
137
|
+
When the developer user or end user reports a cross-doc inconsistency that this rule missed, that report is prior to agent judgment. Log it, treat it as a corner case (`corner-case-management`) or as a trigger for rule refinement, and adjust detection thresholds accordingly. The rule's job is to catch what humans catch — and then go further.
|