kc-beta 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (141) hide show
  1. package/bin/kc-beta.js +16 -0
  2. package/package.json +32 -0
  3. package/src/agent/confidence-scorer.js +120 -0
  4. package/src/agent/context.js +124 -0
  5. package/src/agent/corner-case-registry.js +119 -0
  6. package/src/agent/engine.js +224 -0
  7. package/src/agent/events.js +27 -0
  8. package/src/agent/history.js +101 -0
  9. package/src/agent/llm-client.js +131 -0
  10. package/src/agent/pipelines/base.js +14 -0
  11. package/src/agent/pipelines/distillation.js +113 -0
  12. package/src/agent/pipelines/extraction.js +92 -0
  13. package/src/agent/pipelines/index.js +23 -0
  14. package/src/agent/pipelines/initializer.js +163 -0
  15. package/src/agent/pipelines/production-qc.js +99 -0
  16. package/src/agent/pipelines/skill-authoring.js +83 -0
  17. package/src/agent/pipelines/skill-testing.js +111 -0
  18. package/src/agent/tools/agent-tool.js +100 -0
  19. package/src/agent/tools/base.js +35 -0
  20. package/src/agent/tools/dashboard-render.js +146 -0
  21. package/src/agent/tools/document-parse.js +184 -0
  22. package/src/agent/tools/document-search.js +111 -0
  23. package/src/agent/tools/evolution-cycle.js +150 -0
  24. package/src/agent/tools/qc-sample.js +94 -0
  25. package/src/agent/tools/registry.js +55 -0
  26. package/src/agent/tools/rule-catalog.js +113 -0
  27. package/src/agent/tools/sandbox-exec.js +106 -0
  28. package/src/agent/tools/tier-downgrade.js +114 -0
  29. package/src/agent/tools/worker-llm-call.js +109 -0
  30. package/src/agent/tools/workflow-run.js +138 -0
  31. package/src/agent/tools/workspace-file.js +122 -0
  32. package/src/agent/version-manager.js +130 -0
  33. package/src/agent/workspace.js +82 -0
  34. package/src/cli/components.js +164 -0
  35. package/src/cli/index.js +329 -0
  36. package/src/cli/init.js +80 -0
  37. package/src/cli/onboard.js +182 -0
  38. package/src/cli/terminal.js +143 -0
  39. package/src/config.js +93 -0
  40. package/template/.env.template +31 -0
  41. package/template/CLAUDE.md +137 -0
  42. package/template/Input/.gitkeep +0 -0
  43. package/template/Output/.gitkeep +0 -0
  44. package/template/Rules/.gitkeep +0 -0
  45. package/template/Samples/.gitkeep +0 -0
  46. package/template/skills/en/meta/compliance-judgment/SKILL.md +114 -0
  47. package/template/skills/en/meta/compliance-judgment/references/output-format.md +151 -0
  48. package/template/skills/en/meta/confidence-system/SKILL.md +117 -0
  49. package/template/skills/en/meta/corner-case-management/SKILL.md +111 -0
  50. package/template/skills/en/meta/cross-document-verification/SKILL.md +131 -0
  51. package/template/skills/en/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  52. package/template/skills/en/meta/data-sensibility/SKILL.md +115 -0
  53. package/template/skills/en/meta/document-parsing/SKILL.md +108 -0
  54. package/template/skills/en/meta/document-parsing/references/parser-catalog.md +40 -0
  55. package/template/skills/en/meta/entity-extraction/SKILL.md +129 -0
  56. package/template/skills/en/meta/tree-processing/SKILL.md +103 -0
  57. package/template/skills/en/meta-meta/bootstrap-workspace/SKILL.md +70 -0
  58. package/template/skills/en/meta-meta/dashboard-reporting/SKILL.md +106 -0
  59. package/template/skills/en/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  60. package/template/skills/en/meta-meta/evolution-loop/SKILL.md +210 -0
  61. package/template/skills/en/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  62. package/template/skills/en/meta-meta/quality-control/SKILL.md +138 -0
  63. package/template/skills/en/meta-meta/quality-control/references/qa-layers.md +92 -0
  64. package/template/skills/en/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  65. package/template/skills/en/meta-meta/rule-extraction/SKILL.md +100 -0
  66. package/template/skills/en/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  67. package/template/skills/en/meta-meta/rule-graph/SKILL.md +118 -0
  68. package/template/skills/en/meta-meta/skill-authoring/SKILL.md +108 -0
  69. package/template/skills/en/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  70. package/template/skills/en/meta-meta/skill-to-workflow/SKILL.md +150 -0
  71. package/template/skills/en/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  72. package/template/skills/en/meta-meta/task-decomposition/SKILL.md +129 -0
  73. package/template/skills/en/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  74. package/template/skills/en/meta-meta/version-control/SKILL.md +152 -0
  75. package/template/skills/en/meta-meta/version-control/references/trace-id-spec.md +79 -0
  76. package/template/skills/en/skill-creator/LICENSE.txt +202 -0
  77. package/template/skills/en/skill-creator/SKILL.md +479 -0
  78. package/template/skills/en/skill-creator/agents/analyzer.md +274 -0
  79. package/template/skills/en/skill-creator/agents/comparator.md +202 -0
  80. package/template/skills/en/skill-creator/agents/grader.md +223 -0
  81. package/template/skills/en/skill-creator/assets/eval_review.html +146 -0
  82. package/template/skills/en/skill-creator/eval-viewer/generate_review.py +471 -0
  83. package/template/skills/en/skill-creator/eval-viewer/viewer.html +1325 -0
  84. package/template/skills/en/skill-creator/references/schemas.md +430 -0
  85. package/template/skills/en/skill-creator/scripts/__init__.py +0 -0
  86. package/template/skills/en/skill-creator/scripts/aggregate_benchmark.py +401 -0
  87. package/template/skills/en/skill-creator/scripts/generate_report.py +326 -0
  88. package/template/skills/en/skill-creator/scripts/improve_description.py +248 -0
  89. package/template/skills/en/skill-creator/scripts/package_skill.py +136 -0
  90. package/template/skills/en/skill-creator/scripts/quick_validate.py +103 -0
  91. package/template/skills/en/skill-creator/scripts/run_eval.py +310 -0
  92. package/template/skills/en/skill-creator/scripts/run_loop.py +332 -0
  93. package/template/skills/en/skill-creator/scripts/utils.py +47 -0
  94. package/template/skills/zh/meta/compliance-judgment/SKILL.md +303 -0
  95. package/template/skills/zh/meta/compliance-judgment/references/output-format.md +151 -0
  96. package/template/skills/zh/meta/confidence-system/SKILL.md +228 -0
  97. package/template/skills/zh/meta/corner-case-management/SKILL.md +235 -0
  98. package/template/skills/zh/meta/cross-document-verification/SKILL.md +241 -0
  99. package/template/skills/zh/meta/cross-document-verification/references/contradiction-taxonomy.md +73 -0
  100. package/template/skills/zh/meta/data-sensibility/SKILL.md +235 -0
  101. package/template/skills/zh/meta/document-parsing/SKILL.md +168 -0
  102. package/template/skills/zh/meta/document-parsing/references/parser-catalog.md +40 -0
  103. package/template/skills/zh/meta/entity-extraction/SKILL.md +276 -0
  104. package/template/skills/zh/meta/tree-processing/SKILL.md +233 -0
  105. package/template/skills/zh/meta-meta/bootstrap-workspace/SKILL.md +147 -0
  106. package/template/skills/zh/meta-meta/dashboard-reporting/SKILL.md +281 -0
  107. package/template/skills/zh/meta-meta/dashboard-reporting/scripts/generate_dashboard.py +178 -0
  108. package/template/skills/zh/meta-meta/evolution-loop/SKILL.md +302 -0
  109. package/template/skills/zh/meta-meta/evolution-loop/references/convergence-guide.md +62 -0
  110. package/template/skills/zh/meta-meta/quality-control/SKILL.md +269 -0
  111. package/template/skills/zh/meta-meta/quality-control/references/qa-layers.md +92 -0
  112. package/template/skills/zh/meta-meta/quality-control/references/sampling-strategies.md +76 -0
  113. package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +208 -0
  114. package/template/skills/zh/meta-meta/rule-extraction/references/chunking-strategies.md +80 -0
  115. package/template/skills/zh/meta-meta/rule-graph/SKILL.md +203 -0
  116. package/template/skills/zh/meta-meta/skill-authoring/SKILL.md +235 -0
  117. package/template/skills/zh/meta-meta/skill-authoring/references/skill-format-spec.md +78 -0
  118. package/template/skills/zh/meta-meta/skill-to-workflow/SKILL.md +275 -0
  119. package/template/skills/zh/meta-meta/skill-to-workflow/references/worker-llm-catalog.md +50 -0
  120. package/template/skills/zh/meta-meta/task-decomposition/SKILL.md +224 -0
  121. package/template/skills/zh/meta-meta/task-decomposition/references/decision-matrix.md +81 -0
  122. package/template/skills/zh/meta-meta/version-control/SKILL.md +284 -0
  123. package/template/skills/zh/meta-meta/version-control/references/trace-id-spec.md +79 -0
  124. package/template/skills/zh/skill-creator/LICENSE.txt +202 -0
  125. package/template/skills/zh/skill-creator/SKILL.md +479 -0
  126. package/template/skills/zh/skill-creator/agents/analyzer.md +274 -0
  127. package/template/skills/zh/skill-creator/agents/comparator.md +202 -0
  128. package/template/skills/zh/skill-creator/agents/grader.md +223 -0
  129. package/template/skills/zh/skill-creator/assets/eval_review.html +146 -0
  130. package/template/skills/zh/skill-creator/eval-viewer/generate_review.py +471 -0
  131. package/template/skills/zh/skill-creator/eval-viewer/viewer.html +1325 -0
  132. package/template/skills/zh/skill-creator/references/schemas.md +430 -0
  133. package/template/skills/zh/skill-creator/scripts/__init__.py +0 -0
  134. package/template/skills/zh/skill-creator/scripts/aggregate_benchmark.py +401 -0
  135. package/template/skills/zh/skill-creator/scripts/generate_report.py +326 -0
  136. package/template/skills/zh/skill-creator/scripts/improve_description.py +248 -0
  137. package/template/skills/zh/skill-creator/scripts/package_skill.py +136 -0
  138. package/template/skills/zh/skill-creator/scripts/quick_validate.py +103 -0
  139. package/template/skills/zh/skill-creator/scripts/run_eval.py +310 -0
  140. package/template/skills/zh/skill-creator/scripts/run_loop.py +332 -0
  141. package/template/skills/zh/skill-creator/scripts/utils.py +47 -0
@@ -0,0 +1,115 @@
1
+ ---
2
+ name: data-sensibility
3
+ description: Build intuition about document data before writing extraction logic. Use before designing any extraction schema or regex pattern, when onboarding a new document type, or when extraction accuracy is unexpectedly low and you suspect a data assumption is wrong. Covers systematic observation of raw documents, spot-checking extracted results, distribution analysis, and recognizing suspicious patterns. If you are about to write code that touches document data and you have not read at least five documents end-to-end, stop and use this skill first.
4
+ ---
5
+
6
+ # Data Sensibility
7
+
8
+ The most expensive errors in document verification come from assumptions about data that do not hold. You assume dates are in YYYY-MM-DD because the first three documents used it — then document four uses DD/MM/YYYY and every extraction is silently wrong. Data sensibility is the discipline of looking before coding.
9
+
10
+ ## Front-Load Observation
11
+
12
+ Read 3-5 complete documents end-to-end before writing any extraction logic. Not skim — read. Open the raw parsed text, start at page one, and go to the end.
13
+
14
+ While reading, note what surprises you:
15
+ - A field you expected in a table is buried in a paragraph.
16
+ - Amounts appear in both Chinese uppercase and digits, sometimes on the same page.
17
+ - The document has two signature pages, not one.
18
+ - Section headings use inconsistent numbering schemes.
19
+ - A "standard" field like interest rate is expressed three different ways across five documents.
20
+
21
+ Read with a text editor, not a PDF viewer. You need to see what the extraction code will see — parsed text with all its artifacts, not the pretty rendered version. If your pipeline starts with OCR, read the OCR output.
22
+
23
+ Do this for each new document type. Do it again when document sources change. 30 minutes of reading saves 3 hours of debugging extraction logic that was built on wrong assumptions.
24
+
25
+ ## Systematic Observation Checklist
26
+
27
+ After reading, answer these questions explicitly — write the answers down, not just think them:
28
+
29
+ **What is consistent across all documents?**
30
+ Header structure, field positions, terminology, date formats. These are your anchors. Design extraction around them.
31
+
32
+ **What varies?**
33
+ Table layouts, section ordering, field presence, formatting conventions. These are your risk points. Every variant needs a test case.
34
+
35
+ **What is surprising?**
36
+ Anything you did not expect. A field that is sometimes missing. A value expressed in different units across documents. A section that appears in some templates but not others.
37
+
38
+ **Document subtypes?**
39
+ Are there different templates, issuers, or time periods represented? A "loan contract" from Bank A may look nothing like one from Bank B. Identify subtypes early — they often need separate extraction paths.
40
+
41
+ **Section lengths?**
42
+ Measure them. A section that averages 200 tokens is fine for any model. A section that occasionally runs to 8,000 tokens will blow your context window budget. Plan accordingly.
43
+
44
+ **Encoding issues?**
45
+ Full-width vs half-width characters (12.5% vs 12.5%). Unicode normalization problems. OCR artifacts. These cause silent extraction failures because the text looks correct to human eyes but does not match regex patterns.
46
+
47
+ ## Spot-Check Protocol
48
+
49
+ After any extraction run, pick 10 random fields and verify manually against the source document. Not the easy ones — select across the confidence spectrum: 3 high-confidence, 4 medium, 3 low.
50
+
51
+ For each field, check:
52
+ - **Value correct?** Does the extracted value match what the document says?
53
+ - **Source location correct?** Did the extraction find the value in the right place, or did it grab a similar value from somewhere else?
54
+ - **Normalization correct?** If the raw text says "伍仟万元" and your output says 50000000, is that right?
55
+ - **Type correct?** Did a percentage end up as a decimal? Did a date parse into the wrong century?
56
+
57
+ If more than 1 out of 10 is wrong, stop. Do not continue processing the batch. Investigate the failures, identify the pattern, fix the extraction, and re-run. A 10% spot-check error rate implies a much higher overall error rate — you have not seen the errors that look correct but are not.
58
+
59
+ ## Distribution Visualization
60
+
61
+ After extraction, visualize key field distributions. You do not need fancy tools — basic Python is enough.
62
+
63
+ **Field lengths.** If a "contract number" field is usually 12-16 characters but some are 3 or 200, those outliers are almost certainly wrong extractions.
64
+
65
+ **Value frequencies.** Run `Counter(values).most_common(20)` on categorical fields. Suspicious concentrations (one value appearing 90% of the time when you expect diversity) suggest extraction is grabbing a header or default value instead of the actual data.
66
+
67
+ **Missing rates.** If a field that should be present in every document is missing in 30% of results, your extraction is failing silently. Missing data is the easiest signal to detect and the most commonly ignored.
68
+
69
+ **Numeric ranges.** Plot or bucket numeric fields. Loan amounts should be positive and within a plausible range for the business. Interest rates should be between 0% and 30%, not 0.03 (forgot to multiply) or 3000 (grabbed the wrong column). Even a quick histogram of numeric field values will reveal bimodal distributions (two document subtypes?), impossible values (negative loan amounts?), or truncation artifacts.
70
+
71
+ ## Smelly Patterns
72
+
73
+ Some extracted values look correct individually but form suspicious patterns in aggregate:
74
+
75
+ - **Suspiciously round amounts.** Every loan is exactly 1,000,000.00 or 5,000,000.00. Possible in reality (banks do round), but worth verifying — you might be extracting a template placeholder.
76
+ - **Date clusters.** All contracts signed on the 1st of the month. All documents dated the same day. May indicate you are extracting a print date instead of a signature date.
77
+ - **Identical unique fields.** Multiple documents sharing the same contract number or borrower ID. Either a data quality issue upstream or an extraction error.
78
+ - **Values at regulatory thresholds.** Capital adequacy at exactly 8.00%. Loan-to-value at exactly 70.00%. These may be real (institutions manage to thresholds), or they may be artifacts of extracting the threshold definition instead of the actual value.
79
+ - **Perfect consistency.** Every single document passes every single rule. Real data has exceptions. If your system reports 100% compliance, your system is probably wrong.
80
+
81
+ These patterns may be real. They may be artifacts. The point is to notice them and investigate.
82
+
83
+ ## Intermediate Output Materialization
84
+
85
+ Save every processing stage to disk:
86
+ 1. Raw parsed text (from document parsing).
87
+ 2. Tree-chunked sections (from tree processing).
88
+ 3. Extracted entities (from entity extraction).
89
+ 4. Judgment results (from compliance judgment).
90
+
91
+ Write intermediate results as JSON files in a `debug/` directory, named by document ID and stage:
92
+
93
+ ```
94
+ debug/
95
+ DOC001_01_raw_text.json
96
+ DOC001_02_tree_sections.json
97
+ DOC001_03_extracted_entities.json
98
+ DOC001_04_judgments.json
99
+ ```
100
+
101
+ Disk is cheap. Debugging without intermediates is guesswork.
102
+
103
+ When something goes wrong — and it will — you can inspect each stage independently. Was the section split wrong? Was the extraction correct but the judgment wrong? Was the raw text garbled by OCR? Without intermediates, you are reduced to staring at the final output and guessing.
104
+
105
+ Keep intermediates for at least the current iteration. Delete old iterations only when disk space becomes a real constraint.
106
+
107
+ ## Integration
108
+
109
+ Feed your observations into downstream skills:
110
+ - `entity-extraction`: Your observations about data formats, field variants, and encoding issues directly inform schema design and regex patterns.
111
+ - `skill-authoring`: Edge cases and known variants you discover here become test cases and special handling in rule skills.
112
+ - `confidence-system`: Your spot-check results and distribution analysis calibrate initial confidence expectations.
113
+ - `evolution-loop`: Revisit data sensibility whenever the evolution loop reveals unexpected failures — the data may have changed, or your initial observations may have been incomplete.
114
+
115
+ Data sensibility is not a one-time activity. Every new batch, every new document source, every format change is an occasion to look before coding.
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: document-parsing
3
+ description: Parse source documents into machine-readable text with maximum fidelity. Use when processing any document in Samples/ or Input/ for the first time, when parsed text quality is poor, or when tables and charts need special handling. Covers multi-level parser selection from simple text extraction to OCR and vision models. Also use when a verification rule fails due to parsing issues (garbled text, missing tables, mangled layouts) and the parser needs to be upgraded for that document type.
4
+ ---
5
+
6
+ # Document Parsing
7
+
8
+ Parsing is the foundation. If the text is wrong, everything downstream is wrong. But parsing is also a cost center — do not use expensive vision models when simple text extraction works.
9
+
10
+ ## The Minimum Viable Parser Principle
11
+
12
+ Start with the simplest parser. Escalate only when necessary. This is not about saving money — it is about producing the most reliable output. Simple parsers have fewer failure modes.
13
+
14
+ ### Level 1: Direct Text Extraction
15
+ - Tool: pymupdf (PyMuPDF) or similar PDF text extraction.
16
+ - When: Well-formed digital PDFs with embedded text. This covers most modern business documents.
17
+ - Output: Raw text with basic structure preserved (paragraphs, basic formatting).
18
+ - Limitations: Tables may come out as messy text. Charts and images are invisible. Scanned PDFs produce nothing.
19
+
20
+ ### Level 2: Layout-Aware Extraction
21
+ - Tool: pdfplumber or similar layout-aware parser.
22
+ - When: Level 1 produces messy table output, or when preserving spatial layout matters (forms, multi-column documents).
23
+ - Output: Text with table detection and cell-level extraction.
24
+ - Limitations: Still text-based. Cannot handle scanned content.
25
+
26
+ ### Level 3: OCR
27
+ - Tool: Vision-capable models from OCR_MODEL_TIER in `.env` (PaddleOCR-VL, GLM-4.6V, etc.).
28
+ - When: Scanned PDFs, image-based PDFs, or PDFs where Level 1-2 produce garbled/incomplete text.
29
+ - Output: Recognized text from images.
30
+ - Limitations: Slower, costs API calls, may introduce OCR errors.
31
+
32
+ ### Level 4: Vision Model Interpretation
33
+ - Tool: High-capability vision models (OCR_MODEL_TIER1).
34
+ - When: Complex tables that text extraction cannot parse correctly, charts that need data point extraction, mixed text-and-image layouts.
35
+ - Output: Structured interpretation of visual content (table as markdown, chart data as JSON).
36
+ - Limitations: Expensive, slow. Reserve for when the visual content genuinely needs interpretation.
37
+
38
+ ## Quality Detection
39
+
40
+ How to know when to escalate:
41
+
42
+ - **Low character count**: The document has pages but extracted text is very short. Likely a scanned PDF.
43
+ - **Garbled text**: Unusual character sequences, encoding errors, or meaningless text patterns.
44
+ - **Missing expected sections**: The table of contents mentions Chapter 5 but no Chapter 5 text was extracted.
45
+ - **Table artifacts**: Columns of numbers without alignment, cell content mixed with headers, or table borders appearing as characters.
46
+ - **Missing numbers in financial tables**: If a financial document's key metrics are not in the extracted text, the tables were probably not parsed.
47
+
48
+ Write a quick quality check after parsing and before proceeding. If quality is insufficient, escalate to the next parser level.
49
+
50
+ ### Parse Quality Score
51
+
52
+ Compute a quality score (0.0 to 1.0) from weighted heuristics to make escalation decisions systematic rather than ad-hoc. A recommended starting framework:
53
+
54
+ - **Character density** (weight ~0.3): actual character count / expected characters for the document's page count. A 10-page PDF that yields only 200 characters likely failed.
55
+ - **Garble ratio** (weight ~0.2): fraction of characters that are common CJK/Latin vs control characters, unusual sequences, or encoding artifacts.
56
+ - **Section completeness** (weight ~0.3): if the document has a table of contents, what fraction of TOC entries have matching content in the extracted text?
57
+ - **Table integrity** (weight ~0.2): for financial documents, are key numeric values that should appear in tables actually present in the extracted text?
58
+
59
+ **Escalation thresholds** (recommended defaults — adjust freely):
60
+ - Score >= 0.7: accept this parser level, proceed to downstream processing.
61
+ - Score 0.4-0.7: escalate to the next parser level, re-parse, re-score.
62
+ - Score < 0.4: skip directly to Level 3 (OCR) or Level 4 (vision) depending on document characteristics.
63
+
64
+ **Lock-in**: once a parser level produces an acceptable score for a document type, record that level. Do not re-evaluate unless a downstream verification failure is traced back to a parsing issue.
65
+
66
+ These weights, thresholds, and the scoring approach itself are starting points. The coding agent should design whatever quality assessment works for the specific document types at hand — a simple pass/fail heuristic may be sufficient for some scenarios; a more nuanced scoring function may be needed for others. The important pattern is: **measure quality → compare to threshold → decide whether to escalate**.
67
+
68
+ This follows the same tier-transition pattern as model tier selection in `skill-to-workflow`: a quality/accuracy score drives the decision to stay, escalate, or skip tiers.
69
+
70
+ ## Table Handling
71
+
72
+ Tables are critical in financial documents (balance sheets, ratio tables, compliance metrics). They deserve special attention:
73
+
74
+ 1. **Detection**: Identify table regions. Look for grid patterns, consistent column spacing, or explicit table markers.
75
+ 2. **Extraction**: Extract cell-by-cell content. Preserve the row-column relationship.
76
+ 3. **Reconstruction**: Convert to a structured format (markdown table, JSON array of rows, or CSV).
77
+ 4. **Validation**: Spot-check that key values in the reconstructed table match what is visible in the document.
78
+
79
+ When the standard parser fails on tables, try the vision model approach: send the table image (cropped from the PDF page) to a vision model and ask it to produce a markdown table.
80
+
81
+ ## Chart Handling
82
+
83
+ Charts (bar charts, line charts, pie charts) occasionally contain data needed for verification:
84
+
85
+ - Extract the chart image from the document.
86
+ - Send to a vision model with a prompt: "Extract the data points, labels, and values from this chart. Return as a JSON array."
87
+ - Validate the extracted data against any nearby text or table that might contain the same numbers.
88
+
89
+ This is expensive. Only do it when a verification rule specifically requires data from a chart and that data is not available in text elsewhere in the document.
90
+
91
+ ## Output Format
92
+
93
+ Parsed documents should be saved as clean markdown:
94
+
95
+ - Preserve the document's heading hierarchy (# Chapter, ## Section, ### Subsection).
96
+ - Preserve lists, numbered or bulleted.
97
+ - Convert tables to markdown table format.
98
+ - Note page boundaries if relevant (some rules reference specific pages).
99
+ - Strip noise: headers, footers, page numbers, watermarks (unless a rule specifically checks for them).
100
+
101
+ Save parsed output alongside the original document for reuse across rules.
102
+
103
+ ## Caching
104
+
105
+ Parsing is expensive (especially Level 3-4). Cache parsed output:
106
+ - Store the parsed markdown alongside the original file.
107
+ - Track which parser level produced it.
108
+ - Re-parse only when: the original file changes, a rule requires higher-quality parsing than what is cached, or a verification failure is traced back to a parsing issue.
@@ -0,0 +1,40 @@
1
+ # Parser Catalog
2
+
3
+ ## Text-Based Parsers (No LLM Required)
4
+
5
+ | Parser | Type | Strengths | Limitations | Install |
6
+ |--------|------|-----------|-------------|---------|
7
+ | PyMuPDF (fitz) | Text extraction | Fast, reliable, basic structure | No table awareness, no OCR | `pip install pymupdf` |
8
+ | pdfplumber | Layout-aware | Good table detection, spatial layout | Text-only, no OCR | `pip install pdfplumber` |
9
+ | python-docx | DOCX parser | Native DOCX support, preserves structure | DOCX only | `pip install python-docx` |
10
+ | openpyxl | XLSX parser | Full spreadsheet support | XLSX only | `pip install openpyxl` |
11
+ | MarkItDown | Multi-format | Handles PDF, DOCX, PPTX, XLSX → markdown | Basic parsing, may miss complex layouts | `pip install markitdown` |
12
+
13
+ ## OCR / Vision Models (Via SiliconFlow API)
14
+
15
+ | Model | Tier | Strengths | Best For |
16
+ |-------|------|-----------|----------|
17
+ | zai-org/GLM-4.6V | OCR_TIER1 | Best accuracy, strong Chinese OCR | Complex tables, mixed layouts |
18
+ | Qwen/Qwen3.5-397B-A17B | OCR_TIER2 | Good general vision, large model | Tables with context-dependent interpretation |
19
+ | PaddlePaddle/PaddleOCR-VL-1.5 | OCR_TIER3 | Fast, lightweight | Standard text, simple tables |
20
+
21
+ ## Local Deployment Options
22
+
23
+ For developer users who prefer local processing:
24
+
25
+ | Tool | Type | Notes |
26
+ |------|------|-------|
27
+ | PaddleOCR | Local OCR | Open source, supports Chinese/English |
28
+ | Surya | Local OCR | Modern OCR with table detection |
29
+ | pdf2md-local | PDF → Markdown | Reference: github.com/Ruilin-mmwa/pdf2md-local |
30
+
31
+ ## Selection Decision Tree
32
+
33
+ ```
34
+ Is the PDF text-based (not scanned)?
35
+ ├─ Yes → PyMuPDF or pdfplumber
36
+ │ └─ Are tables parsed correctly?
37
+ │ ├─ Yes → Done
38
+ │ └─ No → Try pdfplumber → If still bad → Vision model on table regions
39
+ └─ No (scanned) → OCR_TIER3 → If quality insufficient → OCR_TIER1
40
+ ```
@@ -0,0 +1,129 @@
1
+ ---
2
+ name: entity-extraction
3
+ description: Extract specific entities, values, and text segments from documents as required by verification rules. Use after tree processing has located the relevant section, when a rule needs a specific number, date, name, amount, clause, or any domain-specific entity extracted. Covers extraction method selection (regex vs LLM), schema design, postprocessing, and confidence annotation. Also use when designing the extraction step of a workflow for worker LLMs.
4
+ ---
5
+
6
+ # Entity Extraction
7
+
8
+ An entity is the thing you need to check. A number, a date, a name, a clause, a percentage, a statement. The rule says what to check; extraction is how you get the value to check it against.
9
+
10
+ ## Extraction Type Taxonomy
11
+
12
+ Different extraction scenarios call for different approaches:
13
+
14
+ ### Single Entity from Single Section
15
+ The simplest case. One rule needs one value from one place.
16
+ - Example: "Extract the capital adequacy ratio from the Key Metrics table."
17
+ - Approach: Locate the section, apply regex or LLM extraction.
18
+
19
+ ### Multiple Entities from Single Section
20
+ One rule needs several related values from the same place.
21
+ - Example: "Extract the borrower's name, loan amount, interest rate, and maturity date from the loan agreement summary."
22
+ - Approach: Design a single extraction call that returns all values. More efficient than multiple calls.
23
+
24
+ ### Single Entity from Multiple Sections
25
+ One value is scattered across multiple places, or needs cross-referencing.
26
+ - Example: "Extract the total collateral value, which may be listed in the collateral section or in Appendix A."
27
+ - Approach: Collect content from all relevant sections, then extract. Note which source the value came from.
28
+
29
+ ### Entity from Full Document
30
+ The value could be anywhere, or the rule applies to the document as a whole.
31
+ - Example: "Check whether the document contains a valid signature page."
32
+ - Approach: For the coding agent, scan the full document. For worker LLM workflows, design a two-pass approach: first pass identifies the location, second pass extracts the value.
33
+
34
+ ## Method Selection
35
+
36
+ ### Regex / Python (Cost: zero, Speed: instant)
37
+
38
+ Use when the entity has a predictable format:
39
+
40
+ - **Dates**: `\d{4}[-/年]\d{1,2}[-/月]\d{1,2}[日]?` or specific patterns for the document type.
41
+ - **Monetary amounts**: `[\d,]+\.?\d*\s*(元|万元|亿元|USD|RMB)` or similar.
42
+ - **Percentages**: `\d+\.?\d*\s*%`
43
+ - **Identifiers**: Loan numbers, registration codes, ID numbers — usually fixed formats.
44
+ - **Specific phrases**: "I hereby agree" or "本人同意" — exact string matching.
45
+
46
+ Build and test the regex on sample documents. A good regex is better than a good LLM prompt for structured values — it is faster, deterministic, and free.
47
+
48
+ ### LLM Extraction (Cost: API call, Speed: seconds)
49
+
50
+ Use when the entity requires understanding:
51
+
52
+ - **Named entities in context**: "the guarantor's main business" — requires understanding who the guarantor is and which text describes their business.
53
+ - **Conditional values**: "the interest rate, including any adjustments" — requires understanding what constitutes an adjustment.
54
+ - **Semantic matching**: "adequate risk disclosure" — requires judgment about what text constitutes risk disclosure.
55
+ - **Table interpretation**: When a table's structure is not uniform and regex cannot reliably extract cells.
56
+
57
+ Design the LLM prompt to:
58
+ 1. Include the narrowed context (from tree processing).
59
+ 2. Specify exactly what to extract.
60
+ 3. Define the output format (JSON with named fields).
61
+ 4. Provide one example if the extraction is non-obvious.
62
+
63
+ ### Hybrid Approach
64
+
65
+ Often the best strategy:
66
+ 1. Use regex to extract candidates (fast, catches obvious matches).
67
+ 2. If regex finds a confident match, use it.
68
+ 3. If regex fails or is uncertain, fall back to LLM extraction.
69
+ 4. Use LLM to validate regex results when confidence matters.
70
+
71
+ ## Schema Design
72
+
73
+ Define the expected output for each extraction. Keep it simple and JIT:
74
+
75
+ ```json
76
+ {
77
+ "entity_name": "capital_adequacy_ratio",
78
+ "value": 12.5,
79
+ "unit": "%",
80
+ "raw_text": "资本充足率为12.5%",
81
+ "source_location": "Chapter 2, Table 1, Row 3",
82
+ "confidence": 0.95,
83
+ "extraction_method": "regex"
84
+ }
85
+ ```
86
+
87
+ The schema should capture:
88
+ - **value**: The extracted value, normalized.
89
+ - **unit**: If applicable (%, 元, days, etc.).
90
+ - **raw_text**: The original text fragment where the value was found. This is evidence for the judgment step.
91
+ - **source_location**: Where in the document the value was found.
92
+ - **confidence**: How sure you are (see `confidence-system`).
93
+ - **extraction_method**: What extracted it (regex, LLM-TIER2, etc.).
94
+
95
+ Do not over-engineer the schema. Add fields as needed during testing.
96
+
97
+ ## Postprocessing
98
+
99
+ Raw extracted values often need normalization:
100
+
101
+ - **Chinese numerals → digits**: 一百二十万 → 1200000
102
+ - **Date standardization**: 2024年3月15日 → 2024-03-15
103
+ - **Unit conversion**: 万元 → multiply by 10000 if comparing to a threshold in 元.
104
+ - **Whitespace and noise removal**: Strip extra spaces, line breaks, formatting artifacts.
105
+ - **Percentage normalization**: 0.125 → 12.5% or vice versa, depending on what the rule expects.
106
+
107
+ Build postprocessing as Python functions in the rule skill's `scripts/` directory. They are deterministic and reusable.
108
+
109
+ ## Confidence Annotation
110
+
111
+ Every extraction should carry a confidence estimate:
112
+
113
+ - **Regex match, validated format**: 0.90-0.95
114
+ - **LLM extraction, high certainty**: 0.80-0.85
115
+ - **LLM extraction, some ambiguity**: 0.60-0.75
116
+ - **Fallback or inferred value**: 0.40-0.60
117
+ - **No value found**: 0.0 (flag as MISSING)
118
+
119
+ These are starting points. Calibrate based on actual accuracy (see `confidence-system`).
120
+
121
+ ## Fitting Worker LLM Context
122
+
123
+ When designing extraction for worker LLM workflows:
124
+
125
+ 1. Calculate the prompt size: system prompt + instructions + examples + output format = N tokens.
126
+ 2. Available context for document content = model's context window - N.
127
+ 3. If the section exceeds available context, narrow further via tree processing.
128
+ 4. Always leave room for the model's response.
129
+ 5. Test with the actual model to verify the context fits — token counts from the coding agent may differ from the worker LLM's tokenizer.
@@ -0,0 +1,103 @@
1
+ ---
2
+ name: tree-processing
3
+ description: Build hierarchical document trees and navigate to specific chapters or sections required by verification rules. Use when a rule targets a specific part of a document (e.g., "Chapter 3 must contain..."), when documents are too long for a single LLM context window, or when you need to find where a specific entity lives within a large document. Implements the "onion peeler" approach for chunking. Also use for documents over 100 pages where full-document processing is impractical.
4
+ ---
5
+
6
+ # Tree Processing
7
+
8
+ Most verification rules do not need the entire document. They need a specific section, a specific table, a specific disclosure. The tree is your map for navigating large documents efficiently.
9
+
10
+ ## Why Trees
11
+
12
+ Two reasons:
13
+
14
+ 1. **Rules have scope.** "The risk disclosure in Chapter 5 must contain..." — you need to find Chapter 5, not read 1000 pages.
15
+ 2. **Worker LLMs have limits.** A 16K-32K context window cannot hold a 1000-page document. You must narrow to the relevant section.
16
+
17
+ The tree structure solves both: it tells you WHERE things are, and lets you extract JUST what you need.
18
+
19
+ ## Building the Tree
20
+
21
+ ### Step 1: Discover the Structure
22
+
23
+ Before building a tree parser, explore several sample documents to find structural patterns. Look for:
24
+
25
+ - **Header conventions**: Do chapters start with "Chapter X"? "第X章"? "Part X"? A Roman numeral?
26
+ - **Numbering systems**: "1.1.2", "Article 3", "(a)(i)", hierarchical numbering?
27
+ - **Visual markers**: Bold text, larger font, horizontal rules, page breaks before chapters?
28
+ - **Table of contents**: Most formal documents have one. It is the document's own tree.
29
+
30
+ Spend time here. The patterns you find determine whether the tree builder is a simple regex or a complex parser.
31
+
32
+ ### Step 2: Choose the Parser
33
+
34
+ **If patterns are consistent** (they usually are in regulated documents):
35
+ - Write a regex-based splitter. For example:
36
+ - `^第[一二三四五六七八九十百千]+章` for Chinese chapter headers
37
+ - `^Chapter \d+` for English
38
+ - `^\d+\.\d+(\.\d+)*\s` for numbered sections
39
+ - This is fast, deterministic, and reliable. Prefer this when it works.
40
+
41
+ **If patterns are inconsistent or absent**:
42
+ - Use the LLM-guided wedge-driving approach (see `rule-extraction/references/chunking-strategies.md` for the full algorithm: rolling context window, K-token quoting, Levenshtein fuzzy matching).
43
+ - This is slower and costs LLM calls, but handles unstructured documents. The rolling window means even very large unstructured leaf nodes can be chunked incrementally.
44
+
45
+ **If the document has a table of contents**:
46
+ - Parse the TOC first. It gives you the tree structure and page numbers for free.
47
+ - Then use the TOC-derived structure to split the document body.
48
+
49
+ ### Step 3: Build the Tree
50
+
51
+ The tree is a simple nested structure:
52
+
53
+ ```
54
+ Document
55
+ ├── Part I: General Provisions
56
+ │ ├── Chapter 1: Definitions (pages 1-15)
57
+ │ └── Chapter 2: Scope (pages 16-22)
58
+ ├── Part II: Capital Requirements
59
+ │ ├── Chapter 3: Minimum Capital (pages 23-45)
60
+ │ │ ├── Section 3.1: Tier 1 Capital
61
+ │ │ └── Section 3.2: Tier 2 Capital
62
+ │ └── Chapter 4: Risk Weighting (pages 46-78)
63
+ └── Part III: Disclosure
64
+ └── Chapter 5: Risk Disclosure (pages 79-120)
65
+ ```
66
+
67
+ Each node stores: the header text, the level, the start/end positions in the document, and the content size (in tokens or characters).
68
+
69
+ ### Step 4: Use the Tree
70
+
71
+ Given a rule that says "check the risk disclosure section":
72
+
73
+ 1. **Search the tree** for the relevant node. Match the rule's scope description against node headers.
74
+ - Exact match: "Chapter 5" → find node with "Chapter 5" header.
75
+ - Semantic match: "risk disclosure section" → find node whose header or content relates to risk disclosure. May need fuzzy matching or LLM classification.
76
+ 2. **Extract the content** of that node (and optionally its children).
77
+ 3. **Check the size.** If the content fits in the worker LLM's context window, use it directly. If not, descend to child nodes and find the specific subsection needed.
78
+
79
+ ## The Full Context → Chapter → Entity Pipeline
80
+
81
+ This is the standard narrowing funnel for extracting entities for verification:
82
+
83
+ 1. **Full context**: Use the tree to understand the document structure. Know where everything is.
84
+ 2. **Chapter**: Navigate to the specific section that the rule targets. Extract its content.
85
+ 3. **Entity**: Within the chapter content, extract the specific entity (number, text, clause) using the techniques from `entity-extraction`.
86
+
87
+ For worker LLMs with 16K-32K context:
88
+ - The chapter content + the extraction prompt must fit in the context window.
89
+ - If a chapter is too large, descend further in the tree.
90
+ - Always include the parent header chain for context: "Part II > Chapter 3 > Section 3.1" so the LLM knows where this content sits in the document.
91
+
92
+ ## Caching and Reuse
93
+
94
+ Build the tree once per document, reuse across all rules:
95
+ - Save the tree structure as JSON alongside the parsed document.
96
+ - Multiple rules may need different sections of the same document. The tree lets each rule navigate directly to its section without re-parsing.
97
+
98
+ ## Edge Cases
99
+
100
+ - **Flat documents**: Some documents have no structural hierarchy. Treat the entire document as one node. Use LLM-guided chunking if it exceeds the context window.
101
+ - **Deeply nested structures**: Some legal documents have 6+ nesting levels. Build all levels but typically only navigate 2-3 levels deep for any given rule.
102
+ - **Cross-section references**: A section might reference "as defined in Section 1.2." When extracting, you may need content from multiple tree nodes. Collect them into a single context for the LLM.
103
+ - **Appendices and annexes**: Often contain critical tables and data. Include them as top-level nodes in the tree.
@@ -0,0 +1,70 @@
1
+ ---
2
+ name: bootstrap-workspace
3
+ description: Initialize and configure a document verification workspace. Use when a developer user first opens this workspace, when .env needs configuration, or when the business scenario needs to be understood. Guides the coding agent through reading regulation documents, understanding the developer user's business context, configuring model tiers and thresholds, and establishing the working relationship. Covers initial conversation with developer user to scope the verification task, set expectations, and agree on checkpoints.
4
+ ---
5
+
6
+ # Bootstrap Workspace
7
+
8
+ You are replacing a team of business analysts, prompt engineers, and QA engineers. Before writing a single line of code, understand what that team would ask first.
9
+
10
+ ## First Actions
11
+
12
+ 1. Read everything in `Rules/`. Understand the domain, the regulation structure, the language, and the level of specificity.
13
+ 2. Scan `Samples/` to understand the document types, formats (PDF, DOCX, scans), typical length, and structural patterns.
14
+ 3. Review `.env` to understand the current configuration. Check that API keys are set and models are accessible.
15
+
16
+ ## Discuss with Developer User
17
+
18
+ Before proceeding, have a focused conversation with the developer user to align on:
19
+
20
+ ### Scope
21
+ - What type of documents are being verified? (contracts, filings, reports, etc.)
22
+ - What regulations or rules apply? Are they all in Rules/, or are some implicit domain knowledge?
23
+ - How many distinct rules are there, roughly? Dozens? Hundreds?
24
+ - Are there rules already sorted into a spreadsheet (xlsx/csv) with clear scope definitions?
25
+
26
+ ### Granularity
27
+ - How fine-grained should each rule be? A single clause? A paragraph? A section?
28
+ - Some rules are naturally atomic ("the interest rate must not exceed X%"). Others are compound ("the disclosure section must contain items A, B, C, and each must meet criteria D, E, F"). Discuss how to decompose compound rules.
29
+ - If the developer user has pre-sorted rules, follow their structure. Do not re-decompose unless they ask.
30
+
31
+ ### Expectations
32
+ - What accuracy is acceptable? The default in `.env` is 0.9 for both skills and workflows. Is this right for this business scenario?
33
+ - Are some rules more critical than others? Should critical rules have higher thresholds?
34
+ - What is the expected production volume? How many documents per batch?
35
+ - How fast do results need to be? Same-day? Real-time?
36
+
37
+ ### Checkpoints
38
+ - Explain the lifecycle to the developer user: you will first verify documents yourself (skill phase), then build automated workflows (workflow phase), then monitor in production (QC phase).
39
+ - Agree on when the developer user wants to review progress. After the first rule? After all rules? After workflow distillation?
40
+ - Establish how the developer user wants to see results (HTML dashboard, direct conversation, output files).
41
+
42
+ ## .env Configuration Guidance
43
+
44
+ Walk through `.env` parameters with the developer user:
45
+
46
+ - **TIER1-4**: Model tiers from most capable to most efficient. TIER1 is used when accuracy matters most. TIER4 is used when the task is well-understood and speed/cost matters. The coding agent decides per-task, but the developer user should confirm the available models match their API access.
47
+ - **OCR_MODEL_TIER1-3**: For document parsing when text extraction fails. Only relevant if documents include scanned pages, complex tables, or charts.
48
+ - **SKILL_ACCURACY**: The accuracy threshold before a skill is considered "proven" and ready for workflow distillation. Default 0.9. Higher means more iteration before moving on, but more reliable workflows.
49
+ - **WORKFLOW_ACCURACY**: The accuracy threshold before a workflow is deployed to production. Default 0.9. This is measured against the coding agent's own skill-based results as ground truth.
50
+ - **MONITOR_FREQUENCY**: How aggressively to sample and review production results. `high` = slower decay (review more for longer), `mid` = balanced, `low` = faster decay (trust workflows sooner).
51
+ - **MAX_ITERATIONS**: Safety valve. After this many evolution cycles on a single rule, escalate to the developer user instead of continuing to iterate.
52
+
53
+ ## Setting Up the Workspace
54
+
55
+ After the conversation:
56
+
57
+ 1. Ensure all workspace folders exist (Rules/, Samples/, Input/, Output/).
58
+ 2. Create a `logs/` directory for iteration logs.
59
+ 3. Create a `workflows/` directory for distilled Python workflows.
60
+ 4. Create a `rule-skills/` directory where individual rule skill folders will live.
61
+ 5. Initialize version tracking (a `versions.json` manifest).
62
+ 6. Log the bootstrap conversation summary for future reference.
63
+
64
+ ## When to Re-Bootstrap
65
+
66
+ Return to this skill when:
67
+ - The developer user adds new regulation documents to Rules/.
68
+ - The business scenario changes significantly.
69
+ - Model availability changes (new models added, old models deprecated).
70
+ - Thresholds need adjustment based on production experience.