kc-beta 0.3.1 → 0.3.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/template/skills/en/meta/entity-extraction/SKILL.md +6 -0
- package/template/skills/en/meta-meta/rule-extraction/SKILL.md +35 -0
- package/template/skills/en/meta-meta/rule-graph/SKILL.md +16 -0
- package/template/skills/zh/meta/entity-extraction/SKILL.md +6 -0
- package/template/skills/zh/meta-meta/rule-extraction/SKILL.md +35 -0
- package/template/skills/zh/meta-meta/rule-graph/SKILL.md +16 -0
package/package.json
CHANGED
|
@@ -49,6 +49,12 @@ Many real verification tasks require semantic understanding — "is this descrip
|
|
|
49
49
|
|
|
50
50
|
If a method's results fall below the accuracy threshold, try a different method or a more capable model. If regex works and meets accuracy — keep it, it's free. If regex produces results below threshold, escalate to worker LLM. If a cheap worker LLM isn't accurate enough, try a more capable tier. Record what works for each extraction type in AGENT.md for future reference.
|
|
51
51
|
|
|
52
|
+
## Project Glossary
|
|
53
|
+
|
|
54
|
+
The project glossary (built and maintained by `rule-extraction`, stored at `rules/glossary.json`) is a useful resource when designing extraction. It records canonical names and known aliases for entities that appear across rules. Reading it before extracting helps keep entity names schema-aligned and avoids parallel labels for the same thing.
|
|
55
|
+
|
|
56
|
+
Whether the glossary becomes more than a naming convention — for instance, driving cheap pattern matching for entities with stable surface forms — is a per-project judgment. Apply the same cost-accuracy logic as elsewhere: whatever method meets the accuracy threshold for the task at hand.
|
|
57
|
+
|
|
52
58
|
## Schema Design
|
|
53
59
|
|
|
54
60
|
Define the expected output for each extraction. Keep it simple and JIT:
|
|
@@ -104,6 +104,41 @@ Maintain a lightweight catalog of all extracted rules. This is your index, not t
|
|
|
104
104
|
|
|
105
105
|
Format: a simple markdown table or JSON file. Do not over-engineer this. The catalog exists to give you and the developer user an overview of progress.
|
|
106
106
|
|
|
107
|
+
## Project Glossary
|
|
108
|
+
|
|
109
|
+
Alongside the rule catalog, build a project glossary — a living vocabulary of the entities, terms, and patterns the verification system encounters. The glossary is what keeps entity names consistent across rules: without it, the same balance-sheet item might be named "注册资本", "registered capital", and "paid-in capital" by three different rule skills, breaking shared-entity matching and producing inconsistent extraction outputs.
|
|
110
|
+
|
|
111
|
+
The glossary is not frozen at the end of extraction. It is a living document. Update it when you discover new aliases in samples, when a worker LLM extraction reveals a variant phrasing, when corner cases surface unfamiliar terminology. Both the coding agent and any operator can edit it.
|
|
112
|
+
|
|
113
|
+
### When to seed it
|
|
114
|
+
|
|
115
|
+
During rule extraction. As you decompose each rule, note the entities the rule references — capital ratios, signature pages, related-party transactions, dates, parties, monetary values. Seed the glossary with the canonical name and any aliases already visible in the source documents.
|
|
116
|
+
|
|
117
|
+
### Storage and shape
|
|
118
|
+
|
|
119
|
+
Save as `rules/glossary.json` next to `catalog.json`. Each entry is small:
|
|
120
|
+
|
|
121
|
+
```json
|
|
122
|
+
{
|
|
123
|
+
"canonical": "registered_capital",
|
|
124
|
+
"aliases": ["注册资本", "registered capital", "实收资本"],
|
|
125
|
+
"definition": "The capital amount registered with regulators",
|
|
126
|
+
"entity_type": "monetary_value",
|
|
127
|
+
"seen_in": ["rules/regulation_A.pdf:p12", "samples/annual_report_2024.pdf:p3"],
|
|
128
|
+
"status": "extracted"
|
|
129
|
+
}
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
Status field tracks maturity: `extracted` (from rules), `validated` (confirmed in samples), `production` (used by deployed workflows). Add or drop fields as the project demands — same JIT philosophy as the rule schema.
|
|
133
|
+
|
|
134
|
+
### How it integrates
|
|
135
|
+
|
|
136
|
+
- `rule-graph` consumes the glossary so `shares_entity` edges reference canonical labels rather than free-text strings.
|
|
137
|
+
- `entity-extraction` references the glossary for canonical names and known aliases when designing extraction logic.
|
|
138
|
+
- Skills authored under `skill-authoring` should use canonical names in their schemas.
|
|
139
|
+
|
|
140
|
+
How the glossary is used downstream is a per-project judgment. A mature glossary may enable cheap pattern-based matching for some entities; for others it just keeps naming consistent. Let the cost-accuracy logic in `entity-extraction` decide per case.
|
|
141
|
+
|
|
107
142
|
## Handling Ambiguity
|
|
108
143
|
|
|
109
144
|
Regulations are often ambiguous. When you encounter ambiguity:
|
|
@@ -43,6 +43,22 @@ Two rules that can produce contradictory guidance. Regulation A requires disclos
|
|
|
43
43
|
|
|
44
44
|
Edge cases that affect multiple rules. A document with an unusual structure (merged cells in a table, non-standard date format) may cause extraction failures across several rules. The graph links these rules to the shared corner case so a fix in one propagates awareness to others.
|
|
45
45
|
|
|
46
|
+
## Project Glossary
|
|
47
|
+
|
|
48
|
+
The glossary (built and owned by `rule-extraction`, stored at `rules/glossary.json`) is the canonical-label registry that makes `shares_entity` edges meaningful. Without it, two rules can target the same entity under different names and the edge between them never gets drawn.
|
|
49
|
+
|
|
50
|
+
Edges that reference entities should use the glossary's canonical labels, not free-text strings copied from rule descriptions:
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{"from": "R001", "to": "R004", "type": "shares_entity", "entity": "registered_capital"}
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
Where `registered_capital` is the canonical name in `glossary.json`, with aliases like `注册资本` and `paid-in capital` recorded under it.
|
|
57
|
+
|
|
58
|
+
When the glossary is updated — new aliases discovered in samples, two entries merged, a definition refined — revisit affected `shares_entity` edges. New aliases may surface previously hidden cross-rule connections; merged entries collapse parallel edges into one.
|
|
59
|
+
|
|
60
|
+
The glossary is built and owned by rule-extraction; rule-graph just consumes it.
|
|
61
|
+
|
|
46
62
|
## Three Uses
|
|
47
63
|
|
|
48
64
|
### 1. Impact Analysis
|
|
@@ -49,6 +49,12 @@ Many real verification tasks require semantic understanding — "is this descrip
|
|
|
49
49
|
|
|
50
50
|
If a method's results fall below the accuracy threshold, try a different method or a more capable model. If regex works and meets accuracy — keep it, it's free. If regex produces results below threshold, escalate to worker LLM. If a cheap worker LLM isn't accurate enough, try a more capable tier. Record what works for each extraction type in AGENT.md for future reference.
|
|
51
51
|
|
|
52
|
+
## Project Glossary
|
|
53
|
+
|
|
54
|
+
The project glossary (built and maintained by `rule-extraction`, stored at `rules/glossary.json`) is a useful resource when designing extraction. It records canonical names and known aliases for entities that appear across rules. Reading it before extracting helps keep entity names schema-aligned and avoids parallel labels for the same thing.
|
|
55
|
+
|
|
56
|
+
Whether the glossary becomes more than a naming convention — for instance, driving cheap pattern matching for entities with stable surface forms — is a per-project judgment. Apply the same cost-accuracy logic as elsewhere: whatever method meets the accuracy threshold for the task at hand.
|
|
57
|
+
|
|
52
58
|
## Schema Design
|
|
53
59
|
|
|
54
60
|
Define the expected output for each extraction. Keep it simple and JIT:
|
|
@@ -104,6 +104,41 @@ Maintain a lightweight catalog of all extracted rules. This is your index, not t
|
|
|
104
104
|
|
|
105
105
|
Format: a simple markdown table or JSON file. Do not over-engineer this. The catalog exists to give you and the developer user an overview of progress.
|
|
106
106
|
|
|
107
|
+
## Project Glossary
|
|
108
|
+
|
|
109
|
+
Alongside the rule catalog, build a project glossary — a living vocabulary of the entities, terms, and patterns the verification system encounters. The glossary is what keeps entity names consistent across rules: without it, the same balance-sheet item might be named "注册资本", "registered capital", and "paid-in capital" by three different rule skills, breaking shared-entity matching and producing inconsistent extraction outputs.
|
|
110
|
+
|
|
111
|
+
The glossary is not frozen at the end of extraction. It is a living document. Update it when you discover new aliases in samples, when a worker LLM extraction reveals a variant phrasing, when corner cases surface unfamiliar terminology. Both the coding agent and any operator can edit it.
|
|
112
|
+
|
|
113
|
+
### When to seed it
|
|
114
|
+
|
|
115
|
+
During rule extraction. As you decompose each rule, note the entities the rule references — capital ratios, signature pages, related-party transactions, dates, parties, monetary values. Seed the glossary with the canonical name and any aliases already visible in the source documents.
|
|
116
|
+
|
|
117
|
+
### Storage and shape
|
|
118
|
+
|
|
119
|
+
Save as `rules/glossary.json` next to `catalog.json`. Each entry is small:
|
|
120
|
+
|
|
121
|
+
```json
|
|
122
|
+
{
|
|
123
|
+
"canonical": "registered_capital",
|
|
124
|
+
"aliases": ["注册资本", "registered capital", "实收资本"],
|
|
125
|
+
"definition": "The capital amount registered with regulators",
|
|
126
|
+
"entity_type": "monetary_value",
|
|
127
|
+
"seen_in": ["rules/regulation_A.pdf:p12", "samples/annual_report_2024.pdf:p3"],
|
|
128
|
+
"status": "extracted"
|
|
129
|
+
}
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
Status field tracks maturity: `extracted` (from rules), `validated` (confirmed in samples), `production` (used by deployed workflows). Add or drop fields as the project demands — same JIT philosophy as the rule schema.
|
|
133
|
+
|
|
134
|
+
### How it integrates
|
|
135
|
+
|
|
136
|
+
- `rule-graph` consumes the glossary so `shares_entity` edges reference canonical labels rather than free-text strings.
|
|
137
|
+
- `entity-extraction` references the glossary for canonical names and known aliases when designing extraction logic.
|
|
138
|
+
- Skills authored under `skill-authoring` should use canonical names in their schemas.
|
|
139
|
+
|
|
140
|
+
How the glossary is used downstream is a per-project judgment. A mature glossary may enable cheap pattern-based matching for some entities; for others it just keeps naming consistent. Let the cost-accuracy logic in `entity-extraction` decide per case.
|
|
141
|
+
|
|
107
142
|
## Handling Ambiguity
|
|
108
143
|
|
|
109
144
|
Regulations are often ambiguous. When you encounter ambiguity:
|
|
@@ -87,6 +87,22 @@ description: Build and maintain a graph of relationships between verification ru
|
|
|
87
87
|
|
|
88
88
|
图谱关联这些规则到共享的角落案例,一处修复、多处感知。
|
|
89
89
|
|
|
90
|
+
## 项目术语表(Glossary)
|
|
91
|
+
|
|
92
|
+
术语表由 `rule-extraction` 构建并维护,存放于 `rules/glossary.json`。它是规范化标签的注册中心——`shares_entity`(共享实体)边能否成立,全靠它。没有术语表,两条规则可能针对同一个实体却用不同名字,它们之间的边就永远画不出来。
|
|
93
|
+
|
|
94
|
+
涉及实体的边应该引用术语表中的规范化标签,而不是从规则描述中复制粘贴的自由文本:
|
|
95
|
+
|
|
96
|
+
```json
|
|
97
|
+
{"from": "R001", "to": "R004", "type": "shares_entity", "entity": "registered_capital"}
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
其中 `registered_capital` 是 `glossary.json` 里的规范化名称,`注册资本`、`paid-in capital`、`实收资本` 作为别名记录在该条目下。
|
|
101
|
+
|
|
102
|
+
术语表更新时——发现新别名、合并两条目、修订定义——回过头检查受影响的 `shares_entity` 边。新别名可能让原本隐藏的跨规则关联浮现;合并的条目会把平行的边收敛为一条。
|
|
103
|
+
|
|
104
|
+
术语表由 rule-extraction 构建和持有,rule-graph 只是消费方。
|
|
105
|
+
|
|
90
106
|
## 四个用途
|
|
91
107
|
|
|
92
108
|
### 1. 影响分析(Impact Analysis)
|