@cremini/skillpack 1.1.8-beta.1 → 1.1.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +5 -1
- package/dist/cli.js +177 -256
- package/package.json +2 -2
- package/templates/builtin-skills/skill-creator/LICENSE.txt +202 -0
- package/templates/builtin-skills/skill-creator/SKILL.md +171 -0
- package/templates/builtin-skills/skill-creator/agents/analyzer.md +274 -0
- package/templates/builtin-skills/skill-creator/agents/comparator.md +202 -0
- package/templates/builtin-skills/skill-creator/agents/grader.md +223 -0
- package/templates/builtin-skills/skill-creator/assets/eval_review.html +146 -0
- package/templates/builtin-skills/skill-creator/eval-viewer/generate_review.py +471 -0
- package/templates/builtin-skills/skill-creator/eval-viewer/viewer.html +1325 -0
- package/templates/builtin-skills/skill-creator/references/schemas.md +430 -0
- package/templates/builtin-skills/skill-creator/scripts/__init__.py +0 -0
- package/templates/builtin-skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
- package/templates/builtin-skills/skill-creator/scripts/generate_report.py +326 -0
- package/templates/builtin-skills/skill-creator/scripts/improve_description.py +247 -0
- package/templates/builtin-skills/skill-creator/scripts/package_skill.py +136 -0
- package/templates/builtin-skills/skill-creator/scripts/quick_validate.py +103 -0
- package/templates/builtin-skills/skill-creator/scripts/run_eval.py +310 -0
- package/templates/builtin-skills/skill-creator/scripts/run_loop.py +328 -0
- package/templates/builtin-skills/skill-creator/scripts/utils.py +47 -0
- package/web/js/api-key-dialog.js +5 -3
- package/web/js/chat-apps-dialog.js +10 -4
- package/web/js/chat.js +8 -8
- package/web/js/settings.js +8 -3
- package/dist/runtime/registry.js +0 -244
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@cremini/skillpack",
|
|
3
|
-
"version": "1.1.
|
|
3
|
+
"version": "1.1.9",
|
|
4
4
|
"description": "Pack AI Skills into Local Agents",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"repository": {
|
|
@@ -65,4 +65,4 @@
|
|
|
65
65
|
"tsup": "^8.5.1",
|
|
66
66
|
"typescript": "^5.9.3"
|
|
67
67
|
}
|
|
68
|
-
}
|
|
68
|
+
}
|
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
|
|
2
|
+
Apache License
|
|
3
|
+
Version 2.0, January 2004
|
|
4
|
+
http://www.apache.org/licenses/
|
|
5
|
+
|
|
6
|
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
|
7
|
+
|
|
8
|
+
1. Definitions.
|
|
9
|
+
|
|
10
|
+
"License" shall mean the terms and conditions for use, reproduction,
|
|
11
|
+
and distribution as defined by Sections 1 through 9 of this document.
|
|
12
|
+
|
|
13
|
+
"Licensor" shall mean the copyright owner or entity authorized by
|
|
14
|
+
the copyright owner that is granting the License.
|
|
15
|
+
|
|
16
|
+
"Legal Entity" shall mean the union of the acting entity and all
|
|
17
|
+
other entities that control, are controlled by, or are under common
|
|
18
|
+
control with that entity. For the purposes of this definition,
|
|
19
|
+
"control" means (i) the power, direct or indirect, to cause the
|
|
20
|
+
direction or management of such entity, whether by contract or
|
|
21
|
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
|
22
|
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
|
23
|
+
|
|
24
|
+
"You" (or "Your") shall mean an individual or Legal Entity
|
|
25
|
+
exercising permissions granted by this License.
|
|
26
|
+
|
|
27
|
+
"Source" form shall mean the preferred form for making modifications,
|
|
28
|
+
including but not limited to software source code, documentation
|
|
29
|
+
source, and configuration files.
|
|
30
|
+
|
|
31
|
+
"Object" form shall mean any form resulting from mechanical
|
|
32
|
+
transformation or translation of a Source form, including but
|
|
33
|
+
not limited to compiled object code, generated documentation,
|
|
34
|
+
and conversions to other media types.
|
|
35
|
+
|
|
36
|
+
"Work" shall mean the work of authorship, whether in Source or
|
|
37
|
+
Object form, made available under the License, as indicated by a
|
|
38
|
+
copyright notice that is included in or attached to the work
|
|
39
|
+
(an example is provided in the Appendix below).
|
|
40
|
+
|
|
41
|
+
"Derivative Works" shall mean any work, whether in Source or Object
|
|
42
|
+
form, that is based on (or derived from) the Work and for which the
|
|
43
|
+
editorial revisions, annotations, elaborations, or other modifications
|
|
44
|
+
represent, as a whole, an original work of authorship. For the purposes
|
|
45
|
+
of this License, Derivative Works shall not include works that remain
|
|
46
|
+
separable from, or merely link (or bind by name) to the interfaces of,
|
|
47
|
+
the Work and Derivative Works thereof.
|
|
48
|
+
|
|
49
|
+
"Contribution" shall mean any work of authorship, including
|
|
50
|
+
the original version of the Work and any modifications or additions
|
|
51
|
+
to that Work or Derivative Works thereof, that is intentionally
|
|
52
|
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
|
53
|
+
or by an individual or Legal Entity authorized to submit on behalf of
|
|
54
|
+
the copyright owner. For the purposes of this definition, "submitted"
|
|
55
|
+
means any form of electronic, verbal, or written communication sent
|
|
56
|
+
to the Licensor or its representatives, including but not limited to
|
|
57
|
+
communication on electronic mailing lists, source code control systems,
|
|
58
|
+
and issue tracking systems that are managed by, or on behalf of, the
|
|
59
|
+
Licensor for the purpose of discussing and improving the Work, but
|
|
60
|
+
excluding communication that is conspicuously marked or otherwise
|
|
61
|
+
designated in writing by the copyright owner as "Not a Contribution."
|
|
62
|
+
|
|
63
|
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
|
64
|
+
on behalf of whom a Contribution has been received by Licensor and
|
|
65
|
+
subsequently incorporated within the Work.
|
|
66
|
+
|
|
67
|
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
|
68
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
69
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
70
|
+
copyright license to reproduce, prepare Derivative Works of,
|
|
71
|
+
publicly display, publicly perform, sublicense, and distribute the
|
|
72
|
+
Work and such Derivative Works in Source or Object form.
|
|
73
|
+
|
|
74
|
+
3. Grant of Patent License. Subject to the terms and conditions of
|
|
75
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
76
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
77
|
+
(except as stated in this section) patent license to make, have made,
|
|
78
|
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
|
79
|
+
where such license applies only to those patent claims licensable
|
|
80
|
+
by such Contributor that are necessarily infringed by their
|
|
81
|
+
Contribution(s) alone or by combination of their Contribution(s)
|
|
82
|
+
with the Work to which such Contribution(s) was submitted. If You
|
|
83
|
+
institute patent litigation against any entity (including a
|
|
84
|
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
|
85
|
+
or a Contribution incorporated within the Work constitutes direct
|
|
86
|
+
or contributory patent infringement, then any patent licenses
|
|
87
|
+
granted to You under this License for that Work shall terminate
|
|
88
|
+
as of the date such litigation is filed.
|
|
89
|
+
|
|
90
|
+
4. Redistribution. You may reproduce and distribute copies of the
|
|
91
|
+
Work or Derivative Works thereof in any medium, with or without
|
|
92
|
+
modifications, and in Source or Object form, provided that You
|
|
93
|
+
meet the following conditions:
|
|
94
|
+
|
|
95
|
+
(a) You must give any other recipients of the Work or
|
|
96
|
+
Derivative Works a copy of this License; and
|
|
97
|
+
|
|
98
|
+
(b) You must cause any modified files to carry prominent notices
|
|
99
|
+
stating that You changed the files; and
|
|
100
|
+
|
|
101
|
+
(c) You must retain, in the Source form of any Derivative Works
|
|
102
|
+
that You distribute, all copyright, patent, trademark, and
|
|
103
|
+
attribution notices from the Source form of the Work,
|
|
104
|
+
excluding those notices that do not pertain to any part of
|
|
105
|
+
the Derivative Works; and
|
|
106
|
+
|
|
107
|
+
(d) If the Work includes a "NOTICE" text file as part of its
|
|
108
|
+
distribution, then any Derivative Works that You distribute must
|
|
109
|
+
include a readable copy of the attribution notices contained
|
|
110
|
+
within such NOTICE file, excluding those notices that do not
|
|
111
|
+
pertain to any part of the Derivative Works, in at least one
|
|
112
|
+
of the following places: within a NOTICE text file distributed
|
|
113
|
+
as part of the Derivative Works; within the Source form or
|
|
114
|
+
documentation, if provided along with the Derivative Works; or,
|
|
115
|
+
within a display generated by the Derivative Works, if and
|
|
116
|
+
wherever such third-party notices normally appear. The contents
|
|
117
|
+
of the NOTICE file are for informational purposes only and
|
|
118
|
+
do not modify the License. You may add Your own attribution
|
|
119
|
+
notices within Derivative Works that You distribute, alongside
|
|
120
|
+
or as an addendum to the NOTICE text from the Work, provided
|
|
121
|
+
that such additional attribution notices cannot be construed
|
|
122
|
+
as modifying the License.
|
|
123
|
+
|
|
124
|
+
You may add Your own copyright statement to Your modifications and
|
|
125
|
+
may provide additional or different license terms and conditions
|
|
126
|
+
for use, reproduction, or distribution of Your modifications, or
|
|
127
|
+
for any such Derivative Works as a whole, provided Your use,
|
|
128
|
+
reproduction, and distribution of the Work otherwise complies with
|
|
129
|
+
the conditions stated in this License.
|
|
130
|
+
|
|
131
|
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
|
132
|
+
any Contribution intentionally submitted for inclusion in the Work
|
|
133
|
+
by You to the Licensor shall be under the terms and conditions of
|
|
134
|
+
this License, without any additional terms or conditions.
|
|
135
|
+
Notwithstanding the above, nothing herein shall supersede or modify
|
|
136
|
+
the terms of any separate license agreement you may have executed
|
|
137
|
+
with Licensor regarding such Contributions.
|
|
138
|
+
|
|
139
|
+
6. Trademarks. This License does not grant permission to use the trade
|
|
140
|
+
names, trademarks, service marks, or product names of the Licensor,
|
|
141
|
+
except as required for reasonable and customary use in describing the
|
|
142
|
+
origin of the Work and reproducing the content of the NOTICE file.
|
|
143
|
+
|
|
144
|
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
|
145
|
+
agreed to in writing, Licensor provides the Work (and each
|
|
146
|
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
|
147
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
148
|
+
implied, including, without limitation, any warranties or conditions
|
|
149
|
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
|
150
|
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
|
151
|
+
appropriateness of using or redistributing the Work and assume any
|
|
152
|
+
risks associated with Your exercise of permissions under this License.
|
|
153
|
+
|
|
154
|
+
8. Limitation of Liability. In no event and under no legal theory,
|
|
155
|
+
whether in tort (including negligence), contract, or otherwise,
|
|
156
|
+
unless required by applicable law (such as deliberate and grossly
|
|
157
|
+
negligent acts) or agreed to in writing, shall any Contributor be
|
|
158
|
+
liable to You for damages, including any direct, indirect, special,
|
|
159
|
+
incidental, or consequential damages of any character arising as a
|
|
160
|
+
result of this License or out of the use or inability to use the
|
|
161
|
+
Work (including but not limited to damages for loss of goodwill,
|
|
162
|
+
work stoppage, computer failure or malfunction, or any and all
|
|
163
|
+
other commercial damages or losses), even if such Contributor
|
|
164
|
+
has been advised of the possibility of such damages.
|
|
165
|
+
|
|
166
|
+
9. Accepting Warranty or Additional Liability. While redistributing
|
|
167
|
+
the Work or Derivative Works thereof, You may choose to offer,
|
|
168
|
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
|
169
|
+
or other liability obligations and/or rights consistent with this
|
|
170
|
+
License. However, in accepting such obligations, You may act only
|
|
171
|
+
on Your own behalf and on Your sole responsibility, not on behalf
|
|
172
|
+
of any other Contributor, and only if You agree to indemnify,
|
|
173
|
+
defend, and hold each Contributor harmless for any liability
|
|
174
|
+
incurred by, or claims asserted against, such Contributor by reason
|
|
175
|
+
of your accepting any such warranty or additional liability.
|
|
176
|
+
|
|
177
|
+
END OF TERMS AND CONDITIONS
|
|
178
|
+
|
|
179
|
+
APPENDIX: How to apply the Apache License to your work.
|
|
180
|
+
|
|
181
|
+
To apply the Apache License to your work, attach the following
|
|
182
|
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
|
183
|
+
replaced with your own identifying information. (Don't include
|
|
184
|
+
the brackets!) The text should be enclosed in the appropriate
|
|
185
|
+
comment syntax for the file format. We also recommend that a
|
|
186
|
+
file or class name and description of purpose be included on the
|
|
187
|
+
same "printed page" as the copyright notice for easier
|
|
188
|
+
identification within third-party archives.
|
|
189
|
+
|
|
190
|
+
Copyright [yyyy] [name of copyright owner]
|
|
191
|
+
|
|
192
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
193
|
+
you may not use this file except in compliance with the License.
|
|
194
|
+
You may obtain a copy of the License at
|
|
195
|
+
|
|
196
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
197
|
+
|
|
198
|
+
Unless required by applicable law or agreed to in writing, software
|
|
199
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
|
200
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
201
|
+
See the License for the specific language governing permissions and
|
|
202
|
+
limitations under the License.
|
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: skill-creator
|
|
3
|
+
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Skill Creator
|
|
7
|
+
|
|
8
|
+
A skill for creating new skills and iteratively improving them inside this SkillPack.
|
|
9
|
+
|
|
10
|
+
At a high level, the process of creating a skill goes like this:
|
|
11
|
+
|
|
12
|
+
- Decide what the skill should do and when it should trigger.
|
|
13
|
+
- Write a draft of the skill.
|
|
14
|
+
- Create a few realistic test prompts.
|
|
15
|
+
- Run the tests, review the results with the user, and improve the skill.
|
|
16
|
+
- Repeat until the skill is good enough for the user's needs.
|
|
17
|
+
|
|
18
|
+
Your job when using this skill is to figure out where the user is in this process and help them move forward without overcomplicating things.
|
|
19
|
+
|
|
20
|
+
## Communicating with the user
|
|
21
|
+
|
|
22
|
+
Adjust your language to the user's level of familiarity. Avoid unnecessary jargon. Briefly explain terms like "frontmatter", "assertion", or "benchmark" if the user does not appear comfortable with them.
|
|
23
|
+
|
|
24
|
+
If the user clearly wants a lightweight collaboration rather than a full evaluation loop, keep things simple and iterate directly with them.
|
|
25
|
+
|
|
26
|
+
## Pack-specific rules
|
|
27
|
+
|
|
28
|
+
This SkillPack uses a fixed project-level skills directory and config file:
|
|
29
|
+
|
|
30
|
+
- Skills directory: `{{SKILLS_PATH}}`
|
|
31
|
+
- SkillPack config: `{{PACK_CONFIG_PATH}}`
|
|
32
|
+
|
|
33
|
+
These paths override any generic advice you may know from other environments.
|
|
34
|
+
|
|
35
|
+
When creating or updating skills in this SkillPack:
|
|
36
|
+
|
|
37
|
+
- Always place the skill under `{{SKILLS_PATH}}/<skill-name>/`.
|
|
38
|
+
- Always write the main skill file to `{{SKILLS_PATH}}/<skill-name>/SKILL.md`.
|
|
39
|
+
- Treat `skill-name` as the canonical directory name unless the user explicitly asks to preserve an existing directory layout.
|
|
40
|
+
- Never create new skills inside the current workspace directory just because the active cwd is elsewhere.
|
|
41
|
+
|
|
42
|
+
## Creating a skill
|
|
43
|
+
|
|
44
|
+
### Capture intent
|
|
45
|
+
|
|
46
|
+
Start by understanding the user's intent. The current conversation may already contain the workflow the user wants to capture. Extract answers from the conversation first, then fill the gaps with targeted questions.
|
|
47
|
+
|
|
48
|
+
Confirm these points before writing the first draft:
|
|
49
|
+
|
|
50
|
+
1. What should this skill enable the model to do?
|
|
51
|
+
2. When should this skill trigger?
|
|
52
|
+
3. What output should it produce?
|
|
53
|
+
4. Does the user want a lightweight draft, or a tested and iterated skill?
|
|
54
|
+
|
|
55
|
+
### Interview and research
|
|
56
|
+
|
|
57
|
+
Ask about:
|
|
58
|
+
|
|
59
|
+
- edge cases
|
|
60
|
+
- input/output formats
|
|
61
|
+
- example prompts or files
|
|
62
|
+
- success criteria
|
|
63
|
+
- dependencies or required tools
|
|
64
|
+
|
|
65
|
+
Wait to write test prompts until these basics are clear enough.
|
|
66
|
+
|
|
67
|
+
### Write the skill
|
|
68
|
+
|
|
69
|
+
Create the skill directory at `{{SKILLS_PATH}}/<skill-name>/`.
|
|
70
|
+
|
|
71
|
+
Create `SKILL.md` with YAML frontmatter. The frontmatter must include:
|
|
72
|
+
|
|
73
|
+
- `name`
|
|
74
|
+
- `description`
|
|
75
|
+
|
|
76
|
+
The `description` is the primary triggering mechanism. Make it concrete and slightly "pushy": include both what the skill does and the situations where it should be used.
|
|
77
|
+
|
|
78
|
+
Keep the skill practical:
|
|
79
|
+
|
|
80
|
+
- Put "when to use" information in the `description`, not buried in the body.
|
|
81
|
+
- Keep the body focused on the workflow, decisions, and output expectations.
|
|
82
|
+
- If the skill needs deterministic helpers, place them under `scripts/`.
|
|
83
|
+
- If the skill needs long reference material, place it under `references/` and tell the model when to read it.
|
|
84
|
+
|
|
85
|
+
### Required save location
|
|
86
|
+
|
|
87
|
+
For a newly created skill named `example-skill`, the target layout must be:
|
|
88
|
+
|
|
89
|
+
```text
|
|
90
|
+
{{SKILLS_PATH}}/example-skill/
|
|
91
|
+
{{SKILLS_PATH}}/example-skill/SKILL.md
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
If the user is improving an existing skill, preserve the existing skill name unless they explicitly request a rename.
|
|
95
|
+
|
|
96
|
+
### Update skillpack.json
|
|
97
|
+
|
|
98
|
+
After you create or update a skill, you must sync `{{PACK_CONFIG_PATH}}`.
|
|
99
|
+
|
|
100
|
+
Do not guess the metadata from memory. Instead:
|
|
101
|
+
|
|
102
|
+
1. Read the final `SKILL.md`.
|
|
103
|
+
2. Parse the YAML frontmatter.
|
|
104
|
+
3. Extract:
|
|
105
|
+
- `name`
|
|
106
|
+
- `description`
|
|
107
|
+
4. Upsert an entry into the `skills` array in `{{PACK_CONFIG_PATH}}`:
|
|
108
|
+
|
|
109
|
+
```json
|
|
110
|
+
{
|
|
111
|
+
"name": "<frontmatter.name>",
|
|
112
|
+
"description": "<frontmatter.description>",
|
|
113
|
+
"source": "./skills/<frontmatter.name>"
|
|
114
|
+
}
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Rules for this update:
|
|
118
|
+
|
|
119
|
+
- `name` must come from `frontmatter.name`.
|
|
120
|
+
- `description` must come from `frontmatter.description`.
|
|
121
|
+
- `source` must be `./skills/<frontmatter.name>`.
|
|
122
|
+
- If an entry for the same skill already exists, update it instead of creating a duplicate.
|
|
123
|
+
|
|
124
|
+
### Writing guide
|
|
125
|
+
|
|
126
|
+
Prefer imperative, clear instructions. Explain why important constraints exist. Avoid overly rigid language unless strict behavior is actually required.
|
|
127
|
+
|
|
128
|
+
Useful structure:
|
|
129
|
+
|
|
130
|
+
- purpose
|
|
131
|
+
- trigger guidance
|
|
132
|
+
- required inputs
|
|
133
|
+
- step-by-step workflow
|
|
134
|
+
- output format
|
|
135
|
+
- edge cases
|
|
136
|
+
|
|
137
|
+
If the skill supports multiple domains or frameworks, organize the references by variant and tell the model how to choose the right one.
|
|
138
|
+
|
|
139
|
+
## Test and iterate
|
|
140
|
+
|
|
141
|
+
After drafting the skill, propose 2-3 realistic test prompts. The prompts should sound like something a real user would actually say.
|
|
142
|
+
|
|
143
|
+
If the user wants evaluation:
|
|
144
|
+
|
|
145
|
+
- run the test prompts with the skill
|
|
146
|
+
- compare the outputs against the user's expectations
|
|
147
|
+
- note what worked and what failed
|
|
148
|
+
- revise the skill
|
|
149
|
+
|
|
150
|
+
If the user does not want a heavy evaluation loop, do at least a lightweight sanity check before calling the skill complete.
|
|
151
|
+
|
|
152
|
+
## Improving an existing skill
|
|
153
|
+
|
|
154
|
+
When updating an existing skill:
|
|
155
|
+
|
|
156
|
+
- preserve its canonical `name` unless the user explicitly asks to rename it
|
|
157
|
+
- keep the directory aligned with the canonical skill name
|
|
158
|
+
- update `SKILL.md` first
|
|
159
|
+
- then re-read the final frontmatter and sync `{{PACK_CONFIG_PATH}}`
|
|
160
|
+
|
|
161
|
+
Focus on general improvements rather than overfitting to one example. Keep the prompt lean and remove instructions that are not earning their place.
|
|
162
|
+
|
|
163
|
+
## Completion checklist
|
|
164
|
+
|
|
165
|
+
Before you say the work is done, verify all of the following:
|
|
166
|
+
|
|
167
|
+
- the skill exists under `{{SKILLS_PATH}}/<skill-name>/SKILL.md`
|
|
168
|
+
- `SKILL.md` has `name` and `description` frontmatter
|
|
169
|
+
- `{{PACK_CONFIG_PATH}}` has a matching entry in `skills`
|
|
170
|
+
- the `source` field is `./skills/<skill-name>`
|
|
171
|
+
- you have either tested the skill or explicitly told the user what remains untested
|
|
@@ -0,0 +1,274 @@
|
|
|
1
|
+
# Post-hoc Analyzer Agent
|
|
2
|
+
|
|
3
|
+
Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
|
|
4
|
+
|
|
5
|
+
## Role
|
|
6
|
+
|
|
7
|
+
After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
|
|
8
|
+
|
|
9
|
+
## Inputs
|
|
10
|
+
|
|
11
|
+
You receive these parameters in your prompt:
|
|
12
|
+
|
|
13
|
+
- **winner**: "A" or "B" (from blind comparison)
|
|
14
|
+
- **winner_skill_path**: Path to the skill that produced the winning output
|
|
15
|
+
- **winner_transcript_path**: Path to the execution transcript for the winner
|
|
16
|
+
- **loser_skill_path**: Path to the skill that produced the losing output
|
|
17
|
+
- **loser_transcript_path**: Path to the execution transcript for the loser
|
|
18
|
+
- **comparison_result_path**: Path to the blind comparator's output JSON
|
|
19
|
+
- **output_path**: Where to save the analysis results
|
|
20
|
+
|
|
21
|
+
## Process
|
|
22
|
+
|
|
23
|
+
### Step 1: Read Comparison Result
|
|
24
|
+
|
|
25
|
+
1. Read the blind comparator's output at comparison_result_path
|
|
26
|
+
2. Note the winning side (A or B), the reasoning, and any scores
|
|
27
|
+
3. Understand what the comparator valued in the winning output
|
|
28
|
+
|
|
29
|
+
### Step 2: Read Both Skills
|
|
30
|
+
|
|
31
|
+
1. Read the winner skill's SKILL.md and key referenced files
|
|
32
|
+
2. Read the loser skill's SKILL.md and key referenced files
|
|
33
|
+
3. Identify structural differences:
|
|
34
|
+
- Instructions clarity and specificity
|
|
35
|
+
- Script/tool usage patterns
|
|
36
|
+
- Example coverage
|
|
37
|
+
- Edge case handling
|
|
38
|
+
|
|
39
|
+
### Step 3: Read Both Transcripts
|
|
40
|
+
|
|
41
|
+
1. Read the winner's transcript
|
|
42
|
+
2. Read the loser's transcript
|
|
43
|
+
3. Compare execution patterns:
|
|
44
|
+
- How closely did each follow their skill's instructions?
|
|
45
|
+
- What tools were used differently?
|
|
46
|
+
- Where did the loser diverge from optimal behavior?
|
|
47
|
+
- Did either encounter errors or make recovery attempts?
|
|
48
|
+
|
|
49
|
+
### Step 4: Analyze Instruction Following
|
|
50
|
+
|
|
51
|
+
For each transcript, evaluate:
|
|
52
|
+
- Did the agent follow the skill's explicit instructions?
|
|
53
|
+
- Did the agent use the skill's provided tools/scripts?
|
|
54
|
+
- Were there missed opportunities to leverage skill content?
|
|
55
|
+
- Did the agent add unnecessary steps not in the skill?
|
|
56
|
+
|
|
57
|
+
Score instruction following 1-10 and note specific issues.
|
|
58
|
+
|
|
59
|
+
### Step 5: Identify Winner Strengths
|
|
60
|
+
|
|
61
|
+
Determine what made the winner better:
|
|
62
|
+
- Clearer instructions that led to better behavior?
|
|
63
|
+
- Better scripts/tools that produced better output?
|
|
64
|
+
- More comprehensive examples that guided edge cases?
|
|
65
|
+
- Better error handling guidance?
|
|
66
|
+
|
|
67
|
+
Be specific. Quote from skills/transcripts where relevant.
|
|
68
|
+
|
|
69
|
+
### Step 6: Identify Loser Weaknesses
|
|
70
|
+
|
|
71
|
+
Determine what held the loser back:
|
|
72
|
+
- Ambiguous instructions that led to suboptimal choices?
|
|
73
|
+
- Missing tools/scripts that forced workarounds?
|
|
74
|
+
- Gaps in edge case coverage?
|
|
75
|
+
- Poor error handling that caused failures?
|
|
76
|
+
|
|
77
|
+
### Step 7: Generate Improvement Suggestions
|
|
78
|
+
|
|
79
|
+
Based on the analysis, produce actionable suggestions for improving the loser skill:
|
|
80
|
+
- Specific instruction changes to make
|
|
81
|
+
- Tools/scripts to add or modify
|
|
82
|
+
- Examples to include
|
|
83
|
+
- Edge cases to address
|
|
84
|
+
|
|
85
|
+
Prioritize by impact. Focus on changes that would have changed the outcome.
|
|
86
|
+
|
|
87
|
+
### Step 8: Write Analysis Results
|
|
88
|
+
|
|
89
|
+
Save structured analysis to `{output_path}`.
|
|
90
|
+
|
|
91
|
+
## Output Format
|
|
92
|
+
|
|
93
|
+
Write a JSON file with this structure:
|
|
94
|
+
|
|
95
|
+
```json
|
|
96
|
+
{
|
|
97
|
+
"comparison_summary": {
|
|
98
|
+
"winner": "A",
|
|
99
|
+
"winner_skill": "path/to/winner/skill",
|
|
100
|
+
"loser_skill": "path/to/loser/skill",
|
|
101
|
+
"comparator_reasoning": "Brief summary of why comparator chose winner"
|
|
102
|
+
},
|
|
103
|
+
"winner_strengths": [
|
|
104
|
+
"Clear step-by-step instructions for handling multi-page documents",
|
|
105
|
+
"Included validation script that caught formatting errors",
|
|
106
|
+
"Explicit guidance on fallback behavior when OCR fails"
|
|
107
|
+
],
|
|
108
|
+
"loser_weaknesses": [
|
|
109
|
+
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
|
|
110
|
+
"No script for validation, agent had to improvise and made errors",
|
|
111
|
+
"No guidance on OCR failure, agent gave up instead of trying alternatives"
|
|
112
|
+
],
|
|
113
|
+
"instruction_following": {
|
|
114
|
+
"winner": {
|
|
115
|
+
"score": 9,
|
|
116
|
+
"issues": [
|
|
117
|
+
"Minor: skipped optional logging step"
|
|
118
|
+
]
|
|
119
|
+
},
|
|
120
|
+
"loser": {
|
|
121
|
+
"score": 6,
|
|
122
|
+
"issues": [
|
|
123
|
+
"Did not use the skill's formatting template",
|
|
124
|
+
"Invented own approach instead of following step 3",
|
|
125
|
+
"Missed the 'always validate output' instruction"
|
|
126
|
+
]
|
|
127
|
+
}
|
|
128
|
+
},
|
|
129
|
+
"improvement_suggestions": [
|
|
130
|
+
{
|
|
131
|
+
"priority": "high",
|
|
132
|
+
"category": "instructions",
|
|
133
|
+
"suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
|
|
134
|
+
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
|
|
135
|
+
},
|
|
136
|
+
{
|
|
137
|
+
"priority": "high",
|
|
138
|
+
"category": "tools",
|
|
139
|
+
"suggestion": "Add validate_output.py script similar to winner skill's validation approach",
|
|
140
|
+
"expected_impact": "Would catch formatting errors before final output"
|
|
141
|
+
},
|
|
142
|
+
{
|
|
143
|
+
"priority": "medium",
|
|
144
|
+
"category": "error_handling",
|
|
145
|
+
"suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
|
|
146
|
+
"expected_impact": "Would prevent early failure on difficult documents"
|
|
147
|
+
}
|
|
148
|
+
],
|
|
149
|
+
"transcript_insights": {
|
|
150
|
+
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
|
|
151
|
+
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
|
|
152
|
+
}
|
|
153
|
+
}
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## Guidelines
|
|
157
|
+
|
|
158
|
+
- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
|
|
159
|
+
- **Be actionable**: Suggestions should be concrete changes, not vague advice
|
|
160
|
+
- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
|
|
161
|
+
- **Prioritize by impact**: Which changes would most likely have changed the outcome?
|
|
162
|
+
- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
|
|
163
|
+
- **Stay objective**: Analyze what happened, don't editorialize
|
|
164
|
+
- **Think about generalization**: Would this improvement help on other evals too?
|
|
165
|
+
|
|
166
|
+
## Categories for Suggestions
|
|
167
|
+
|
|
168
|
+
Use these categories to organize improvement suggestions:
|
|
169
|
+
|
|
170
|
+
| Category | Description |
|
|
171
|
+
|----------|-------------|
|
|
172
|
+
| `instructions` | Changes to the skill's prose instructions |
|
|
173
|
+
| `tools` | Scripts, templates, or utilities to add/modify |
|
|
174
|
+
| `examples` | Example inputs/outputs to include |
|
|
175
|
+
| `error_handling` | Guidance for handling failures |
|
|
176
|
+
| `structure` | Reorganization of skill content |
|
|
177
|
+
| `references` | External docs or resources to add |
|
|
178
|
+
|
|
179
|
+
## Priority Levels
|
|
180
|
+
|
|
181
|
+
- **high**: Would likely change the outcome of this comparison
|
|
182
|
+
- **medium**: Would improve quality but may not change win/loss
|
|
183
|
+
- **low**: Nice to have, marginal improvement
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
# Analyzing Benchmark Results
|
|
188
|
+
|
|
189
|
+
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
|
|
190
|
+
|
|
191
|
+
## Role
|
|
192
|
+
|
|
193
|
+
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
|
|
194
|
+
|
|
195
|
+
## Inputs
|
|
196
|
+
|
|
197
|
+
You receive these parameters in your prompt:
|
|
198
|
+
|
|
199
|
+
- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
|
|
200
|
+
- **skill_path**: Path to the skill being benchmarked
|
|
201
|
+
- **output_path**: Where to save the notes (as JSON array of strings)
|
|
202
|
+
|
|
203
|
+
## Process
|
|
204
|
+
|
|
205
|
+
### Step 1: Read Benchmark Data
|
|
206
|
+
|
|
207
|
+
1. Read the benchmark.json containing all run results
|
|
208
|
+
2. Note the configurations tested (with_skill, without_skill)
|
|
209
|
+
3. Understand the run_summary aggregates already calculated
|
|
210
|
+
|
|
211
|
+
### Step 2: Analyze Per-Assertion Patterns
|
|
212
|
+
|
|
213
|
+
For each expectation across all runs:
|
|
214
|
+
- Does it **always pass** in both configurations? (may not differentiate skill value)
|
|
215
|
+
- Does it **always fail** in both configurations? (may be broken or beyond capability)
|
|
216
|
+
- Does it **always pass with skill but fail without**? (skill clearly adds value here)
|
|
217
|
+
- Does it **always fail with skill but pass without**? (skill may be hurting)
|
|
218
|
+
- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
|
|
219
|
+
|
|
220
|
+
### Step 3: Analyze Cross-Eval Patterns
|
|
221
|
+
|
|
222
|
+
Look for patterns across evals:
|
|
223
|
+
- Are certain eval types consistently harder/easier?
|
|
224
|
+
- Do some evals show high variance while others are stable?
|
|
225
|
+
- Are there surprising results that contradict expectations?
|
|
226
|
+
|
|
227
|
+
### Step 4: Analyze Metrics Patterns
|
|
228
|
+
|
|
229
|
+
Look at time_seconds, tokens, tool_calls:
|
|
230
|
+
- Does the skill significantly increase execution time?
|
|
231
|
+
- Is there high variance in resource usage?
|
|
232
|
+
- Are there outlier runs that skew the aggregates?
|
|
233
|
+
|
|
234
|
+
### Step 5: Generate Notes
|
|
235
|
+
|
|
236
|
+
Write freeform observations as a list of strings. Each note should:
|
|
237
|
+
- State a specific observation
|
|
238
|
+
- Be grounded in the data (not speculation)
|
|
239
|
+
- Help the user understand something the aggregate metrics don't show
|
|
240
|
+
|
|
241
|
+
Examples:
|
|
242
|
+
- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
|
|
243
|
+
- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
|
|
244
|
+
- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
|
|
245
|
+
- "Skill adds 13s average execution time but improves pass rate by 50%"
|
|
246
|
+
- "Token usage is 80% higher with skill, primarily due to script output parsing"
|
|
247
|
+
- "All 3 without-skill runs for eval 1 produced empty output"
|
|
248
|
+
|
|
249
|
+
### Step 6: Write Notes
|
|
250
|
+
|
|
251
|
+
Save notes to `{output_path}` as a JSON array of strings:
|
|
252
|
+
|
|
253
|
+
```json
|
|
254
|
+
[
|
|
255
|
+
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
|
|
256
|
+
"Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
|
|
257
|
+
"Without-skill runs consistently fail on table extraction expectations",
|
|
258
|
+
"Skill adds 13s average execution time but improves pass rate by 50%"
|
|
259
|
+
]
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Guidelines
|
|
263
|
+
|
|
264
|
+
**DO:**
|
|
265
|
+
- Report what you observe in the data
|
|
266
|
+
- Be specific about which evals, expectations, or runs you're referring to
|
|
267
|
+
- Note patterns that aggregate metrics would hide
|
|
268
|
+
- Provide context that helps interpret the numbers
|
|
269
|
+
|
|
270
|
+
**DO NOT:**
|
|
271
|
+
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
|
|
272
|
+
- Make subjective quality judgments ("the output was good/bad")
|
|
273
|
+
- Speculate about causes without evidence
|
|
274
|
+
- Repeat information already in the run_summary aggregates
|