@buaa_smat/hometrans 0.1.7 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,163 +0,0 @@
1
- # Report Template
2
-
3
- ALWAYS use this exact structure for the evaluation report. Replace bracketed placeholders with actual content. Skip sections marked N/A only when the corresponding dimension was entirely skipped (e.g., no scripts).
4
-
5
- ---
6
-
7
- # Skill Quality Report: [skill-name]
8
-
9
- **Evaluated on:** [date]
10
- **Skill path:** [path]
11
- **Overall score:** [X.X] / 5.0 — [Rating]
12
-
13
- ---
14
-
15
- ## Executive Summary
16
-
17
- [2-3 sentences: what the skill does, its strongest quality, and the single most impactful improvement it needs.]
18
-
19
- ---
20
-
21
- ## Dimension Scores
22
-
23
- | Dimension | Score | Weight | Weighted |
24
- |-----------|-------|--------|----------|
25
- | 1. Spec Compliance | X.X | 10% | X.X |
26
- | 2. Progressive Disclosure | X.X | 15% | X.X |
27
- | 3. Content Efficiency | X.X | 20% | X.X |
28
- | 4. Instruction Quality | X.X | 25% | X.X |
29
- | 5. Description Effectiveness | X.X | 15% | X.X |
30
- | 6. Script Quality | X.X / N/A | 5% | X.X |
31
- | 7. Evaluability | X.X | 10% | X.X |
32
- | **Overall** | | | **X.X** |
33
-
34
- ---
35
-
36
- ## Dimension 1 — Specification Compliance: X.X / 5.0
37
-
38
- | # | Criterion | Score | Evidence |
39
- |---|-----------|-------|----------|
40
- | 1.1 | Frontmatter validity | X | [Specific observation] |
41
- | 1.2 | Name rules | X | [Specific observation] |
42
- | 1.3 | Description length | X | [Specific observation] |
43
- | 1.4 | Directory structure | X | [Specific observation] |
44
-
45
- **What's good:** [1 sentence]
46
- **What to improve:** [1 sentence with action]
47
-
48
- ---
49
-
50
- ## Dimension 2 — Progressive Disclosure: X.X / 5.0
51
-
52
- | # | Criterion | Score | Evidence |
53
- |---|-----------|-------|----------|
54
- | 2.1 | SKILL.md size | X | [Line count / observation] |
55
- | 2.2 | Reference usage | X | [Which references exist, are they focused?] |
56
- | 2.3 | Load triggers | X | [Does the skill say when to load each reference?] |
57
- | 2.4 | Path conventions | X | [Relative paths? Chain depth?] |
58
-
59
- **What's good:** [1 sentence]
60
- **What to improve:** [1 sentence with action]
61
-
62
- ---
63
-
64
- ## Dimension 3 — Content Efficiency: X.X / 5.0
65
-
66
- | # | Criterion | Score | Evidence |
67
- |---|-----------|-------|----------|
68
- | 3.1 | No generic filler | X | [Examples of filler found, or confirmation of none] |
69
- | 3.2 | Coherent scope | X | [Scope assessment] |
70
- | 3.3 | Appropriate detail | X | [Detail level assessment] |
71
- | 3.4 | Domain grounding | X | [What domain-specific knowledge is present?] |
72
-
73
- **What's good:** [1 sentence]
74
- **What to improve:** [1 sentence with action]
75
-
76
- ---
77
-
78
- ## Dimension 4 — Instruction Quality: X.X / 5.0
79
-
80
- | # | Criterion | Score | Evidence |
81
- |---|-----------|-------|----------|
82
- | 4.1 | Calibrated specificity | X | [Examples of good or poor calibration] |
83
- | 4.2 | Defaults over menus | X | [Where does it provide defaults? Where does it list options?] |
84
- | 4.3 | Procedural over declarative | X | [Does it teach approach or give specific answers?] |
85
- | 4.4 | Explains why | X | [Examples of reasoning in instructions] |
86
- | 4.5 | Gotchas | X | [Presence and quality of gotchas section] |
87
-
88
- **What's good:** [1 sentence]
89
- **What to improve:** [1 sentence with action]
90
-
91
- ---
92
-
93
- ## Dimension 5 — Description Effectiveness: X.X / 5.0
94
-
95
- | # | Criterion | Score | Evidence |
96
- |---|-----------|-------|----------|
97
- | 5.1 | Imperative framing | X | [Quote the description or its key phrases] |
98
- | 5.2 | Intent-focused | X | [Mechanics vs. user intent analysis] |
99
- | 5.3 | Trigger coverage | X | [What triggers are covered? What's missing?] |
100
- | 5.4 | Conciseness | X | [Character count and density assessment] |
101
- | 5.5 | Keyword specificity | X | [Would this false-trigger or miss-trigger?] |
102
-
103
- **What's good:** [1 sentence]
104
- **What to improve:** [1 sentence with action]
105
-
106
- ---
107
-
108
- ## Dimension 6 — Script Quality: [X.X / 5.0 or N/A]
109
-
110
- [If N/A: "No scripts/ directory — this dimension is not applicable."]
111
-
112
- [If scored:]
113
-
114
- | # | Criterion | Score | Evidence |
115
- |---|-----------|-------|----------|
116
- | 6.1 | Self-contained | X | [Dependency handling] |
117
- | 6.2 | Non-interactive | X | [Any TTY prompts?] |
118
- | 6.3 | Help output | X | [--help quality] |
119
- | 6.4 | Error messages | X | [Error message quality] |
120
- | 6.5 | Structured output | X | [Output format assessment] |
121
-
122
- **What's good:** [1 sentence]
123
- **What to improve:** [1 sentence with action]
124
-
125
- ---
126
-
127
- ## Dimension 7 — Evaluability: X.X / 5.0
128
-
129
- | # | Criterion | Score | Evidence |
130
- |---|-----------|-------|----------|
131
- | 7.1 | Test cases exist | X | [evals/evals.json present? How many cases?] |
132
- | 7.2 | Realistic prompts | X | [Example prompt quality] |
133
- | 7.3 | Verifiable assertions | X | [Are assertions specific and checkable?] |
134
- | 7.4 | Edge case coverage | X | [Any boundary condition tests?] |
135
-
136
- **What's good:** [1 sentence]
137
- **What to improve:** [1 sentence with action]
138
-
139
- ---
140
-
141
- ## Top 3 Strengths
142
-
143
- 1. **[Strength title]** — [1-2 sentences with evidence]
144
- 2. **[Strength title]** — [1-2 sentences with evidence]
145
- 3. **[Strength title]** — [1-2 sentences with evidence]
146
-
147
- ---
148
-
149
- ## Top 3 Improvement Areas
150
-
151
- 1. **[Issue title]** — [Why it matters] → **Fix:** [Specific, actionable change]
152
- 2. **[Issue title]** — [Why it matters] → **Fix:** [Specific, actionable change]
153
- 3. **[Issue title]** — [Why it matters] → **Fix:** [Specific, actionable change]
154
-
155
- ---
156
-
157
- ## Quick Wins
158
-
159
- [2-3 one-line fixes that would take under 5 minutes and yield immediate improvement. If none are obvious, say "No quick wins identified — improvements require structural changes."]
160
-
161
- ---
162
-
163
- *Report generated by skill-quality-evaluator*
@@ -1,269 +0,0 @@
1
- # Scoring Rubric
2
-
3
- Each sub-criterion is scored on a 1-5 scale. The anchors below define what each score means per criterion. When evidence sits between two anchors, use the intermediate score (2 or 4).
4
-
5
- ## 1. Specification Compliance
6
-
7
- ### 1.1 Frontmatter validity
8
- | Score | Anchor |
9
- |-------|--------|
10
- | 1 | YAML is malformed or missing entirely; required fields absent |
11
- | 3 | YAML parses but has minor issues (e.g., indentation warnings); required fields present but one may be empty |
12
- | 5 | YAML is well-formed; `name` and `description` both present and non-empty |
13
-
14
- ### 1.2 Name rules
15
- | Score | Anchor |
16
- |-------|--------|
17
- | 1 | Name is missing, contains uppercase, starts/ends with hyphen, or has consecutive hyphens |
18
- | 3 | Name follows most rules but has one violation (e.g., trailing hyphen, doesn't match directory) |
19
- | 5 | Name follows all rules: 1-64 chars, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens, matches directory |
20
-
21
- ### 1.3 Description length
22
- | Score | Anchor |
23
- |-------|--------|
24
- | 1 | Missing or over 1024 characters |
25
- | 3 | Present but too short (under ~20 chars, too vague to trigger) or uncomfortably close to the 1024 limit |
26
- | 5 | Within limit, informative length (typically 100-600 chars) |
27
-
28
- ### 1.4 Directory structure
29
- | Score | Anchor |
30
- |-------|--------|
31
- | 1 | No SKILL.md at root |
32
- | 3 | Has SKILL.md but auxiliary directories are disorganized (e.g., scripts mixed into root, references inside assets) |
33
- | 5 | Clean structure: SKILL.md at root, optional dirs named conventionally (`scripts/`, `references/`, `assets/`) |
34
-
35
- ### 1.5 No broken references
36
- | Score | Anchor |
37
- |-------|--------|
38
- | 1 | Multiple broken references — SKILL.md references files in scripts/, references/, or assets/ that don't exist; or the skill's resource tables list non-existent files |
39
- | 3 | One broken or suspicious reference (e.g., the path looks right but the file is missing, or a reference exists in prose but isn't listed in the resource table) |
40
- | 5 | Every file path mentioned in SKILL.md resolves to an actual file on disk. Every entry in scripts/references/assets tables exists. No dangling references. |
41
-
42
- ## 2. Progressive Disclosure
43
-
44
- ### 2.1 SKILL.md size
45
- | Score | Anchor |
46
- |-------|--------|
47
- | 1 | Over 1000 lines — bloated, loads massive context on every invocation |
48
- | 3 | 500-700 lines — slightly over the recommended 500, or approaching the limit with dense content |
49
- | 5 | Under 500 lines, tightly scoped to core instructions |
50
-
51
- ### 2.2 Reference usage
52
- | Score | Anchor |
53
- |-------|--------|
54
- | 1 | All detail crammed into SKILL.md; no references/ even when content is clearly reference-grade (long tables, exhaustive lists, multi-domain variants) |
55
- | 3 | Has references/ but some material that belongs there is still inline, or reference files are unfocused dumps |
56
- | 5 | Reference-grade material lives in references/; each reference file has a clear, focused topic |
57
-
58
- ### 2.3 Load triggers
59
- | Score | Anchor |
60
- |-------|--------|
61
- | 1 | References files without saying when to read them ("see references/ for details") |
62
- | 3 | Some references have conditional triggers, others are just listed |
63
- | 5 | Every reference file has a clear conditional trigger ("if X happens, read `references/Y.md`") |
64
-
65
- ### 2.4 Path conventions
66
- | Score | Anchor |
67
- |-------|--------|
68
- | 1 | Uses absolute paths or deeply nested reference chains (A → B → C → D) |
69
- | 3 | Uses relative paths but has one case of a 2+ level deep reference chain |
70
- | 5 | All paths are relative from skill root; references are at most one level deep from SKILL.md |
71
-
72
- ## 3. Content Efficiency
73
-
74
- ### 3.1 No generic filler
75
- | Score | Anchor |
76
- |-------|--------|
77
- | 1 | Large sections explain basic concepts the agent already knows (e.g., "JSON is a data format...", "Git is a version control system...") |
78
- | 3 | A few instances of unnecessary explanation mixed with useful content |
79
- | 5 | Every sentence earns its place. The "would the agent get this wrong without it?" test passes throughout. |
80
-
81
- ### 3.2 Coherent scope
82
- | Score | Anchor |
83
- |-------|--------|
84
- | 1 | Scope is either a single trivial action (too narrow) or a sprawling collection of unrelated workflows (too broad) |
85
- | 3 | Scope is generally coherent but has one notable inclusion/exclusion that doesn't fit |
86
- | 5 | Encapsulates one coherent unit of work that composes well with other skills |
87
-
88
- ### 3.3 Appropriate detail
89
- | Score | Anchor |
90
- |-------|--------|
91
- | 1 | Either a vague one-liner with no structure OR an exhaustive reference manual covering every edge case |
92
- | 3 | Generally good detail level but with sections that are too sparse or too exhaustive |
93
- | 5 | Concise, stepwise guidance with enough structure to follow reliably; leaves reasonable judgment calls to the agent |
94
-
95
- ### 3.4 Domain grounding
96
- | Score | Anchor |
97
- |-------|--------|
98
- | 1 | Entirely generic — could have been written by an LLM with no project-specific knowledge |
99
- | 3 | Some domain-specific content but also significant generic portions |
100
- | 5 | Rich with project-specific conventions, environment-specific gotchas, non-obvious patterns — things only an expert on *this* project would know |
101
-
102
- ## 4. Instruction Quality
103
-
104
- ### 4.1 Calibrated specificity
105
- | Score | Anchor |
106
- |-------|--------|
107
- | 1 | Uniformly rigid (ALWAYS/NEVER for everything) or uniformly vague ("handle this appropriately") |
108
- | 3 | Mostly well-calibrated but with a few instructions that are too rigid or too loose for their context |
109
- | 5 | Specificity matches fragility: prescriptive for brittle operations, flexible where multiple approaches work |
110
-
111
- ### 4.2 Defaults over menus
112
- | Score | Anchor |
113
- |-------|--------|
114
- | 1 | Presents undifferentiated lists of tools/approaches with no guidance on which to pick |
115
- | 3 | Provides defaults for some choices but presents menus for others |
116
- | 5 | Every choice point has a clear default with alternatives noted briefly |
117
-
118
- ### 4.3 Procedural over declarative
119
- | Score | Anchor |
120
- |-------|--------|
121
- | 1 | Gives specific answers for specific cases rather than reusable methods (e.g., exact SQL instead of a query-building approach) |
122
- | 3 | Has some reusable method descriptions but also some overly specific one-shot instructions |
123
- | 5 | Teaches *how to approach* a class of problems; instructions generalize beyond the examples |
124
-
125
- ### 4.4 Explains why
126
- | Score | Anchor |
127
- |-------|--------|
128
- | 1 | Pure directives with zero reasoning; "just do X" throughout |
129
- | 3 | Some instructions include reasoning but most are bare directives |
130
- | 5 | Non-obvious instructions consistently include the reasoning behind them |
131
-
132
- ### 4.5 Gotchas
133
- | Score | Anchor |
134
- |-------|--------|
135
- | 1 | No gotchas section, even though the domain likely has non-obvious pitfalls |
136
- | 3 | Has gotchas but they are too generic ("handle errors") or miss important ones |
137
- | 5 | Concrete, environment-specific corrections that would prevent real mistakes; every gotcha is testable |
138
-
139
- ## 5. Description Effectiveness
140
-
141
- ### 5.1 Imperative framing
142
- | Score | Anchor |
143
- |-------|--------|
144
- | 1 | Passive description ("This skill helps with PDFs") |
145
- | 3 | Mix of declarative and imperative |
146
- | 5 | Consistently imperative: "Use this skill when...", "Trigger on...", clearly instructing the agent |
147
-
148
- ### 5.2 Intent-focused
149
- | Score | Anchor |
150
- |-------|--------|
151
- | 1 | Describes internal mechanics ("This skill runs a Python script that parses XML and...") |
152
- | 3 | Mentions both mechanics and user intent |
153
- | 5 | Describes what the user wants to achieve; implementation details are absent from the description |
154
-
155
- ### 5.3 Trigger coverage
156
- | Score | Anchor |
157
- |-------|--------|
158
- | 1 | Only triggers on exact keyword matches; misses obvious related requests |
159
- | 3 | Covers common triggers but has notable gaps (e.g., covers English but not Chinese, covers formal but not casual phrasing) |
160
- | 5 | Pushy about when to trigger; covers synonyms, casual phrasing, cross-language triggers, and implicit mentions |
161
-
162
- ### 5.4 Conciseness
163
- | Score | Anchor |
164
- |-------|--------|
165
- | 1 | Either a 5-word fragment or a 1000+ character paragraph |
166
- | 3 | Reasonable length but could be tighter or slightly more expansive |
167
- | 5 | Dense with signal; covers scope fully without bloat; well under 1024 chars |
168
-
169
- ### 5.5 Keyword specificity
170
- | Score | Anchor |
171
- |-------|--------|
172
- | 1 | Keywords are so generic they'd cause false triggers on unrelated tasks, or so narrow they'd miss most real uses |
173
- | 3 | Keywords are reasonable but would either false-trigger on some adjacent domains or miss some relevant ones |
174
- | 5 | Keywords precisely carve out the skill's domain; would match relevant prompts and not match adjacent-but-different ones |
175
-
176
- ## 6. Script Quality
177
-
178
- Skip this dimension if no `scripts/` directory exists.
179
-
180
- ### 6.1 Self-contained
181
- | Score | Anchor |
182
- |-------|--------|
183
- | 1 | Scripts have undocumented external dependencies; fail on first run without manual setup |
184
- | 3 | Dependencies documented but not declared inline; requires an install step |
185
- | 5 | Dependencies declared inline (PEP 723, `package.json`, `bundler/inline`) or scripts use only stdlib |
186
-
187
- ### 6.2 Non-interactive
188
- | Score | Anchor |
189
- |-------|--------|
190
- | 1 | Scripts prompt for input interactively (TTY read, confirm dialog) |
191
- | 3 | Mostly non-interactive but one script has an interactive fallback |
192
- | 5 | All input via flags, env vars, or stdin; no TTY dependency |
193
-
194
- ### 6.3 Help output
195
- | Score | Anchor |
196
- |-------|--------|
197
- | 1 | No `--help` or equivalent |
198
- | 3 | `--help` exists but is minimal (just lists flags, no examples) |
199
- | 5 | `--help` shows usage, flag descriptions, and working examples |
200
-
201
- ### 6.4 Error messages
202
- | Score | Anchor |
203
- |-------|--------|
204
- | 1 | Opaque errors ("Error: invalid input") or silent failures |
205
- | 3 | Some errors are descriptive, others are cryptic |
206
- | 5 | Every error says what went wrong, what was expected, and what to try |
207
-
208
- ### 6.5 Structured output
209
- | Score | Anchor |
210
- |-------|--------|
211
- | 1 | Free-form text output only; no structured format available |
212
- | 3 | Has a structured output option but it's not the default, or the structured format is poorly chosen |
213
- | 5 | Structured output (JSON/CSV) is the default or easily selectable; diagnostics go to stderr |
214
-
215
- ## 7. Evaluability
216
-
217
- ### 7.1 Test cases exist
218
- | Score | Anchor |
219
- |-------|--------|
220
- | 1 | No evals/ directory or evals.json |
221
- | 3 | Has evals.json but with only 1 test case |
222
- | 5 | Has evals/evals.json with 2+ test cases |
223
-
224
- ### 7.2 Realistic prompts
225
- | Score | Anchor |
226
- |-------|--------|
227
- | 1 | Prompts are generic placeholders ("test the skill", "run the workflow") |
228
- | 3 | Some prompts are realistic but others are too abstract |
229
- | 5 | Prompts look like real user messages: file paths, personal context, varied formality, specific details |
230
-
231
- ### 7.3 Verifiable assertions
232
- | Score | Anchor |
233
- |-------|--------|
234
- | 1 | Assertions are vague ("output is good", "works correctly") or missing entirely |
235
- | 3 | Mix of specific and vague assertions |
236
- | 5 | Every assertion is specific, objective, and verifiable — you could write a script to check most of them |
237
-
238
- ### 7.4 Edge case coverage
239
- | Score | Anchor |
240
- |-------|--------|
241
- | 1 | All test cases are happy-path only |
242
- | 3 | Has one edge case but it's superficial |
243
- | 5 | At least one test case thoroughly probes a boundary condition (malformed input, ambiguous request, missing data) |
244
-
245
- ## Dimension Weights
246
-
247
- The overall score is a weighted average of dimension scores. Weights reflect the relative importance of each dimension to skill effectiveness:
248
-
249
- | Dimension | Weight | Rationale |
250
- |-----------|--------|-----------|
251
- | 1. Spec Compliance | 10% | Table stakes — must pass, but doesn't differentiate great from good |
252
- | 2. Progressive Disclosure | 15% | Directly affects agent context efficiency and skill composability |
253
- | 3. Content Efficiency | 20% | The core value question: does this skill add information the agent lacks? |
254
- | 4. Instruction Quality | 25% | The highest-impact dimension — how well the skill guides the agent |
255
- | 5. Description Effectiveness | 15% | Determines whether the skill gets used at all |
256
- | 6. Script Quality | 5% | Optional; important when present but many good skills have no scripts |
257
- | 7. Evaluability | 10% | Indicates maturity and maintainability |
258
-
259
- When a dimension is N/A (e.g., no scripts), redistribute its weight proportionally across the remaining dimensions.
260
-
261
- ## Overall Score Interpretation
262
-
263
- | Range | Rating | Meaning |
264
- |-------|--------|---------|
265
- | 4.5 - 5.0 | Excellent | Production-ready; minor polish at most |
266
- | 3.5 - 4.4 | Good | Solid skill; a few specific improvements would elevate it |
267
- | 2.5 - 3.4 | Adequate | Functional but has notable gaps; needs targeted work |
268
- | 1.5 - 2.4 | Weak | Major issues across multiple dimensions; significant revision needed |
269
- | 1.0 - 1.4 | Poor | Fundamentally incomplete or misaligned with best practices |