npm - @buaa_smat/hometrans - Versions diffs - 0.1.6 → 0.1.7 - Mend

@buaa_smat/hometrans 0.1.6 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/skills/skill-quality-evaluator/references/Optimizing-skill-descriptions.md ADDED Viewed

@@ -0,0 +1,196 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+# Optimizing skill descriptions
+> How to improve your skill's description so it triggers reliably on relevant prompts.
+A skill only helps if it gets activated. The `description` field in your `SKILL.md` frontmatter is the primary mechanism agents use to decide whether to load a skill for a given task. An under-specified description means the skill won't trigger when it should; an over-broad description means it triggers when it shouldn't.
+This guide covers how to systematically test and improve your skill's description for triggering accuracy.
+## How skill triggering works
+Agents use [progressive disclosure](/specification#progressive-disclosure) to manage context. At startup, they load only the `name` and `description` of each available skill — just enough to decide when a skill might be relevant. When a user's task matches a description, the agent reads the full `SKILL.md` into context and follows its instructions.
+This means the description carries the entire burden of triggering. If the description doesn't convey when the skill is useful, the agent won't know to reach for it.
+One important nuance: agents typically only consult skills for tasks that require knowledge or capabilities beyond what they can handle alone. A simple, one-step request like "read this PDF" may not trigger a PDF skill even if the description matches perfectly, because the agent can handle it with basic tools. Tasks that involve specialized knowledge — an unfamiliar API, a domain-specific workflow, or an uncommon format — are where a well-written description can make the difference.
+## Writing effective descriptions
+Before testing, it helps to know what a good description looks like. A few principles:
+* **Use imperative phrasing.** Frame the description as an instruction to the agent: "Use this skill when..." rather than "This skill does..." The agent is deciding whether to act, so tell it when to act.
+* **Focus on user intent, not implementation.** Describe what the user is trying to achieve, not the skill's internal mechanics. The agent matches against what the user asked for.
+* **Err on the side of being pushy.** Explicitly list contexts where the skill applies, including cases where the user doesn't name the domain directly: "even if they don't explicitly mention 'CSV' or 'analysis.'"
+* **Keep it concise.** A few sentences to a short paragraph is usually right — long enough to cover the skill's scope, short enough that it doesn't bloat the agent's context across many skills. The [specification](/specification#description-field) enforces a hard limit of 1024 characters.
+## Designing trigger eval queries
+To test triggering, you need a set of eval queries — realistic user prompts labeled with whether they should or shouldn't trigger your skill.
+```json eval_queries.json theme={null}
+[
+  { "query": "I've got a spreadsheet in ~/data/q4_results.xlsx with revenue in col C and expenses in col D — can you add a profit margin column and highlight anything under 10%?", "should_trigger": true },
+  { "query": "whats the quickest way to convert this json file to yaml", "should_trigger": false }
+]
+```
+Aim for about 20 queries: 8-10 that should trigger and 8-10 that shouldn't.
+### Should-trigger queries
+These test whether the description captures the skill's scope. Vary them along several axes:
+* **Phrasing**: some formal, some casual, some with typos or abbreviations.
+* **Explicitness**: some name the skill's domain directly ("analyze this CSV"), others describe the need without naming it ("my boss wants a chart from this data file").
+* **Detail**: mix terse prompts with context-heavy ones — a short "analyze my sales CSV and make a chart" alongside a longer message with file paths, column names, and backstory.
+* **Complexity**: vary the number of steps and decision points. Include single-step tasks alongside multi-step workflows to test whether the agent can discern the skill is relevant when the task it addresses is buried in a larger chain.
+The most useful should-trigger queries are ones where the skill would help but the connection isn't obvious from the query alone. These are the cases where description wording makes the difference — if the query already asks for exactly what the skill does, any reasonable description would trigger.
+### Should-not-trigger queries
+The most valuable negative test cases are **near-misses** — queries that share keywords or concepts with your skill but actually need something different. These test whether the description is precise, not just broad.
+For a CSV analysis skill, weak negative examples would be:
+* `"Write a fibonacci function"` — obviously irrelevant, tests nothing.
+* `"What's the weather today?"` — no keyword overlap, too easy.
+Strong negative examples:
+* `"I need to update the formulas in my Excel budget spreadsheet"` — shares "spreadsheet" and "data" concepts, but needs Excel editing, not CSV analysis.
+* `"can you write a python script that reads a csv and uploads each row to our postgres database"` — involves CSV, but the task is database ETL, not analysis.
+### Tips for realism
+Real user prompts contain context that generic test queries lack. Include:
+* File paths (`~/Downloads/report_final_v2.xlsx`)
+* Personal context (`"my manager asked me to..."`)
+* Specific details (column names, company names, data values)
+* Casual language, abbreviations, and occasional typos
+## Testing whether a description triggers
+The basic approach: run each query through your agent with the skill installed and observe whether the agent invokes it. Make sure the skill is registered and discoverable by your agent — how this works varies by client (e.g., a skills directory, a configuration file, or a CLI flag).
+Most agent clients provide some form of observability — execution logs, tool call histories, or verbose output — that lets you see which skills were consulted during a run. Check your client's documentation for details. The skill triggered if the agent loaded your skill's `SKILL.md`; it didn't trigger if the agent proceeded without consulting it.
+A query "passes" if:
+* `should_trigger` is `true` and the skill was invoked, or
+* `should_trigger` is `false` and the skill was not invoked.
+### Running multiple times
+Model behavior is nondeterministic — the same query might trigger the skill on one run but not the next. Run each query multiple times (3 is a reasonable starting point) and compute a **trigger rate**: the fraction of runs where the skill was invoked.
+A should-trigger query passes if its trigger rate is above a threshold (0.5 is a reasonable default). A should-not-trigger query passes if its trigger rate is below that threshold.
+With 20 queries at 3 runs each, that's 60 invocations. You'll want to script this. Here's the general structure — replace the `claude` invocation and detection logic in `check_triggered` with whatever your agent client provides:
+```bash theme={null}
+#!/bin/bash
+QUERIES_FILE="${1:?Usage: $0 <queries.json>}"
+SKILL_NAME="my-skill"
+RUNS=3
+# This example uses Claude Code's JSON output to check for Skill tool calls.
+# Replace this function with detection logic for your agent client.
+# Should return 0 (success) if the skill was invoked, 1 otherwise.
+check_triggered() {
+  local query="$1"
+  claude -p "$query" --output-format json 2>/dev/null \
+    | jq -e --arg skill "$SKILL_NAME" \
+      'any(.messages[].content[]; .type == "tool_use" and .name == "Skill" and .input.skill == $skill)' \
+      > /dev/null 2>&1
+}
+count=$(jq length "$QUERIES_FILE")
+for i in $(seq 0 $((count - 1))); do
+  query=$(jq -r ".[$i].query" "$QUERIES_FILE")
+  should_trigger=$(jq -r ".[$i].should_trigger" "$QUERIES_FILE")
+  triggers=0
+  for run in $(seq 1 $RUNS); do
+    check_triggered "$query" && triggers=$((triggers + 1))
+  done
+  jq -n \
+    --arg query "$query" \
+    --argjson should_trigger "$should_trigger" \
+    --argjson triggers "$triggers" \
+    --argjson runs "$RUNS" \
+    '{query: $query, should_trigger: $should_trigger, triggers: $triggers, runs: $runs, trigger_rate: ($triggers / $runs)}'
+done | jq -s '.'
+```
+<Tip>
+  If your agent client supports it, you can stop a run early once the outcome is clear — the agent either consulted the skill or started working without it. This can significantly reduce the time and cost of running the full eval set.
+</Tip>
+## Avoiding overfitting with train/validation splits
+If you optimize the description against all your queries, you risk overfitting — crafting a description that works for these specific phrasings but fails on new ones.
+The solution is to split your query set:
+* **Train set (\~60%)**: the queries you use to identify failures and guide improvements.
+* **Validation set (\~40%)**: queries you set aside and only use to check whether improvements generalize.
+Make sure both sets contain a proportional mix of should-trigger and should-not-trigger queries — don't accidentally put all the positives in one set. Shuffle randomly and keep the split fixed across iterations so you're comparing apples to apples.
+If you're using a script like the one [above](#running-multiple-times), you can split your queries into two files — `train_queries.json` and `validation_queries.json` — and run the script against each one separately.
+## The optimization loop
+1. **Evaluate** the current description on both *train and validation sets*. The train results guide your changes; the validation results tell you whether those changes are generalizing.
+2. **Identify failures** in the *train set*: which should-trigger queries didn't trigger? Which should-not-trigger queries did?
+   * Only use train set failures to guide your changes — whether you're revising the description yourself or prompting an LLM, keep validation set results out of the process.
+3. **Revise the description.** Focus on generalizing:
+   * If should-trigger queries are failing, the description may be too narrow. Broaden the scope or add context about when the skill is useful.
+   * If should-not-trigger queries are false-triggering, the description may be too broad. Add specificity about what the skill does *not* do, or clarify the boundary between this skill and adjacent capabilities.
+   * Avoid adding specific keywords from failed queries — that's overfitting. Instead, find the general category or concept those queries represent and address that.
+   * If you're stuck after several iterations, try a structurally different approach to the description rather than incremental tweaks. A different framing or sentence structure may break through where refinement can't.
+   * Check that the description stays under the 1024-character limit — descriptions tend to grow during optimization.
+4. **Repeat** steps 1-3 until all *train set* queries pass or you stop seeing meaningful improvement.
+5. **Select the best iteration** by its validation pass rate — the fraction of queries in the *validation set* that passed. Note that the best description may not be the last one you produced; an earlier iteration might have a higher validation pass rate than later ones that overfit to the train set.
+Five iterations is usually enough. If performance isn't improving, the issue may be with the queries (too easy, too hard, or poorly labeled) rather than the description.
+<Tip>
+  The [`skill-creator`](https://github.com/anthropics/skills/tree/main/skills/skill-creator) Skill automates this loop end-to-end: it splits the eval set, evaluates trigger rates in parallel, proposes description improvements using Claude, and generates a live HTML report you can watch as it runs.
+</Tip>
+## Applying the result
+Once you've selected the best description:
+1. Update the `description` field in your `SKILL.md` frontmatter.
+2. Verify the description is under the [1024-character limit](/specification#description-field).
+3. Verify the description triggers as expected. Try a few prompts manually as a quick sanity check. For a more rigorous test, write 5-10 fresh queries (a mix of should-trigger and should-not-trigger) and run them through the eval script — since these queries were never part of the optimization process, they give you an honest check on whether the description generalizes.
+Before and after:
+```yaml theme={null}
+# Before
+description: Process CSV files.
+# After
+description: >
+  Analyze CSV and tabular data files — compute summary statistics,
+  add derived columns, generate charts, and clean messy data. Use this
+  skill when the user has a CSV, TSV, or Excel file and wants to
+  explore, transform, or visualize the data, even if they don't
+  explicitly mention "CSV" or "analysis."
+```
+The improved description is more specific about what the skill does (summary stats, derived columns, charts, cleaning) and broader about when it applies (CSV, TSV, Excel; even without explicit keywords).
+## Next steps
+Once your skill triggers reliably, you'll want to evaluate whether it produces good outputs. See [Evaluating skill output quality](/skill-creation/evaluating-skills) for how to set up test cases, grade results, and iterate.

package/skills/skill-quality-evaluator/references/Specification.md ADDED Viewed

@@ -0,0 +1,272 @@
+> ## Documentation Index
+> Fetch the complete documentation index at: https://agentskills.io/llms.txt
+> Use this file to discover all available pages before exploring further.
+# Specification
+> The complete format specification for Agent Skills.
+## Directory structure
+A skill is a directory containing, at minimum, a `SKILL.md` file:
+```
+skill-name/
+├── SKILL.md          # Required: metadata + instructions
+├── scripts/          # Optional: executable code
+├── references/       # Optional: documentation
+├── assets/           # Optional: templates, resources
+└── ...               # Any additional files or directories
+```
+## `SKILL.md` format
+The `SKILL.md` file must contain YAML frontmatter followed by Markdown content.
+### Frontmatter
+| Field           | Required | Constraints                                                                                                       |
+| --------------- | -------- | ----------------------------------------------------------------------------------------------------------------- |
+| `name`          | Yes      | Max 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.             |
+| `description`   | Yes      | Max 1024 characters. Non-empty. Describes what the skill does and when to use it.                                 |
+| `license`       | No       | License name or reference to a bundled license file.                                                              |
+| `compatibility` | No       | Max 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.). |
+| `metadata`      | No       | Arbitrary key-value mapping for additional metadata.                                                              |
+| `allowed-tools` | No       | Space-separated string of pre-approved tools the skill may use. (Experimental)                                    |
+<Card>
+  **Minimal example:**
+  ```markdown SKILL.md theme={null}
+  ---
+  name: skill-name
+  description: A description of what this skill does and when to use it.
+  ---
+  ```
+  **Example with optional fields:**
+  ```markdown SKILL.md theme={null}
+  ---
+  name: pdf-processing
+  description: Extract PDF text, fill forms, merge files. Use when handling PDFs.
+  license: Apache-2.0
+  metadata:
+    author: example-org
+    version: "1.0"
+  ---
+  ```
+</Card>
+#### `name` field
+The required `name` field:
+* Must be 1-64 characters
+* May only contain unicode lowercase alphanumeric characters (`a-z`, `0-9`) and hyphens (`-`)
+* Must not start or end with a hyphen (`-`)
+* Must not contain consecutive hyphens (`--`)
+* Must match the parent directory name
+<Card>
+  **Valid examples:**
+  ```yaml theme={null}
+  name: pdf-processing
+  ```
+  ```yaml theme={null}
+  name: data-analysis
+  ```
+  ```yaml theme={null}
+  name: code-review
+  ```
+  **Invalid examples:**
+  ```yaml theme={null}
+  name: PDF-Processing  # uppercase not allowed
+  ```
+  ```yaml theme={null}
+  name: -pdf  # cannot start with hyphen
+  ```
+  ```yaml theme={null}
+  name: pdf--processing  # consecutive hyphens not allowed
+  ```
+</Card>
+#### `description` field
+The required `description` field:
+* Must be 1-1024 characters
+* Should describe both what the skill does and when to use it
+* Should include specific keywords that help agents identify relevant tasks
+<Card>
+  **Good example:**
+  ```yaml theme={null}
+  description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.
+  ```
+  **Poor example:**
+  ```yaml theme={null}
+  description: Helps with PDFs.
+  ```
+</Card>
+#### `license` field
+The optional `license` field:
+* Specifies the license applied to the skill
+* We recommend keeping it short (either the name of a license or the name of a bundled license file)
+<Card>
+  **Example:**
+  ```yaml theme={null}
+  license: Proprietary. LICENSE.txt has complete terms
+  ```
+</Card>
+#### `compatibility` field
+The optional `compatibility` field:
+* Must be 1-500 characters if provided
+* Should only be included if your skill has specific environment requirements
+* Can indicate intended product, required system packages, network access needs, etc.
+<Card>
+  **Examples:**
+  ```yaml theme={null}
+  compatibility: Designed for Claude Code (or similar products)
+  ```
+  ```yaml theme={null}
+  compatibility: Requires git, docker, jq, and access to the internet
+  ```
+  ```yaml theme={null}
+  compatibility: Requires Python 3.14+ and uv
+  ```
+</Card>
+<Note>
+  Most skills do not need the `compatibility` field.
+</Note>
+#### `metadata` field
+The optional `metadata` field:
+* A map from string keys to string values
+* Clients can use this to store additional properties not defined by the Agent Skills spec
+* We recommend making your key names reasonably unique to avoid accidental conflicts
+<Card>
+  **Example:**
+  ```yaml theme={null}
+  metadata:
+    author: example-org
+    version: "1.0"
+  ```
+</Card>
+#### `allowed-tools` field
+The optional `allowed-tools` field:
+* A space-separated string of tools that are pre-approved to run
+* Experimental. Support for this field may vary between agent implementations
+<Card>
+  **Example:**
+  ```yaml theme={null}
+  allowed-tools: Bash(git:*) Bash(jq:*) Read
+  ```
+</Card>
+### Body content
+The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.
+Recommended sections:
+* Step-by-step instructions
+* Examples of inputs and outputs
+* Common edge cases
+Note that the agent will load this entire file once it's decided to activate a skill. Consider splitting longer `SKILL.md` content into referenced files.
+## Optional directories
+### `scripts/`
+Contains executable code that agents can run. Scripts should:
+* Be self-contained or clearly document dependencies
+* Include helpful error messages
+* Handle edge cases gracefully
+Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.
+### `references/`
+Contains additional documentation that agents can read when needed:
+* `REFERENCE.md` - Detailed technical reference
+* `FORMS.md` - Form templates or structured data formats
+* Domain-specific files (`finance.md`, `legal.md`, etc.)
+Keep individual [reference files](#file-references) focused. Agents load these on demand, so smaller files mean less use of context.
+### `assets/`
+Contains static resources:
+* Templates (document templates, configuration templates)
+* Images (diagrams, examples)
+* Data files (lookup tables, schemas)
+## Progressive disclosure
+Agents load skills *progressively*, pulling in more detail only as a task calls for it. Skills should be structured to take advantage of this:
+1. **Metadata** (\~100 tokens): The `name` and `description` fields are loaded at startup for all skills
+2. **Instructions** (\< 5000 tokens recommended): The full `SKILL.md` body is loaded when the skill is activated
+3. **Resources** (as needed): Files (e.g. those in `scripts/`, `references/`, or `assets/`) are loaded only when required
+Keep your main `SKILL.md` under 500 lines. Move detailed reference material to separate files.
+## File references
+When referencing other files in your skill, use relative paths from the skill root:
+```markdown SKILL.md theme={null}
+See [the reference guide](references/REFERENCE.md) for details.
+Run the extraction script:
+scripts/extract.py
+```
+Keep file references one level deep from `SKILL.md`. Avoid deeply nested reference chains.
+## Validation
+Use the [skills-ref](https://github.com/agentskills/agentskills/tree/main/skills-ref) reference library to validate your skills:
+```bash theme={null}
+skills-ref validate ./my-skill
+```
+This checks that your `SKILL.md` frontmatter is valid and follows all naming conventions.