npm - @alexanderzzlatkov/skilleval - Versions diffs - 0.1.0 - Mend

@alexanderzzlatkov/skilleval 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 skilleval contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,186 @@
+# skilleval
+Evaluate how well AI models understand and respond to [Agent Skills](https://agentskills.io/home) (SKILL.md files).
+Skill authors write a SKILL.md and have zero idea whether it works on any model besides the one they tested with. `skilleval` fixes that  - it simulates how agents like [OpenClaw](https://openclaw.ai/) and Claude Code inject skills into prompts, following the [OpenSkills](https://github.com/numman-ali/openskills) specification, then tests whether various LLM models correctly trigger and follow the skill's instructions.
+## Prerequisites
+You need an [OpenRouter](https://openrouter.ai) account. Create a free API key at [openrouter.ai/keys](https://openrouter.ai/keys).
+OpenRouter is used for test prompt generation and evaluation judging, and as the default provider for testing models. Even if you use a different provider (Anthropic, OpenAI, Google) for the models being tested, an OpenRouter key is still required for the generator and judge unless you supply custom prompts via `--prompts`.
+**Free model limitations:** OpenRouter's free models (those ending in `:free`) are subject to upstream rate limits and may be temporarily unavailable. If you encounter rate limit errors, you can:
+- Wait and retry  - free model availability fluctuates
+- Use paid models instead (remove the `:free` suffix, e.g. `meta-llama/llama-3.3-70b-instruct`)
+- Provide your own test prompts with `--prompts` to skip the generator model entirely
+## Quick Start
+```bash
+# Set your OpenRouter API key
+export OPENROUTER_API_KEY=sk-or-...
+# Evaluate a local skill
+npx skilleval ./my-skill/SKILL.md
+# Evaluate a skill from a GitHub repo (like skills.sh)
+npx skilleval owner/repo
+# Evaluate a specific skill within a repo
+npx skilleval owner/repo --skill skill-name
+# Evaluate from a GitHub URL
+npx skilleval https://github.com/user/repo/blob/main/skills/my-skill/SKILL.md
+```
+## How It Works
+`skilleval` follows the [OpenSkills](https://github.com/numman-ali/openskills) specification  - a universal skills format based on Anthropic's SKILL.md system. It's compatible with skills built for agents like Claude Code, [OpenClaw](https://openclaw.ai/), and any agent that uses the SKILL.md format. It simulates how these agents inject skills into system prompts using `<available_skills>` XML blocks  - the same format used in production. This means the evaluation reflects real-world skill behavior, not synthetic benchmarks.
+1. **Parse**  - Reads the SKILL.md, extracts name, description, and instructions.
+2. **Build context**  - The model is presented as a helpful AI agent with access to multiple skills. The system prompt uses `<available_skills>` XML injection where your skill is mixed in with 3 fake distractor skills (e.g. "git-commit-helper", "api-documentation", "test-generator"). This tests whether the model can identify the right skill from a realistic list rather than in isolation.
+3. **Generate test prompts**  - A generator model creates 5 positive prompts (should trigger) and 5 negative prompts (should not).
+4. **Run trigger tests**  - Sends each prompt to each target model with the skill-injected system prompt.
+5. **Evaluate**  - A judge model assesses trigger accuracy and, for correctly triggered prompts, runs a compliance test against the full skill instructions. If the skill references tools (e.g. `WebFetch`, `BraveSearch`, `Read`, `Write`, `Edit`, `Bash`, `Grep`, `Glob`), `skilleval` automatically provides mock tool definitions so the model can make real structured tool calls instead of fabricating results in text. The judge evaluates whether the model called the right tools with the right parameters, not the quality of the mock results.
+6. **Report**  - Prints a compatibility matrix to the terminal.
+See [AGENTS.md](./AGENTS.md) for detailed pipeline internals.
+## Output
+```
+skilleval v0.1.0
+Skill: pdf-processing
+Description: Extract text and tables from PDF files
+Provider: openrouter
+Models: 5
+┌───────────────────────────────────────────┬──────────────┬────────────────┬─────────┐
+│ Model                                     │ Trigger      │ Compliance     │ Overall │
+├───────────────────────────────────────────┼──────────────┼────────────────┼─────────┤
+│ qwen/qwen3-235b-a22b:free                 │ 10/10        │ 5/5 (92)       │ 98%     │
+│ meta-llama/llama-3.3-70b-instruct:free    │ 9/10         │ 4/5 (85)       │ 82%     │
+│ deepseek/deepseek-r1:free                 │ 9/10         │ 4/5 (80)       │ 81%     │
+│ google/gemma-3-27b-it:free                │ 8/10         │ 3/5 (70)       │ 72%     │
+│ mistralai/mistral-small-3.1-24b:free      │ 7/10         │ 3/5 (65)       │ 66%     │
+└───────────────────────────────────────────┴──────────────┴────────────────┴─────────┘
+Best model: qwen/qwen3-235b-a22b:free (98%)
+Worst model: mistralai/mistral-small-3.1-24b:free (66%)
+```
+## Usage
+```
+skilleval <skill> [options]
+Arguments:
+  skill                          Path, URL, or GitHub shorthand (owner/repo)
+Options:
+  -p, --provider <provider>      Provider: openrouter, anthropic, openai, google (default: openrouter)
+  -m, --models <models>          Comma-separated model IDs
+  -s, --skill <name>             Skill name within the repo (looks for skills/<name>/SKILL.md)
+  -k, --key <key>                API key (or use provider-specific env var)
+  --generator-model <model>      Model for test prompt generation (comma-separated for fallbacks)
+  --judge-model <model>          Model for evaluation judging (comma-separated for fallbacks)
+  --json                         Output results as JSON
+  --verbose                      Show detailed per-prompt results
+  -n, --count <number>           Number of positive+negative test prompts (default: 5, so 5+5=10 total)
+  --prompts <path>               Path to JSON file with custom test prompts
+  -V, --version                  Output the version number
+  -h, --help                     Display help
+```
+### Model Roles
+`skilleval` uses three types of models, each with a different role in the pipeline:
+| Role | Flag | Default | Description |
+|---|---|---|---|
+| **Test models** | `-m, --models` | 5 free OpenRouter models | The models being evaluated. These receive the skill-injected prompt and are scored on how well they trigger and follow the skill. |
+| **Generator models** | `--generator-model` | 3 free OpenRouter models (with fallback) | Generate test prompts (positive + negative) from the skill definition. Count configurable via `-n`. You can provide comma-separated model IDs for fallback. |
+| **Judge models** | `--judge-model` | 3 free OpenRouter models (with fallback) | Evaluate each test model's response  - did it correctly trigger the skill? Did it follow instructions? You can provide comma-separated model IDs for fallback. |
+The generator and judge models always run through OpenRouter (even if you set a different `--provider`). Only the test models use your specified provider.
+### Providers
+All providers use the [Vercel AI SDK](https://ai-sdk.dev) under the hood.
+| Provider | Flag | Env Var | Notes |
+|---|---|---|---|
+| OpenRouter | `--provider openrouter` | `OPENROUTER_API_KEY` | Default. Access 300+ models including free ones. |
+| Anthropic | `--provider anthropic` | `ANTHROPIC_API_KEY` | Direct API access to Claude models. |
+| OpenAI | `--provider openai` | `OPENAI_API_KEY` | Direct API access to GPT models. |
+| Google | `--provider google` | `GOOGLE_GENERATIVE_AI_API_KEY` | Direct API access to Gemini models. |
+### Examples
+```bash
+# Test against default free OpenRouter models
+npx skilleval ./SKILL.md
+# Test against specific models via OpenRouter
+npx skilleval ./SKILL.md --models "anthropic/claude-sonnet-4-20250514,openai/gpt-4o"
+# Test directly against Anthropic
+npx skilleval ./SKILL.md --provider anthropic --model claude-sonnet-4-20250514
+# Use a smarter judge model
+npx skilleval ./SKILL.md --judge-model "qwen/qwen3-235b-a22b:free"
+# Provide your own test prompts
+npx skilleval ./SKILL.md --prompts ./my-test-prompts.json
+# Machine-readable output
+npx skilleval ./SKILL.md --json
+# Quick test with fewer prompts (1 positive + 1 negative)
+npx skilleval ./SKILL.md -n 1
+# Detailed per-prompt breakdown
+npx skilleval ./SKILL.md --verbose
+```
+### Custom Test Prompts
+Create a JSON file with your own test prompts:
+```json
+[
+  {"text": "Help me extract text from this PDF", "type": "positive"},
+  {"text": "Merge these two PDF files together", "type": "positive"},
+  {"text": "Convert this PDF to Word", "type": "positive"},
+  {"text": "Fill out this PDF form", "type": "positive"},
+  {"text": "Extract tables from the PDF report", "type": "positive"},
+  {"text": "What's the weather today?", "type": "negative"},
+  {"text": "Write me a Python script", "type": "negative"},
+  {"text": "Help me debug this CSS", "type": "negative"},
+  {"text": "Create a git commit message", "type": "negative"},
+  {"text": "Summarize this article for me", "type": "negative"}
+]
+```
+## Scoring
+Each model is scored on two dimensions:
+- **Trigger accuracy** (50% of overall): Did the model correctly identify when to use the skill (positive prompts) and when to ignore it (negative prompts)?
+- **Compliance** (50% of overall): For positive prompts where the skill was triggered, did the model follow the skill's instructions? Split into pass/fail (30%) and quality score 0-100 (20%).
+Exit code is `0` if all models score >= 50%, `1` otherwise  - useful for CI.
+## Development
+```bash
+git clone https://github.com/zlatkov/skilleval.git
+cd skilleval
+npm install
+npm run dev -- ./path/to/SKILL.md
+```
+## License
+MIT

package/dist/config.d.ts ADDED Viewed

@@ -0,0 +1,63 @@
+import type { LanguageModel } from 'ai';
+export declare const PROVIDER_NAMES: readonly ["openrouter", "anthropic", "openai", "google"];
+export type ProviderName = (typeof PROVIDER_NAMES)[number];
+export declare const PROVIDER_ENV_VARS: Record<ProviderName, string>;
+export interface SkillDefinition {
+    name: string;
+    description: string;
+    body: string;
+    rawContent: string;
+}
+export interface TestPrompt {
+    text: string;
+    type: 'positive' | 'negative';
+}
+export interface TestResult {
+    modelId: string;
+    prompt: TestPrompt;
+    response: string;
+    latencyMs: number;
+    error?: string;
+}
+export interface TriggerEval {
+    triggered: boolean;
+    correct: boolean;
+    reason: string;
+}
+export interface ComplianceEval {
+    compliant: boolean;
+    score: number;
+    reason: string;
+}
+export interface EvalResult {
+    modelId: string;
+    prompt: TestPrompt;
+    response: string;
+    trigger: TriggerEval;
+    compliance?: ComplianceEval;
+}
+export interface EvalReport {
+    modelId: string;
+    triggerScore: {
+        correct: number;
+        total: number;
+    };
+    complianceScore: {
+        correct: number;
+        total: number;
+        avgScore: number;
+    };
+    overall: number;
+}
+export declare const DEFAULT_FREE_MODELS: string[];
+export declare const DEFAULT_GENERATOR_MODELS: string[];
+export declare const DEFAULT_JUDGE_MODELS: string[];
+export declare const REQUEST_TIMEOUT_MS = 30000;
+export declare const DUMMY_SKILLS: {
+    name: string;
+    description: string;
+}[];
+export interface ModelWithId {
+    model: LanguageModel;
+    modelId: string;
+}

package/dist/config.js ADDED Viewed

@@ -0,0 +1,42 @@
+// --- Provider Types ---
+export const PROVIDER_NAMES = ['openrouter', 'anthropic', 'openai', 'google'];
+export const PROVIDER_ENV_VARS = {
+    openrouter: 'OPENROUTER_API_KEY',
+    anthropic: 'ANTHROPIC_API_KEY',
+    openai: 'OPENAI_API_KEY',
+    google: 'GOOGLE_GENERATIVE_AI_API_KEY',
+};
+// --- Constants ---
+export const DEFAULT_FREE_MODELS = [
+    'meta-llama/llama-3.3-70b-instruct:free',
+    'mistralai/mistral-small-3.1-24b-instruct:free',
+    'google/gemma-3-27b-it:free',
+    'qwen/qwen3-coder:free',
+    'nousresearch/hermes-3-llama-3.1-405b:free',
+];
+export const DEFAULT_GENERATOR_MODELS = [
+    'meta-llama/llama-3.3-70b-instruct:free',
+    'google/gemma-3-27b-it:free',
+    'mistralai/mistral-small-3.1-24b-instruct:free',
+];
+export const DEFAULT_JUDGE_MODELS = [
+    'meta-llama/llama-3.3-70b-instruct:free',
+    'google/gemma-3-27b-it:free',
+    'mistralai/mistral-small-3.1-24b-instruct:free',
+];
+export const REQUEST_TIMEOUT_MS = 30_000;
+export const DUMMY_SKILLS = [
+    {
+        name: 'git-commit-helper',
+        description: 'Helps create well-formatted git commits with conventional commit messages.',
+    },
+    {
+        name: 'api-documentation',
+        description: 'Generates API documentation from code comments and type definitions.',
+    },
+    {
+        name: 'test-generator',
+        description: 'Creates unit tests for functions and classes based on their signatures and behavior.',
+    },
+];
+//# sourceMappingURL=config.js.map

package/dist/context-builder.d.ts ADDED Viewed

@@ -0,0 +1,5 @@
+import { type CoreTool } from 'ai';
+import { type SkillDefinition } from './config.js';
+export declare function buildTriggerSystemPrompt(skill: SkillDefinition): string;
+export declare function buildMockTools(): Record<string, CoreTool>;
+export declare function buildComplianceSystemPrompt(skill: SkillDefinition): string;

package/dist/context-builder.js ADDED Viewed

@@ -0,0 +1,121 @@
+import { z } from 'zod';
+import { tool } from 'ai';
+import { DUMMY_SKILLS } from './config.js';
+// Tools provided to the model during compliance testing, matching what real agent hosts offer
+const KNOWN_TOOLS = {
+    WebFetch: {
+        description: 'Fetch content from a URL',
+        parameters: z.object({ url: z.string().describe('The URL to fetch') }),
+    },
+    WebSearch: {
+        description: 'Search the web for information',
+        parameters: z.object({ query: z.string().describe('The search query') }),
+    },
+    BraveSearch: {
+        description: 'Search the web using Brave Search',
+        parameters: z.object({ query: z.string().describe('The search query') }),
+    },
+    Read: {
+        description: 'Read a file from the filesystem',
+        parameters: z.object({ file_path: z.string().describe('The file path to read') }),
+    },
+    Write: {
+        description: 'Write content to a file',
+        parameters: z.object({
+            file_path: z.string().describe('The file path to write'),
+            content: z.string().describe('The content to write'),
+        }),
+    },
+    Edit: {
+        description: 'Edit a file by replacing text',
+        parameters: z.object({
+            file_path: z.string().describe('The file path to edit'),
+            old_string: z.string().describe('The text to replace'),
+            new_string: z.string().describe('The replacement text'),
+        }),
+    },
+    Bash: {
+        description: 'Execute a shell command',
+        parameters: z.object({ command: z.string().describe('The command to execute') }),
+    },
+    Grep: {
+        description: 'Search file contents using regex',
+        parameters: z.object({
+            pattern: z.string().describe('The regex pattern to search for'),
+            path: z.string().optional().describe('The directory to search in'),
+        }),
+    },
+    Glob: {
+        description: 'Find files matching a pattern',
+        parameters: z.object({ pattern: z.string().describe('The glob pattern') }),
+    },
+    NotebookEdit: {
+        description: 'Edit a Jupyter notebook cell',
+        parameters: z.object({
+            notebook: z.string().describe('The notebook file path'),
+            cell: z.number().describe('The cell index'),
+            new_source: z.string().describe('The new cell content'),
+        }),
+    },
+    code_interpreter: {
+        description: 'Execute Python code in a sandbox',
+        parameters: z.object({ code: z.string().describe('The Python code to execute') }),
+    },
+    browser: {
+        description: 'Browse a web page',
+        parameters: z.object({ url: z.string().describe('The URL to browse') }),
+    },
+};
+const BASE_SYSTEM_PROMPT = `You are a helpful AI agent. You have access to various skills that can help you complete tasks. When a user's request matches a skill's description, you should use that skill by following its instructions.
+When you determine a skill is relevant to the user's request:
+1. Announce that you are using the skill
+2. Follow the skill's instructions carefully
+When no skill matches the user's request, respond normally without mentioning any skills.`;
+function buildSkillXml(name, description, location) {
+    const loc = location ?? `skills/${name}/SKILL.md`;
+    return `<skill>
+  <name>${name}</name>
+  <description>${description}</description>
+  <location>${loc}</location>
+</skill>`;
+}
+export function buildTriggerSystemPrompt(skill) {
+    const allSkills = [
+        buildSkillXml(DUMMY_SKILLS[0].name, DUMMY_SKILLS[0].description),
+        buildSkillXml(DUMMY_SKILLS[1].name, DUMMY_SKILLS[1].description),
+        buildSkillXml(skill.name, skill.description),
+        buildSkillXml(DUMMY_SKILLS[2].name, DUMMY_SKILLS[2].description),
+    ];
+    return `${BASE_SYSTEM_PROMPT}
+<available_skills>
+${allSkills.join('\n')}
+</available_skills>`;
+}
+export function buildMockTools() {
+    const tools = {};
+    for (const [name, def] of Object.entries(KNOWN_TOOLS)) {
+        tools[name] = tool({
+            description: def.description,
+            parameters: def.parameters,
+            execute: async () => `[Mock result from ${name}]`,
+        });
+    }
+    return tools;
+}
+export function buildComplianceSystemPrompt(skill) {
+    return `${BASE_SYSTEM_PROMPT}
+<available_skills>
+<skill>
+  <name>${skill.name}</name>
+  <description>${skill.description}</description>
+  <instructions>
+${skill.body}
+  </instructions>
+</skill>
+</available_skills>`;
+}
+//# sourceMappingURL=context-builder.js.map

package/dist/evaluator.d.ts ADDED Viewed

@@ -0,0 +1,4 @@
+import { type LanguageModel } from 'ai';
+import type { EvalReport, EvalResult, ModelWithId, SkillDefinition, TestResult } from './config.js';
+export declare function evaluateResults(skill: SkillDefinition, testResults: TestResult[], judgeModels: LanguageModel[], models: ModelWithId[], verbose: boolean): Promise<EvalResult[]>;
+export declare function computeReport(evalResults: EvalResult[], modelIds: string[]): EvalReport[];