npm - agent-tool-forge - Versions diffs - 0.4.7 → 0.4.9 - Mend

agent-tool-forge 0.4.7 → 0.4.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/lib/config-schema.js +3 -3
package/lib/forge-service.js +7 -4
package/lib/hitl-engine.d.ts +3 -3
package/lib/index.js +2 -3
package/lib/init.js +1 -1
package/lib/sidecar.d.ts +4 -2
package/package.json +2 -1
package/skills/forge-eval/SKILL.md +69 -0
package/skills/forge-eval/references/assertion-patterns.md +265 -0
package/skills/forge-eval/references/eval-types.md +262 -0
package/skills/forge-eval/references/overlap-map.md +89 -0
package/skills/forge-mcp/SKILL.md +62 -0
package/skills/forge-mcp/references/mcp-templates.md +302 -0
package/skills/forge-mcp/references/tool-to-mcp-mapping.md +108 -0
package/skills/forge-tool/SKILL.md +112 -0
package/skills/forge-tool/references/description-contract.md +102 -0
package/skills/forge-tool/references/extension-points.md +120 -0
package/skills/forge-tool/references/pending-spec.md +53 -0
package/skills/forge-tool/references/tool-shape.md +106 -0
package/skills/forge-verifier/SKILL.md +78 -0
package/skills/forge-verifier/references/output-groups.md +39 -0
package/skills/forge-verifier/references/verifier-pattern.md +83 -0
package/skills/forge-verifier/references/verifier-stubs.md +147 -0

package/skills/forge-eval/references/eval-types.md ADDED Viewed

@@ -0,0 +1,262 @@
+# Eval Case Interfaces
+## GoldenEvalCase
+Sends a known prompt through the full agent loop. Asserts on single-tool selection + response content. The routing sanity check.
+```
+GoldenEvalCase {
+  id:          string    // Format: "gs-<toolname>-NNN"
+  description: string    // What this case tests
+  input: {
+    message:   string    // The user prompt to send
+  }
+  stubs?: {
+    [toolName: string]: object  // Stub response for each tool (activates multi-turn mode)
+  }
+  maxTurns?: number             // Max LLM loop iterations in stub mode (default 5)
+  expect: {
+    toolsCalled:        string[]    // Exact tools that must be called
+    noToolErrors:       boolean     // In stub mode: all tools must have stub entries
+    responseNonEmpty:   boolean     // Agent produced a response
+    responseContains?:  string[]    // ALL must appear (exact values)
+    responseContainsAny?: string[][] // At least one from EACH group
+    responseNotContains?: string[]  // NONE may appear
+    maxLatencyMs?:      number      // Response time budget
+  }
+}
+```
+### Field Semantics
+- **toolsCalled** — Exact match. If you expect `["get_weather"]`, the agent must call exactly `get_weather` and no other tools.
+- **stubs** — Map of tool name → stub return object. When present, activates stub-based multi-turn mode: the runner feeds stub results back to the model after each tool call and continues until the model produces a final text response. Response assertions (`responseContains`, etc.) check the **final** response, not first-turn text. Required for `noToolErrors` and response content assertions to be meaningful.
+- **maxTurns** — Safety cap on the stub-based loop. Defaults to 5. The runner stops after this many LLM turns even if the model keeps calling tools.
+- **noToolErrors** — In stub mode: fails if the model calls a tool that has no entry in `stubs`. In routing-only mode (no `stubs`): no-op.
+- **responseContains** — Substring match, case-sensitive. Use for exact values that prove the tool returned real data (dollar amounts, names, IDs). Meaningful only in stub mode.
+- **responseContainsAny** — Each inner array is a synonym group. At least one member from each group must appear. Use for domain terms where multiple phrasings are acceptable.
+- **responseNotContains** — Substring match, case-sensitive. Use to catch: cop-outs ("I don't know"), raw JSON leaks ("fetchedAt"), sensitive data ("API_KEY").
+---
+## LabeledEvalCase
+Sends a message through the full agent loop. Tests tool routing under ambiguity, edge cases, and adversarial inputs.
+```
+LabeledEvalCase {
+  id:          string    // Format: "ls-<toolname>-NNN"
+  description: string    // What this case tests
+  difficulty:  'straightforward' | 'ambiguous' | 'edge'
+  input: {
+    message:   string    // The user prompt to send
+  }
+  expect: {
+    // Tool ROUTING assertions (use one of toolsCalled OR toolsAcceptable)
+    toolsCalled?:      string[]    // Exact tools that must appear
+    toolsAcceptable?:  string[][]  // Any of these tool sets is valid
+    toolsNotCalled?:   string[]    // Tools that must NOT be called
+    noToolErrors?:     boolean
+    // Parameter assertions (did the model pass the right arguments?)
+    toolParams?:           ToolParamAssertion[]
+    // Response quality assertions (all deterministic)
+    responseNonEmpty?:     boolean
+    responseContains?:     string[]
+    responseContainsAny?:  string[][]
+    responseNotContains?:  string[]
+    responseMatches?:      string[]    // Regex patterns
+    maxLatencyMs?:         number
+    maxTokens?:            number      // Response token count ceiling
+  }
+}
+```
+### toolsCalled vs toolsAcceptable
+- **toolsCalled** — Use for straightforward cases where the exact tool set is known. `["get_weather", "get_forecast"]` means both must be called.
+- **toolsAcceptable** — Use for ambiguous cases where multiple strategies are valid. `[["get_weather"], ["get_weather", "get_forecast"]]` means either set is acceptable.
+- Never use both on the same case.
+- For edge cases where no tools should be called, use `toolsAcceptable: [["__none__"]]`.
+### Difficulty Tiers
+- **straightforward** — Clear multi-tool tasks. Obvious which tools to use and in what combination.
+- **ambiguous** — Multiple valid interpretations or tool combinations. Tests the agent's judgment.
+- **edge** — Adversarial inputs, prompt injection, off-topic, contradictions. Tests robustness.
+---
+## ToolParamAssertion
+Asserts that the model passed correct arguments to a tool call. This catches a failure class that routing assertions miss: the model calls the right tool but passes wrong, missing, or hallucinated parameters.
+```
+ToolParamAssertion {
+  tool:       string              // Which tool call to check
+  paramName:  string              // Which parameter to assert on
+  assertion:  'equals' | 'contains' | 'oneOf' | 'exists' | 'notExists' | 'matches'
+  value?:     string | string[]   // Expected value(s) — not needed for exists/notExists
+}
+```
+### Assertion Types
+- **equals** — Exact match (case-sensitive). `{ tool: "get_weather", paramName: "city", assertion: "equals", value: "Paris" }`
+- **contains** — Substring match. `{ paramName: "query", assertion: "contains", value: "weather" }`
+- **oneOf** — Value matches any in the list. `{ paramName: "units", assertion: "oneOf", value: ["metric", "imperial"] }`
+- **exists** — Parameter was provided (any value). `{ paramName: "city", assertion: "exists" }`
+- **notExists** — Parameter was NOT provided (catches hallucinated params). `{ paramName: "country_code", assertion: "notExists" }`
+- **matches** — Regex match. `{ paramName: "date", assertion: "matches", value: "^\\d{4}-\\d{2}-\\d{2}$" }`
+### When to Use
+- **Golden evals:** Assert the obvious parameter. "What's the weather in Paris?" → `city` equals `"Paris"` (or contains `"Paris"`).
+- **Labeled straightforward:** Assert parameters across multiple tool calls. "Weather and air quality in Beijing" → `get_weather.city` contains `"Beijing"` AND `get_air_quality.city` contains `"Beijing"`.
+- **Labeled ambiguous:** Use `oneOf` for parameters with valid alternatives. Units could be `"metric"` or `"imperial"` depending on user locale.
+- **Edge cases:** Use `notExists` to catch hallucinated parameters the schema doesn't define.
+### Rules
+- Parameter assertions check the arguments the model SENT to the tool, not the tool's response.
+- Use `contains` over `equals` when the model might normalize the input (e.g., `"Paris"` vs `"Paris, FR"` vs `"paris"`).
+- Use `oneOf` when multiple values are acceptable (enum fields, locale-dependent defaults).
+- Use template tokens (`{{seed:*}}`) in values for data-dependent parameters.
+- If the tool wasn't called (routing assertion already failed), parameter assertions are skipped.
+- When `toolsAcceptable` is used, parameter assertions apply only if the tool was actually called.
+---
+## RegressionEvalCase
+Captures a specific bug that was found and fixed. Immutable once created — never modified, only archived when no longer applicable.
+```
+RegressionEvalCase {
+  id:          string    // Format: "rg-<toolname>-NNN"
+  description: string    // "regression — <what the bug was>"
+  difficulty:  'regression'
+  createdAt:   string    // ISO 8601 — when this case was created
+  bugRef:      string    // Issue number or commit hash of the fix
+  input: {
+    message:   string    // The exact prompt that triggered the bug
+  }
+  expect: {
+    toolsCalled?:         string[]
+    toolsAcceptable?:     string[][]
+    noToolErrors?:        boolean
+    responseNonEmpty?:    boolean
+    responseContains?:    string[]    // Use seed-stable values only (no snapshots)
+    responseContainsAny?: string[][]
+    responseNotContains?: string[]
+    maxLatencyMs?:        number
+  }
+}
+```
+### Key Properties
+- **Immutable** — Never edited after creation. If the tool changes so much the case no longer applies, archive it.
+- **Cheap to run** — Seed-stable assertions only (no `{{snapshot:*}}` tokens). Suitable for CI.
+- **Not scaled** — No formula. One case per real bug, created ad-hoc.
+- **Self-documenting** — `bugRef` links to the fix, so a failure immediately tells you what regressed.
+- **Separate file** — Stored in `<toolname>.regression.json`, separate from golden and labeled.
+---
+## RubricEvalCase (Stub)
+Reserved for scored multi-dimensional evaluation. Use for synthesis quality testing where deterministic assertions are insufficient — e.g., "did the agent coherently combine data from 3 tools into a useful response?"
+```
+RubricEvalCase {
+  id:          string
+  description: string
+  input: { message: string }
+  referenceData?: Record<string, unknown>  // Ground truth tool output for the judge
+  rubric: {
+    dimension: string    // e.g., "accuracy", "completeness", "safety"
+    maxScore:  number    // Use 1 for binary (pass/fail), 3-5 for graded
+    criteria:  string    // What each score level means
+  }[]
+}
+```
+### When to Use
+- Multi-tool synthesis: "Did the response accurately integrate data from all called tools?"
+- Response quality: "Is the response helpful and well-structured?" (not testable with substring matching)
+- Safety nuance: "Did the response appropriately hedge on investment advice?"
+### Implementation Notes
+- Binary scoring (`maxScore: 1`) is more defensible than graded rubrics
+- The `referenceData` field provides ground truth so the judge compares against real data, not its own knowledge
+- Keep rubric evals in a separate tier — run deterministic evals first, rubric evals only when deterministic pass
+---
+## EvalCaseResult
+The result of running a single eval case.
+```
+EvalCaseResult {
+  id:          string
+  description: string
+  passed:      boolean
+  durationMs:  number
+  error?:      string              // Failure reason
+  details?:    Record<string, any> // Extra info: tools called, response, etc.
+}
+```
+## EvalSuiteResult
+The result of running an entire eval suite.
+```
+EvalSuiteResult {
+  runId:           string    // UUID — unique identifier for this run
+  timestamp:       string    // ISO 8601
+  tier:            'golden' | 'labeled' | 'regression'
+  toolName:        string
+  // Current state at run time (for staleness comparison)
+  metadata: {
+    toolVersion:     string
+    descriptionHash: string    // sha256(description)[0:12]
+    registrySize:    number
+    evalFileHash:    string    // sha256(eval JSON file)[0:12]
+  }
+  stalenessWarnings: string[]  // Warnings from staleness checks
+  cases:           EvalCaseResult[]
+  summary: {
+    totalCases:         number
+    passed:             number
+    failed:             number
+    skippedAssertions:  number
+    totalDurationMs:    number
+    estimatedCostUsd?:  number
+  }
+  // Baseline diffing (optional — requires a previous runId)
+  baselineRunId?:  string    // Previous run to compare against
+  regressions:     string[]  // Case IDs that passed before, fail now
+  newPasses:       string[]  // Case IDs that failed before, pass now
+}
+```
+### Baseline Diffing
+When `baselineRunId` is provided, the runner loads the previous `EvalSuiteResult` and compares:
+- **Regression:** Passed in baseline, fails now → add to `regressions[]`
+- **New pass:** Failed in baseline, passes now → add to `newPasses[]`
+- **Unchanged:** No note
+This is the minimum viable A/B comparison. It answers: "Did my change break anything that was working?"

package/skills/forge-eval/references/overlap-map.md ADDED Viewed

@@ -0,0 +1,89 @@
+# Tool Overlap Map — Format and Usage
+The overlap map declares which tools are **close neighbors** — tools that could plausibly be confused by a natural language prompt. It serves two purposes:
+1. **Eval generation** — The eval factory reads the map to target ambiguous cases at real overlaps
+2. **Coverage gap detection** — Cross-reference the map with existing evals to find untested overlaps
+---
+## Structure
+```json
+{
+  "get_weather": {
+    "overlaps": ["get_forecast", "get_air_quality"],
+    "clusters": [
+      ["get_weather", "get_forecast", "get_air_quality"]
+    ],
+    "reason": "Weather queries are broad — 'what's it like outside' could route to current conditions, forecast, or air quality"
+  },
+  "get_forecast": {
+    "overlaps": ["get_weather"],
+    "clusters": [
+      ["get_weather", "get_forecast", "get_air_quality"]
+    ],
+    "reason": "Future-oriented weather questions could route to forecast or current weather with extrapolation"
+  }
+}
+```
+### Fields
+- **`overlaps`** — Pairwise close neighbors. Tools that could be confused 1:1.
+- **`clusters`** — Groups of 3+ tools that frequently appear together in complex prompts. A cluster means "these tools naturally co-occur in multi-step tasks."
+- **`reason`** — Human-readable explanation of why these tools overlap. Useful for onboarding and debugging.
+---
+## Rules
+### When to declare an overlap
+Two tools overlap when a natural language prompt could reasonably route to either one.
+- `get_weather` and `get_forecast` overlap on "what's the weather like"
+- `get_weather` and `delete_account` do NOT overlap
+### Clusters capture natural groupings
+When a user asks a broad question ("tell me everything about the weather"), which tools naturally get called together? That's a cluster. Cluster members don't need to be pairwise confusable — they just co-occur in multi-step tasks.
+### Sparse, not exhaustive
+At 50 tools, most tools should have 2-4 overlaps and 0-2 clusters. Hub tools may have more. The map is NOT an NxN matrix.
+### Symmetric by convention
+If A lists B as an overlap, B should list A.
+### The tool factory adds the entry
+When a new tool is created, the factory identifies close neighbors and clusters, then updates the overlap map. The eval factory reads it.
+### Coverage check surfaces gaps
+A coverage check should report:
+- (a) tools with no golden evals
+- (b) tools with no labeled evals
+- (c) declared overlaps with no ambiguous eval testing both tools together
+- (d) declared clusters with no labeled eval exercising the full group
+---
+## When NOT to Add an Overlap
+- Tools in different categories serving obviously different purposes (read vs write)
+- Tools whose descriptions have zero shared trigger conditions
+- Tools where confusion would require a badly malformed prompt
+---
+## Growth Management
+As the registry scales, periodically review the map for stale entries. If two tools' descriptions have been refined to eliminate confusion, remove the overlap. The coverage check confirms they're no longer tested together — which is correct if they genuinely don't overlap.
+---
+## How the Eval Factory Uses the Map
+1. **Read the map** before generating labeled evals
+2. **Count overlaps (O) and clusters (C)** for the scaling formula
+3. **Target ambiguous cases at declared overlaps** — every overlap pair gets at least one ambiguous eval
+4. **Use clusters for multi-tool cases** — clusters drive 3-tool and 4+ tool labeled cases
+5. **After generation**, verify every declared overlap has coverage
+The map turns ambiguous eval generation from guesswork into a directed process.

package/skills/forge-mcp/SKILL.md ADDED Viewed

@@ -0,0 +1,62 @@
+# /forge-mcp — Generate MCP Server Scaffold
+Generate a Model Context Protocol (MCP) server scaffold from one or more ToolDefinition objects. The generated server can be registered with any MCP-compatible client (Claude, Cursor, etc.).
+---
+## Step 1 — Identify Tools to Expose
+Ask the user which tools to expose via MCP, or default to all tools in `tools/index.js`.
+For each tool, read its ToolDefinition to extract:
+- `name`, `description`, `schema` (parameters)
+- `mcpRouting` (HTTP endpoint and method)
+---
+## Step 2 — Generate MCP Server
+Generate `mcp-server.js` (or the user's preferred filename) using `@modelcontextprotocol/sdk`:
+```js
+import { Server } from '@modelcontextprotocol/sdk/server/index.js';
+import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
+import { CallToolRequestSchema, ListToolsRequestSchema } from '@modelcontextprotocol/sdk/types.js';
+```
+The server must:
+1. Implement `tools/list` — return all tools with their JSON schemas
+2. Implement `tools/call` — route to the correct ToolDefinition's `mcpRouting` HTTP endpoint
+3. Handle auth headers from environment (`FORGE_API_KEY` or tool-specific env vars)
+4. Return structured errors on tool failure
+---
+## Step 3 — Generate Transport Config
+Generate a `mcp.json` config snippet the user can paste into their Claude Desktop or Cursor config:
+```json
+{
+  "mcpServers": {
+    "<project-name>": {
+      "command": "node",
+      "args": ["./mcp-server.js"],
+      "env": {
+        "API_BASE_URL": "http://localhost:3000",
+        "FORGE_API_KEY": "<your-api-key>"
+      }
+    }
+  }
+}
+```
+---
+## Step 4 — Summary
+Print:
+- MCP server file path
+- Number of tools exposed
+- Config snippet location
+- Instructions for registering with the MCP client