npm - @opencode_weave/weave - Versions diffs - 0.7.1 → 0.7.4-preview.1 - Mend

@opencode_weave/weave 0.7.1 → 0.7.4-preview.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

package/README.md +3 -196
package/dist/agents/tapestry/prompt-composer.d.ts +3 -1
package/dist/config/schema.d.ts +3 -0
package/dist/features/analytics/generate-metrics-report.d.ts +4 -4
package/dist/features/analytics/index.d.ts +4 -3
package/dist/features/analytics/plan-token-aggregator.d.ts +24 -1
package/dist/features/analytics/quality-score.d.ts +30 -0
package/dist/features/analytics/session-tracker.d.ts +5 -0
package/dist/features/analytics/types.d.ts +51 -14
package/dist/features/evals/evaluators/trajectory-assertion.d.ts +2 -0
package/dist/features/evals/executors/github-models-api.d.ts +13 -0
package/dist/features/evals/executors/model-response.d.ts +6 -1
package/dist/features/evals/executors/prompt-renderer.d.ts +1 -1
package/dist/features/evals/executors/trajectory-run.d.ts +3 -0
package/dist/features/evals/index.d.ts +8 -5
package/dist/features/evals/loader.d.ts +2 -1
package/dist/features/evals/reporter.d.ts +1 -0
package/dist/features/evals/runner.d.ts +1 -1
package/dist/features/evals/schema.d.ts +65 -16
package/dist/features/evals/storage.d.ts +2 -0
package/dist/features/evals/types.d.ts +43 -2
package/dist/features/skill-loader/loader.d.ts +2 -0
package/dist/features/workflow/context.d.ts +2 -1
package/dist/features/workflow/discovery.d.ts +6 -3
package/dist/features/workflow/hook.d.ts +2 -0
package/dist/hooks/compaction-todo-preserver.d.ts +20 -0
package/dist/hooks/create-hooks.d.ts +4 -0
package/dist/hooks/index.d.ts +6 -0
package/dist/hooks/todo-continuation-enforcer.d.ts +25 -0
package/dist/hooks/todo-description-override.d.ts +18 -0
package/dist/hooks/todo-writer.d.ts +17 -0
package/dist/index.js +755 -254
package/dist/plugin/types.d.ts +1 -1
package/dist/shared/resolve-safe-path.d.ts +14 -0
package/package.json +10 -8
package/dist/features/analytics/suggestions.d.ts +0 -10

package/README.md CHANGED Viewed

@@ -6,38 +6,6 @@
 Weave is a lean OpenCode plugin with multi-agent orchestration. It provides a cohesive framework for weaving agents, tools, and skills into structured workflows. By delegating complex tasks to specialized agents and monitoring execution state through hooks, Weave ensures reliable and efficient project development.
-## Table of Contents
-- [Overview](#overview)
-- [Documentation](#documentation)
-- [Agents](#agents)
-  - [Agent Modes](#agent-modes)
-  - [Agent Details](#agent-details)
-- [Workflow](#workflow)
-  - [When the Full Workflow Is Used](#when-the-full-workflow-is-used)
-  - [1. Plan](#1-plan)
-  - [2. Review (Optional)](#2-review-optional)
-  - [3. Execute](#3-execute)
-  - [Resuming Interrupted Work](#resuming-interrupted-work)
-  - [Quick Tasks (No Plan Needed)](#quick-tasks-no-plan-needed)
-- [Installation](#installation)
-  - [Prerequisites](#prerequisites)
-  - [Step 1: Add to opencode.json](#step-1-add-to-opencodejson)
-  - [Step 2: Restart OpenCode](#step-2-restart-opencode)
-  - [Troubleshooting](#troubleshooting)
-- [Uninstalling](#uninstalling)
-- [Configuration](#configuration)
-  - [Example Configuration](#example-configuration)
-  - [Configuration Fields](#configuration-fields)
-- [Features](#features)
-  - [Hooks](#hooks)
-  - [Skills](#skills)
-  - [Background Agents](#background-agents)
-  - [Tool Permissions](#tool-permissions)
-- [Development](#development)
-- [Acknowledgments](#acknowledgments)
-- [License](#license)
 ## Overview
 - **8 specialized agents** with weaving-themed names designed for specific roles in the development lifecycle.
@@ -50,7 +18,9 @@ Weave is a lean OpenCode plugin with multi-agent orchestration. It provides a co
 ## Documentation
-Visit [tryweave.io](https://tryweave.io) for more information, or head straight to the [documentation](https://tryweave.io/docs/) for detailed guides on setup, configuration, and usage.
+For detailed guides on configuration, workflows, agents, features, and more, visit the **[Weave documentation](https://tryweave.io/docs/)**.
+For agent routing eval trends and dashboards, see the **[Eval Dashboard](https://tryweave.io/evals/)**.
 ## Agents
@@ -65,87 +35,6 @@ Visit [tryweave.io](https://tryweave.io) for more information, or head straight
 | **Weft** | reviewer/auditor | subagent | Reviews completed work and plans with a critical but fair eye, rejecting only for true blocking issues. |
 | **Warp** | security auditor | subagent | Audits code changes for security vulnerabilities and specification compliance with a skeptical bias. |
-### Agent Modes
-- `primary`: Respects the user-selected model in the OpenCode UI.
-- `subagent`: Uses its own model or fallback chain, ignoring UI selection for predictable specialization.
-- `all`: Available in both primary and subagent contexts.
-### Agent Details
-**Loom** is the central orchestrator and the default entry point for every request. It breaks down complex problems into tasks, decides which agents to delegate to, and tracks progress obsessively via todo lists. Loom never implements code directly — it plans and delegates. For quick fixes it acts immediately; for complex work it kicks off the plan → review → execute workflow.
-**Pattern** is the strategic planner. When a task requires 5+ steps or involves architectural decisions, Loom delegates to Pattern, which researches the codebase (via Thread) and external docs (via Spindle), then produces a structured implementation plan saved to `.weave/plans/{name}.md`. Plans use `- [ ]` checkboxes for every actionable task. Pattern never writes code — only plans.
-**Weft** is the reviewer and auditor. It validates plans before execution and reviews completed work after implementation. Weft is approval-biased and only rejects for true blocking issues (max 3 per review). It checks that file references are correct, tasks have sufficient context, implementations match requirements, and no stubs or TODOs are left behind. Weft is read-only.
-**Warp** is the security and specification compliance auditor. It reviews code changes for security vulnerabilities (injection, auth bypass, token handling, crypto weaknesses) and verifies compliance with standards like OAuth2, OIDC, WebAuthn, and JWT. Warp has a skeptical bias — unlike Weft, it rejects by default when security patterns are detected. It self-triages to fast-exit on non-security changes, and can webfetch RFCs for verification. Warp is read-only.
-**Tapestry** is the execution engine. Activated by the `/start-work` command, it reads a plan from `.weave/plans/` and works through tasks sequentially — writing code, running commands, verifying output, and marking checkboxes as it goes. Tapestry cannot spawn subagents; it focuses on heads-down implementation. If interrupted, it resumes from the first unchecked task.
-**Thread** is the fast codebase explorer. Loom delegates to Thread whenever it needs to understand code structure, find files, or answer questions about the repository. Thread uses grep, glob, and read tools with zero creativity (temperature 0.0) to return precise, factual answers with file paths and line numbers. Thread is read-only.
-**Spindle** is the external researcher. When Loom needs documentation for a library, API reference, or any information outside the codebase, Spindle fetches URLs, reads docs, and synthesizes findings with source citations. Spindle is read-only.
-**Shuttle** is the domain specialist. When work falls into a specific category (e.g., visual engineering, data processing), Loom dispatches Shuttle with full tool access to execute the task. Shuttle's model and configuration can be overridden per-category for domain-optimized performance.
-## Workflow
-Weave uses a structured **Plan → Review → Execute** workflow for complex tasks. Simple requests are handled directly by Loom without the full cycle.
-### When the Full Workflow Is Used
-- Tasks requiring 5+ steps or architectural decisions
-- Multi-file refactors or new feature implementations
-- Work that benefits from a reviewable plan before execution
-### 1. Plan
-Loom delegates to **Pattern**, which researches the codebase and produces a detailed implementation plan:
-```
-User Request → Loom (assesses complexity) → Pattern (researches + plans)
-                                              ↓
-                                     .weave/plans/{name}.md
-```
-The plan includes clear objectives, deliverables, and atomic tasks marked with `- [ ]` checkboxes. Pattern never writes code.
-### 2. Review (Optional)
-For high-stakes or complex plans, Loom delegates to **Weft** to validate the plan before execution:
-```
-.weave/plans/{name}.md → Weft (validates) → APPROVE or REJECT
-```
-Weft checks that referenced files exist, tasks have sufficient context, and there are no contradictions. If rejected, issues are sent back to Pattern for revision.
-### 3. Execute
-The user runs `/start-work` to begin execution:
-```
-/start-work [plan-name] → creates .weave/state.json → switches to Tapestry
-```
-**Tapestry** reads the plan and executes tasks sequentially:
-1. Find the first unchecked `- [ ]` task
-2. Implement the task (write code, run commands, create files)
-3. Verify completion (read files, run tests, check acceptance criteria)
-4. Mark the checkbox `- [x]`
-5. Move to the next unchecked task
-6. When all tasks are complete, report a final summary
-### Resuming Interrupted Work
-If a session is interrupted, running `/start-work` again resumes from the first unchecked task — no re-planning or restarting. The work state is persisted in `.weave/state.json`, so progress is never lost.
-### Quick Tasks (No Plan Needed)
-For simple requests — single-file fixes, quick questions, small edits — Loom handles the work directly or delegates to the appropriate agent without creating a formal plan.
 ## Installation
 This package is published on [npm](https://www.npmjs.com/package/@opencode_weave/weave).
@@ -211,88 +100,6 @@ If you no longer use Weave in any project, remove the global configuration:
 rm -f ~/.config/opencode/weave-opencode.jsonc ~/.config/opencode/weave-opencode.json
 ```
-## Configuration
-Weave searches for configuration files in the following locations, merging them in order (user config → project config → defaults):
-- **Project**: `.opencode/weave-opencode.jsonc` or `.opencode/weave-opencode.json`
-- **User**: `~/.config/opencode/weave-opencode.jsonc` or `~/.config/opencode/weave-opencode.json`
-The configuration uses JSONC format, allowing for comments and trailing commas.
-### Example Configuration
-```jsonc
-{
-  // Override agent models and parameters
-  "agents": {
-    "loom": {
-      "model": "anthropic/claude-3-5-sonnet",
-      "temperature": 0.1
-    },
-    "thread": {
-      "model": "openai/gpt-4o-mini"
-    }
-  },
-  // Category-based dispatch overrides
-  "categories": {
-    "visual-engineering": {
-      "model": "google/gemini-2-pro"
-    }
-  },
-  // Selective feature toggling
-  "disabled_hooks": [],
-  "disabled_agents": [],
-  "disabled_tools": [],
-  "disabled_skills": [],
-  // Background agent concurrency limits
-  "background": {
-    "defaultConcurrency": 5
-  }
-}
-```
-### Configuration Fields
-- `agents` — Override model, temperature, prompt_append, tools, and skills per agent.
-- `categories` — Custom model and tool configurations for category-based dispatch.
-- `disabled_hooks` / `disabled_agents` / `disabled_tools` / `disabled_skills` — Selective feature disabling.
-- `background` — Concurrency limits and timeouts for parallel background agents.
-- `tmux` — Terminal multiplexer layout settings for TUI integration.
-- `skills` — Custom skill discovery paths and recursion settings.
-- `experimental` — Plugin load timeouts and context window threshold adjustments.
-## Features
-### Hooks
-Weave includes 5 built-in hooks that monitor and modify agent behavior:
-- `context-window-monitor` — Warns when token usage approaches limits and suggests recovery strategies.
-- `write-existing-file-guard` — Tracks file reads to prevent agents from overwriting files they haven't examined.
-- `rules-injector` — Automatically injects contextual rules when agents enter directories containing AGENTS.md.
-- `first-message-variant` — Applies specific prompt variants on session start for consistent behavior.
-- `keyword-detector` — Detects keywords in messages to trigger behavioral changes or agent switches.
-All hooks are enabled by default and can be disabled via the `disabled_hooks` configuration.
-### Skills
-Skills are injectable prompt expertise loaded from markdown files (SKILL.md). They modify agent behavior by prepending domain-specific instructions to the agent's system prompt.
-Skills are discovered across three scopes:
-- `builtin` — Provided by the Weave plugin.
-- `user` — Located in the user's global configuration directory.
-- `project` — Located in the current project's `.opencode/skills/` directory.
-### Background Agents
-Weave supports parallel asynchronous sub-agent management via the BackgroundManager. This allows Loom to spawn multiple agents simultaneously to handle independent tasks, with configurable concurrency limits to manage API rate limits.
-### Tool Permissions
-Tool access is controlled per-agent to ensure safety and specialized focus. For example, **Thread** and **Spindle** are strictly read-only; they are denied access to write, edit, and task management tools. These permissions can be customized globally or per-agent in the configuration.
 ## Development
 - **Build**: `bun run build`

package/dist/agents/tapestry/prompt-composer.d.ts CHANGED Viewed

@@ -9,13 +9,15 @@ export interface TapestryPromptOptions {
     /** Set of disabled agent names (lowercase config keys) */
     disabledAgents?: Set<string>;
 }
-export declare function buildTapestryRoleSection(): string;
+export declare function buildTapestryRoleSection(disabled?: Set<string>): string;
 export declare function buildTapestryDisciplineSection(): string;
 export declare function buildTapestrySidebarTodosSection(): string;
 export declare function buildTapestryPlanExecutionSection(disabled?: Set<string>): string;
 export declare function buildTapestryVerificationSection(): string;
+export declare function buildTapestryVerificationGateSection(): string;
 export declare function buildTapestryPostExecutionReviewSection(disabled: Set<string>): string;
 export declare function buildTapestryExecutionSection(): string;
+export declare function buildTapestryDebuggingSection(): string;
 export declare function buildTapestryStyleSection(): string;
 /**
  * Compose the full Tapestry system prompt from sections.

package/dist/config/schema.d.ts CHANGED Viewed

@@ -161,6 +161,7 @@ export declare const AnalyticsConfigSchema: z.ZodObject<{
 }, z.core.$strip>;
 export declare const WorkflowConfigSchema: z.ZodObject<{
     disabled_workflows: z.ZodOptional<z.ZodArray<z.ZodString>>;
+    directories: z.ZodOptional<z.ZodArray<z.ZodString>>;
 }, z.core.$strip>;
 export declare const WeaveConfigSchema: z.ZodObject<{
     $schema: z.ZodOptional<z.ZodString>;
@@ -233,6 +234,7 @@ export declare const WeaveConfigSchema: z.ZodObject<{
     disabled_tools: z.ZodOptional<z.ZodArray<z.ZodString>>;
     disabled_agents: z.ZodOptional<z.ZodArray<z.ZodString>>;
     disabled_skills: z.ZodOptional<z.ZodArray<z.ZodString>>;
+    skill_directories: z.ZodOptional<z.ZodArray<z.ZodString>>;
     background: z.ZodOptional<z.ZodObject<{
         defaultConcurrency: z.ZodOptional<z.ZodNumber>;
         providerConcurrency: z.ZodOptional<z.ZodRecord<z.ZodString, z.ZodNumber>>;
@@ -261,6 +263,7 @@ export declare const WeaveConfigSchema: z.ZodObject<{
     }, z.core.$strip>>;
     workflows: z.ZodOptional<z.ZodObject<{
         disabled_workflows: z.ZodOptional<z.ZodArray<z.ZodString>>;
+        directories: z.ZodOptional<z.ZodArray<z.ZodString>>;
     }, z.core.$strip>>;
 }, z.core.$strip>;
 export type AgentOverrideConfig = z.infer<typeof AgentOverrideConfigSchema>;

package/dist/features/analytics/generate-metrics-report.d.ts CHANGED Viewed

@@ -1,17 +1,17 @@
 import type { WorkState } from "../work-state/types";
 import type { MetricsReport } from "./types";
 /**
- * Generate a Phase 1 metrics report for a completed plan.
+ * Generate a metrics report for a completed plan.
  *
  * Orchestrates:
  * 1. Extract planned files from the plan markdown
  * 2. Get actual changed files via git diff (startSha..HEAD)
  * 3. Calculate adherence (coverage, precision)
- * 4. Aggregate token usage across all sessions for this plan
+ * 4. Aggregate token usage (with per-session and model detail) across all sessions
  * 5. Compute total duration from session summaries
- * 6. Write the report to metrics-reports.jsonl
+ * 6. Calculate quality score (composite of adherence, task completion, efficiency)
+ * 7. Write the report to metrics-reports.jsonl
  *
- * In Phase 1, `quality` and `gaps` are undefined.
  * Returns the report if successful, null on error.
  */
 export declare function generateMetricsReport(directory: string, state: WorkState): MetricsReport | null;

package/dist/features/analytics/index.d.ts CHANGED Viewed

@@ -1,16 +1,17 @@
-export type { ToolUsageEntry, DelegationEntry, SessionSummary, TokenUsage, MetricsTokenUsage, AdherenceReport, MetricsReport, DetectedStack, ProjectFingerprint, Suggestion, InFlightToolCall, TrackedSession, } from "./types";
+export type { ToolUsageEntry, DelegationEntry, SessionSummary, TokenUsage, MetricsTokenUsage, AdherenceReport, MetricsReport, QualityReport, SessionTokenBreakdown, DetectedStack, ProjectFingerprint, InFlightToolCall, TrackedSession, } from "./types";
 export { ANALYTICS_DIR, SESSION_SUMMARIES_FILE, FINGERPRINT_FILE, METRICS_REPORTS_FILE, MAX_METRICS_ENTRIES, zeroTokenUsage, } from "./types";
 export { ensureAnalyticsDir, appendSessionSummary, readSessionSummaries, writeFingerprint, readFingerprint, writeMetricsReport, readMetricsReports, } from "./storage";
 export { detectStack, detectPackageManager, detectMonorepo, detectPrimaryLanguage, generateFingerprint, fingerprintProject, getOrCreateFingerprint, } from "./fingerprint";
 export { SessionTracker, createSessionTracker } from "./session-tracker";
-export { generateSuggestions, getSuggestionsForProject } from "./suggestions";
 export { generateTokenReport, getTokenReport } from "./token-report";
 export { formatMetricsMarkdown } from "./format-metrics";
 export { generateMetricsReport } from "./generate-metrics-report";
 export { extractPlannedFiles } from "./plan-parser";
 export { getChangedFiles } from "./git-diff";
 export { calculateAdherence } from "./adherence";
-export { aggregateTokensForPlan } from "./plan-token-aggregator";
+export { aggregateTokensForPlan, aggregateTokensDetailed } from "./plan-token-aggregator";
+export type { DetailedTokenAggregation } from "./plan-token-aggregator";
+export { calculateQualityScore, BASELINE_TOKENS_PER_TASK } from "./quality-score";
 import type { SessionTracker } from "./session-tracker";
 import type { ProjectFingerprint } from "./types";
 /** Return value of createAnalytics — bundles tracker + fingerprint */

package/dist/features/analytics/plan-token-aggregator.d.ts CHANGED Viewed

@@ -1,4 +1,4 @@
-import type { MetricsTokenUsage } from "./types";
+import type { MetricsTokenUsage, SessionTokenBreakdown } from "./types";
 /**
  * Aggregate token usage for a plan by summing across matching session summaries.
  *
@@ -9,3 +9,26 @@ import type { MetricsTokenUsage } from "./types";
  * Maps from session TokenUsage (inputTokens/outputTokens) to MetricsTokenUsage (input/output).
  */
 export declare function aggregateTokensForPlan(directory: string, sessionIds: string[]): MetricsTokenUsage;
+/** Result of detailed token aggregation across sessions for a plan */
+export interface DetailedTokenAggregation {
+    /** Total token usage across all sessions */
+    total: MetricsTokenUsage;
+    /** Total dollar cost across all sessions */
+    totalCost: number;
+    /** Per-session breakdowns */
+    sessions: SessionTokenBreakdown[];
+    /** Per-model aggregation (grouped by model ID, "(unknown)" for sessions without model) */
+    modelBreakdown: Array<{
+        model: string;
+        tokens: MetricsTokenUsage;
+        cost: number;
+        sessionCount: number;
+    }>;
+}
+/**
+ * Aggregate token usage for a plan with per-session and per-model detail.
+ *
+ * The existing `aggregateTokensForPlan()` is unchanged for backward compatibility.
+ * This function adds per-session breakdowns and model attribution.
+ */
+export declare function aggregateTokensDetailed(directory: string, sessionIds: string[]): DetailedTokenAggregation;

package/dist/features/analytics/quality-score.d.ts ADDED Viewed

@@ -0,0 +1,30 @@
+import type { AdherenceReport, QualityReport } from "./types";
+/**
+ * Baseline tokens-per-task for efficiency scoring.
+ * A plan consuming this many tokens per task gets an efficiency score of 0.5.
+ * Plans below baseline score above 0.5; plans above baseline score below 0.5.
+ * Exported for future configurability and test reference.
+ */
+export declare const BASELINE_TOKENS_PER_TASK = 50000;
+/**
+ * Calculate a composite quality score for a completed plan.
+ *
+ * Inputs:
+ * - adherence: coverage and precision from the adherence report
+ * - totalTasks / completedTasks: from getPlanProgress()
+ * - totalTokens: sum of input + output + reasoning tokens across all sessions
+ *
+ * Component weights:
+ * - adherenceCoverage (30%): fraction of planned files actually changed
+ * - adherencePrecision (25%): fraction of actual changes that were planned
+ * - taskCompletion (30%): fraction of tasks marked [x]
+ * - efficiency (15%): inverse of normalized tokens-per-task (sigmoid-like)
+ *
+ * Pure function — no I/O.
+ */
+export declare function calculateQualityScore(params: {
+    adherence: AdherenceReport;
+    totalTasks: number;
+    completedTasks: number;
+    totalTokens: number;
+}): QualityReport;

package/dist/features/analytics/session-tracker.d.ts CHANGED Viewed

@@ -28,6 +28,11 @@ export declare class SessionTracker {
      * Set the agent name for a session. Only sets on first call (captures primary agent).
      */
     setAgentName(sessionId: string, agentName: string): void;
+    /**
+     * Set the model ID for a session. Only sets on first call (captures primary model).
+     * Safe to call for untracked sessions (no-op, no throw).
+     */
+    trackModel(sessionId: string, modelId: string): void;
     /**
      * Accumulate dollar cost from a message into the session total.
      */

package/dist/features/analytics/types.d.ts CHANGED Viewed

@@ -59,6 +59,8 @@ export interface SessionSummary {
     totalDelegations: number;
     /** Display name of the agent that ran this session (e.g., "Loom (Main Orchestrator)") */
     agentName?: string;
+    /** Model ID used in this session (e.g., "claude-sonnet-4-20250514") */
+    model?: string;
     /** Total dollar cost accumulated across all messages */
     totalCost?: number;
     /** Aggregated token usage across all messages (absent for old entries or sessions with no messages) */
@@ -92,16 +94,45 @@ export interface ProjectFingerprint {
     /** Weave version that generated this fingerprint (e.g., "0.6.3") */
     weaveVersion?: string;
 }
-/** A suggestion generated from session analytics */
-export interface Suggestion {
-    /** Unique identifier for deduplication */
-    id: string;
-    /** Human-readable suggestion text */
-    text: string;
-    /** Category of suggestion */
-    category: "tool-usage" | "delegation" | "workflow" | "token-usage";
-    /** Confidence level */
-    confidence: "high" | "medium" | "low";
+/** Composite quality score for a completed plan */
+export interface QualityReport {
+    /** Composite quality score (0-1) — weighted average of components */
+    composite: number;
+    /** Component scores (each 0-1) */
+    components: {
+        /** Fraction of planned files that were actually changed */
+        adherenceCoverage: number;
+        /** Fraction of actual changes that were planned */
+        adherencePrecision: number;
+        /** Fraction of plan tasks marked as complete ([x]) */
+        taskCompletion: number;
+        /** Efficiency score — inverse of normalized tokens-per-task */
+        efficiency: number;
+    };
+    /** Raw data used to compute efficiency (for transparency) */
+    efficiencyData: {
+        /** Total tokens consumed */
+        totalTokens: number;
+        /** Number of tasks in the plan */
+        totalTasks: number;
+        /** Tokens per task */
+        tokensPerTask: number;
+    };
+}
+/** Per-session token breakdown within a plan's metrics report */
+export interface SessionTokenBreakdown {
+    /** Session ID */
+    sessionId: string;
+    /** Model ID used in this session */
+    model?: string;
+    /** Display name of the agent */
+    agentName?: string;
+    /** Token usage for this session */
+    tokens: MetricsTokenUsage;
+    /** Dollar cost for this session */
+    cost?: number;
+    /** Duration in milliseconds */
+    durationMs: number;
 }
 /** File name for metrics reports (JSONL format) */
 export declare const METRICS_REPORTS_FILE = "metrics-reports.jsonl";
@@ -147,10 +178,8 @@ export interface MetricsReport {
     generatedAt: string;
     /** Adherence metrics */
     adherence: AdherenceReport;
-    /** Code quality score (Phase 2 — undefined in Phase 1) */
-    quality?: unknown;
-    /** Quality gaps (Phase 2 — undefined in Phase 1) */
-    gaps?: unknown;
+    /** Composite quality score for the plan */
+    quality?: QualityReport;
     /** Token usage across all sessions */
     tokenUsage: MetricsTokenUsage;
     /** Total duration of all sessions in milliseconds */
@@ -163,6 +192,12 @@ export interface MetricsReport {
     endSha?: string;
     /** Session IDs that contributed to this report */
     sessionIds: string[];
+    /** Deduplicated list of model IDs used across all sessions */
+    modelsUsed?: string[];
+    /** Total dollar cost across all sessions */
+    totalCost?: number;
+    /** Per-session token breakdown */
+    sessionBreakdown?: SessionTokenBreakdown[];
 }
 /** Tracks in-flight tool calls for duration measurement */
 export interface InFlightToolCall {
@@ -187,6 +222,8 @@ export interface TrackedSession {
     inFlight: Record<string, InFlightToolCall>;
     /** Display name of the agent running this session */
     agentName?: string;
+    /** Model ID used in this session (e.g., "claude-sonnet-4-20250514") */
+    model?: string;
     /** Accumulated dollar cost across all messages */
     totalCost: number;
     /** Cumulative token usage across all messages */

package/dist/features/evals/evaluators/trajectory-assertion.d.ts ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ import type { AssertionResult, EvalArtifacts, TrajectoryAssertionEvaluator } from "../types";
2	+ export declare function runTrajectoryAssertionEvaluator(spec: TrajectoryAssertionEvaluator, artifacts: EvalArtifacts): AssertionResult[];

package/dist/features/evals/executors/github-models-api.d.ts ADDED Viewed

@@ -0,0 +1,13 @@
+/**
+ * GitHub Models API caller for live eval execution.
+ *
+ * Provides a fetch-based approach for calling GitHub Models API
+ * in Phase 2 live eval harness. Uses only built-in fetch() — no new dependencies.
+ */
+export declare const GITHUB_MODELS_API_URL = "https://models.inference.ai.azure.com/chat/completions";
+export declare const DELAY_BETWEEN_CALLS_MS = 1000;
+export interface GitHubModelsResponse {
+    content: string;
+    durationMs: number;
+}
+export declare function callGitHubModels(systemPrompt: string, userMessage: string, model: string, token: string): Promise<GitHubModelsResponse>;

package/dist/features/evals/executors/model-response.d.ts CHANGED Viewed

@@ -1,2 +1,7 @@
 import type { EvalArtifacts, ExecutionContext, ModelResponseExecutor, ResolvedTarget } from "../types";
-export declare function executeModelResponse(resolvedTarget: ResolvedTarget, executor: ModelResponseExecutor, _context: ExecutionContext): EvalArtifacts;
+/**
+ * Executes a model-response eval case by calling the GitHub Models API.
+ *
+ * Phase 2 is live-only — requires GITHUB_TOKEN env var.
+ */
+export declare function executeModelResponse(resolvedTarget: ResolvedTarget, executor: ModelResponseExecutor, context: ExecutionContext): Promise<EvalArtifacts>;

package/dist/features/evals/executors/prompt-renderer.d.ts CHANGED Viewed

@@ -1,2 +1,2 @@
 import type { EvalArtifacts, ExecutionContext, ExecutorSpec, ResolvedTarget } from "../types";
-export declare function executePromptRender(resolvedTarget: ResolvedTarget, executor: ExecutorSpec, _context: ExecutionContext): EvalArtifacts;
+export declare function executePromptRender(resolvedTarget: ResolvedTarget, executor: ExecutorSpec, _context: ExecutionContext): Promise<EvalArtifacts>;

package/dist/features/evals/executors/trajectory-run.d.ts ADDED Viewed

@@ -0,0 +1,3 @@
+import type { EvalArtifacts, ExecutionContext, ResolvedTarget, TrajectoryRunExecutor } from "../types";
+export declare function detectDelegation(response: string): string | null;
+export declare function executeTrajectoryRun(resolvedTarget: ResolvedTarget, executor: TrajectoryRunExecutor, context: ExecutionContext): Promise<EvalArtifacts>;

package/dist/features/evals/index.d.ts CHANGED Viewed

@@ -9,16 +9,19 @@
  *
  * Promptfoo, if adopted later, should plug in behind executor/judge adapters.
  */
-export type { EvalPhase, EvalTarget, ExecutorSpec, EvaluatorSpec, EvalSuiteManifest, EvalCase, LoadedEvalCase, LoadedEvalSuiteManifest, EvalArtifacts, AssertionResult, EvalCaseResult, EvalRunResult, EvalRunSummary, RunEvalSuiteOptions, RunnerFilters, } from "./types";
-export { EvalCaseSchema, EvalSuiteManifestSchema, EvalRunResultSchema } from "./schema";
-export { EvalConfigError, loadEvalSuiteManifest, loadEvalCasesForSuite, resolveSuitePath } from "./loader";
+export type { EvalPhase, EvalTarget, ExecutorSpec, EvaluatorSpec, EvalSuiteManifest, EvalCase, LoadedEvalCase, LoadedEvalSuiteManifest, EvalArtifacts, AssertionResult, EvalCaseResult, EvalRunResult, EvalRunSummary, RunEvalSuiteOptions, RunnerFilters, TrajectoryScenario, TrajectoryTurn, TrajectoryTrace, TrajectoryTurnResult, TrajectoryAssertionEvaluator, } from "./types";
+export { isTrajectoryTrace } from "./types";
+export { EvalCaseSchema, EvalSuiteManifestSchema, EvalRunResultSchema, TrajectoryScenarioSchema, TrajectoryTurnSchema, TrajectoryAssertionEvaluatorSchema, } from "./schema";
+export { EvalConfigError, loadEvalSuiteManifest, loadEvalCasesForSuite, resolveSuitePath, loadTrajectoryScenario, } from "./loader";
 export { resolveBuiltinAgentTarget } from "./targets/builtin-agent-target";
 export { executePromptRender } from "./executors/prompt-renderer";
 export { executeModelResponse } from "./executors/model-response";
+export { executeTrajectoryRun, detectDelegation } from "./executors/trajectory-run";
 export { runDeterministicEvaluator } from "./evaluators/deterministic";
 export { runLlmJudgeEvaluator } from "./evaluators/llm-judge";
+export { runTrajectoryAssertionEvaluator } from "./evaluators/trajectory-assertion";
 export { deriveDeterministicBaseline, readDeterministicBaseline, compareDeterministicBaseline } from "./baseline";
-export { ensureEvalStorageDir, getDefaultEvalRunPath, writeEvalRunResult } from "./storage";
-export { formatEvalSummary } from "./reporter";
+export { ensureEvalStorageDir, getDefaultEvalRunPath, writeEvalRunResult, getDefaultJsonlPath, appendEvalRunJsonl } from "./storage";
+export { formatEvalSummary, formatJobSummaryMarkdown } from "./reporter";
 export type { RunEvalSuiteOutput } from "./runner";
 export { runEvalSuite } from "./runner";

package/dist/features/evals/loader.d.ts CHANGED Viewed

@@ -1,4 +1,4 @@
-import type { LoadedEvalCase, LoadedEvalSuiteManifest } from "./types";
+import type { LoadedEvalCase, LoadedEvalSuiteManifest, TrajectoryScenario } from "./types";
 export declare class EvalConfigError extends Error {
     constructor(message: string);
 }
@@ -6,3 +6,4 @@ export declare function resolveSuitePath(directory: string, suite: string): stri
 export declare function loadEvalSuiteManifest(directory: string, suite: string): LoadedEvalSuiteManifest;
 export declare function loadEvalCaseFile(directory: string, filePath: string): LoadedEvalCase;
 export declare function loadEvalCasesForSuite(directory: string, suite: LoadedEvalSuiteManifest): LoadedEvalCase[];
+export declare function loadTrajectoryScenario(directory: string, scenarioRef: string): TrajectoryScenario;

package/dist/features/evals/reporter.d.ts CHANGED Viewed

@@ -1,2 +1,3 @@
 import type { EvalRunResult } from "./types";
 export declare function formatEvalSummary(result: EvalRunResult): string;
+export declare function formatJobSummaryMarkdown(result: EvalRunResult): string;

package/dist/features/evals/runner.d.ts CHANGED Viewed

@@ -4,4 +4,4 @@ export interface RunEvalSuiteOutput {
     artifactPath: string;
     consoleSummary: string;
 }
-export declare function runEvalSuite(options: RunEvalSuiteOptions): RunEvalSuiteOutput;
+export declare function runEvalSuite(options: RunEvalSuiteOptions): Promise<RunEvalSuiteOutput>;