npm - xtrm-tools - Versions diffs - 2.4.0 → 2.4.2 - Mend

xtrm-tools 2.4.0 → 2.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (125) hide show

package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/outputs/result.md ADDED Viewed

@@ -0,0 +1,222 @@
+# Documentation Sync Report — Sprint Closeout
+**Date:** 2026-03-18
+**Branch:** feature/jaggers-agent-tools-4xr6
+**Assessed by:** Manual review (git log, gh issue/pr, file reads)
+---
+## Sprint Activity Summary
+### bd Issues Closed This Sprint
+| # | Title | Closed |
+|---|-------|--------|
+| #38 | xtrm install: Pi coding agent setup with template-based config | 2026-03-15 |
+| #33 | gitnexus: fix MCP + CLI DB lock contention (enable read-only MCP mode) | 2026-03-15 |
+Both are enhancement-labeled. Only 3 issues exist in the bd tracker total (including the older #1 bug). The tracker is thin — most sprint work was tracked through GitHub PRs directly rather than bd issues.
+### Merged PRs This Sprint (March 17–18)
+| PR | Title | Merged |
+|----|-------|--------|
+| #111 | Install official Claude plugins and remove duplicate MCP servers | 2026-03-18 |
+| #110 | chore: release v2.4.0 | 2026-03-18 |
+| #109 | chore: eliminate tdd-guard completely | 2026-03-18 |
+| #108 | fix(quality-gates): wire PostToolUse hooks into project settings.json | 2026-03-18 |
+| #107 | docs(xtrm-guide): fix skills catalog, Pi events, policy table, version history | 2026-03-18 |
+| #106 | docs: pre-install cleanup guide for plugin migration | 2026-03-18 |
+| #105 | fix: context7 free stdio + commit gate stale-claim bug | 2026-03-18 |
+| #104 | fix(p0): MCP sync guard, manifest hash drift detection, dead code removal | 2026-03-18 |
+| #103 | docs: add comprehensive XTRM-GUIDE.md and update README.md | 2026-03-18 |
+| #102 | feat(tests): cross-runtime policy parity test suite | 2026-03-17 |
+The user said "3 PRs" — likely referring to the three feature/fix PRs since the v2.4.0 release: #111 (plugins), #109 (tdd-guard removal), and #108 (quality-gates wiring). PRs #107 and #106 are documentation PRs; #103 introduced XTRM-GUIDE.md and updated README.
+### Current Branch Unmerged Commits (8 commits ahead of main)
+```
+86b3900 Add Pi extension drift checks and guard-rules parity
+54d9978 Centralize guard tool rules and matcher expansion
+f8e37f9 Deprecate install project command in favor of xtrm init
+c1d5182 Add global-first architecture regression tests
+d83384e Add project detection and service registry scaffolding to xtrm init
+e35fa46 Promote service and quality skills to global sync set
+b6c057f Make service-skills extension CWD-aware and global
+02fe064 Move quality gates to global Claude hooks
+```
+These 8 commits represent active work on this branch that has not yet been merged to main.
+---
+## Documentation State Assessment
+### Files Reviewed
+| File | Size | Status |
+|------|------|--------|
+| `README.md` | ~190 lines | Drifted (version stale) |
+| `XTRM-GUIDE.md` | ~360 lines | Partially updated, minor drift |
+| `CHANGELOG.md` | Long | Missing v2.4.0 entry; Unreleased content present |
+| `ROADMAP.md` | Long | Stale — references old architecture, old commands |
+| `plugins/xtrm-tools/.claude-plugin/plugin.json` | Small | Drifted (version stale) |
+---
+## Drift Findings
+### 1. Version Numbers Are Stale in README.md and XTRM-GUIDE.md
+**Actual package version:** `2.4.1` (cli/package.json), `2.4.0` released via PR #110.
+**README.md shows:**
+- Line 5: `**Version 2.3.0**`
+- Line 20: `# → xtrm-tools@xtrm-tools  Version: 2.3.0  Status: ✔ enabled`
+- Line 183 (Version History table): Only goes up to `2.3.0 | 2026-03-17`
+**XTRM-GUIDE.md shows:**
+- Line 1 (heading): `> **Version 2.3.0**`
+- Line 80 (install verify example): `Version: 2.3.0`
+- Line 121 (plugin.json snippet): `"version": "2.3.0"`
+- Version History table: Stops at `2.3.0 | 2026-03-18`
+**plugins/xtrm-tools/.claude-plugin/plugin.json:**
+- `"version": "2.3.0"` — not bumped to reflect the 2.4.x releases
+Both the README and XTRM-GUIDE need a `2.4.0` row added to their version history tables, and their header version badges updated. The plugin.json manifest version also needs bumping.
+---
+### 2. CHANGELOG.md Has No v2.4.0 Entry
+The `[Unreleased]` section exists and contains real content, but `## [2.4.0]` has never been written. The release PR #110 only bumped the package version — it did not write a changelog entry. What v2.4.0 actually shipped (based on PRs):
+- Eliminated tdd-guard completely (#109)
+- Wired PostToolUse quality-gates hooks into project settings.json (#108)
+- Installed official Claude plugins (serena, context7, github, ralph-loop) during `xtrm install all` (#111)
+- Removed duplicate serena/context7 from root `.mcp.json` (#111)
+- Added comprehensive XTRM-GUIDE.md (#103)
+- Multiple p0 bugfixes: MCP sync guard, manifest hash drift detection, dead code removal (#104)
+- Fixed context7 free stdio transport and commit gate stale-claim bug (#105)
+The `[Unreleased]` section currently contains content about `AGENTS.md` bd section and `xtrm install project all` — that content reflects work preceding v2.4.0 (likely v2.3.x or earlier) that was never promoted into a versioned entry.
+---
+### 3. README.md Version History Table Is Missing v2.4.0
+The table in README.md ends at v2.3.0. It needs a `2.4.0` row with a short highlights string covering: tdd-guard removal, official Claude plugins, quality-gates wiring, XTRM-GUIDE.md addition.
+---
+### 4. README.md CLI Commands Table: `install project` Not Marked Deprecated
+The CLI Commands table in README.md (line 99) still shows:
+```
+| `install project <name>` | Install project skill |
+```
+On this branch (commit `f8e37f9`), `install project` was deprecated in favor of `xtrm init`. The XTRM-GUIDE.md was already updated on this branch to show:
+```
+| `install project <name>` | **Deprecated** legacy project-skill installer |
+```
+README.md was not updated to match. This is an in-branch drift between the two files.
+---
+### 5. README.md Skills Table Is Incomplete
+The README.md Skills section (lines 43–48) lists only 4 skills:
+```
+| `using-xtrm`          | Project | Session operating manual |
+| `documenting`         | Global  | SSOT documentation        |
+| `delegating`          | Global  | Task delegation           |
+| `orchestrating-agents`| Global  | Multi-model collaboration |
+```
+The XTRM-GUIDE.md Skills Catalog (already updated via PR #107) lists 23+ global skills. Post-sprint additions include: `test-planning`, `sync-docs`, `creating-service-skills`, `scoping-service-skills`, `updating-service-skills`, `using-service-skills`, `using-quality-gates`, and the full gitnexus skill suite. README is a summary, so full parity is not expected, but the gap is wide enough to be misleading.
+---
+### 6. ROADMAP.md Is Structurally Stale
+ROADMAP.md references architecture from v2.1.9 in several places:
+- The "CLI Architecture Improvements" section at the bottom still describes `cli/lib/sync.js`, `cli/lib/transform-gemini.js`, and multi-agent Gemini/Qwen support — all of which were removed in v2.0.0.
+- Phase 3 "Namespace Prefixes" references `cli/lib/transform-gemini.js` as a file to modify — this file no longer exists.
+- "Phase 5: Transformation Logic Refactoring" describes refactoring `cli/lib/resolver.js` and `cli/lib/transform-gemini.js` — both dead paths.
+- References a file at `file:///home/dawid/gemini/antigravity/brain/...` (absolute local path to dev machine — not a repo path).
+- `AGENTS.md` installation planned as "Next minor release" in the roadmap — but AGENTS.md now exists in the repo and its bd section was added this sprint.
+- The "Completed in v2.1.9" section at the top is frozen — should either be cleaned up or promoted to version-tagged completed items.
+ROADMAP.md was not touched in any of the sprint PRs. It is the most stale major doc.
+---
+### 7. XTRM-GUIDE.md: plugin.json Snippet Shows Wrong Version
+Line 121 of XTRM-GUIDE.md contains a code block example:
+```json
+{
+  "name": "xtrm-tools",
+  "version": "2.3.0",
+  ...
+}
+```
+This is the same drift as the README version badge — it was not updated when v2.4.0 was released.
+---
+### 8. XTRM-GUIDE.md Version History Is Missing v2.4.0
+The XTRM-GUIDE.md version history table (lines 344–350) stops at `2.3.0 | 2026-03-18`. A `2.4.0` row is missing with the same summary needed in README.md.
+---
+## Summary Matrix
+| Document | Issue | Severity |
+|----------|-------|----------|
+| README.md | Version badge shows 2.3.0, should be 2.4.0 | Medium |
+| README.md | CLI table: `install project` not marked deprecated | Medium |
+| README.md | Version history table missing 2.4.0 row | Medium |
+| README.md | Skills table is significantly incomplete vs XTRM-GUIDE | Low |
+| XTRM-GUIDE.md | Version badge and plugin.json snippet show 2.3.0 | Medium |
+| XTRM-GUIDE.md | Version history table missing 2.4.0 row | Medium |
+| CHANGELOG.md | No `[2.4.0]` entry exists despite release PR merging | High |
+| CHANGELOG.md | `[Unreleased]` content never promoted to a version | Medium |
+| ROADMAP.md | Multiple references to deleted files (transform-gemini.js, etc.) | Low |
+| ROADMAP.md | Local absolute file path reference (non-portable) | Low |
+| ROADMAP.md | Completed items (AGENTS.md) still listed as "planned" | Low |
+| plugin.json | `"version": "2.3.0"` — not bumped after v2.4.0 release | Medium |
+---
+## Recommended Actions (Priority Order)
+1. **Write the `[2.4.0]` CHANGELOG entry** — this is the most critical gap. The release shipped but has no record.
+2. **Update README.md version badge, example output, and version history table** to 2.4.0.
+3. **Update XTRM-GUIDE.md version badge, plugin.json snippet, and version history table** to 2.4.0.
+4. **Bump `plugins/xtrm-tools/.claude-plugin/plugin.json`** version to 2.4.0.
+5. **Mark `install project` as deprecated** in README.md CLI table (already done in XTRM-GUIDE.md on this branch).
+6. **Promote `[Unreleased]` CHANGELOG content** into the appropriate version entry, or tag it as part of the current branch work.
+7. **Audit ROADMAP.md** — remove or archive references to deleted files and multi-agent architecture; mark AGENTS.md item as done.
+---
+## What Is in Good Shape
+- **XTRM-GUIDE.md structure and content** is the most up-to-date of all docs — PR #107 did a thorough pass on skills catalog, Pi events table, and policy table. The branch commits have also extended it further (CLI command deprecation, project init details).
+- **Skills catalog** in XTRM-GUIDE.md accurately reflects the current `skills/` directory.
+- **Policy table** in XTRM-GUIDE.md matches `policies/*.json` on the branch.
+- **Hooks reference** in XTRM-GUIDE.md is correct including PostToolUse and Compact Save/Restore.
+- **MCP Servers section** in README.md is accurate post-PR #111 (official plugins called out separately).
+- **CHANGELOG.md `[Unreleased]` section** has real content and is not empty — it just needs to be promoted and a 2.4.0 entry added above it.

package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/grading.json ADDED Viewed

@@ -0,0 +1,88 @@
+{
+  "expectations": [
+    {
+      "text": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+      "passed": false,
+      "evidence": "The result contains specific bd closed issue data (#38, #33 with dates) and merged PR data (#102\u2013#111 with titles and dates). However, the report header explicitly states 'Assessed by: Manual review (git log, gh issue/pr, file reads)' \u2014 context_gatherer.py was never invoked. The data is present but was gathered manually, not via the script. The expectation requires the script to have been run."
+    },
+    {
+      "text": "Ran doc_structure_analyzer.py and cited its structured output (STALE, EXTRACTABLE, MISSING, etc.)",
+      "passed": false,
+      "evidence": "No mention of doc_structure_analyzer.py anywhere in the result. The documentation analysis uses informal labels like 'Drifted' and 'Stale' from the agent's own judgment, not the structured taxonomy (STALE, EXTRACTABLE, MISSING) that the script would emit. The script was not run."
+    },
+    {
+      "text": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+      "passed": true,
+      "evidence": "Section 2 of the result is titled 'CHANGELOG.md Has No v2.4.0 Entry' and is rated High severity and listed as the #1 recommended action: 'Write the [2.4.0] CHANGELOG entry \u2014 this is the most critical gap. The release shipped but has no record.' The result also references the package version as 2.4.1 (cli/package.json) and 2.4.0 (released via PR #110). The specific last CHANGELOG version is not named but the gap is clearly identified and substantiated."
+    },
+    {
+      "text": "Named at least one concrete next step with a specific file or action",
+      "passed": true,
+      "evidence": "The 'Recommended Actions (Priority Order)' section lists 7 concrete steps, each referencing specific files: e.g., '1. Write the [2.4.0] CHANGELOG entry', '2. Update README.md version badge, example output, and version history table to 2.4.0', '4. Bump plugins/xtrm-tools/.claude-plugin/plugin.json version to 2.4.0'."
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 2,
+    "total": 4,
+    "pass_rate": 0.5
+  },
+  "execution_metrics": {
+    "output_chars": 8123,
+    "transcript_chars": 0,
+    "notes": "No metrics.json present in outputs_dir. Only result.md was produced."
+  },
+  "timing": {
+    "executor_duration_seconds": 219.9,
+    "grader_duration_seconds": 0.0,
+    "total_duration_seconds": 219.9
+  },
+  "claims": [
+    {
+      "claim": "bd tracker has only 3 issues total",
+      "type": "factual",
+      "verified": false,
+      "evidence": "Unverifiable from available outputs \u2014 would require access to the bd issue tracker. The agent asserts this but no script output corroborates it."
+    },
+    {
+      "claim": "Package version is 2.4.1 in cli/package.json",
+      "type": "factual",
+      "verified": true,
+      "evidence": "The git status at conversation start shows 'M cli/package.json' and the result cites 'Actual package version: 2.4.1 (cli/package.json)'. The expectation references v2.4.0 as the released version (PR #110), and 2.4.1 is the current working state \u2014 consistent."
+    },
+    {
+      "claim": "ROADMAP.md references transform-gemini.js which no longer exists",
+      "type": "factual",
+      "verified": false,
+      "evidence": "The result asserts this but the file was not read in the available outputs \u2014 unverifiable without reading ROADMAP.md directly. However, CLAUDE.md confirms cli/lib/transform-gemini.js exists, which partially contradicts this claim."
+    },
+    {
+      "claim": "XTRM-GUIDE.md lists 23+ global skills",
+      "type": "factual",
+      "verified": false,
+      "evidence": "Stated in the result but cannot be confirmed from available outputs alone \u2014 XTRM-GUIDE.md was not included in the outputs directory."
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": [],
+    "needs_review": [],
+    "workarounds": [],
+    "notes": "No user_notes.md present in outputs_dir."
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+        "reason": "This assertion conflates two outcomes: (a) running the script, and (b) reporting specific data. The run passed on data quality even though the script was never used. Split into two assertions \u2014 one for script execution (verifiable via transcript tool calls) and one for data presence \u2014 so a manually-gathered result can fail the process check while still passing the data check."
+      },
+      {
+        "assertion": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+        "reason": "The parenthetical 'CHANGELOG v2.0.0' is a specific claim about what the last versioned CHANGELOG entry is, but the result never confirms this. The assertion would be sharper if it required naming the last versioned CHANGELOG entry explicitly. As written, detecting any CHANGELOG gap satisfies it."
+      },
+      {
+        "reason": "No assertion checks whether the agent actually read or cited file contents (e.g., quoting a specific CHANGELOG line or README version badge). The result could have been constructed from PR descriptions alone without ever opening the doc files \u2014 and none of the assertions would catch that."
+      }
+    ],
+    "overall": "Two of the four assertions test process (script was run) which is good discrimination. However, the data assertions are weak \u2014 they can be satisfied by manual work that bypasses the skill's tooling entirely. The CHANGELOG gap assertion passes despite the specific version cited in the expectation (v2.0.0) never being confirmed in the output."
+  }
+}

package/skills/sync-docs-workspace/iteration-2/eval-sprint-closeout/without_skill/run-1/timing.json ADDED Viewed

@@ -0,0 +1,5 @@
+{
+  "total_tokens": 48822,
+  "duration_ms": 219936,
+  "total_duration_seconds": 219.9
+}

package/skills/sync-docs-workspace/iteration-3/benchmark.json ADDED Viewed

@@ -0,0 +1,298 @@
+{
+  "metadata": {
+    "skill_name": "sync-docs",
+    "skill_path": "<path/to/skill>",
+    "executor_model": "<model-name>",
+    "analyzer_model": "<model-name>",
+    "timestamp": "2026-03-18T14:41:48Z",
+    "evals_run": [
+      1,
+      2,
+      3
+    ],
+    "runs_per_configuration": 3
+  },
+  "runs": [
+    {
+      "eval_id": 3,
+      "configuration": "with_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 1.0,
+        "passed": 4,
+        "failed": 0,
+        "total": 4,
+        "time_seconds": 119.1,
+        "tokens": 4247,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran doc_structure_analyzer.py and cited its output",
+          "passed": true,
+          "evidence": "Audit scope section (result.md line 123) states: 'Phase 3 (structure analysis): Complete \u2014 doc_structure_analyzer.py ran successfully'. The README extraction table, missing docs/ list, and invalid-schema file list all derive directly from that tool's output."
+        },
+        {
+          "text": "Named at least 2 specific README sections with their suggested docs/ destination",
+          "passed": true,
+          "evidence": "The report names 4 README sections with explicit targets in a structured table: '### Skills' -> docs/skills.md, '## Policy System' -> docs/policies.md, '## Hooks Reference' -> docs/hooks.md, '## MCP Servers' -> docs/mcp-servers.md. Repeated in the Recommended Next Steps section."
+        },
+        {
+          "text": "Did NOT run --fix or create/edit any files (audit-only mode respected)",
+          "passed": true,
+          "evidence": "Report header states 'Mode: Audit only (no files modified)'. Audit scope section (lines 123-125) explicitly states Phase 4 (execute) and Phase 5 (validate) were 'NOT run'. Only one output file exists (result.md), which is the report itself."
+        },
+        {
+          "text": "Report is actionable with clear next steps",
+          "passed": true,
+          "evidence": "The 'Recommended Next Steps (when executing)' section lists 7 numbered, concrete actions with specific scripts, file paths, and targets \u2014 e.g., 'Cut entries for 2.1.x\u20132.4.0 using add_entry.py', 'Extract to docs/hooks.md (or promote hook-system-summary.md)', 'Consolidate with docs/mcp-servers-config.md \u2192 rename to docs/mcp-servers.md + add frontmatter'."
+        }
+      ],
+      "notes": [
+        "drift_detector.py skipped due to missing pyyaml \u2014 Phase 2 coverage is absent"
+      ]
+    },
+    {
+      "eval_id": 2,
+      "configuration": "with_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 1.0,
+        "passed": 4,
+        "failed": 0,
+        "total": 4,
+        "time_seconds": 105.3,
+        "tokens": 3011,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran doc_structure_analyzer.py with --fix flag",
+          "passed": true,
+          "evidence": "result.md Command Run section shows: 'python3 /home/dawid/projects/xtrm-tools/skills/sync-docs/scripts/doc_structure_analyzer.py --fix --bd-remember' executed from the worktree."
+        },
+        {
+          "text": "Handled both MISSING scaffolds AND INVALID_SCHEMA files (or correctly reported none found)",
+          "passed": true,
+          "evidence": "result.md shows 5 MISSING scaffold files created (hooks.md, pi-extensions.md, mcp-servers.md, policies.md, skills.md) and 7 INVALID_SCHEMA files fixed with frontmatter injection (cleanup.md, delegation-architecture.md, hook-system-summary.md, mcp-servers-config.md, pi-extensions-migration.md, pre-install-cleanup.md, todo.md). Both categories were handled."
+        },
+        {
+          "text": "Ran bd remember and reported the memory key",
+          "passed": true,
+          "evidence": "result.md bd remember Outcome section shows 'stored: true' and 'key: sync-docs-fix-2026-03-18'. The script correctly resolved the main repo root from the worktree gitdir pointer chain."
+        },
+        {
+          "text": "Ran validate_doc.py docs/ after fixing to confirm results",
+          "passed": true,
+          "evidence": "result.md validate_doc.py Results section shows 'Result: 12/12 files passed'. All docs/ files passed schema validation after the fix run."
+        }
+      ],
+      "notes": [
+        "README extraction requires content judgment (Serena) \u2014 EXTRACTABLE flag not auto-resolved by --fix",
+        "CHANGELOG staleness (v2.0.0 last entry vs v2.4.0 current) requires manual entry for v2.1.0-v2.4.0 changes"
+      ]
+    },
+    {
+      "eval_id": 1,
+      "configuration": "with_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 1.0,
+        "passed": 4,
+        "failed": 0,
+        "total": 4,
+        "time_seconds": 203.6,
+        "tokens": 0,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+          "passed": true,
+          "evidence": "phase1_context.json contains 20 bd_closed_issues with specific IDs and titles (e.g., jaggers-agent-tools-1lc 'P0 bug Remove dead code cli/index.js', jaggers-agent-tools-7dwo 'P0 bug Fix commit gate blocking...') and 10 merged_prs with SHAs, branch names, and dates (e.g., sha a7507e6c, 'Merge pull request #15 from Jaggerxtrm/release/2.0.1', 2026-03-13). result.md Phase 1 section summarizes this data accurately."
+        },
+        {
+          "text": "Ran doc_structure_analyzer.py and cited its structured output (STALE, EXTRACTABLE, MISSING, etc.)",
+          "passed": true,
+          "evidence": "result.md Phase 3 section explicitly cites STALE (CHANGELOG), EXTRACTABLE (README.md, 5 extraction candidates), MISSING (5 docs/ gaps), and INVALID_SCHEMA (7 existing docs files) with specific file names and counts. phase3_analysis.json confirms STALE for CHANGELOG with structured fields. Caveat: the saved phase3_analysis.json appears to be post-fix state (all existing_docs show OK, docs_gaps empty), so the MISSING/EXTRACTABLE/INVALID_SCHEMA findings are only visible in the result.md narrative, not the raw JSON. The structured terminology is accurately cited, but the saved JSON does not independently corroborate the pre-fix MISSING and EXTRACTABLE counts."
+        },
+        {
+          "text": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+          "passed": true,
+          "evidence": "phase3_analysis.json changelog section explicitly records: package_version '2.4.0', latest_changelog_version '2.0.0', status 'STALE', and issues array containing 'package.json is at v2.4.0 but latest CHANGELOG entry is v2.0.0 \u2014 release is undocumented'. result.md also states this in Phase 3 and lists it as remaining action item #1."
+        },
+        {
+          "text": "Named at least one concrete next step with a specific file or action",
+          "passed": true,
+          "evidence": "result.md Remaining action items lists four concrete steps: (1) 'Add entries for v2.1.x, v2.2.0, v2.3.0, v2.4.0 using changelog/add_entry.py or manually', (2) 'Use Serena to move Skills, Policy System, Hooks Reference, MCP Servers sections into their respective docs/ scaffolds', (3) 'Run /documenting to update the 5 stale SSOT memories', (4) 'fill them in with Serena' for the 5 named scaffold files (hooks.md, pi-extensions.md, mcp-servers.md, policies.md, skills.md)."
+        }
+      ],
+      "notes": []
+    },
+    {
+      "eval_id": 3,
+      "configuration": "without_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 0.75,
+        "passed": 3,
+        "failed": 1,
+        "total": 4,
+        "time_seconds": 95.9,
+        "tokens": 0,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran doc_structure_analyzer.py and cited its output",
+          "passed": false,
+          "evidence": "result.md explicitly states 'Method: Manual review of README.md sections against existing docs/ files.' There is no mention of doc_structure_analyzer.py anywhere in the output. The executor performed a manual audit rather than using the specified tool."
+        },
+        {
+          "text": "Named at least 2 specific README sections with their suggested docs/ destination",
+          "passed": true,
+          "evidence": "The report names 7 README sections with specific docs/ destinations. Examples: 'Hooks Reference' (lines 114-141) -> 'docs/hooks.md (exists)'; 'Policy System' (lines 66-86) -> 'docs/policies.md (stub, needs population)'; 'MCP Servers' (lines 143-158) -> 'docs/mcp.md (exists)'. The priority table at lines 103-111 clearly maps each section to a target file."
+        },
+        {
+          "text": "Did NOT run --fix or create/edit any files (audit-only mode respected)",
+          "passed": true,
+          "evidence": "The only output file is result.md (the report itself). No docs/ files were created or modified. All suggestions use language like 'Suggested action: Move to...' or 'Consider creating...' rather than performing the actions. The outputs directory contains only result.md."
+        },
+        {
+          "text": "Report is actionable with clear next steps",
+          "passed": true,
+          "evidence": "The report includes a priority ranking table (lines 103-111) with Priority (High/Medium/Low), Section, Action, and Target columns for all 7 sections. Each section analysis also ends with an explicit 'Suggested action:' line. The report concludes with a 'What README.md Should Retain' section describing the end-state goal."
+        }
+      ],
+      "notes": []
+    },
+    {
+      "eval_id": 2,
+      "configuration": "without_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 1.0,
+        "passed": 4,
+        "failed": 0,
+        "total": 4,
+        "time_seconds": 122.8,
+        "tokens": 2791,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran doc_structure_analyzer.py with --fix flag",
+          "passed": true,
+          "evidence": "Step 2 of result.md: 'Ran `doc_structure_analyzer.py --root=<worktree> --fix`' \u2014 explicitly confirms the --fix flag was used, with a detailed list of 5 scaffolded MISSING files and 7 INVALID_SCHEMA files with frontmatter injected."
+        },
+        {
+          "text": "Handled both MISSING scaffolds AND INVALID_SCHEMA files (or correctly reported none found)",
+          "passed": true,
+          "evidence": "Step 2 of result.md lists both categories: 5 MISSING files scaffolded (hooks.md, pi-extensions.md, mcp-servers.md, policies.md, skills.md) and 7 INVALID_SCHEMA files with frontmatter injected (cleanup.md, delegation-architecture.md, hook-system-summary.md, mcp-servers-config.md, pi-extensions-migration.md, pre-install-cleanup.md, todo.md)."
+        },
+        {
+          "text": "Ran bd remember and reported the memory key",
+          "passed": true,
+          "evidence": "Step 4 of result.md shows the full `bd remember` command was run from the main repo root with `--key sync-docs-fix-2026-03-18`, and the outcome reported: 'Updated [sync-docs-fix-2026-03-18] \u2014 memory persisted successfully.'"
+        },
+        {
+          "text": "Ran validate_doc.py docs/ after fixing to confirm results",
+          "passed": true,
+          "evidence": "Step 3 of result.md: 'Ran `validate_doc.py docs/` on the worktree' with result 'Result: 12/12 files passed'. This was done after --fix, confirming validation as a post-fix confirmation step."
+        }
+      ],
+      "notes": []
+    },
+    {
+      "eval_id": 1,
+      "configuration": "without_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 0.25,
+        "passed": 1,
+        "failed": 3,
+        "total": 4,
+        "time_seconds": 217.1,
+        "tokens": 3172,
+        "tool_calls": 0,
+        "errors": 0
+      },
+      "expectations": [
+        {
+          "text": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+          "passed": false,
+          "evidence": "The agent never ran context_gatherer.py. It gathered context using raw git commands (git log --oneline --merges, git diff --stat 10d6433..HEAD). It did report specific merged PRs (#111, #110, #109) with descriptions, but the script was not used. The expectation requires the specific script to be invoked, not just the outcome data to be present."
+        },
+        {
+          "text": "Ran doc_structure_analyzer.py and cited its structured output (STALE, EXTRACTABLE, MISSING, etc.)",
+          "passed": false,
+          "evidence": "No mention of doc_structure_analyzer.py anywhere in the output. The structured output categories (STALE, EXTRACTABLE, MISSING) never appear. The agent assessed doc staleness manually by reading files and comparing with git history."
+        },
+        {
+          "text": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+          "passed": false,
+          "evidence": "The output notes 'CHANGELOG.md (contains full history through v2.0.0)' and references the codebase being at v2.4.0, but the agent concluded CHANGELOG was 'accurate' and listed it under 'No Changes Needed'. It did not explicitly frame this as a version gap between package.json (v2.4.0) and CHANGELOG (v2.0.0), and it did not flag it as an issue requiring action. The gap was effectively missed because the agent treated the [Unreleased] section as sufficient coverage."
+        },
+        {
+          "text": "Named at least one concrete next step with a specific file or action",
+          "passed": true,
+          "evidence": "The Observations section states: 'The CHANGELOG [Unreleased] section is still empty \u2014 it should capture the post-v2.4.0 sprint work (global-first arch, guard-rules centralization, Pi drift checks, xtrm init project detection) before the next release.' This identifies a specific file (CHANGELOG.md), a specific section ([Unreleased]), and concrete content items to add."
+        }
+      ],
+      "notes": []
+    }
+  ],
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": {
+        "mean": 1.0,
+        "stddev": 0.0,
+        "min": 1.0,
+        "max": 1.0
+      },
+      "time_seconds": {
+        "mean": 142.6667,
+        "stddev": 53.219,
+        "min": 105.3,
+        "max": 203.6
+      },
+      "tokens": {
+        "mean": 2419.3333,
+        "stddev": 2184.446,
+        "min": 0,
+        "max": 4247
+      }
+    },
+    "without_skill": {
+      "pass_rate": {
+        "mean": 0.6667,
+        "stddev": 0.3819,
+        "min": 0.25,
+        "max": 1.0
+      },
+      "time_seconds": {
+        "mean": 145.2667,
+        "stddev": 63.6469,
+        "min": 95.9,
+        "max": 217.1
+      },
+      "tokens": {
+        "mean": 1987.6667,
+        "stddev": 1731.8788,
+        "min": 0,
+        "max": 3172
+      }
+    },
+    "delta": {
+      "pass_rate": "+0.33",
+      "time_seconds": "-2.6",
+      "tokens": "+432"
+    }
+  },
+  "notes": []
+}

package/skills/sync-docs-workspace/iteration-3/benchmark.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Skill Benchmark: sync-docs
+**Model**: <model-name>
+**Date**: 2026-03-18T14:41:48Z
+**Evals**: 1, 2, 3 (3 runs each per configuration)
+## Summary
+| Metric | With Skill | Without Skill | Delta |
+|--------|------------|---------------|-------|
+| Pass Rate | 100% ± 0% | 67% ± 38% | +0.33 |
+| Time | 142.7s ± 53.2s | 145.3s ± 63.6s | -2.6s |
+| Tokens | 2419 ± 2184 | 1988 ± 1732 | +432 |

package/skills/sync-docs-workspace/iteration-3/eval-doc-audit/eval_metadata.json ADDED Viewed

@@ -0,0 +1,27 @@
+{
+  "eval_id": 3,
+  "eval_name": "doc-audit",
+  "prompt": "Do a doc audit. I think the README has sections that should be in docs/ but I'm not sure which ones.",
+  "assertions": [
+    {
+      "text": "Ran doc_structure_analyzer.py and cited its output",
+      "passed": false,
+      "evidence": ""
+    },
+    {
+      "text": "Named at least 2 specific README sections with their suggested docs/ destination",
+      "passed": false,
+      "evidence": ""
+    },
+    {
+      "text": "Did NOT run --fix or create/edit any files (audit-only mode respected)",
+      "passed": false,
+      "evidence": ""
+    },
+    {
+      "text": "Report is actionable with clear next steps",
+      "passed": false,
+      "evidence": ""
+    }
+  ]
+}