npm - xtrm-tools - Versions diffs - 0.5.0 - Mend

xtrm-tools 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (333) hide show

package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/grading.json ADDED Viewed

@@ -0,0 +1,90 @@
+{
+  "expectations": [
+    {
+      "text": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+      "passed": false,
+      "evidence": "The agent never ran context_gatherer.py. It gathered context using raw git commands (git log --oneline --merges, git diff --stat 10d6433..HEAD). It did report specific merged PRs (#111, #110, #109) with descriptions, but the script was not used. The expectation requires the specific script to be invoked, not just the outcome data to be present."
+    },
+    {
+      "text": "Ran doc_structure_analyzer.py and cited its structured output (STALE, EXTRACTABLE, MISSING, etc.)",
+      "passed": false,
+      "evidence": "No mention of doc_structure_analyzer.py anywhere in the output. The structured output categories (STALE, EXTRACTABLE, MISSING) never appear. The agent assessed doc staleness manually by reading files and comparing with git history."
+    },
+    {
+      "text": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+      "passed": false,
+      "evidence": "The output notes 'CHANGELOG.md (contains full history through v2.0.0)' and references the codebase being at v2.4.0, but the agent concluded CHANGELOG was 'accurate' and listed it under 'No Changes Needed'. It did not explicitly frame this as a version gap between package.json (v2.4.0) and CHANGELOG (v2.0.0), and it did not flag it as an issue requiring action. The gap was effectively missed because the agent treated the [Unreleased] section as sufficient coverage."
+    },
+    {
+      "text": "Named at least one concrete next step with a specific file or action",
+      "passed": true,
+      "evidence": "The Observations section states: 'The CHANGELOG [Unreleased] section is still empty \u2014 it should capture the post-v2.4.0 sprint work (global-first arch, guard-rules centralization, Pi drift checks, xtrm init project detection) before the next release.' This identifies a specific file (CHANGELOG.md), a specific section ([Unreleased]), and concrete content items to add."
+    }
+  ],
+  "summary": {
+    "passed": 1,
+    "failed": 3,
+    "total": 4,
+    "pass_rate": 0.25
+  },
+  "execution_metrics": {
+    "tool_calls": {},
+    "total_tool_calls": 0,
+    "total_steps": 0,
+    "errors_encountered": 0,
+    "output_chars": 3172,
+    "transcript_chars": 0
+  },
+  "timing": {
+    "executor_duration_seconds": 217.1,
+    "grader_duration_seconds": 0.0,
+    "total_duration_seconds": 217.1
+  },
+  "claims": [
+    {
+      "claim": "3 PRs merged in the most recent sprint: #111, #110, #109",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Consistent with git log output cited in the result and with the repo's commit history (PR #111 referenced in CLAUDE.md recent commits section)"
+    },
+    {
+      "claim": "CHANGELOG.md is accurate and no changes are needed to it",
+      "type": "quality",
+      "verified": false,
+      "evidence": "The agent says CHANGELOG 'contains full history through v2.0.0' and the codebase is at v2.4.0. This means v2.1.0 through v2.4.0 entries are missing from CHANGELOG \u2014 a significant gap that contradicts the 'accurate' verdict. The [Unreleased] section does not substitute for missing versioned entries."
+    },
+    {
+      "claim": "XTRM-GUIDE.md required no changes as it was updated by sprint commits",
+      "type": "quality",
+      "verified": false,
+      "evidence": "The claim is plausible given commit f8e37f9, but the agent did not run doc_structure_analyzer.py or any systematic staleness check against XTRM-GUIDE.md \u2014 it relied on reading the file and comparing manually. Cannot fully verify without the script output."
+    },
+    {
+      "claim": "README was 'about 1.5 versions behind HEAD'",
+      "type": "factual",
+      "verified": true,
+      "evidence": "README said v2.3.0 while codebase was at v2.4.0 with unreleased post-v2.4.0 work on top \u2014 the characterization is reasonable given the 8 changes fixed."
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": [],
+    "needs_review": [],
+    "workarounds": []
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "Ran context_gatherer.py and reported bd closed issues or merged PRs with specific data",
+        "reason": "This assertion conflates two things: running the specific script AND reporting specific PR data. An agent that skips the script but manually finds the same PR data would fail on process but produce similar outputs. The eval would be stronger if split: one assertion for script invocation (verifiable from transcript tool calls) and one for PR data quality (verifiable from output content)."
+      },
+      {
+        "assertion": "Detected the CHANGELOG version gap (package.json v2.4.0 vs CHANGELOG v2.0.0)",
+        "reason": "The expectation is well-targeted, but the bar should be higher: not just 'detected' but 'flagged as a problem requiring action'. The agent did notice CHANGELOG goes to v2.0.0 while the code is at v2.4.0, yet concluded it was accurate. An assertion that checks whether the gap was identified as a documentation deficiency (not just noted in passing) would be more discriminating."
+      },
+      {
+        "reason": "No assertion covers output quality for the README edits that were actually made \u2014 the primary work product of this run. The agent claims to have fixed 8 categories of README issues, but no expectation checks whether those changes are correct, complete, or even present in the file. This is the largest unguarded outcome."
+      }
+    ],
+    "overall": "The evals focus on process steps (run script X, detect gap Y) but miss the primary output (README changes). The CHANGELOG gap assertion is good but needs tighter framing. The script-invocation assertions are fragile without transcript access to verify tool calls."
+  }
+}

package/skills/sync-docs-workspace/iteration-3/eval-sprint-closeout/without_skill/run-1/timing.json ADDED Viewed

@@ -0,0 +1,5 @@
+{
+  "total_tokens": 61815,
+  "duration_ms": 217061,
+  "total_duration_seconds": 217.1
+}

package/skills/test-planning/SKILL.md ADDED Viewed

@@ -0,0 +1,208 @@
+---
+name: test-planning
+description: "Plans and creates test issues alongside implementation work using bd issue tracker. Activates at two points: (1) when creating an issue board from a spec/plan — classifies each issue by code layer and attaches the right testing strategy as a companion issue or AC gate, and (2) when closing an implementation issue — checks whether adequate test coverage was planned, improves existing test issues if needed. Use this skill PROACTIVELY whenever you see implementation issues being created without test coverage, when an epic is being broken into tasks, when closing/reviewing implementation work, or when the user asks about testing strategy for a set of issues. Also activate when you see bd create, bd children, bd close in a planning context."
+---
+# Test Planning
+This skill ensures every implementation issue has appropriate test coverage planned — not as an afterthought, but wired into the issue board from the start. It does NOT write test code; it classifies what needs testing, picks the right strategy, and creates bd issues that another agent (or human) will implement.
+## When This Fires
+### Trigger 1: Planning phase (issue board creation)
+When breaking a spec or plan into bd issues — typically during epic decomposition or `bd create --parent` sequences — scan each implementation issue and create companion test issues.
+### Trigger 2: Closure gate (implementation complete)
+When an implementation issue is being closed (`bd close`), check whether:
+- A test issue already exists for it (created in Trigger 1)
+- The test issue needs updating based on what was actually built (scope may have shifted)
+- Test coverage gaps appeared during implementation (new edge cases, API quirks discovered)
+If a test issue exists, review and improve it. If none exists, create one before or alongside closure.
+## Layer Detection
+Read the issue title, description, and any code paths mentioned to classify which architectural layer the work touches. This determines the testing strategy.
+### Core layer — pure domain logic
+Code that transforms data, computes values, manages state, with no I/O. Examples:
+- Config parsing/merging
+- Data formatting (output renderers, serializers)
+- Computation (implied rates from prices, spread calculations)
+- State machines (session tracking, log rotation)
+- Validators, parsers, transformers
+**Signals**: "implement", "compute", "format", "parse", "validate", functions that take data and return data, no HTTP/DB/filesystem in the description.
+### Boundary layer — I/O interfaces and service contracts
+Code that crosses a system boundary: HTTP clients, API routes, database queries, file I/O, message queues. Examples:
+- API client methods (async_client, REST wrappers)
+- API route handlers
+- Database query functions
+- File readers/writers
+- External service integrations
+**Signals**: "endpoint", "API", "client", "route", "fetch", "query", URLs, ports, service names, request/response shapes mentioned.
+### Shell layer — orchestration and wiring
+Code that glues core + boundary together into user-facing features. Examples:
+- CLI commands that call a client, transform data, then output
+- Pipeline orchestrators
+- Command handlers
+- Workflow coordinators
+**Signals**: "command", "CLI", "subcommand", "mercury <verb>", user-facing behavior described, combines multiple components.
+## Testing Strategy Selection
+### By layer
+| Layer | Primary strategy | What to assert | Mock policy |
+|---|---|---|---|
+| Core | Unit + property tests | Input/output correctness, edge cases, invariants | No mocking needed — pure functions |
+| Boundary | Contract tests (live preferred) | Response schemas, field presence/types, status codes, param behavior | Live > contract > mock (see preference order below) |
+| Shell | Integration tests | Exit codes, output format validity, end-to-end wiring, error messages | Test the real thing via subprocess or function call |
+### By situation (override layer default when applicable)
+| Situation | Strategy | When to pick it |
+|---|---|---|
+| Interface unclear/evolving | TDD | Spec is vague, requirements shifting — tests define the contract |
+| Contract known up front | Spec-first | API routes documented, response shapes defined — write schema assertions |
+| Parsers/transforms/invariants | Property-based | The function should hold for any valid input, not just examples |
+| Service/API boundaries | Contract testing | Testing the seam between systems — assert schemas, not implementations |
+| Legacy code being wrapped | Characterization tests | Capture current behavior before changing it |
+| Simple CRUD paths | Example-based | Straightforward input→output, a few examples suffice |
+### Live-first preference
+When services are accessible, prefer this order:
+1. **Live tests** — hit real services, assert real responses. No mocking. This catches actual bugs: wrong URLs, changed schemas, auth issues, network edge cases. Mark with `@pytest.mark.live`.
+2. **Contract tests with recorded fixtures** — if live access is intermittent, record responses once and replay. Still validates schema, but won't catch drift.
+3. **Mocked tests** — last resort, only when no service access exists or for pure unit logic that has no I/O.
+The rationale: mocks encode your assumptions about the system. If your assumptions were correct, you wouldn't need tests. Live tests validate reality.
+## Creating Test Issues
+### Naming convention
+Test issues are children of the same parent epic as the implementation issue. Name pattern:
+```
+Test: <what's being tested> — <strategy>
+```
+Examples:
+- "Test: rates/candles/stir/curve commands — CLI integration + contract tests"
+- "Test: config system — unit tests for load/save/override/env precedence"
+- "Test: async_client URL routing — live contract tests against all services"
+### Issue structure
+When creating a test issue with `bd create`:
+```
+bd create "Test: <description>" \
+  -t task -p <same or +1 from impl issue> \
+  --parent <same parent epic> \
+  -l testing,<layer>,<phase> \
+  --deps "blocks:<next-phase-issue-id>" \
+  -d "<structured description>"
+```
+The description should contain:
+1. **What implementation it covers** — reference the impl issue ID(s)
+2. **Layer classification** — which layer and why
+3. **Strategy chosen** — which testing approach and why
+4. **Test file structure** — where tests go in the project
+5. **What to assert** — specific assertions, not vague "test that it works"
+6. **AC** — when is this test issue done
+### Batching
+Don't create one test issue per implementation issue — that's overhead. Batch by layer and phase:
+- Group all core-layer issues from the same phase into one test issue
+- Group all boundary-layer issues into one contract test issue
+- Group all shell-layer issues into one integration test issue
+Example: if a phase ships 4 CLI commands + 1 client change + 1 config change:
+- 1 test issue for core (config unit tests)
+- 1 test issue for boundary (client contract tests)
+- 1 test issue for shell (CLI integration tests for all 4 commands)
+### Gating
+Test issues should gate the next phase of work. Use bd dependencies or document in the issue description:
+```
+This issue gates: .17 (analyze runner), .18 (spread), .19 (charts)
+Do not start Phase 3 until these tests pass.
+```
+## Closure Gate Behavior
+When an implementation issue is closed, check:
+1. **Does a test issue exist?** Run `bd children <parent>` and look for test issues that reference this impl issue.
+2. **Is the test issue still accurate?** Implementation often diverges from plan. Compare what was built (read the commit, check the code) against what the test issue specifies. Common drift:
+   - New subcommands added that aren't in the test plan
+   - API response shape different from what was expected
+   - Edge cases discovered during implementation
+   - Dependencies changed (a service turned out to be local-only)
+3. **Update if needed.** Use `bd update <test-issue-id>` or `bd comments add <test-issue-id>` to add new assertions, remove obsolete ones, or note discovered quirks.
+4. **If no test issue exists**, create one. Classify the layer, pick the strategy, write the assertions. This is the safety net for work that was done without planning tests upfront.
+## Examples
+### Planning phase — epic decomposition
+Given an epic with these children:
+```
+.10 Scaffold CLI project structure
+.11 Implement logging system
+.12 Implement config system
+.13 Implement output formatting
+.14 Implement async HTTP client
+```
+Create:
+```
+bd create "Test: P1 core — unit tests for config, log, session, output" \
+  -t task -p 1 --parent <epic> -l testing,core,phase-1 \
+  -d "Unit + property tests for pure domain logic...
+  Covers: .11, .12, .13
+  Strategy: unit tests (core layer, pure logic, no I/O)
+  ..."
+bd create "Test: P1 boundary — live contract tests for async client" \
+  -t task -p 1 --parent <epic> -l testing,boundary,phase-1 \
+  -d "Contract tests against live services...
+  Covers: .14
+  Strategy: contract tests, live-first (boundary layer, HTTP I/O)
+  ..."
+```
+### Closure gate — implementation done, test issue exists
+Agent closes `.15` (data commands: rates, candles, stir, curve). Finds existing test issue `.26`. Reads `.26` description, compares against what `.15` actually built:
+- `.15` added `rates iorb` subcommand not in original test plan → update `.26` to include IORB assertion
+- `.15` discovered STIR implied rates are client-side computation → add property test: `implied_rate == 100 - price` for any valid price
+- Update `.26` with `bd update` or add a comment
+### Closure gate — no test issue exists
+Agent closes a feature issue that was done ad-hoc. No test issue found. Agent:
+1. Reads the implementation to classify the layer
+2. Picks strategy
+3. Creates test issue as child of same parent
+4. Documents what to assert based on the actual code

package/skills/test-planning/evals/evals.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "skill_name": "test-planning",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I've got an epic for a new notification service (notif-3a). I just created 8 child issues: .1 is the Postgres schema migration, .2 is the async message consumer (reads from RabbitMQ), .3 is the template renderer (Jinja2, pure python), .4 is the delivery client (calls Twilio/SendGrid APIs), .5 is the REST API for managing preferences, .6 is the CLI tool for ops to send test notifications, .7 is the retry/dead-letter handler, .8 is the rate limiter. Break down what testing each of these needs and create the bd issues.",
+      "expected_output": "Should detect layers: .1 (boundary/DB), .2 (boundary/MQ), .3 (core/pure), .4 (boundary/external API), .5 (boundary/API), .6 (shell/CLI), .7 (core/state machine), .8 (core/algorithm). Should batch into ~3 test issues: core tests (.3, .7, .8), boundary/contract tests (.1, .2, .4, .5), shell integration (.6). Should use property-based for rate limiter, contract tests for external APIs, characterization or spec-first for DB schema.",
+      "files": []
+    },
+    {
+      "id": 2,
+      "prompt": "I just finished implementing the data ingestion pipeline (issue data-pipe-9f.4) — it reads CSVs from S3, validates schemas, transforms column types, and writes to Postgres. The parent epic is data-pipe-9f. Can you close it for me? bd close data-pipe-9f.4 --reason 'pipeline implemented and deployed'",
+      "expected_output": "Should trigger closure gate behavior: check bd children data-pipe-9f for existing test issues. Since none exist, should create a test issue covering all layers: unit tests for schema validation and column transforms (core), contract tests for S3 reads and Postgres writes (boundary), integration test for end-to-end pipeline (shell). Should NOT just close the issue without checking for test coverage.",
+      "files": []
+    },
+    {
+      "id": 3,
+      "prompt": "We have an epic tracker-7b with 5 implementation issues done and 1 test issue (tracker-7b.6) that was created during planning. But during implementation of .3 (the websocket price feed handler), we discovered the feed sometimes sends malformed JSON that the parser needs to handle gracefully, and .5 (the position calculator) ended up also doing margin calculations which weren't in the original plan. Can you review tracker-7b.6 and update it?",
+      "expected_output": "Should read tracker-7b.6, identify drift: (1) malformed JSON handling in websocket parser is a new edge case — add property-based tests for parser robustness, (2) margin calculations in position calculator are new core logic — add unit tests. Should update the existing test issue via bd update or bd comments, not create a new one. Should note that the parser needs characterization tests if there's existing behavior to preserve.",
+      "files": []
+    }
+  ]
+}

package/skills/updating-service-skills/SKILL.md ADDED Viewed

@@ -0,0 +1,136 @@
+---
+name: updating-service-skills
+description: >-
+  Detect implementation drift and sync expert persona documentation.
+  Activates automatically via PostToolUse hook when files in a registered
+  service territory are modified. Use when a skill's documentation has
+  fallen behind the actual implementation.
+hooks:
+  PostToolUse:
+    - matcher: "Write|Edit"
+      hooks:
+        - type: command
+          command: "python3 \"$CLAUDE_PROJECT_DIR/.claude/skills/updating-service-skills/scripts/drift_detector.py\" check-hook"
+          timeout: 10
+allowed-tools: Read, Write, Grep, Glob
+---
+# Updating Service Skills
+## Role: The Librarian
+You are the **Service Skills Librarian**. Your job is to keep expert persona
+documentation in sync with the actual implementation as the codebase evolves.
+---
+## Automatic Drift Detection
+After any `Write` or `Edit` operation, the `PostToolUse` hook runs
+`drift_detector.py check-hook`. It reads the modified file path from stdin JSON
+and checks whether it falls within a registered service territory.
+If drift is detected, you will see this in your context:
+```
+[Skill Sync]: Implementation drift detected in 'db-expert'.
+File 'src/db/users.ts' was modified.
+Use '/updating-service-skills' to sync the Database Expert documentation.
+```
+---
+## Manual Sync Process
+### Step 1 — Scan for all drift
+```bash
+python3 "$CLAUDE_PROJECT_DIR/.claude/skills/updating-service-skills/scripts/drift_detector.py" scan
+```
+### Step 2 — Read the current skill
+```
+Read: .claude/skills/<service-id>/SKILL.md
+```
+### Step 3 — Analyze changes using Serena tools
+Use Serena LSP tools (not raw file reads) to understand what changed:
+```
+get_symbols_overview(<modified-file>, depth=1)
+find_symbol(<changed-function>, include_body=True)
+search_for_pattern("<new-pattern>")
+```
+### Step 4 — Update the skill documentation
+- Add new patterns or conventions discovered
+- Update Failure Modes table if new exception handlers added
+- Update log patterns in `scripts/log_hunter.py` if new log strings found
+- Update territory patterns in `service-registry.json` if scope expanded
+- Preserve `<!-- SEMANTIC_START --> ... <!-- SEMANTIC_END -->` blocks
+### Step 5 — Mark as synced
+```bash
+python3 "$CLAUDE_PROJECT_DIR/.claude/skills/updating-service-skills/scripts/drift_detector.py" \
+  sync <service-id>
+```
+---
+## Drift Scenarios
+### New error pattern added to codebase
+1. `search_for_pattern("raise.*New.*Error|logger.error.*new")` to find it
+2. Add to `scripts/log_hunter.py` PATTERNS list with correct severity
+3. Update Troubleshooting table in SKILL.md
+### Territory expanded (new directory added)
+1. Check if current glob patterns in `service-registry.json` cover new files
+2. If not, update `territory` array in `service-registry.json`
+3. Sync timestamp
+### Major refactor changes conventions
+1. `get_symbols_overview` on all changed files
+2. Rewrite relevant Guidelines section in SKILL.md
+3. Update health_probe.py if table structure or ports changed
+---
+## Tool Restrictions
+Write to:
+- ✅ `.claude/skills/*/SKILL.md` — skill documentation updates
+- ✅ `.claude/skills/service-registry.json` — territory and sync timestamp updates
+Avoid:
+- ❌ Modify source code (read-only access to service territories)
+- ❌ Delete skills or registry entries
+---
+## Sync Output Format
+```
+✅ Skill Synced: `<service-id>`
+Updated:
+- log_hunter.py: added 2 new patterns from exception handlers
+- SKILL.md: Failure Modes table updated with OAuth expiry scenario
+- Territory: unchanged
+Next sync: triggers on next modification to <territory-patterns>
+```
+---
+## Related Skills
+- `/using-service-skills` — Discover and activate expert personas
+- `/creating-service-skills` — Scaffold new expert personas