npm - agentic-qe - Versions diffs - 3.4.1 → 3.4.2 - Mend

agentic-qe 3.4.1 → 3.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (496) hide show

package/v3/assets/skills/.validation/skill-validation-mcp-integration.md ADDED Viewed

@@ -0,0 +1,250 @@
+# Skill Validation MCP Integration Specification
+**Version**: 1.0.0
+**Created**: 2026-02-02
+**Status**: Active
+## Overview
+All skill validation components MUST use AQE MCP tool calls for shared learning. This ensures validation patterns, outcomes, and insights are stored in the ReasoningBank and shared across the QE agent fleet.
+## Required MCP Tool Calls
+### 1. Pattern Storage
+Skills MUST store successful validation patterns for future reference:
+```typescript
+// After successful validation
+mcp__agentic-qe__memory_store({
+  key: `skill-validation-${skillName}-${timestamp}`,
+  value: {
+    skillName: string,
+    trustTier: number,
+    validationResult: ValidationResult,
+    model: string,
+    passRate: number,
+    patterns: LearnedPattern[]
+  },
+  namespace: "skill-validation"
+})
+```
+### 2. Pattern Query
+Before validation, query for existing patterns:
+```typescript
+// Query learned patterns
+const existingPatterns = await mcp__agentic-qe__memory_query({
+  pattern: `skill-validation-${skillName}-*`,
+  namespace: "skill-validation",
+  limit: 10
+})
+// Use patterns to inform validation expectations
+```
+### 3. Outcome Tracking
+Track all validation outcomes for the learning feedback loop:
+```typescript
+// Record validation outcome
+mcp__agentic-qe__test_outcome_track({
+  testId: `skill-${skillName}-${evalId}`,
+  generatedBy: agentId,
+  patternId: usedPatternId,
+  passed: boolean,
+  coverage: {
+    lines: number,
+    branches: number,
+    functions: number
+  },
+  executionTime: number,
+  flaky: false
+})
+```
+### 4. Cross-Agent Learning Share
+Share validation insights with the learning coordinator:
+```typescript
+// Share learning with fleet
+mcp__agentic-qe__memory_share({
+  sourceAgentId: currentAgentId,
+  targetAgentIds: ["qe-learning-coordinator", "qe-queen-coordinator"],
+  knowledgeDomain: "skill-validation",
+  data: {
+    skillName: string,
+    insights: ValidationInsight[],
+    recommendations: string[]
+  }
+})
+```
+### 5. Quality Gate Integration
+Update skill quality scores via quality assessment:
+```typescript
+// After validation completes
+mcp__agentic-qe__quality_assess({
+  target: `skill:${skillName}`,
+  metrics: {
+    passRate: number,
+    schemaCompliance: boolean,
+    validatorPassed: boolean,
+    evalSuiteScore: number
+  },
+  updateQualityScore: true
+})
+```
+## Memory Namespace Structure
+```
+aqe/skill-validation/
+├── patterns/
+│   ├── security-testing/*       - Security validation patterns
+│   ├── accessibility-testing/*  - A11y validation patterns
+│   └── {skill-name}/*          - Per-skill patterns
+├── outcomes/
+│   ├── by-skill/               - Outcomes grouped by skill
+│   ├── by-model/               - Outcomes grouped by model
+│   └── by-date/                - Outcomes grouped by date
+├── insights/
+│   ├── cross-model/            - Cross-model behavior differences
+│   ├── regressions/            - Detected regressions
+│   └── improvements/           - Improvement recommendations
+└── confidence/
+    └── {skill-name}/           - Confidence scores per skill
+```
+## Eval Runner Integration
+The `scripts/run-skill-eval.ts` evaluation runner MUST:
+1. **Before running evals**: Query ReasoningBank for learned patterns
+2. **During evals**: Track each test case outcome
+3. **After evals**: Store patterns and share learning
+4. **On regression**: Alert via quality gate
+```typescript
+// Evaluation runner pseudocode
+async function runSkillEval(skill: string, model: string) {
+  // 1. Query existing patterns
+  const patterns = await mcp__agentic-qe__memory_query({
+    pattern: `skill-validation-${skill}-*`,
+    namespace: "skill-validation"
+  });
+  // 2. Run evaluation test cases
+  const results = await runTestCases(skill, model, patterns);
+  // 3. Track outcomes
+  for (const result of results) {
+    await mcp__agentic-qe__test_outcome_track({
+      testId: result.id,
+      passed: result.passed,
+      // ...
+    });
+  }
+  // 4. Store new patterns
+  await mcp__agentic-qe__memory_store({
+    key: `skill-validation-${skill}-${Date.now()}`,
+    value: { results, patterns: extractPatterns(results) },
+    namespace: "skill-validation"
+  });
+  // 5. Share learning
+  await mcp__agentic-qe__memory_share({
+    sourceAgentId: "eval-runner",
+    targetAgentIds: ["qe-learning-coordinator"],
+    knowledgeDomain: "skill-validation",
+    data: summarizeResults(results)
+  });
+  // 6. Update quality gate
+  await mcp__agentic-qe__quality_assess({
+    target: `skill:${skill}`,
+    metrics: calculateMetrics(results),
+    updateQualityScore: true
+  });
+  return results;
+}
+```
+## Validator Script Integration
+Bash validators should call the MCP tools via the CLI wrapper:
+```bash
+# In validate.sh after validation
+store_validation_result() {
+  local skill="$1"
+  local result="$2"
+  npx aqe memory store \
+    --key "skill-validation-${skill}-$(date +%s)" \
+    --value "$result" \
+    --namespace skill-validation
+}
+track_outcome() {
+  local test_id="$1"
+  local passed="$2"
+  npx aqe feedback track \
+    --test-id "$test_id" \
+    --passed "$passed"
+}
+```
+## CI Pipeline Integration
+GitHub Actions workflow MUST use MCP tools:
+```yaml
+- name: Query Baseline Patterns
+  run: |
+    npx aqe memory query \
+      --pattern "skill-validation-${{ matrix.skill }}-*" \
+      --namespace skill-validation \
+      --limit 5 \
+      --output baseline-patterns.json
+- name: Run Validation
+  run: |
+    npx ts-node scripts/run-skill-eval.ts \
+      --skill "${{ matrix.skill }}" \
+      --model "${{ matrix.model }}" \
+      --use-mcp-learning
+- name: Share Results with Fleet
+  run: |
+    npx aqe memory share \
+      --source "ci-validator" \
+      --targets "qe-learning-coordinator" \
+      --domain "skill-validation" \
+      --data-file validation-results.json
+```
+## Success Criteria
+- [ ] All validators call `memory_store` after validation
+- [ ] Eval runner queries patterns before running
+- [ ] Outcomes tracked via `test_outcome_track`
+- [ ] Learning shared with coordinator
+- [ ] Quality gate updated with validation metrics
+- [ ] CI pipeline uses MCP tools for learning
+## References
+- [ADR-056: Deterministic Skill Validation System](../v3/implementation/adrs/ADR-056-skill-validation-system.md)
+- [ADR-021: QE ReasoningBank](../v3/implementation/adrs/v3-adrs.md#adr-021)
+- [ADR-023: Quality Feedback Loop](../v3/implementation/adrs/v3-adrs.md#adr-023)
+- [AQE MCP Tools Reference](../reference/aqe-fleet.md)

package/v3/assets/skills/.validation/templates/eval.template.yaml ADDED Viewed

@@ -0,0 +1,366 @@
+# =============================================================================
+# AQE Skill Evaluation Test Suite Template
+# Copy this template to: .claude/skills/{skill-name}/evals/{skill-name}.yaml
+# =============================================================================
+#
+# This evaluation suite validates skill behavior through:
+# 1. Input/expected-output test cases
+# 2. Multi-model consistency testing
+# 3. Semantic validation of reasoning quality
+# 4. AQE MCP integration for shared learning (NEW)
+# 5. ReasoningBank pattern storage (NEW)
+#
+# Schema: docs/schemas/skill-eval.schema.json
+# MCP Spec: docs/specs/skill-validation-mcp-integration.md
+# Runner: scripts/run-skill-eval.ts
+#
+# For a comprehensive example, see: docs/templates/security-testing-eval.template.yaml
+# =============================================================================
+skill: REPLACE_WITH_SKILL_NAME
+version: 1.0.0
+description: >
+  Evaluation test suite for REPLACE_WITH_SKILL_NAME skill.
+  Tests core functionality across multiple models to ensure consistent,
+  high-quality output.
+# =============================================================================
+# Multi-Model Configuration
+# =============================================================================
+# Test across multiple models to ensure consistent behavior and identify
+# model-specific quirks. Results are compared to detect variance.
+models_to_test:
+  - claude-3.5-sonnet    # Primary model (high accuracy expected)
+  - claude-3-haiku       # Fast model (ensure it meets minimum quality)
+  # - gpt-4o             # Optional: Cross-vendor validation
+# =============================================================================
+# MCP Integration Configuration (NEW in v1.4)
+# =============================================================================
+# Per docs/specs/skill-validation-mcp-integration.md
+# Enable to integrate with AQE ReasoningBank for shared learning.
+mcp_integration:
+  enabled: true
+  namespace: skill-validation
+  # Query existing patterns before running evals
+  query_patterns: true
+  # Track each test outcome for the learning feedback loop
+  track_outcomes: true
+  # Store successful patterns after evals complete
+  store_patterns: true
+  # Share learning with fleet coordinator agents
+  share_learning: true
+  # Update quality gate with validation metrics
+  update_quality_gate: true
+  # Agents to share learning with
+  target_agents:
+    - qe-learning-coordinator
+    - qe-queen-coordinator
+# =============================================================================
+# ReasoningBank Learning Configuration (NEW in v1.4)
+# =============================================================================
+learning:
+  store_success_patterns: true
+  store_failure_patterns: true
+  pattern_ttl_days: 90
+  min_confidence_to_store: 0.7
+  cross_model_comparison: true
+# =============================================================================
+# Result Format Configuration (NEW in v1.4)
+# =============================================================================
+result_format:
+  json_output: true
+  markdown_report: false
+  include_raw_output: false
+  include_timing: true
+  include_token_usage: true
+# =============================================================================
+# Environment Setup
+# =============================================================================
+setup:
+  required_tools: []
+  # - tool_name
+  environment_variables: {}
+  # ENV_VAR: value
+  fixtures: []
+  # - name: fixture_name
+  #   path: fixtures/fixture_name.json
+  #   content: |
+  #     { "key": "value" }
+# =============================================================================
+# Test Cases
+# =============================================================================
+# Each test case validates a specific behavior or scenario.
+# IDs follow the pattern: tc{NNN}_{short_description}
+# =============================================================================
+test_cases:
+  # -------------------------------------------------------------------------
+  # Basic Functionality Tests
+  # -------------------------------------------------------------------------
+  - id: tc001_basic_invocation
+    description: "Skill responds to basic invocation with valid output"
+    category: basic
+    priority: critical
+    input:
+      prompt: |
+        Analyze the following code for issues:
+        ```javascript
+        function hello() {
+          console.log("Hello, World!");
+        }
+        ```
+      context:
+        language: javascript
+    expected_output:
+      must_contain:
+        - "function"
+        - "hello"
+      must_not_contain:
+        - "error"
+        - "unable to analyze"
+      # finding_count:
+      #   min: 0
+      #   max: 5
+    validation:
+      schema_check: true
+      keyword_match_threshold: 0.8
+      reasoning_quality_min: 0.6
+  - id: tc002_handles_empty_input
+    description: "Skill handles empty or minimal input gracefully"
+    category: edge_cases
+    priority: high
+    input:
+      prompt: "Analyze this code:"
+      context:
+        language: unknown
+    expected_output:
+      must_contain:
+        - "no code"
+        - "provide"
+      must_not_contain:
+        - "exception"
+        - "crash"
+    validation:
+      schema_check: true
+      allow_partial: true
+  # -------------------------------------------------------------------------
+  # Core Functionality Tests - CUSTOMIZE THESE
+  # -------------------------------------------------------------------------
+  - id: tc003_core_feature_1
+    description: "DESCRIBE WHAT THIS TEST VALIDATES"
+    category: core
+    priority: high
+    input:
+      code: |
+        // Replace with relevant code sample
+        const example = "test";
+      context:
+        language: javascript
+        framework: nodejs
+    expected_output:
+      must_contain:
+        - "EXPECTED_KEYWORD_1"
+        - "EXPECTED_KEYWORD_2"
+      severity_classification: medium
+    validation:
+      schema_check: true
+      keyword_match_threshold: 0.8
+  # -------------------------------------------------------------------------
+  # Negative Tests (Should NOT find issues)
+  # -------------------------------------------------------------------------
+  - id: tc010_no_false_positives
+    description: "Skill does not flag clean code as problematic"
+    category: negative
+    priority: high
+    input:
+      code: |
+        // Well-written, clean code
+        function add(a, b) {
+          return a + b;
+        }
+      context:
+        language: javascript
+    expected_output:
+      must_contain:
+        - "no issues"
+        # OR
+        # - "clean"
+        # - "good"
+      must_not_contain:
+        - "critical"
+        - "vulnerability"
+        - "error"
+    validation:
+      schema_check: true
+      finding_count:
+        max: 1  # Allow at most 1 minor finding
+  # -------------------------------------------------------------------------
+  # Edge Cases
+  # -------------------------------------------------------------------------
+  - id: tc020_large_input
+    description: "Skill handles large input without truncation issues"
+    category: edge_cases
+    priority: medium
+    skip: false
+    input:
+      file_path: fixtures/large_sample.js
+      context:
+        language: javascript
+    expected_output:
+      must_contain:
+        - "analyzed"
+    validation:
+      schema_check: true
+    timeout_ms: 60000  # Longer timeout for large files
+  - id: tc021_special_characters
+    description: "Skill handles special characters in input"
+    category: edge_cases
+    priority: medium
+    input:
+      code: |
+        const emoji = "Hello 🌍!";
+        const unicode = "日本語テスト";
+        const escape = "Line1\nLine2\tTabbed";
+      context:
+        language: javascript
+    expected_output:
+      must_not_contain:
+        - "encoding error"
+        - "parse error"
+    validation:
+      schema_check: true
+  # -------------------------------------------------------------------------
+  # Multi-Language Support (if applicable)
+  # -------------------------------------------------------------------------
+  - id: tc030_python_support
+    description: "Skill correctly analyzes Python code"
+    category: language_support
+    priority: medium
+    skip: true  # Enable if skill supports Python
+    input:
+      code: |
+        def hello():
+            print("Hello, World!")
+      context:
+        language: python
+    expected_output:
+      must_contain:
+        - "def"
+        - "function"
+    validation:
+      schema_check: true
+  # -------------------------------------------------------------------------
+  # Integration Scenarios
+  # -------------------------------------------------------------------------
+  - id: tc040_with_context
+    description: "Skill uses provided context appropriately"
+    category: integration
+    priority: medium
+    input:
+      code: |
+        app.get('/api/users', (req, res) => {
+          const users = db.query('SELECT * FROM users');
+          res.json(users);
+        });
+      context:
+        language: javascript
+        framework: express
+        environment: production
+      options:
+        detailed: true
+    expected_output:
+      must_contain:
+        - "express"
+        - "api"
+    validation:
+      schema_check: true
+      grading_rubric:
+        completeness: 0.4
+        accuracy: 0.4
+        actionability: 0.2
+# =============================================================================
+# Success Criteria
+# =============================================================================
+# Define what constitutes a passing evaluation suite run.
+# =============================================================================
+success_criteria:
+  # Minimum percentage of tests that must pass
+  pass_rate: 0.9
+  # Critical tests must have 100% pass rate
+  critical_pass_rate: 1.0
+  # Average reasoning quality across all tests
+  avg_reasoning_quality: 0.7
+  # Maximum time for entire suite (5 minutes)
+  max_execution_time_ms: 300000
+  # Maximum variance between different models (0.1 = 10%)
+  cross_model_variance: 0.15
+# =============================================================================
+# Metadata
+# =============================================================================
+metadata:
+  author: "@your-github-handle"
+  created: "2026-02-02"
+  last_updated: "2026-02-02"
+  coverage_target: "Core functionality and common edge cases"