npm - gsd-trae - Versions diffs - 1.0.0 → 1.0.2 - Mend

gsd-trae 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (763) hide show

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/design.md DELETED Viewed

@@ -1,111 +0,0 @@
-# Design: VibeCodingBench Architecture
-## Context
-Building a comprehensive benchmark for coding agents (Claude Code, Gemini, Codex, DeepSeek, etc.) that measures real-world developer task performance. Must support both local execution and hosted leaderboard.
-## Goals
-- Reproducible evaluation across different agents
-- Fair comparison with isolated Docker execution
-- Multi-dimensional scoring (not just pass/fail)
-- Easy task contribution workflow
-- Support polyglot: TypeScript, Python, Go, Rust, Java
-## Non-Goals
-- Real-time collaboration features
-- IDE integrations (agents run headless)
-- Training data generation
-## Decisions
-### 1. Monorepo Structure
-```
-vibecodingbench/
-├── packages/
-│   ├── cli/              # Task runner CLI
-│   ├── evaluator/        # Scoring engine
-│   └── leaderboard/      # Web service
-├── tasks/
-│   ├── saas-core/        # 30% weight
-│   ├── glue-code/        # 20% weight
-│   ├── ai-integration/   # 20% weight
-│   ├── frontend/         # 15% weight
-│   └── api-integrations/ # 15% weight
-├── templates/            # Starter codebases
-│   ├── nextjs-supabase/
-│   ├── fastapi-postgres/
-│   ├── go-fiber/
-│   └── rust-axum/
-└── docker/               # Base images
-```
-**Rationale:** Single repo simplifies versioning, CI, and contributions.
-### 2. Task Definition Format
-Each task is a directory:
-```
-tasks/saas-core/auth/supabase-oauth/
-├── task.yaml           # Metadata, prompt, constraints
-├── docker-compose.yaml # Services (DB, mock APIs)
-├── template/           # Starter code (optional)
-├── tests/              # Evaluation tests
-│   ├── functional/     # Must pass
-│   ├── security/       # OWASP checks
-│   └── visual/         # Screenshot diff (frontend)
-└── golden/             # Reference implementation
-```
-**Rationale:** Self-contained, versionable, easy to add.
-### 3. Execution Model
-```
-┌─────────────┐     ┌──────────────┐     ┌─────────────┐
-│   CLI       │────▶│  Task Env    │────▶│  Evaluator  │
-│  (host)     │     │  (Docker)    │     │  (Docker)   │
-└─────────────┘     └──────────────┘     └─────────────┘
-      │                    │                    │
-      │   mount workspace  │   run agent        │   run tests
-      │   inject prompt    │   capture output   │   compute scores
-      └────────────────────┴────────────────────┘
-```
-- Agent runs inside container with network access (for package installs)
-- Evaluation runs in separate container (no agent access)
-- Time/token limits enforced by CLI
-### 4. Scoring Dimensions
-| Dimension | Weight | Method |
-|-----------|--------|--------|
-| Functional | 40% | Test pass rate |
-| Code Quality | 20% | ESLint/Ruff + complexity metrics |
-| Security | 20% | Semgrep OWASP rules |
-| Efficiency | 20% | Tokens used + wall time |
-### 5. Agent Interface
-Agents connect via stdio or HTTP:
-```yaml
-# task.yaml
-agent_interface:
-  type: stdio  # or http
-  prompt_file: PROMPT.md
-  workspace: /workspace
-  timeout: 300s
-  token_limit: 100000
-```
-## Alternatives Considered
-### Task Registry (rejected)
-- Pros: Smaller local footprint
-- Cons: More infrastructure, harder offline use
-- Decision: Start monorepo, can extract registry later
-### VM-per-task (rejected)
-- Pros: Better isolation
-- Cons: 10x cost, slower iteration
-- Decision: Docker sufficient, VMs for hosted tier only
-## Risks & Mitigations
-| Risk | Mitigation |
-|------|------------|
-| Task contamination in training data | Version tasks, rotate variants |
-| Agent gaming metrics | Multiple equivalent tasks per category |
-| Unfair time comparisons | Normalize by model speed tier |
-| Docker escape | Rootless containers, seccomp profiles |

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/proposal.md DELETED Viewed

@@ -1,15 +0,0 @@
-# Change: Initialize VibeCodingBench
-## Why
-Existing coding benchmarks (HumanEval, SWE-bench) focus on algorithmic puzzles or isolated bug fixes. Real developers spend 40% of time on SaaS boilerplate, integrations, and glue code. We need a benchmark that measures what coding agents actually do in production.
-## What Changes
-- Create monorepo structure with task runner CLI
-- Define task specification format (YAML + Docker)
-- Implement multi-dimensional evaluation (functional, quality, security, efficiency)
-- Build leaderboard service for hosted evaluation
-- Add 200+ tasks across 5 categories: SaaS, Glue Code, AI Integration, Frontend, API
-## Impact
-- Affected specs: task-runner, task-definition, evaluation, leaderboard (all new)
-- Affected code: Greenfield project

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/specs/evaluation/spec.md DELETED Viewed

@@ -1,105 +0,0 @@
-## ADDED Requirements
-### Requirement: Multi-Dimensional Scoring
-The system SHALL compute scores across five dimensions with configurable weights.
-#### Scenario: Default weights
-- **WHEN** no custom weights specified
-- **THEN** system uses: Functional 40%, Visual 20%, Quality 20%, Cost 10%, Speed 10%
-#### Scenario: Custom weights
-- **WHEN** user specifies `--weights func=50,visual=0,quality=30,cost=10,speed=10`
-- **THEN** system applies custom weight distribution
-### Requirement: Functional Correctness (Pass@k)
-The system SHALL measure functional correctness via execution-based testing.
-#### Scenario: Pass@1
-- **WHEN** test suite runs once and passes
-- **THEN** functional score = 100%
-#### Scenario: Pass@n with retries
-- **WHEN** task allows n attempts and any attempt passes
-- **THEN** functional score = 100% but efficiency penalty applied
-#### Scenario: Fail-to-Pass validation
-- **WHEN** task is bug-fix type
-- **THEN** system verifies agent's test fails before fix and passes after
-### Requirement: Visual Fidelity
-The system SHALL measure UI accuracy via screenshot comparison.
-#### Scenario: Pixel diff scoring
-- **WHEN** task has `reference.png` in golden/
-- **THEN** system captures screenshot and computes pixel match percentage
-#### Scenario: Responsive breakpoints
-- **WHEN** task specifies `breakpoints: [375, 768, 1440]`
-- **THEN** system tests at each width and averages scores
-#### Scenario: Tolerance threshold
-- **WHEN** pixel mismatch < 5%
-- **THEN** visual score = 100% (allows font rendering variance)
-### Requirement: Code Quality
-The system SHALL measure code hygiene via static analysis.
-#### Scenario: Linter errors
-- **WHEN** generated code has linter errors
-- **THEN** quality score reduced by error count (max -50 points)
-#### Scenario: Cyclomatic complexity
-- **WHEN** average complexity > 10
-- **THEN** quality score reduced proportionally
-#### Scenario: Security scan
-- **WHEN** Semgrep finds Critical/High vulnerabilities
-- **THEN** task auto-fails regardless of other scores
-### Requirement: Hallucination Detection
-The system SHALL detect fabricated dependencies.
-#### Scenario: Import validation
-- **WHEN** agent imports package not in npm/PyPI/Go modules
-- **THEN** hallucination flag raised, quality score -20
-### Requirement: Cost Efficiency
-The system SHALL track token usage and compute costs.
-#### Scenario: Token tracking
-- **WHEN** task completes
-- **THEN** system records input_tokens, output_tokens, total_cost
-#### Scenario: Cost per solved task (CPST)
-- **WHEN** computing leaderboard
-- **THEN** CPST = total_cost / passed_tasks
-#### Scenario: Context pollution rate
-- **WHEN** agent reads files
-- **THEN** pollution_rate = (files_read - files_edited) / files_read
-### Requirement: Speed Metrics
-The system SHALL track execution time and reasoning efficiency.
-#### Scenario: Wall-clock time
-- **WHEN** task completes
-- **THEN** system records start_time, end_time, duration_seconds
-#### Scenario: Step efficiency
-- **WHEN** agent completes task
-- **THEN** system counts LLM round-trips (fewer = better)
-#### Scenario: Self-correction rate
-- **WHEN** agent encounters error and retries
-- **THEN** system tracks retry_count (target < 2)
-### Requirement: Final Score Calculation
-The system SHALL compute weighted final score with penalties.
-#### Scenario: Score formula
-- **WHEN** all dimensions computed
-- **THEN** final_score = (func * w1) + (visual * w2) + (quality * w3) - (cost_penalty) - (speed_penalty)
-#### Scenario: Leaderboard ranking
-- **WHEN** displaying results
-- **THEN** rank by final_score descending, show all dimensions in spider chart

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/specs/leaderboard/spec.md DELETED Viewed

@@ -1,68 +0,0 @@
-## ADDED Requirements
-### Requirement: Submission API
-The system SHALL accept evaluation submissions via REST API.
-#### Scenario: Submit run results
-- **WHEN** POST /api/submissions with run results JSON
-- **THEN** system validates, stores, and queues for leaderboard update
-#### Scenario: Agent identification
-- **WHEN** submission includes `agent_id` and `model_version`
-- **THEN** system groups results by agent for comparison
-### Requirement: Leaderboard Display
-The system SHALL display ranked agents with multi-dimensional scores.
-#### Scenario: Overall leaderboard
-- **WHEN** GET /api/leaderboard
-- **THEN** system returns agents ranked by final_score with all dimension breakdowns
-#### Scenario: Category leaderboard
-- **WHEN** GET /api/leaderboard?category=saas-core
-- **THEN** system returns agents ranked by performance in that category only
-#### Scenario: Spider chart data
-- **WHEN** GET /api/leaderboard/:agent_id/chart
-- **THEN** system returns 5-axis radar chart data (func, visual, quality, cost, speed)
-### Requirement: Historical Tracking
-The system SHALL track agent performance over time.
-#### Scenario: Version comparison
-- **WHEN** same agent submits new model version
-- **THEN** system shows delta vs previous version
-#### Scenario: Trend graphs
-- **WHEN** viewing agent detail page
-- **THEN** system displays score trends over last 30 days
-### Requirement: Live Demo Dashboard
-The system SHALL provide real-time task execution viewing.
-#### Scenario: Active runs
-- **WHEN** tasks are running
-- **THEN** dashboard shows live terminal streams and browser recordings
-#### Scenario: Replay recordings
-- **WHEN** user selects completed run
-- **THEN** system plays back asciinema recording synced with browser video
-#### Scenario: Side-by-side comparison
-- **WHEN** user selects 2+ agents for same task
-- **THEN** system shows parallel playback of each agent's execution
-### Requirement: Fairness Controls
-The system SHALL enforce fair comparison conditions.
-#### Scenario: Docker isolation
-- **WHEN** submitting results
-- **THEN** system verifies run was in fresh Docker container (via attestation)
-#### Scenario: Held-out validation
-- **WHEN** task is marked `held_out: true`
-- **THEN** system only accepts submissions from last 14 days (prevents training contamination)
-#### Scenario: Standardized scaffolding
-- **WHEN** displaying leaderboard
-- **THEN** system shows which agent tooling was used (raw API vs Claude Code CLI vs Codex CLI)

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/specs/task-definition/spec.md DELETED Viewed

@@ -1,45 +0,0 @@
-## ADDED Requirements
-### Requirement: Task Schema
-The system SHALL validate task definitions against a JSON Schema.
-#### Scenario: Valid task.yaml
-- **WHEN** task.yaml contains all required fields (id, name, category, prompt, timeout)
-- **THEN** system loads task without errors
-#### Scenario: Invalid task.yaml
-- **WHEN** task.yaml is missing required fields
-- **THEN** system reports validation errors with line numbers
-### Requirement: Task Structure
-Each task SHALL be a self-contained directory with standardized layout.
-#### Scenario: Minimal task
-- **WHEN** task directory contains `task.yaml` and `tests/`
-- **THEN** system can execute and evaluate the task
-#### Scenario: Full task with template
-- **WHEN** task directory contains `task.yaml`, `template/`, `tests/`, `golden/`
-- **THEN** system uses template as starter code and golden for reference comparison
-### Requirement: Task Metadata
-Task definitions SHALL include metadata for filtering and scoring.
-#### Scenario: Category and weight
-- **WHEN** task.yaml specifies `category: saas-core` and `weight: 1.5`
-- **THEN** system applies weight multiplier to final score
-#### Scenario: Difficulty level
-- **WHEN** task.yaml specifies `difficulty: hard`
-- **THEN** system adjusts timeout and token limits accordingly
-### Requirement: Prompt Specification
-Tasks SHALL define agent prompts with clear success criteria.
-#### Scenario: Prompt file
-- **WHEN** task.yaml specifies `prompt_file: PROMPT.md`
-- **THEN** system reads prompt from that file with variable substitution
-#### Scenario: Inline prompt
-- **WHEN** task.yaml contains `prompt:` field directly
-- **THEN** system uses inline prompt text

package/refs/vbenchmark/openspec/changes/init-vibecodingbench/specs/task-runner/spec.md DELETED Viewed

@@ -1,49 +0,0 @@
-## ADDED Requirements
-### Requirement: Task Discovery
-The system SHALL discover tasks from the `tasks/` directory by scanning for `task.yaml` files.
-#### Scenario: List all tasks
-- **WHEN** user runs `vibecodingbench list`
-- **THEN** system displays all tasks grouped by category with metadata
-#### Scenario: Filter by category
-- **WHEN** user runs `vibecodingbench list --category saas-core`
-- **THEN** system displays only tasks in that category
-### Requirement: Task Execution
-The system SHALL execute tasks in isolated Docker containers with configurable timeouts.
-#### Scenario: Run single task
-- **WHEN** user runs `vibecodingbench run <task-id> --agent claude-code`
-- **THEN** system spawns Docker container, injects prompt, captures agent output
-#### Scenario: Timeout enforcement
-- **WHEN** agent exceeds task timeout (default 300s)
-- **THEN** system kills container and records timeout failure
-#### Scenario: Token limit enforcement
-- **WHEN** agent exceeds token limit (default 100k)
-- **THEN** system stops agent and records token limit failure
-### Requirement: Agent Interface
-The system SHALL support multiple agent connection methods.
-#### Scenario: Stdio agent
-- **WHEN** task.yaml specifies `agent_interface.type: stdio`
-- **THEN** system communicates via stdin/stdout pipes
-#### Scenario: HTTP agent
-- **WHEN** task.yaml specifies `agent_interface.type: http`
-- **THEN** system communicates via REST API on localhost:8080
-### Requirement: Live Demo Mode
-The system SHALL support live streaming of task execution for demos.
-#### Scenario: Stream execution
-- **WHEN** user runs `vibecodingbench run <task-id> --live`
-- **THEN** system streams agent actions, terminal output, and browser (if applicable) to web UI
-#### Scenario: Record session
-- **WHEN** user runs `vibecodingbench run <task-id> --record`
-- **THEN** system saves asciinema recording and browser video to `results/<run-id>/`