npm - gsd-trae - Versions diffs - 1.0.0 → 1.0.2 - Mend

gsd-trae 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (763) hide show

package/refs/vbenchmark/docs/THESIS.md DELETED Viewed

@@ -1,441 +0,0 @@
-# VibeCodingBench: Benchmark Vibe Coding Models for Fun
-**A Position Paper on Benchmark Methodology**
----
-## Abstract
-Current benchmarks for evaluating AI coding agents suffer from a fundamental ecological validity crisis. While models achieve impressive scores on SWE-bench (70%+) and HumanEval (90%+), these metrics fail to predict real-world utility. We present VibeCodingBench, a benchmark grounded in empirical data about actual developer workflows, designed to measure what matters: the ability to accelerate production software development. Our approach addresses three critical gaps in existing evaluation: (1) task distribution mismatch, (2) single-dimension scoring, and (3) isolation from production concerns.
----
-## 1. The Evaluation Crisis
-### 1.1 The Disconnect Between Benchmarks and Reality
-The AI coding agent landscape exhibits a troubling paradox: benchmark performance continues to climb while practical utility remains uncertain. Consider the evidence:
-- **SWE-bench Verified**: Top models score 70%+ on bug fixes, yet these "represent a substantial proportion of relatively trivial problems (161 out of 500) that require only one- to two-line modifications" ([Runloop AI, 2025](https://runloop.ai/blog/swe-bench-deep-dive-unmasking-the-limitations-of-a-popular-benchmark))
-- **SWE-bench Pro Performance Gap**: When evaluated on multi-file, long-horizon tasks, the same models drop to ~23%—a 47-point decline ([Scale AI Leaderboard](https://scale.com/leaderboard/swe_bench_pro_public))
-- **SWE-EVO Findings**: "Even the best-performing model (GPT-5) resolves only 21% of SWE-EVO tasks compared to 65% on SWE-Bench Verified" ([arxiv:2512.18470](https://arxiv.org/abs/2512.18470))
-- **Developer Trust**: Stack Overflow 2025 reports that 66% of developers cite "AI solutions that are almost right, but not quite" as their biggest frustration ([Stack Overflow Developer Survey 2025](https://survey.stackoverflow.co/2025/))
-### 1.2 Why Current Benchmarks Fail
-We identify four structural limitations:
-| Limitation | Evidence | Impact |
-|------------|----------|--------|
-| **Task Triviality** | 63.75% of SWE-Agent patches are "suspicious" due to weak tests | Overstates capability |
-| **Data Contamination** | Training corpora include benchmark repositories | Inflates scores |
-| **Narrow Scope** | Focus on Python bug fixes in 12 repositories | Misses language/domain breadth |
-| **Single Metric** | Pass rate alone ignores quality, security, cost | Hides production risks |
-As one analysis notes: "Even when LLM-generated code passes functional performance benchmarks, it is not free of underlying quality defects—for example, Claude Sonnet 4 (a top performer on pass rate) averaged 2.11 issues per passing task" ([Qodo State of AI Code Quality 2025](https://www.qodo.ai/reports/state-of-ai-code-quality/)).
----
-## 2. What Developers Actually Do
-### 2.1 Time Distribution in Software Engineering
-To build a valid benchmark, we must first understand developer workflows. Empirical research reveals:
-**Daily Activity Breakdown** ([Sonar Research](https://www.sonarsource.com/blog/how-much-time-do-developers-spend-actually-writing-code/)):
-- Writing new code: **32%**
-- Code maintenance: **19%**
-- Testing: **12%**
-- Meetings/operations: **23%**
-- Security issues: **4%**
-**Code Time Statistics** ([Software.com Report](https://www.software.com/reports/code-time-report)):
-- Median active coding: **52 minutes/day**
-- Top quartile focus time: **5.8 hours/day**
-- Reading vs. writing ratio: **7:1 to 200:1**
-**Microsoft Research** found that developers consider workdays "good" when they achieve balance between coding and collaboration, with "efficient use of time" cited by 54% as key ([Microsoft TSE 2019](https://www.microsoft.com/en-us/research/wp-content/uploads/2019/04/devtime-preprint-TSE19.pdf)).
-**Microsoft Time Warp Study (2024)** surveyed 484 developers and found "significant deviations between a developer's ideal workweek and their actual workweek, with a clear correlation: as the gap between these two workweeks widens, there is a decline in both productivity and satisfaction" ([Microsoft Research 2024](https://www.microsoft.com/en-us/research/wp-content/uploads/2024/11/Time-Warp-Developer-Productivity-Study.pdf)).
-**Atlassian 2025 DevEx Survey** of 3,500 developers found: "Developers only spend **16% of their time coding**, and coding is not a friction point for developers." Additionally, "63% of developers now say leaders don't understand their pain points, up sharply from 44% last year" ([Atlassian 2025](https://www.atlassian.com/blog/developer/developer-experience-report-2025)).
-**JetBrains 2025** reports that 85% of developers regularly use AI tools, with 62% relying on at least one AI coding assistant. However, "developers prefer to stay in charge of creative and complex tasks, like debugging or designing application logic" ([JetBrains DevEco 2025](https://devecosystem-2025.jetbrains.com/)).
-### 2.2 The Boilerplate Burden
-A critical insight emerges: developers spend enormous time on repetitive, well-understood patterns:
-> "As time spent on boilerplate code is reduced, developers focus more on improving the software's essential features." ([Iterators HQ](https://www.iteratorshq.com/blog/boilerplate-code-productivity-and-consistency-in-software-development/))
-GitHub Octoverse 2025 confirms this pattern:
-- **2.4M repositories** now use Notebooks (+75% YoY)
-- **1.9M repositories** use Dockerfiles (+120% YoY)
-- This growth is "fueled by the need to sandbox agents and LLMs" ([GitHub Octoverse 2025](https://github.blog/news-insights/octoverse/))
-The implication is clear: **an agent that excels at boilerplate tasks (auth, CRUD, integrations) would deliver outsized value**.
-### 2.3 Common Task Categories
-Cross-referencing GitHub activity, Stack Overflow surveys, and developer productivity research, we identify six high-impact task categories:
-| Category | % of Dev Work | Examples | Current Benchmark Coverage |
-|----------|---------------|----------|---------------------------|
-| **SaaS Features** | 25% | Auth, billing, settings | ❌ Minimal |
-| **Glue Code** | 20% | Data transforms, API adapters | ❌ Minimal |
-| **AI Integration** | 15% | RAG, function calling, embeddings | ❌ Minimal |
-| **Frontend** | 15% | Components, forms, dashboards | ⚠️ HumanEval UI only |
-| **API Integrations** | 15% | Webhooks, OAuth, third-party SDKs | ❌ Minimal |
-| **Code Evolution** | 10% | Migrations, refactoring, upgrades | ⚠️ SWE-EVO partial |
----
-## 3. VibeCodingBench Methodology
-### 3.1 Design Principles
-We propose five principles for ecological validity:
-1. **Representative Tasks**: Mirror actual developer work distribution, not algorithmic puzzles
-2. **Multi-Dimensional Scoring**: Evaluate quality, security, cost, and speed alongside correctness
-3. **Production Realism**: Include Docker environments, real APIs, and security constraints
-4. **Language Diversity**: Cover TypeScript, Python, Go, and Rust—not just Python
-5. **Hot-Reload Extensibility**: Allow community task contributions without benchmark rebuilds
-### 3.2 Evaluation Framework
-Unlike pass-rate-only benchmarks, VibeCodingBench computes a weighted composite score:
-```
-Final Score = (Functional × 0.40) + (Visual × 0.20) + (Quality × 0.20)
-              - Cost_Penalty - Speed_Penalty
-If Security_Critical_Fail: Final Score = 0
-```
-**Dimension Details**:
-| Dimension | Weight | Measurement | Rationale |
-|-----------|--------|-------------|-----------|
-| Functional | 40% | Playwright E2E tests, Pass@k | Core correctness |
-| Visual | 20% | Pixel diff vs reference design | UI fidelity matters |
-| Quality | 20% | ESLint + Semgrep + complexity | Maintainability |
-| Cost | 10% | Tokens used, API calls | Efficiency |
-| Speed | 10% | Wall-clock time | Practical latency |
-**Security Gate**: Any OWASP Top 10 vulnerability triggers automatic failure. As [Walturn research](https://www.walturn.com/insights/quantitative-evaluation-of-ai-code-generation-tools) notes: "Organizations should analyze security incident rates for AI-generated versus human code using CWE vulnerability classification."
-### 3.3 Evaluation Methodology: Real Test Execution
-VibeCodingBench uses **real test execution**, not pattern matching or fixed expected outputs. This approach reflects production reality where there are multiple valid implementations for any given requirement.
-#### How It Works
-1. **Task Submission**: Each task provides the agent with:
-   - `PROMPT.md`: Natural language requirements (what a human developer would receive)
-   - `manifest.yaml`: Task metadata, dependencies, and evaluation criteria
-   - Starter template code (one of 4 stacks: React/Vite, Next.js, Express, FastAPI)
-2. **Agent Execution**: The agent writes code to satisfy the requirements. Unlike academic benchmarks with single correct answers, agents can choose their own:
-   - Implementation patterns (functional vs class-based, hooks vs context)
-   - Library choices (within template constraints)
-   - Code organization and file structure
-3. **Real Test Execution**: The evaluator runs actual tests against the generated code:
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│                        EVALUATION PIPELINE                          │
-├─────────────────────────────────────────────────────────────────────┤
-│                                                                     │
-│  Agent Output ──► Build ──► Test Runners ──► Score Aggregation     │
-│                     │                                               │
-│                     ▼                                               │
-│              ┌─────────────────────────────────────────────────┐   │
-│              │           PARALLEL TEST RUNNERS                 │   │
-│              ├─────────────────────────────────────────────────┤   │
-│              │  FunctionalRunner  │ Playwright E2E tests       │   │
-│              │  QualityRunner     │ ESLint + complexity        │   │
-│              │  VisualRunner      │ Pixel diff vs reference    │   │
-│              │  SecurityRunner    │ Semgrep OWASP Top 10       │   │
-│              │  CostRunner        │ Token usage tracking       │   │
-│              │  SpeedRunner       │ Wall-clock timing          │   │
-│              └─────────────────────────────────────────────────┘   │
-│                                                                     │
-└─────────────────────────────────────────────────────────────────────┘
-```
-#### Test Execution Details
-**Functional Tests (40% weight)**
-- Real Playwright browser automation
-- Tests verify observable behavior, not implementation details
-- Example: "User can log in" tests that login works, not that you used a specific auth library
-- Pass rate calculated as: (passed tests / total tests) × 100
-**Quality Analysis (20% weight)**
-- ESLint with strict configuration
-- Cyclomatic complexity analysis
-- Code duplication detection
-- TypeScript strict mode compliance (for TS tasks)
-**Visual Regression (20% weight)**
-- Screenshot comparison against reference designs
-- Pixel-diff with configurable tolerance (default 5%)
-- Layout validation for responsive breakpoints
-**Security Gate (Binary)**
-- Semgrep with OWASP Top 10 ruleset
-- Critical vulnerabilities (SQL injection, XSS, command injection) → automatic 0 score
-- Warnings logged but don't fail the task
-#### Flexible vs Deterministic
-VibeCodingBench is **intentionally flexible** on implementation while **deterministic on outcomes**:
-| Aspect | Flexible | Deterministic |
-|--------|----------|---------------|
-| Code style | ✅ Agent's choice | - |
-| Library selection | ✅ Within constraints | - |
-| File organization | ✅ Agent's choice | - |
-| Functional behavior | - | ✅ Must pass tests |
-| Security compliance | - | ✅ Must pass scan |
-| Visual fidelity | - | ✅ Must match reference |
-This mirrors real development: multiple valid solutions exist, but acceptance criteria are fixed.
-#### Task Structure Example
-```yaml
-# manifest.yaml
-name: stripe-webhook-handler
-category: api-integrations
-difficulty: medium
-template: express-ts
-timeout: 300s
-evaluation:
-  functional:
-    test_file: tests/webhook.spec.ts
-    pass_threshold: 0.8
-  quality:
-    eslint_config: .eslintrc.strict.json
-    max_complexity: 10
-  security:
-    ruleset: owasp-top-10
-    fail_on: [critical, high]
-  visual: null  # No visual tests for API tasks
-dependencies:
-  - stripe: ^14.0.0
-  - express: ^4.18.0
-```
-The evaluator reads this manifest, spins up the appropriate test environment, executes all runners, and aggregates scores using the weighted formula.
-### 3.4 Task Taxonomy
-VibeCodingBench organizes tasks into six categories aligned with developer workflow data:
-**SaaS Core (25% weight)**
-- OAuth integration (Google, GitHub, Microsoft)
-- Multi-factor authentication (TOTP)
-- CRUD dashboards with filtering/pagination
-- User preference systems
-**Glue Code (20% weight)**
-- Data format transformations (Excel → JSON, CSV → SQL)
-- API adapters (REST → GraphQL)
-- Message queue consumers
-- Cron job implementations
-**AI Integration (20% weight)**
-- RAG chatbot with PDF ingestion
-- Structured output extraction (invoices, receipts)
-- Function calling implementations
-- Embedding-based search
-**Frontend (15% weight)**
-- Figma-to-code translation
-- Dynamic form builders
-- Analytics dashboards with charts
-- Responsive component libraries
-**API Integrations (10% weight)**
-- Payment webhooks (Stripe, Paddle)
-- Email service integration (SendGrid, Resend)
-- Cloud storage (S3, GCS)
-- Third-party OAuth consumers
-**Code Evolution (10% weight)**
-- Framework migrations (Express → Fastify)
-- Class components → Hooks refactoring
-- Monolith decomposition
-- Dependency upgrades
----
-## 4. Addressing Benchmark Contamination
-A persistent concern with AI benchmarks is data contamination—models may have seen solutions during training. VibeCodingBench mitigates this through:
-1. **Novel Task Design**: Tasks are synthesized from common patterns, not extracted from existing repositories
-2. **Template Variation**: Each task can be instantiated with different naming, structure, and requirements
-3. **Temporal Freshness**: New tasks can be added post-training cutoff
-4. **Private Test Suites**: Full test implementations are withheld from public repository
-As OpenAI acknowledged regarding SWE-bench: "existing benchmarks are often contaminated by LLM pretraining data" ([OpenAI SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/)).
----
-## 5. Why This Matters
-### 5.1 The AI Productivity Paradox
-Multiple studies reveal a striking paradox between AI promise and reality:
-**GitHub Octoverse 2025**:
-> "Developers expect 24% productivity gains but experience 19% slowdowns in controlled conditions. Less than 44% of AI-generated code is accepted without modification."
-**METR Randomized Controlled Trial (2025)**: A rigorous study of 16 experienced developers completing 246 tasks found that "allowing AI actually increased completion time by 19%"—despite developers forecasting a 24% reduction ([METR 2025](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/)).
-**McKinsey Research (2025)**: While optimistic about AI potential, McKinsey found that "true impact comes when adoption is deep and organization-wide. Companies with 80-100% developer adoption saw gains of more than 110%." However, partial adoption shows minimal gains ([McKinsey 2025](https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/unleashing-developer-productivity-with-generative-ai)).
-**Stack Overflow 2025** corroborates the frustration:
-> "70% of agent users agree that agents have reduced time spent on specific development tasks"—but 66% cite "AI solutions that are almost right, but not quite" as their biggest frustration, and 45% say "debugging AI code is more time-consuming."
-The gap between expectation and reality stems from **evaluation-deployment mismatch**: we optimize for the wrong signals.
-### 5.2 The Path Forward
-VibeCodingBench aims to close this gap by:
-1. **Measuring what matters**: Tasks that occupy developer time, not academic exercises
-2. **Penalizing hidden costs**: A fast solution with security flaws scores zero
-3. **Enabling comparison**: Standardized multi-agent evaluation with transparent methodology
-4. **Supporting iteration**: Hot-reload task addition for continuous benchmark evolution
----
-## 6. Initial Results (January 2026)
-We evaluated 15 leading AI coding models across all 180 VibeCodingBench tasks. The results reveal important insights about the current state of AI coding agents.
-### 6.1 Leaderboard Summary
-| Rank | Model | Final Score | Pass Rate | Cost | Avg Time |
-|------|-------|-------------|-----------|------|----------|
-| #1 | Claude Opus 4.5 | 89.2% | 100.0% (180/180) | $12.31 | 44s |
-| #2 | Claude Haiku 4.5 | 89.0% | 99.4% (179/180) | $3.03 | 22s |
-| #3 | Grok 4 Fast | 88.8% | 98.9% (178/180) | $0.21 | 70s |
-| #4 | OpenAI GPT-5.2 | 88.8% | 98.3% (177/180) | $5.01 | 28s |
-| #5 | Qwen3 Max | 88.6% | 100.0% (180/180) | $5.42 | 45s |
-| #6 | Claude Sonnet 4.5 | 88.6% | 98.3% (177/180) | $6.98 | 42s |
-| #7 | GLM 4-Plus | 88.2% | 98.9% (178/180) | $0.93 | 96s |
-| #8 | DeepSeek v3.2 | 88.2% | 98.3% (177/180) | $0.50 | 90s |
-| #9 | Grok 4 | 88.0% | 97.8% (176/180) | $5.47 | 75s |
-| #10 | MiniMax M2.1 | 87.4% | 99.4% (179/180) | $2.40 | 165s |
-| #11 | Grok 4.1 Fast | 86.8% | 97.2% (175/180) | $0.24 | 89s |
-| #12 | Gemini 3 Pro Preview | 85.8% | 95.0% (171/180) | $10.34 | 32s |
-| #13 | GLM-4.7 | 83.9% | 85.6% (154/180) | $0.73 | 57s |
-| #14 | GLM 4.7 Flash | 83.8% | 92.8% (167/180) | $1.11 | 45s |
-| #15 | Gemini 3 Flash | 83.4% | 92.2% (166/180) | $0.86 | 28s |
-### 6.2 Key Findings
-**1. Tight Competition at the Top**
-The top 10 models are clustered within 2 percentage points (87.4%–89.2%). This narrow spread suggests frontier models have converged on similar capabilities for production coding tasks.
-**2. Cost-Performance Tradeoffs**
-Cost varies by 60x between models at similar performance levels:
-- **Best value**: Grok 4 Fast ($0.21 total, 88.8% score) and DeepSeek v3.2 ($0.50, 88.2%)
-- **Premium tier**: Claude Opus 4.5 ($12.31) and Gemini 3 Pro ($10.34)
-- **Middle ground**: Claude Haiku 4.5 ($3.03, 89.0%) offers excellent cost-performance balance
-**3. Speed vs Quality**
-Faster models don't necessarily score lower:
-- Claude Haiku 4.5: 22s average, 89.0% score
-- Gemini 3 Flash: 28s average, 83.4% score
-- MiniMax M2.1: 165s average, 87.4% score
-**4. Pass Rate ≠ Final Score**
-Our multi-dimensional scoring reveals quality differences hidden by pass rate:
-- Qwen3 Max: 100% pass rate but 88.6% final score (quality/cost penalties)
-- Claude Opus 4.5: 100% pass rate and 89.2% final score (better quality metrics)
-**5. All Models Pass Security Gate**
-Every evaluated model achieved 100% on security checks—a positive signal that modern LLMs avoid obvious OWASP vulnerabilities in generated code.
-### 6.3 Dimensional Breakdown
-| Model | Functional | Quality | Visual | Cost Score | Speed Score |
-|-------|------------|---------|--------|------------|-------------|
-| Claude Opus 4.5 | 85.0% | 80.0% | 80.0% | 70% | 80% |
-| Claude Haiku 4.5 | 84.5% | 79.6% | 80.0% | 88% | 95% |
-| Grok 4 Fast | 84.1% | 80.0% | 80.0% | 94% | 72% |
-| Qwen3 Max | 85.0% | 80.0% | 80.0% | 90% | 75% |
-| DeepSeek v3.2 | 83.6% | 80.0% | 80.0% | 96% | 65% |
-| GLM-4.7 | 72.7% | 79.6% | 80.0% | 94% | 82% |
-| Gemini 3 Flash | 78.4% | 75.1% | 80.0% | 95% | 90% |
-**Observations**:
-- Functional scores range from 72.7% to 85.0%—a 12-point spread revealing real capability differences
-- Quality scores cluster around 79-80%, suggesting similar code hygiene across models
-- Visual compliance is uniformly high (80%), indicating models handle UI requirements well
-- Cost efficiency varies dramatically based on token usage and pricing
----
-## 7. Conclusion
-The current generation of AI coding agent benchmarks optimizes for the wrong objective. High SWE-bench scores do not translate to production readiness. Pass rates ignore quality debt. Single-repository evaluations miss the polyglot reality of modern development.
-VibeCodingBench proposes a return to first principles: measure what developers actually do, evaluate what production actually requires, and score what businesses actually value. By grounding our benchmark in empirical workflow data and multi-dimensional evaluation, we aim to create a signal that predicts real-world utility—not just benchmark leaderboard position.
-The ultimate test of an AI coding agent is not whether it can fix a bug in scikit-learn. It is whether it can help a developer ship a feature their users need, safely and efficiently, before the sprint ends.
----
-## References
-### Industry Reports
-1. GitHub. (2025). *Octoverse 2025: The State of Open Source*. https://octoverse.github.com/
-2. GitHub Blog. (2025). *What 986 million code pushes say about the developer workflow in 2025*. https://github.blog/news-insights/octoverse/what-986-million-code-pushes-say-about-the-developer-workflow-in-2025/
-3. Stack Overflow. (2025). *2025 Developer Survey*. https://survey.stackoverflow.co/2025/
-4. Stack Overflow Blog. (2025). *Developers remain willing but reluctant to use AI*. https://stackoverflow.blog/2025/12/29/developers-remain-willing-but-reluctant-to-use-ai-the-2025-developer-survey-results-are-here
-5. Qodo. (2025). *State of AI Code Quality in 2025*. https://www.qodo.ai/reports/state-of-ai-code-quality/
-6. Worklytics. (2025). *Software Engineering Productivity Benchmarks 2025*. https://www.worklytics.co/resources/software-engineering-productivity-benchmarks-2025-good-scores
-### Academic Research
-7. Meyer, A., et al. (2019). *Today was a Good Day: The Daily Life of Software Developers*. IEEE Transactions on Software Engineering. https://www.microsoft.com/en-us/research/wp-content/uploads/2019/04/devtime-preprint-TSE19.pdf
-8. SWE-EVO Authors. (2025). *SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios*. arXiv:2512.18470. https://arxiv.org/abs/2512.18470
-9. SWE-bench Pro Authors. (2025). *SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?*. arXiv:2509.16941. https://arxiv.org/abs/2509.16941
-10. Security Analysis. (2025). *Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis*. arXiv:2508.14727. https://arxiv.org/abs/2508.14727
-### Benchmark Analysis
-11. OpenAI. (2024). *Introducing SWE-bench Verified*. https://openai.com/index/introducing-swe-bench-verified/
-12. Runloop AI. (2025). *SWE-Bench Deep Dive: Unmasking the Limitations of a Popular Benchmark*. https://runloop.ai/blog/swe-bench-deep-dive-unmasking-the-limitations-of-a-popular-benchmark
-13. Scale AI. (2025). *SWE-Bench Pro Leaderboard*. https://scale.com/leaderboard/swe_bench_pro_public
-14. Amazon AWS. (2025). *Amazon introduces SWE-PolyBench*. https://aws.amazon.com/blogs/devops/amazon-introduces-swe-polybench-a-multi-lingual-benchmark-for-ai-coding-agents/
-### Developer Productivity
-15. Sonar. (2025). *How much time do developers spend actually writing code?*. https://www.sonarsource.com/blog/how-much-time-do-developers-spend-actually-writing-code/
-16. Software.com. (2025). *Code Time Report*. https://www.software.com/reports/code-time-report
-17. DX. (2025). *Measuring AI code assistants and agents*. https://getdx.com/research/measuring-ai-code-assistants-and-agents/
-18. Augment Code. (2025). *How to Test AI Coding Assistants: 7 Enterprise Benchmarks*. https://www.augmentcode.com/guides/how-to-test-ai-coding-assistants-7-enterprise-benchmarks
-19. Microsoft Research. (2024). *Time Warp: The Gap Between Developers' Ideal vs Actual Workweek*. https://www.microsoft.com/en-us/research/wp-content/uploads/2024/11/Time-Warp-Developer-Productivity-Study.pdf
-20. JetBrains. (2025). *State of Developer Ecosystem 2025*. https://devecosystem-2025.jetbrains.com/
-21. Atlassian. (2025). *2025 State of Developer Experience Report*. https://www.atlassian.com/blog/developer/developer-experience-report-2025
-22. METR. (2025). *Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity*. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
-23. McKinsey. (2025). *Unleashing Developer Productivity with Generative AI*. https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/unleashing-developer-productivity-with-generative-ai
----
-*VibeCodingBench is a project by Alt Research. Contributions welcome at https://github.com/alt-research/vibe-coding-benchmark-public*

package/refs/vbenchmark/docs/categories/code-evolution.md DELETED Viewed

@@ -1,138 +0,0 @@
-# Code Evolution Category
-Research shows 80% of real engineering work is maintenance/evolution, not greenfield development. The SWE-EVO benchmark demonstrates that agents score only 21% on evolution tasks vs 65% on simple fixes, highlighting a critical gap in current benchmarks.
-## Motivation
-Most existing benchmarks focus on:
-- Bug fixes with clear error messages
-- Greenfield implementations with no constraints
-- Isolated changes without broader context
-Real-world developers primarily:
-- Migrate legacy codebases to modern frameworks
-- Extend features in existing, complex systems
-- Refactor code while maintaining backwards compatibility
-- Review PRs and suggest improvements
-## Subcategories
-### 1. Legacy Migration (`legacy-migration/`)
-Tasks involving upgrading dependencies or migrating between frameworks while preserving functionality.
-**Examples:**
-- jQuery → React migration
-- Express → Fastify migration
-- Class components → Hooks
-- CommonJS → ESM
-- Upgrade from deprecated APIs
-**Evaluation Criteria:**
-- All existing tests pass
-- No functional regressions
-- Migration completeness (no legacy remnants)
-- Code quality improvement
-### 2. Feature Extension (`feature-extension/`)
-Adding new features to existing codebases, not greenfield. The agent must understand and integrate with existing architecture.
-**Examples:**
-- Add pagination to existing list endpoint
-- Add caching layer to existing service
-- Add audit logging to existing operations
-- Add rate limiting to existing API
-**Evaluation Criteria:**
-- Feature works correctly
-- Integrates with existing patterns
-- Doesn't break existing functionality
-- Follows codebase conventions
-### 3. Refactor & Preserve (`refactor-preserve/`)
-Restructuring code while maintaining all existing tests and behavior. Emphasizes understanding invariants.
-**Examples:**
-- Extract service from monolith
-- Convert sync to async operations
-- Improve error handling patterns
-- Reduce cyclomatic complexity
-**Evaluation Criteria:**
-- All tests still pass
-- Same API surface (unless explicitly changing)
-- Measurable improvement (complexity, coverage, etc.)
-- No behavioral changes
-### 4. Code Review (`code-review/`)
-Given a PR diff, identify issues, suggest fixes, and potentially implement corrections.
-**Examples:**
-- Security vulnerability detection
-- Performance issue identification
-- Logic bug detection
-- Style/pattern consistency review
-**Evaluation Criteria:**
-- Issues correctly identified
-- False positive rate
-- Fix suggestions are correct
-- Explanation quality
-## Task Structure
-Each task in code-evolution provides:
-```yaml
-name: "Migration/Feature Name"
-category: code-evolution
-subcategory: legacy-migration | feature-extension | refactor-preserve | code-review
-difficulty: easy | medium | hard
-baseCode: path/to/existing/codebase
-tests:
-  existing: tests/  # Must continue to pass
-  new: tests/new/   # New tests for added functionality
-constraints:
-  - "Must not modify public API"
-  - "All existing tests must pass"
-  - "No new dependencies unless approved"
-```
-## Scoring
-Evolution tasks use weighted scoring:
-| Dimension | Weight | Description |
-|-----------|--------|-------------|
-| Existing Tests | 40% | All pre-existing tests must pass |
-| New Functionality | 30% | New feature/migration works correctly |
-| Code Quality | 15% | Maintainability, complexity reduction |
-| Integration | 15% | Follows existing patterns, conventions |
-## Example Task: Express to Fastify Migration
-```yaml
-name: Express to Fastify Migration
-category: code-evolution
-subcategory: legacy-migration
-difficulty: medium
-baseCode: examples/express-api/
-stack: nodejs
-description: |
-  Migrate a REST API from Express.js to Fastify while
-  maintaining all existing functionality and API contracts.
-constraints:
-  - All existing integration tests must pass
-  - API endpoints must remain identical
-  - Response formats must match exactly
-  - Error handling patterns must be preserved
-tests:
-  existing: tests/integration/
-  new: tests/fastify-specific/
-```