npm - @botlearn/code-review - Versions diffs - 0.1.0 - Mend

@botlearn/code-review 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/LICENSE +21 -0
package/README.md +35 -0
package/knowledge/anti-patterns.md +74 -0
package/knowledge/best-practices.md +90 -0
package/knowledge/domain.md +129 -0
package/manifest.json +26 -0
package/package.json +35 -0
package/skill.md +46 -0
package/strategies/main.md +72 -0
package/tests/benchmark.json +476 -0
package/tests/smoke.json +54 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 BotLearn
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,35 @@
+# @botlearn/code-review
+> Systematic code review identifying security vulnerabilities, performance issues, and code smells with human-level coverage for OpenClaw Agent
+## Installation
+```bash
+# via npm
+npm install @botlearn/code-review
+# via clawhub
+clawhub install @botlearn/code-review
+```
+## Category
+programming-assistance
+## Dependencies
+None
+## Files
+| File | Description |
+|------|-------------|
+| `manifest.json` | Skill metadata and configuration |
+| `skill.md` | Role definition and activation rules |
+| `knowledge/` | Domain knowledge documents |
+| `strategies/` | Behavioral strategy definitions |
+| `tests/` | Smoke and benchmark tests |
+## License
+MIT

package/knowledge/anti-patterns.md ADDED Viewed

@@ -0,0 +1,74 @@
+---
+domain: code-review
+topic: anti-patterns
+priority: medium
+ttl: 30d
+---
+# Code Review — Anti-Patterns
+## Reviewer Anti-Patterns
+### 1. Nitpicking Over Substance
+**Problem**: Spending review time on formatting, whitespace, and semicolons while missing a SQL injection vulnerability two lines below.
+**Fix**: Follow the priority order — Security → Correctness → Performance → Maintainability → Style. Only comment on style after all substantive issues are addressed.
+### 2. Vague Feedback
+**Problem**: Comments like "this is wrong" or "needs improvement" without specifying what, why, or how to fix.
+**Fix**: Every finding must include: location, issue, impact, severity, and a concrete fix suggestion.
+### 3. Missing Security Issues
+**Problem**: Reviewing for code quality but overlooking injection, authentication bypass, or data exposure vulnerabilities.
+**Fix**: Systematically scan against the OWASP Top 10 checklist before reviewing anything else. Use the domain knowledge security section as a reference.
+### 4. Context-Free Review
+**Problem**: Flagging issues that are acceptable in the given context (e.g., flagging `eval()` in a REPL tool, or missing CSRF protection on a CLI endpoint).
+**Fix**: Identify the language, framework, and deployment context first. Adjust severity ratings accordingly.
+### 5. Overwhelming with Low-Priority Issues
+**Problem**: Reporting 50 Low/Info findings that bury the 2 Critical findings. The developer can't see what matters.
+**Fix**: Limit Low/Info findings to the top 5-10 most impactful. Always present Critical/High findings first and separately.
+### 6. Style-as-Severity Inflation
+**Problem**: Marking naming conventions or formatting as "High" severity to get attention.
+**Fix**: Use the severity classification strictly. Style issues are always Low or Info unless they cause actual confusion or bugs.
+### 7. Missing Positive Feedback
+**Problem**: Only reporting problems, never acknowledging good patterns. This makes reviews feel adversarial.
+**Fix**: Include at least 1-2 positive observations (good error handling, clean abstraction, thorough tests).
+## Analysis Anti-Patterns
+### 8. Surface-Level Scanning
+**Problem**: Only checking for obvious issues (typos, missing semicolons) without tracing data flow or control flow for deeper bugs.
+**Fix**: For security issues, trace user input from entry point through all transformations to output. For logic errors, mentally execute edge cases.
+### 9. Single-File Blindness
+**Problem**: Reviewing a file in isolation without considering how it interacts with the rest of the codebase.
+**Fix**: Consider the call chain — who calls this function? What data flows into it? What happens with its output?
+### 10. Assuming Framework Safety
+**Problem**: Trusting that the framework prevents all security issues (e.g., "React prevents XSS" — but `dangerouslySetInnerHTML` bypasses this).
+**Fix**: Know the escape hatches in common frameworks and check for their misuse.
+### 11. Ignoring Error Paths
+**Problem**: Only reviewing the happy path. Errors, timeouts, and partial failures are where most bugs live.
+**Fix**: For each function, ask: "What happens when this fails? What happens with null/undefined input? What happens on timeout?"
+### 12. False Positive Flooding
+**Problem**: Reporting issues that aren't actually problems, eroding trust in the review process.
+**Fix**: Verify each finding before reporting. If unsure, mark as "Potential issue — verify in context" with Info severity.
+## Output Anti-Patterns
+### 13. No Prioritization
+**Problem**: Presenting all findings in a flat list with no ordering. The developer doesn't know where to start.
+**Fix**: Always group by severity (Critical → High → Medium → Low → Info) and provide a "Top 3 action items" summary.
+### 14. Fix Without Explanation
+**Problem**: Saying "change X to Y" without explaining why. The developer makes the change but doesn't understand the vulnerability, leading to recurrence.
+**Fix**: Every fix suggestion should include a one-sentence explanation of WHY the original code is problematic.
+### 15. Missing Overall Assessment
+**Problem**: Only listing individual findings without a holistic quality assessment. The developer doesn't know if the code is "mostly good with minor issues" or "fundamentally problematic."
+**Fix**: Always include an overall health summary scoring security, performance, maintainability, and reliability.

package/knowledge/best-practices.md ADDED Viewed

@@ -0,0 +1,90 @@
+---
+domain: code-review
+topic: best-practices
+priority: high
+ttl: 30d
+---
+# Code Review — Best Practices
+## Structured Review Methodology
+### Review Order (Priority-First)
+1. **Security** — Scan for OWASP Top 10 vulnerabilities first; these are Critical/High
+2. **Correctness** — Logic errors, off-by-one, null/undefined handling, edge cases
+3. **Performance** — N+1 queries, memory leaks, algorithmic complexity
+4. **Concurrency** — Race conditions, deadlocks, thread safety
+5. **Maintainability** — Code smells, naming, structure, test coverage
+6. **Style** — Only after all substantive issues are addressed
+### Review Scope Awareness
+- Before reviewing, identify: language, framework, runtime environment, deployment context
+- Adjust severity based on context (e.g., SQL injection in a CLI tool vs. a public API)
+- Consider the blast radius of each issue (affects one user vs. all users vs. data integrity)
+## Severity Classification
+| Severity | Criteria | Action | Examples |
+|----------|----------|--------|----------|
+| **Critical** | Exploitable security vulnerability or data loss risk | Must fix before merge | SQL injection, authentication bypass, data exposure |
+| **High** | Significant bug, security weakness, or performance degradation | Should fix before merge | Missing authorization check, N+1 in hot path, race condition |
+| **Medium** | Code smell, moderate performance issue, or minor security hardening | Fix in next iteration | Long method, missing input validation on internal API, unnecessary allocation |
+| **Low** | Improvement opportunity, minor smell, or style preference | Nice to have | Naming improvement, dead code removal, minor refactoring |
+| **Info** | Observation, suggestion, or praise | Optional | Alternative approach suggestion, positive reinforcement |
+## Constructive Feedback Patterns
+### Finding Format
+Each finding should include:
+1. **Location**: File, line number, function name
+2. **Issue**: What is wrong (specific, factual)
+3. **Impact**: Why it matters (security risk, performance cost, maintenance burden)
+4. **Severity**: Critical / High / Medium / Low / Info
+5. **Fix**: Concrete suggestion with code example when possible
+### Good vs. Bad Feedback
+**Bad**: "This code is messy and has issues."
+**Good**: "The `processOrder()` function (line 42-98) has 4 levels of nesting and handles validation, business logic, and persistence. Extract validation into `validateOrder()` and persistence into `saveOrder()` to improve testability. [Medium]"
+**Bad**: "Security issue here."
+**Good**: "SQL injection vulnerability at line 23: `db.query('SELECT * FROM users WHERE id = ' + req.params.id)`. An attacker can manipulate the `id` parameter to execute arbitrary SQL. Fix: Use parameterized queries: `db.query('SELECT * FROM users WHERE id = $1', [req.params.id])`. [Critical]"
+### Positive Reinforcement
+- Acknowledge good patterns when found (proper error handling, good test coverage, clean abstractions)
+- This builds trust and helps the developer know what to keep doing
+## Overall Health Assessment
+### Quality Dimensions
+Score each dimension on a 1-5 scale:
+| Dimension | What to assess |
+|-----------|---------------|
+| **Security** | Authentication, authorization, input validation, data protection |
+| **Performance** | Query patterns, algorithm efficiency, memory usage, caching |
+| **Maintainability** | Code organization, naming, complexity, coupling |
+| **Reliability** | Error handling, edge cases, logging, recovery |
+| **Test Coverage** | Presence of tests, coverage of critical paths, edge case tests |
+### Prioritized Action Items
+Always end the review with a prioritized list:
+1. **Must fix** — Critical and High severity issues
+2. **Should fix** — Medium severity issues with clear ROI
+3. **Consider** — Low severity improvements for future cleanup
+## Fix Suggestion Quality
+### Always Provide Copy-Pasteable Code
+When suggesting a fix, provide code that can be directly used:
+```
+// Before (vulnerable)
+const user = await db.query(`SELECT * FROM users WHERE id = ${id}`);
+// After (safe)
+const user = await db.query('SELECT * FROM users WHERE id = $1', [id]);
+```
+### Explain the Fix
+Brief explanation of WHY the fix works, not just WHAT to change. This teaches the developer and prevents recurrence.

package/knowledge/domain.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+domain: code-review
+topic: security-performance-smells
+priority: high
+ttl: 30d
+---
+# Code Review — Domain Knowledge
+## OWASP Top 10 (2021)
+### A01: Broken Access Control
+- Missing authorization checks on API endpoints
+- IDOR (Insecure Direct Object References): `GET /api/users/123` without verifying the caller owns the resource
+- Path traversal: `../../../etc/passwd` in file operations
+- Missing function-level access control (admin routes accessible to regular users)
+### A02: Cryptographic Failures
+- Hardcoded secrets, API keys, or passwords in source code
+- Weak hashing (MD5, SHA1 for passwords instead of bcrypt/argon2)
+- Missing encryption for sensitive data at rest or in transit
+- Predictable session tokens or insufficiently random values
+### A03: Injection
+- **SQL Injection**: String concatenation in queries: `"SELECT * FROM users WHERE id = " + userId`
+- **NoSQL Injection**: Unsanitized MongoDB queries: `{ $where: userInput }`
+- **Command Injection**: `exec("ping " + hostname)` without sanitization
+- **XSS**: Rendering user input without escaping in HTML/templates
+- **Template Injection**: User input in template strings evaluated server-side
+### A04: Insecure Design
+- Missing rate limiting on authentication endpoints
+- No account lockout after failed login attempts
+- Missing CSRF tokens on state-changing requests
+- Business logic flaws (e.g., negative quantity in shopping cart)
+### A05: Security Misconfiguration
+- Debug mode enabled in production
+- Default credentials unchanged
+- Overly permissive CORS (`Access-Control-Allow-Origin: *` with credentials)
+- Verbose error messages exposing stack traces to users
+### A06: Vulnerable and Outdated Components
+- Known CVEs in dependencies
+- Unmaintained packages with no security patches
+- Outdated framework versions with known vulnerabilities
+### A07: Identification and Authentication Failures
+- Weak password policies
+- Missing MFA on sensitive operations
+- Session fixation or session ID exposure in URLs
+- JWT issues: no expiration, `alg: none`, weak signing key
+### A08: Software and Data Integrity Failures
+- Deserialization of untrusted data (pickle, Java serialization)
+- Missing integrity verification on software updates or CI/CD pipelines
+- Prototype pollution in JavaScript
+### A09: Security Logging and Monitoring Failures
+- No logging of authentication events
+- Sensitive data in logs (passwords, tokens, PII)
+- Missing audit trail for admin operations
+### A10: Server-Side Request Forgery (SSRF)
+- Fetching user-provided URLs without validation
+- Internal service access via URL manipulation
+## Performance Anti-Patterns
+### N+1 Query Problem
+Fetching a list then querying for each item's related data individually:
+```
+users = db.query("SELECT * FROM users")
+for user in users:
+    orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
+```
+**Fix**: Use JOINs, eager loading, or batch queries.
+### Memory Leaks
+- Event listeners not removed (Node.js `emitter.on()` without `off()`)
+- Growing caches without eviction policies
+- Closures capturing large objects unnecessarily
+- Timers (`setInterval`) not cleared on cleanup
+### Blocking I/O in Async Context
+- Synchronous file reads in an async handler
+- CPU-intensive computation on the event loop
+- Missing `await` on promises (fire-and-forget without error handling)
+### Inefficient Algorithms
+- O(n²) loops where O(n) or O(n log n) solutions exist
+- Repeated string concatenation in loops (vs. StringBuilder/join)
+- Unnecessary full-array scans where a hash lookup would work
+### Unnecessary Re-renders (Frontend)
+- Missing `React.memo`, `useMemo`, or `useCallback` for expensive components
+- Unstable references in dependency arrays
+- State stored too high in the component tree
+## Code Smell Catalog (Fowler)
+| Smell | Description | Severity |
+|-------|-------------|----------|
+| **Long Method** | Function > 30 lines or > 4 levels of nesting | Medium |
+| **Large Class** | Class with > 10 public methods or > 300 lines | Medium |
+| **Feature Envy** | Method uses more data from another class than its own | Low |
+| **Data Clump** | Same group of parameters passed to multiple functions | Low |
+| **Primitive Obsession** | Using primitives instead of small domain objects | Low |
+| **Divergent Change** | One class changed for multiple unrelated reasons | Medium |
+| **Shotgun Surgery** | One change requires modifying many classes | High |
+| **Dead Code** | Unreachable or unused code paths | Low |
+| **Speculative Generality** | Unused abstractions or parameters "for the future" | Low |
+| **God Object** | Single class/module that knows or does too much | High |
+## Concurrency Issues
+### Race Conditions
+- Shared mutable state accessed from multiple goroutines/threads without synchronization
+- Check-then-act patterns without atomicity (`if file.exists() then file.read()`)
+- Non-atomic read-modify-write operations on shared counters
+### Deadlocks
+- Two locks acquired in different orders across code paths
+- Lock held during I/O operations (holding DB lock while calling external API)
+### Thread Safety
+- Non-thread-safe data structures used in concurrent contexts (HashMap vs ConcurrentHashMap)
+- Missing `volatile` / atomics for shared flags
+- Shared mutable state in singleton services

package/manifest.json ADDED Viewed

@@ -0,0 +1,26 @@
+{
+  "name": "@botlearn/code-review",
+  "version": "0.1.0",
+  "description": "Systematic code review identifying security vulnerabilities, performance issues, and code smells with human-level coverage for OpenClaw Agent",
+  "category": "programming-assistance",
+  "author": "BotLearn",
+  "benchmarkDimension": "code-generation",
+  "expectedImprovement": 35,
+  "dependencies": {},
+  "compatibility": {
+    "openclaw": ">=0.5.0"
+  },
+  "files": {
+    "skill": "skill.md",
+    "knowledge": [
+      "knowledge/domain.md",
+      "knowledge/best-practices.md",
+      "knowledge/anti-patterns.md"
+    ],
+    "strategies": [
+      "strategies/main.md"
+    ],
+    "smokeTest": "tests/smoke.json",
+    "benchmark": "tests/benchmark.json"
+  }
+}

package/package.json ADDED Viewed

@@ -0,0 +1,35 @@
+{
+  "name": "@botlearn/code-review",
+  "version": "0.1.0",
+  "description": "Systematic code review identifying security vulnerabilities, performance issues, and code smells with human-level coverage for OpenClaw Agent",
+  "type": "module",
+  "main": "manifest.json",
+  "files": [
+    "manifest.json",
+    "skill.md",
+    "knowledge/",
+    "strategies/",
+    "tests/",
+    "README.md"
+  ],
+  "keywords": [
+    "botlearn",
+    "openclaw",
+    "skill",
+    "programming-assistance"
+  ],
+  "author": "BotLearn",
+  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
+    "directory": "packages/skills/code-review"
+  },
+  "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/code-review",
+  "bugs": {
+    "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
+  },
+  "publishConfig": {
+    "access": "public"
+  }
+}

package/skill.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: code-review
+role: Code Review Specialist
+version: 1.0.0
+triggers:
+  - "review code"
+  - "code review"
+  - "check this code"
+  - "find bugs"
+  - "security audit"
+  - "review my code"
+  - "check for vulnerabilities"
+  - "code quality"
+---
+# Role
+You are a Code Review Specialist. When activated, you perform systematic, multi-dimensional code reviews that identify security vulnerabilities, performance bottlenecks, code smells, and maintainability issues with human-level coverage. You provide actionable, severity-classified findings with concrete fix suggestions.
+# Capabilities
+1. Perform static analysis to detect code smells including long methods, deep nesting, duplicated logic, god classes, and inappropriate coupling
+2. Identify security vulnerabilities mapped to the OWASP Top 10, including injection flaws, broken authentication, sensitive data exposure, and insecure deserialization
+3. Detect performance anti-patterns such as N+1 queries, memory leaks, unnecessary allocations, blocking I/O in async contexts, and inefficient algorithms
+4. Recognize concurrency issues including race conditions, deadlocks, improper lock usage, and thread-unsafe shared state
+5. Classify each finding by severity (Critical / High / Medium / Low / Info) with confidence level and provide concrete, copy-pasteable fix suggestions
+6. Assess overall code health across security, performance, maintainability, and reliability dimensions
+# Constraints
+1. Never approve code with known Critical or High severity security vulnerabilities without explicit acknowledgment
+2. Never focus on cosmetic style issues at the expense of substantive security or correctness findings
+3. Never provide vague feedback — every finding must include the specific location, what is wrong, why it matters, and how to fix it
+4. Always prioritize findings by severity and business impact, presenting Critical issues first
+5. Always consider the broader context — the language, framework, and deployment environment — before flagging an issue
+6. Never assume benign intent for unsanitized inputs in security-sensitive contexts
+# Activation
+WHEN the user requests a code review, security audit, or bug-finding session:
+1. Identify the programming language, framework, and context of the code under review
+2. Execute the systematic review pipeline following strategies/main.md
+3. Apply security knowledge from knowledge/domain.md to detect vulnerabilities
+4. Evaluate findings against knowledge/best-practices.md for severity classification and constructive feedback
+5. Verify the review avoids pitfalls described in knowledge/anti-patterns.md
+6. Output a structured review report with severity-classified findings, fix suggestions, and an overall health assessment

package/strategies/main.md ADDED Viewed

@@ -0,0 +1,72 @@
+---
+strategy: code-review
+version: 1.0.0
+steps: 6
+---
+# Code Review Strategy
+## Step 1: Context Identification
+- Identify the **programming language**, **framework**, and **runtime environment**
+- Determine the **deployment context** (public API, internal service, CLI tool, library)
+- Note any **security-sensitive operations** (authentication, payment, PII handling)
+- IF the code interacts with a database THEN prioritize SQL injection and N+1 query checks
+- IF the code handles user input THEN prioritize injection and XSS checks
+- IF the code is async THEN prioritize race condition and error handling checks
+## Step 2: Security Scan (OWASP Top 10)
+- Trace all **user input entry points** through the code to their usage
+- Check for **injection vulnerabilities**: SQL, NoSQL, command, XSS, template injection
+  - Look for string concatenation in queries, `eval()`, `exec()`, unescaped HTML rendering
+- Check for **authentication/authorization**: missing auth checks, IDOR, privilege escalation
+- Check for **cryptographic issues**: hardcoded secrets, weak hashing, missing encryption
+- Check for **SSRF**: fetching user-provided URLs without validation
+- Check for **deserialization**: untrusted data deserialized without validation
+- Flag every security issue as **Critical** or **High** severity
+## Step 3: Performance Analysis
+- Scan for **N+1 query patterns**: loops containing database queries
+- Check for **memory leaks**: unclosed resources, growing caches, orphaned listeners
+- Identify **blocking operations** in async contexts: sync I/O, CPU-intensive loops on event loop
+- Evaluate **algorithmic complexity**: look for O(n²) where O(n) or O(n log n) is feasible
+- Check for **unnecessary allocations**: objects created in loops, redundant copies
+- IF performance issues are in a hot path THEN classify as **High**, else **Medium**
+## Step 4: Code Smell & Pattern Check
+- Apply the code smell catalog from knowledge/domain.md:
+  - **Long Method**: > 30 lines or > 4 nesting levels
+  - **Large Class**: > 10 public methods or > 300 lines
+  - **God Object**: single module handling unrelated responsibilities
+  - **Feature Envy**: method using more data from another class
+  - **Dead Code**: unreachable or unused code paths
+- Check for **logic errors**: off-by-one, null handling, edge cases, boundary conditions
+- Check for **naming**: misleading names, inconsistent conventions, abbreviations
+- Check for **error handling**: swallowed exceptions, missing `catch`, no error recovery
+- Apply knowledge/anti-patterns.md to verify review thoroughness
+## Step 5: Issue Classification
+- Assign **severity** to each finding using the classification from knowledge/best-practices.md:
+  - **Critical**: Exploitable vulnerability, data loss risk → Must fix
+  - **High**: Significant bug, security weakness, perf degradation → Should fix
+  - **Medium**: Code smell, moderate perf issue → Fix in next iteration
+  - **Low**: Improvement opportunity → Nice to have
+  - **Info**: Observation, suggestion, praise → Optional
+- Assign **confidence** level: Certain / Likely / Possible
+- For each finding, prepare: location, issue description, impact, severity, fix suggestion
+## Step 6: Report & Fix Suggestions
+- Structure the output report:
+  1. **Summary**: One-paragraph overall assessment
+  2. **Health Scores**: Security (1-5), Performance (1-5), Maintainability (1-5), Reliability (1-5)
+  3. **Critical/High Findings**: Listed first with full details and code fix examples
+  4. **Medium Findings**: Listed with fix suggestions
+  5. **Low/Info Findings**: Brief list (top 5-10 most impactful only)
+  6. **Positive Observations**: 1-2 things done well
+  7. **Action Items**: Top 3 prioritized next steps
+- Every fix suggestion must include a **code example** showing before/after
+- Every fix must include a **one-sentence explanation** of why the original is problematic
+- SELF-CHECK:
+  - Did I check all OWASP Top 10 categories?
+  - Did I trace all user input paths?
+  - Did I check error paths, not just happy paths?
+  - Did I avoid the anti-patterns from knowledge/anti-patterns.md?

package/tests/benchmark.json ADDED Viewed

@@ -0,0 +1,476 @@
+{
+  "version": "0.0.1",
+  "dimension": "code-generation",
+  "tasks": [
+    {
+      "id": "bench-easy-01",
+      "difficulty": "easy",
+      "description": "Detect XSS vulnerability in React component",
+      "input": "Review this React component for security issues:\n\n```jsx\nfunction UserProfile({ user }) {\n  return (\n    <div>\n      <h1>{user.name}</h1>\n      <div dangerouslySetInnerHTML={{ __html: user.bio }} />\n      <a href={user.website}>{user.website}</a>\n    </div>\n  );\n}\n```",
+      "rubric": [
+        {
+          "criterion": "XSS Detection",
+          "weight": 0.5,
+          "scoring": {
+            "5": "Identifies dangerouslySetInnerHTML as XSS vector; identifies javascript: protocol risk in href; explains attack scenario",
+            "3": "Identifies dangerouslySetInnerHTML but misses href issue",
+            "1": "Only vague mention of XSS",
+            "0": "Misses XSS entirely"
+          }
+        },
+        {
+          "criterion": "Fix Suggestion",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests DOMPurify for HTML sanitization and URL validation/allowlisting for href",
+            "3": "Suggests removing dangerouslySetInnerHTML but no sanitization alternative",
+            "1": "Vague fix",
+            "0": "No fix"
+          }
+        },
+        {
+          "criterion": "Severity",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Correctly classifies as High/Critical severity",
+            "3": "Classifies as Medium",
+            "1": "Classifies as Low",
+            "0": "No severity"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-easy-02",
+      "difficulty": "easy",
+      "description": "Detect missing error handling in async code",
+      "input": "Review this Node.js function for issues:\n\n```javascript\nasync function fetchUserData(userId) {\n  const response = await fetch(`https://api.example.com/users/${userId}`);\n  const data = await response.json();\n  const posts = await fetch(`https://api.example.com/users/${userId}/posts`);\n  const postsData = await posts.json();\n  return { user: data, posts: postsData };\n}\n```",
+      "rubric": [
+        {
+          "criterion": "Issue Detection",
+          "weight": 0.5,
+          "scoring": {
+            "5": "Identifies: no error handling for failed requests, no status code check, no try-catch, sequential requests that could be parallel, no timeout, no input validation",
+            "3": "Identifies 3-4 of the above",
+            "1": "Only identifies missing try-catch",
+            "0": "No issues found"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Provides complete fix with try-catch, response.ok check, Promise.all for parallel requests, timeout, input validation",
+            "3": "Provides partial fix covering error handling",
+            "1": "Minimal fix",
+            "0": "No fix"
+          }
+        },
+        {
+          "criterion": "Report Structure",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Findings organized by severity with clear action items",
+            "3": "Some organization",
+            "1": "Flat list",
+            "0": "No structure"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 75
+    },
+    {
+      "id": "bench-easy-03",
+      "difficulty": "easy",
+      "description": "Detect basic code smells and naming issues",
+      "input": "Review this Python function:\n\n```python\ndef p(d, t, x=None):\n    r = []\n    for i in d:\n        if i['t'] == t:\n            if x is not None:\n                if i['v'] > x:\n                    r.append(i)\n            else:\n                r.append(i)\n    if len(r) == 0:\n        return None\n    r2 = sorted(r, key=lambda i: i['v'], reverse=True)\n    return r2\n```",
+      "rubric": [
+        {
+          "criterion": "Smell Detection",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies: cryptic naming (p, d, t, x, r, r2, i), deep nesting, use of dict instead of dataclass/namedtuple, returning None vs empty list inconsistency",
+            "3": "Identifies naming and nesting issues",
+            "1": "Only mentions naming",
+            "0": "No issues found"
+          }
+        },
+        {
+          "criterion": "Refactoring Suggestion",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Provides complete refactored version with descriptive names, flat structure (early return/list comprehension), type hints, and consistent return type",
+            "3": "Provides renamed version but structure unchanged",
+            "1": "Only suggests renaming",
+            "0": "No refactoring"
+          }
+        },
+        {
+          "criterion": "Severity Assessment",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Correctly rates naming/readability as Medium severity and notes no security issues",
+            "3": "Reasonable severity assessment",
+            "1": "Inflated or missing severity",
+            "0": "No severity"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 75
+    },
+    {
+      "id": "bench-med-01",
+      "difficulty": "medium",
+      "description": "Detect SQL injection with ORM bypass",
+      "input": "Review this Django view for security issues:\n\n```python\nfrom django.http import JsonResponse\nfrom django.db import connection\nfrom .models import Product\n\ndef search_products(request):\n    query = request.GET.get('q', '')\n    category = request.GET.get('category', '')\n    min_price = request.GET.get('min_price', 0)\n    sort = request.GET.get('sort', 'name')\n    \n    products = Product.objects.filter(\n        name__icontains=query,\n        category__name=category\n    )\n    \n    with connection.cursor() as cursor:\n        cursor.execute(\n            f\"SELECT * FROM products_product WHERE price >= {min_price} ORDER BY {sort}\"\n        )\n        filtered = cursor.fetchall()\n    \n    return JsonResponse({'products': list(products.values()), 'filtered': filtered})\n```",
+      "rubric": [
+        {
+          "criterion": "SQL Injection Detection",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies both injection points: min_price (value injection) AND sort (column injection/ORDER BY injection); explains that the ORM usage above is safe but the raw query below bypasses it",
+            "3": "Identifies the f-string SQL issue but misses ORDER BY injection specifics",
+            "1": "Only vague SQL injection mention",
+            "0": "Misses SQL injection"
+          }
+        },
+        {
+          "criterion": "Additional Issues",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Identifies: redundant query (ORM + raw doing similar things), no pagination, no input validation for min_price type, missing authentication",
+            "3": "Identifies 2 additional issues",
+            "1": "Only SQL injection found",
+            "0": "No additional issues"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides parameterized query fix, allowlist for sort column, consolidated to single ORM query, input validation",
+            "3": "Provides parameterized fix but misses sort allowlist",
+            "1": "Generic fix suggestion",
+            "0": "No fix"
+          }
+        },
+        {
+          "criterion": "Report Structure",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Structured with severity levels, grouped findings, clear action items",
+            "3": "Some structure",
+            "1": "Flat list",
+            "0": "No structure"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-02",
+      "difficulty": "medium",
+      "description": "Detect N+1 query and memory issues in data processing",
+      "input": "Review this TypeScript service for performance issues:\n\n```typescript\nasync function generateReport(orgId: string): Promise<Report> {\n  const departments = await db.departments.findMany({ where: { orgId } });\n  const report: Report = { departments: [] };\n\n  for (const dept of departments) {\n    const employees = await db.employees.findMany({ where: { departmentId: dept.id } });\n    const deptData: DepartmentReport = { name: dept.name, employees: [] };\n\n    for (const emp of employees) {\n      const reviews = await db.performanceReviews.findMany({ where: { employeeId: emp.id } });\n      const projects = await db.projectAssignments.findMany({ where: { employeeId: emp.id } });\n      \n      deptData.employees.push({\n        name: emp.name,\n        reviews: reviews,\n        projects: projects,\n        avgScore: reviews.reduce((sum, r) => sum + r.score, 0) / reviews.length\n      });\n    }\n    report.departments.push(deptData);\n  }\n  return report;\n}\n```",
+      "rubric": [
+        {
+          "criterion": "N+1 Detection",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies nested N+1 pattern: 1 + N + N*M + N*M queries (departments → employees → reviews + projects); calculates approximate query count for realistic org sizes",
+            "3": "Identifies N+1 but doesn't trace the full nesting depth",
+            "1": "Mentions 'too many queries' without specifics",
+            "0": "Misses the issue"
+          }
+        },
+        {
+          "criterion": "Additional Issues",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Identifies: potential division by zero (empty reviews), memory growth with large orgs, sequential reviews+projects that could be parallel, no pagination/limit, missing error handling",
+            "3": "Identifies 2-3 additional issues",
+            "1": "Only N+1 found",
+            "0": "No additional issues"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides fix using JOINs/includes/eager loading, Promise.all for parallel queries, division-by-zero guard, streaming or pagination for large datasets",
+            "3": "Provides JOIN-based fix but misses other improvements",
+            "1": "Generic fix",
+            "0": "No fix"
+          }
+        },
+        {
+          "criterion": "Severity",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Correctly classifies N+1 as High (in a report generation context), division-by-zero as Medium",
+            "3": "Reasonable severity",
+            "1": "Incorrect severity",
+            "0": "No severity"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-03",
+      "difficulty": "medium",
+      "description": "Detect race condition in concurrent code",
+      "input": "Review this Go HTTP handler for concurrency issues:\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"net/http\"\n    \"sync\"\n)\n\nvar (\n    requestCount int\n    cache        = make(map[string]string)\n)\n\nfunc handler(w http.ResponseWriter, r *http.Request) {\n    requestCount++\n    key := r.URL.Query().Get(\"key\")\n    \n    if val, ok := cache[key]; ok {\n        fmt.Fprintf(w, \"cached: %s\", val)\n        return\n    }\n    \n    result := expensiveComputation(key)\n    cache[key] = result\n    fmt.Fprintf(w, \"computed: %s\", result)\n}\n\nfunc statsHandler(w http.ResponseWriter, r *http.Request) {\n    fmt.Fprintf(w, \"requests: %d, cache size: %d\", requestCount, len(cache))\n}\n```",
+      "rubric": [
+        {
+          "criterion": "Race Condition Detection",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies: unsynchronized requestCount increment (data race), unsynchronized map read/write (Go maps are not goroutine-safe; concurrent map access will panic), cache stampede (multiple goroutines computing same key simultaneously)",
+            "3": "Identifies map race condition but misses counter or stampede",
+            "1": "Mentions concurrency issues vaguely",
+            "0": "Misses race conditions"
+          }
+        },
+        {
+          "criterion": "Additional Issues",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Identifies: unbounded cache growth (no eviction), no input validation on key, cache with no TTL, missing error handling on expensiveComputation",
+            "3": "Identifies 1-2 additional issues",
+            "1": "Only race conditions found",
+            "0": "No additional issues"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides sync.RWMutex for cache, atomic.AddInt64 for counter, singleflight for stampede prevention, or suggests sync.Map alternative",
+            "3": "Provides mutex fix but misses singleflight or atomic",
+            "1": "Suggests 'add a lock' without code",
+            "0": "No fix"
+          }
+        },
+        {
+          "criterion": "Severity & Go-Specific Knowledge",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Correctly notes concurrent map access causes runtime panic in Go (not just data corruption); classifies as Critical",
+            "3": "Notes race condition but doesn't mention Go-specific map panic",
+            "1": "Generic concurrency advice",
+            "0": "No Go-specific knowledge"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-04",
+      "difficulty": "medium",
+      "description": "Detect authentication bypass and JWT issues",
+      "input": "Review this authentication middleware:\n\n```javascript\nconst jwt = require('jsonwebtoken');\n\nconst SECRET = 'my-secret-key-123';\n\nfunction authMiddleware(req, res, next) {\n  const token = req.headers.authorization;\n  \n  if (!token) {\n    return res.status(401).json({ error: 'No token' });\n  }\n\n  try {\n    const decoded = jwt.verify(token, SECRET);\n    req.user = decoded;\n    next();\n  } catch (err) {\n    res.status(401).json({ error: 'Invalid token' });\n  }\n}\n\nfunction adminMiddleware(req, res, next) {\n  if (req.user.role === 'admin') {\n    next();\n  }\n  res.status(403).json({ error: 'Forbidden' });\n}\n\nfunction generateToken(user) {\n  return jwt.sign({ id: user.id, role: user.role, email: user.email }, SECRET);\n}\n```",
+      "rubric": [
+        {
+          "criterion": "Security Issue Detection",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies: hardcoded weak secret, no token expiration, missing 'Bearer ' prefix parsing, adminMiddleware missing return (sends both 403 and continues), no algorithm restriction (algorithm confusion attack), PII in token payload",
+            "3": "Identifies 3-4 of the above",
+            "1": "Only identifies hardcoded secret",
+            "0": "Misses major issues"
+          }
+        },
+        {
+          "criterion": "Missing Return Bug",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Identifies that adminMiddleware doesn't return after next(), causing the 403 response to always execute (sending headers after response for admin users)",
+            "3": "Notes the control flow issue but doesn't fully explain the consequence",
+            "1": "Misses the bug",
+            "0": "Not mentioned"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides: env variable for secret, token expiration, Bearer parsing, return statement fix, algorithm pinning ({algorithms: ['HS256']}), minimal payload",
+            "3": "Fixes most issues but misses algorithm pinning",
+            "1": "Partial fixes",
+            "0": "No fixes"
+          }
+        },
+        {
+          "criterion": "Report Structure",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Clear severity classification; Critical for secret/no-expiry/auth bypass, Medium for payload/parsing",
+            "3": "Some severity classification",
+            "1": "Flat list",
+            "0": "No structure"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 75
+    },
+    {
+      "id": "bench-hard-01",
+      "difficulty": "hard",
+      "description": "Comprehensive review of a file upload service with multiple vulnerability classes",
+      "input": "Review this file upload service for all categories of issues:\n\n```python\nimport os\nimport subprocess\nfrom flask import Flask, request, jsonify, send_file\n\napp = Flask(__name__)\nUPLOAD_DIR = '/uploads'\n\n@app.route('/upload', methods=['POST'])\ndef upload():\n    file = request.files['file']\n    filename = file.filename\n    filepath = os.path.join(UPLOAD_DIR, filename)\n    file.save(filepath)\n    \n    if filename.endswith('.pdf'):\n        text = subprocess.check_output(f'pdftotext {filepath} -', shell=True)\n        return jsonify({'text': text.decode(), 'path': filepath})\n    \n    return jsonify({'message': 'Uploaded', 'path': filepath})\n\n@app.route('/files/<path:filename>')\ndef serve_file(filename):\n    return send_file(os.path.join(UPLOAD_DIR, filename))\n\n@app.route('/delete', methods=['POST'])\ndef delete():\n    path = request.json['path']\n    os.remove(path)\n    return jsonify({'deleted': path})\n```",
+      "rubric": [
+        {
+          "criterion": "Vulnerability Detection Breadth",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Identifies ALL: path traversal in upload (../../), command injection in pdftotext (shell=True with unsanitized filename), path traversal in serve_file, arbitrary file deletion in /delete (any path on system), no file type/size validation, no authentication, information disclosure (filepath in response), unrestricted file upload (web shells)",
+            "3": "Identifies 5-6 vulnerabilities",
+            "1": "Identifies 2-3 vulnerabilities",
+            "0": "Misses most issues"
+          }
+        },
+        {
+          "criterion": "Attack Scenario Depth",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Describes concrete attack chains: upload a file named '../../etc/cron.d/backdoor', upload file named '; rm -rf /' for command injection, delete /etc/passwd via delete endpoint",
+            "3": "Describes some attack scenarios but not chains",
+            "1": "Generic vulnerability descriptions",
+            "0": "No attack scenarios"
+          }
+        },
+        {
+          "criterion": "Fix Completeness",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Provides: filename sanitization (werkzeug.secure_filename), file type allowlisting, size limits, subprocess with list args (no shell=True), path validation (os.path.realpath check), authentication, remove filepath from response, chroot/contained upload directory",
+            "3": "Fixes 4-5 issues with code",
+            "1": "Fixes 1-2 issues",
+            "0": "No fixes"
+          }
+        },
+        {
+          "criterion": "Severity Accuracy",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Command injection and arbitrary file deletion as Critical; path traversal as Critical; unrestricted upload as High; missing auth as High; info disclosure as Medium",
+            "3": "Mostly correct severity",
+            "1": "Severity misclassification",
+            "0": "No severity"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 65
+    },
+    {
+      "id": "bench-hard-02",
+      "difficulty": "hard",
+      "description": "Review complex async state management with subtle bugs",
+      "input": "Review this TypeScript connection pool and caching layer for correctness and performance:\n\n```typescript\nclass ConnectionPool {\n  private pool: Connection[] = [];\n  private waiting: ((conn: Connection) => void)[] = [];\n  private cache: Map<string, { data: any; timestamp: number }> = new Map();\n  private maxSize: number;\n  private cacheTimeout = 300000; // 5 min\n\n  constructor(maxSize: number) {\n    this.maxSize = maxSize;\n  }\n\n  async query(sql: string, params: any[]): Promise<any> {\n    const cacheKey = sql + JSON.stringify(params);\n    const cached = this.cache.get(cacheKey);\n    \n    if (cached && Date.now() - cached.timestamp < this.cacheTimeout) {\n      return cached.data;\n    }\n\n    const conn = await this.getConnection();\n    const result = await conn.execute(sql, params);\n    this.cache.set(cacheKey, { data: result, timestamp: Date.now() });\n    this.releaseConnection(conn);\n    return result;\n  }\n\n  private async getConnection(): Promise<Connection> {\n    if (this.pool.length > 0) {\n      return this.pool.pop()!;\n    }\n    if (this.pool.length < this.maxSize) {\n      return new Connection();\n    }\n    return new Promise(resolve => this.waiting.push(resolve));\n  }\n\n  private releaseConnection(conn: Connection) {\n    if (this.waiting.length > 0) {\n      this.waiting.shift()!(conn);\n    } else {\n      this.pool.push(conn);\n    }\n  }\n}\n```",
+      "rubric": [
+        {
+          "criterion": "Bug Detection",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies ALL: getConnection size check is wrong (pool.length tracks available not total, so it creates unlimited connections), connection not released on query error (no try-finally), cache grows unboundedly, cache key collision possible with JSON.stringify, mutation writes cached to cache (shared reference), waiting queue has no timeout (can hang forever)",
+            "3": "Identifies 3-4 bugs",
+            "1": "Identifies 1-2 bugs",
+            "0": "Misses bugs"
+          }
+        },
+        {
+          "criterion": "Concurrency Analysis",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Identifies race conditions: multiple concurrent calls can pop same connection (not atomic), cache stampede (multiple requests for same expired key), TOCTOU in getConnection",
+            "3": "Identifies 1-2 concurrency issues",
+            "1": "Mentions concurrency concerns vaguely",
+            "0": "No concurrency analysis"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides: separate total connection counter, try-finally for connection release, cache eviction (LRU or size-limited), semaphore or mutex for pool access, timeout for waiting, defensive copy for cached data",
+            "3": "Fixes major bugs but misses concurrency fixes",
+            "1": "Partial fixes",
+            "0": "No fixes"
+          }
+        },
+        {
+          "criterion": "Overall Assessment",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Provides health score, identifies this as High-risk code due to resource management bugs, recommends using a proven pool library instead",
+            "3": "Some overall assessment",
+            "1": "Only individual findings",
+            "0": "No assessment"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 65
+    },
+    {
+      "id": "bench-hard-03",
+      "difficulty": "hard",
+      "description": "Review a multi-language microservice interaction with subtle security and reliability issues",
+      "input": "Review this API gateway handler that coordinates between microservices:\n\n```typescript\nimport express from 'express';\nimport axios from 'axios';\nimport Redis from 'ioredis';\n\nconst redis = new Redis();\nconst app = express();\n\napp.post('/api/checkout', async (req, res) => {\n  const { userId, items, paymentToken, promoCode } = req.body;\n  \n  // Step 1: Validate promo code\n  let discount = 0;\n  if (promoCode) {\n    const promo = await axios.get(`http://promo-service/validate/${promoCode}`);\n    discount = promo.data.discount;\n  }\n  \n  // Step 2: Check inventory\n  const inventory = await axios.post('http://inventory-service/check', { items });\n  if (!inventory.data.available) {\n    return res.status(400).json({ error: 'Items unavailable' });\n  }\n  \n  // Step 3: Reserve inventory\n  await axios.post('http://inventory-service/reserve', { items, userId });\n  \n  // Step 4: Calculate total\n  const total = items.reduce((sum, item) => sum + item.price * item.quantity, 0);\n  const finalTotal = total - (total * discount / 100);\n  \n  // Step 5: Charge payment\n  const payment = await axios.post('http://payment-service/charge', {\n    userId, amount: finalTotal, token: paymentToken\n  });\n  \n  // Step 6: Create order\n  const order = await axios.post('http://order-service/create', {\n    userId, items, total: finalTotal, paymentId: payment.data.id\n  });\n  \n  // Step 7: Cache order\n  await redis.set(`order:${order.data.id}`, JSON.stringify(order.data));\n  \n  res.json({ orderId: order.data.id, total: finalTotal });\n});\n```",
+      "rubric": [
+        {
+          "criterion": "Distributed System Issues",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Identifies: no saga/compensation (if payment fails, inventory stays reserved; if order creation fails, payment is charged without order), no idempotency (retry creates double charge), no timeout on service calls, no circuit breaker, race condition between check and reserve (TOCTOU)",
+            "3": "Identifies 2-3 distributed system issues",
+            "1": "Only mentions error handling",
+            "0": "Misses distributed issues"
+          }
+        },
+        {
+          "criterion": "Security Issues",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Identifies: client-supplied prices (item.price from request body — price manipulation), no auth/user verification, discount could be > 100% (negative total), no input validation on items array, SSRF potential via internal service URLs, no rate limiting on checkout",
+            "3": "Identifies 3 security issues",
+            "1": "Identifies 1 security issue",
+            "0": "Misses security issues"
+          }
+        },
+        {
+          "criterion": "Reliability Fix",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Proposes: saga pattern with compensation handlers, idempotency keys, timeouts + circuit breakers, server-side price lookup, input validation, retry with backoff, dead letter queue for failed steps",
+            "3": "Proposes some reliability fixes",
+            "1": "Only suggests try-catch",
+            "0": "No reliability fixes"
+          }
+        },
+        {
+          "criterion": "Severity & Report Quality",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Critical: price manipulation, no compensation; High: no idempotency, no timeout; Medium: missing circuit breaker, redis caching issues; structured report with priorities",
+            "3": "Some severity classification",
+            "1": "Flat list",
+            "0": "No severity"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 65
+    }
+  ]
+}

package/tests/smoke.json ADDED Viewed

@@ -0,0 +1,54 @@
+{
+  "version": "0.0.1",
+  "timeout": 60,
+  "tasks": [
+    {
+      "id": "smoke-01",
+      "description": "Review a code snippet with intentionally planted security, performance, and quality issues",
+      "input": "Review the following Express.js API endpoint for security vulnerabilities, performance issues, and code quality:\n\n```javascript\nconst express = require('express');\nconst app = express();\nconst db = require('./db');\n\napp.post('/api/login', async (req, res) => {\n  const { username, password } = req.body;\n  const user = await db.query(`SELECT * FROM users WHERE username = '${username}' AND password = '${password}'`);\n  if (user.rows.length > 0) {\n    const token = username + Date.now();\n    res.json({ token, user: user.rows[0] });\n  } else {\n    res.json({ error: 'Invalid credentials' });\n  }\n});\n\napp.get('/api/users/:id', async (req, res) => {\n  const user = await db.query(`SELECT * FROM users WHERE id = ${req.params.id}`);\n  res.json(user.rows[0]);\n});\n\napp.get('/api/users', async (req, res) => {\n  const users = await db.query('SELECT * FROM users');\n  const result = [];\n  for (const user of users.rows) {\n    const orders = await db.query(`SELECT * FROM orders WHERE user_id = ${user.id}`);\n    result.push({ ...user, orders: orders.rows });\n  }\n  res.json(result);\n});\n\napp.listen(3000);\n```\n\nProvide a structured review with severity classifications and fix suggestions.",
+      "rubric": [
+        {
+          "criterion": "Issue Detection Rate",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Identifies all major issues: SQL injection (2 instances), plaintext password storage, predictable token, missing auth on /users/:id, N+1 query, missing error handling, no rate limiting, full user object in response (including password)",
+            "3": "Identifies 4-5 of the above issues",
+            "1": "Identifies only 1-2 obvious issues",
+            "0": "Misses all major issues or reports false positives"
+          }
+        },
+        {
+          "criterion": "Severity Classification",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Correctly classifies SQL injection and auth issues as Critical/High; N+1 and error handling as Medium; uses consistent severity framework",
+            "3": "Mostly correct severity assignment with minor misclassifications",
+            "1": "Inconsistent or incorrect severity levels",
+            "0": "No severity classification"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Provides copy-pasteable code fixes for each issue: parameterized queries, bcrypt hashing, JWT tokens, authorization middleware, eager loading/JOIN for N+1, try-catch blocks",
+            "3": "Provides fix descriptions but not all with code examples",
+            "1": "Vague fix suggestions like 'sanitize input'",
+            "0": "No fix suggestions"
+          }
+        },
+        {
+          "criterion": "Report Structure",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Structured report with summary, severity-grouped findings, health scores, positive observations (if any), and prioritized action items",
+            "3": "Some structure but missing components (e.g., no summary or no prioritization)",
+            "1": "Flat list of issues with no organization",
+            "0": "Unstructured prose"
+          }
+        }
+      ],
+      "passThreshold": 60
+    }
+  ]
+}