npm - @botlearn/debugger - Versions diffs - 0.1.0 - Mend

@botlearn/debugger 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/LICENSE +21 -0
package/README.md +35 -0
package/knowledge/anti-patterns.md +74 -0
package/knowledge/best-practices.md +162 -0
package/knowledge/domain.md +180 -0
package/manifest.json +28 -0
package/package.json +38 -0
package/skill.md +48 -0
package/strategies/main.md +109 -0
package/tests/benchmark.json +466 -0
package/tests/smoke.json +54 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 BotLearn
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,35 @@
+# @botlearn/debugger
+> Root cause analysis, bug diagnosis, and fix suggestion for OpenClaw Agent — improves debugging efficiency 5x with systematic hypothesis-driven investigation
+## Installation
+```bash
+# via npm
+npm install @botlearn/debugger
+# via clawhub
+clawhub install @botlearn/debugger
+```
+## Category
+programming-assistance
+## Dependencies
+`@botlearn/code-review`
+## Files
+| File | Description |
+|------|-------------|
+| `manifest.json` | Skill metadata and configuration |
+| `skill.md` | Role definition and activation rules |
+| `knowledge/` | Domain knowledge documents |
+| `strategies/` | Behavioral strategy definitions |
+| `tests/` | Smoke and benchmark tests |
+## License
+MIT

package/knowledge/anti-patterns.md ADDED Viewed

@@ -0,0 +1,74 @@
+---
+domain: debugger
+topic: anti-patterns
+priority: medium
+ttl: 30d
+---
+# Debugging — Anti-Patterns
+## Investigation Anti-Patterns
+### 1. Symptom Fixing (Patch-and-Pray)
+- **Problem**: Applying a surface-level fix that silences the error without understanding the root cause. Example: wrapping a NullPointerException in a try-catch that returns a default value
+- **Why it's harmful**: The underlying defect remains. It will resurface in a different form, often harder to diagnose, or cause silent data corruption
+- **Fix**: Always trace the causal chain from symptom to root cause before writing any fix code. Ask: "Why is this value null?" not "How do I handle null?"
+### 2. Shotgun Debugging
+- **Problem**: Making multiple simultaneous changes hoping one of them fixes the bug, without understanding which change (if any) actually addresses the root cause
+- **Why it's harmful**: Even if the bug disappears, you don't know why. You may have introduced side effects, masked the real issue, or created new bugs. You cannot write a meaningful regression test
+- **Fix**: Change exactly ONE variable at a time. Observe the result. Revert if it didn't help. Proceed methodically
+### 3. Ignoring Error Messages
+- **Problem**: Glancing at an error and immediately forming a theory without reading the full message, stack trace, and context
+- **Why it's harmful**: Error messages are the most direct diagnostic evidence. Ignoring them leads to investigating the wrong hypothesis entirely. Developers frequently report spending hours debugging only to discover the answer was in the error message
+- **Fix**: Read the ENTIRE error message. Read the ENTIRE stack trace. Parse every field: exception type, message string, file, line, column. Then hypothesize
+### 4. Assuming the Bug Is Somewhere Else
+- **Problem**: Blaming the framework, library, compiler, or OS without evidence. "It must be a React bug" or "The database is broken"
+- **Why it's harmful**: Widely-used libraries have been tested by millions of users. The bug is almost certainly in your code. Blaming external components wastes time investigating the wrong system
+- **Fix**: Assume the bug is in YOUR code until you have strong evidence otherwise. Only escalate to library/framework investigation after ruling out your own code with concrete evidence
+### 5. Not Reading the Documentation
+- **Problem**: Using an API based on assumptions about its behavior instead of reading the official documentation
+- **Why it's harmful**: APIs frequently have non-obvious semantics: nullable return values, specific error conditions, required initialization order, thread-safety guarantees (or lack thereof)
+- **Fix**: Before debugging an API integration issue, re-read the relevant documentation section. Check for known issues, version-specific behavior changes, and migration guides
+## Process Anti-Patterns
+### 6. Debugging Without Reproducing
+- **Problem**: Attempting to fix a bug based solely on a bug report or stack trace without first reproducing it locally
+- **Why it's harmful**: Without reproduction, you cannot verify your fix works. You may fix a different bug or introduce a regression. You have no way to write a regression test
+- **Fix**: ALWAYS reproduce the bug before attempting a fix. If reproduction is difficult, invest time in creating a minimal reproduction case. If the bug is intermittent, increase the probability of occurrence (e.g., stress test, increase parallelism)
+### 7. Not Using Version Control During Debugging
+- **Problem**: Making changes to investigate a bug without committing or stashing first, leading to a tangled mix of investigation code and attempted fixes
+- **Why it's harmful**: You cannot cleanly revert to a known state. You lose track of what you changed. You may accidentally commit debug code
+- **Fix**: Always stash or commit your work before starting to debug. Create a debug branch. Make each investigation step a separate commit that can be reverted. Use `git bisect` when the bug is a regression
+### 8. Premature Optimization During Bug Fix
+- **Problem**: While fixing a bug, simultaneously refactoring or optimizing the surrounding code
+- **Why it's harmful**: Conflates two different changes. If the fix introduces a new bug, it's harder to isolate. Code review becomes more difficult. The optimization may not be needed
+- **Fix**: Fix the bug in the smallest possible change. Commit. Then refactor or optimize in a separate commit if warranted
+### 9. Debugging in Production
+- **Problem**: Adding debug logging, print statements, or experimental fixes directly in the production environment
+- **Why it's harmful**: Risk of breaking production for all users. Debug output may expose sensitive data. Changes are not tracked in version control
+- **Fix**: Reproduce the bug in a development or staging environment. If production-only debugging is unavoidable, use observability tools (structured logging, APM, distributed tracing) instead of code changes
+### 10. Ignoring Intermittent Failures
+- **Problem**: Dismissing a test or error that "only fails sometimes" as a flaky test or transient issue without investigation
+- **Why it's harmful**: Intermittent failures are often concurrency bugs, race conditions, or timing-dependent issues — the hardest and most dangerous class of bugs. They tend to worsen under load
+- **Fix**: Treat intermittent failures as HIGH priority. They often indicate a real concurrency or state-management bug. Run the test in a loop (100-1000 iterations) to increase reproduction probability. Add logging at synchronization points
+## Output Anti-Patterns
+### 11. Vague Bug Reports
+- **Problem**: Describing the bug as "it doesn't work" or "it's broken" without specifying: what was expected, what actually happened, steps to reproduce, environment details
+- **Why it's harmful**: Forces the recipient to guess and ask follow-up questions, wasting time. May lead to investigating the wrong issue
+- **Fix**: Every bug report should include: (1) steps to reproduce, (2) expected behavior, (3) actual behavior, (4) environment details, (5) relevant error output
+### 12. Fix Without Regression Test
+- **Problem**: Fixing a bug without adding a test that would catch it if reintroduced
+- **Why it's harmful**: Without a regression test, the same bug can (and often does) come back in a future change. The team has no automated safety net for this specific failure mode
+- **Fix**: For every bug fix, write at least one test that: (1) fails without the fix applied, (2) passes with the fix applied, (3) covers the specific input/state that triggered the original bug

package/knowledge/best-practices.md ADDED Viewed

@@ -0,0 +1,162 @@
+---
+domain: debugger
+topic: debugging-methodologies-and-strategies
+priority: high
+ttl: 30d
+---
+# Debugging — Best Practices
+## The Scientific Debugging Method
+Debugging is hypothesis-driven investigation. Apply the scientific method rigorously:
+### 1. Observe
+- Collect all available evidence: error messages, stack traces, logs, user reports, screenshots
+- Record the exact steps to reproduce, the expected behavior, and the actual behavior
+- Note the environment: OS, language version, framework version, configuration
+### 2. Hypothesize
+- Based on observed symptoms, formulate 2-3 ranked hypotheses for the root cause
+- Each hypothesis must be **falsifiable** — you must be able to design a test that would disprove it
+- Rank by likelihood using bug pattern knowledge (see knowledge/domain.md)
+### 3. Predict
+- For each hypothesis, predict what you would observe if it were true
+- Example: "If the bug is a race condition, adding a sleep(1) before the read should make it pass consistently"
+### 4. Test
+- Design the smallest experiment that distinguishes between hypotheses
+- Change exactly ONE variable at a time
+- Record the result: did the prediction hold?
+### 5. Conclude
+- If the prediction held: the hypothesis is supported (but gather more evidence if possible)
+- If the prediction failed: discard the hypothesis and promote the next one
+- Repeat until root cause is confirmed with high confidence
+## Binary Search / Bisection Debugging
+The most powerful technique for narrowing down bugs in large codebases or long histories.
+### Code Bisection (Runtime)
+1. Identify a known-good state and a known-bad state
+2. Insert a diagnostic check at the midpoint of the code path between them
+3. Determine which half contains the bug
+4. Repeat, halving the search space each time
+5. **Efficiency**: Finds the bug in O(log n) steps instead of O(n)
+### Git Bisect (Historical)
+```bash
+git bisect start
+git bisect bad              # Current commit is broken
+git bisect good abc1234     # This older commit was working
+# Git checks out the midpoint — test it
+git bisect good             # or git bisect bad
+# Repeat until the first bad commit is found
+git bisect reset
+```
+- **When to use**: Bug exists now but worked before; unclear when it was introduced
+- **Automate**: `git bisect run ./test_script.sh` for fully automated bisection
+### Data Bisection
+- For bugs triggered by specific input data, bisect the input:
+  1. Split the input in half
+  2. Test each half separately
+  3. Recurse on the half that triggers the bug
+  4. Identify the minimal triggering input
+## Strategic Logging
+### Log Placement Strategy
+| Placement | Purpose | Example |
+|-----------|---------|---------|
+| Function entry | Verify function is called with expected args | `log.debug("processOrder called", {orderId, items})` |
+| Before external call | Verify outbound request data | `log.debug("Calling payment API", {payload})` |
+| After external call | Verify response data | `log.debug("Payment API response", {status, body})` |
+| Branch points | Verify which code path executes | `log.debug("Using cache path" \| "Using DB path")` |
+| Loop iterations | Track iteration state for off-by-one / infinite loops | `log.debug("Loop iteration", {i, current, total})` |
+| Catch blocks | Always log caught exceptions with full context | `log.error("Failed to process", {error, context})` |
+### Structured Logging for Debugging
+- Use structured key-value pairs, not string concatenation
+- Include correlation IDs to trace requests across services
+- Log the **state** (variable values) not just the **event** (what happened)
+- Use log levels appropriately: DEBUG for investigation, ERROR for failures, WARN for recoverable issues
+### Temporary Debug Logging Pattern
+```
+// DEBUG-START: investigating issue #1234
+console.log('[DEBUG-1234] state at checkpoint:', JSON.stringify(state));
+// DEBUG-END
+```
+- Always tag temporary logging with a ticket/issue number
+- Always remove before committing (or use a lint rule to catch it)
+## Rubber Duck Debugging
+When stuck, explain the problem out loud (or in writing) step by step:
+1. State the expected behavior clearly
+2. State the actual behavior clearly
+3. Walk through the code line by line, explaining what each line does
+4. The act of explaining often reveals the incorrect assumption
+**Why it works**: Forces you to examine each assumption explicitly rather than glossing over them mentally.
+## Minimal Reproduction
+### Why Minimize?
+- Removes noise from unrelated code/data
+- Makes the bug easier to understand and communicate
+- Confirms you understand what triggers the bug
+- Provides a ready-made regression test
+### Minimization Process
+1. Start with the full failing scenario
+2. Remove components one at a time, checking if the bug persists
+3. Simplify input data to the smallest triggering case
+4. Remove configuration, middleware, and dependencies that are not involved
+5. The result should be the **smallest code + input that reproduces the bug**
+### Reproduction Environment Checklist
+- [ ] Same language/runtime version
+- [ ] Same dependency versions (check lock files)
+- [ ] Same OS or container environment
+- [ ] Same configuration / environment variables
+- [ ] Same data state (database, cache, files)
+## Debugging by Error Category
+### For Null/Undefined Errors
+1. Trace the variable backwards from the crash point to where it was assigned
+2. Identify which code path leads to the null/undefined assignment
+3. Common sources: missing API response field, failed database query, uninitialized state
+### For Async/Promise Errors
+1. Map the async execution flow (draw it if needed)
+2. Check for missing `await`, unhandled rejections, or callback error parameters
+3. Verify execution order — async code may not run in the order it appears
+### For Performance Bugs
+1. Profile first, optimize second — never guess at bottlenecks
+2. Check algorithmic complexity: O(n^2) in a loop over large data is a common culprit
+3. Look for N+1 query patterns in database-backed code
+4. Check for unnecessary re-renders in frontend frameworks
+### For Concurrency Bugs
+1. Identify shared mutable state
+2. Map the order of lock acquisitions across threads
+3. Use thread-safe data structures or synchronization primitives
+4. Test with increased parallelism to amplify timing-sensitive bugs
+## Fix Verification Checklist
+After implementing a fix:
+- [ ] The original bug is no longer reproducible
+- [ ] No new failures introduced (run full test suite)
+- [ ] Edge cases covered (empty input, null, boundary values, concurrent access)
+- [ ] A regression test exists that would catch this bug if reintroduced
+- [ ] The fix addresses the root cause, not just the symptom
+- [ ] Code review completed (leverage @botlearn/code-review)

package/knowledge/domain.md ADDED Viewed

@@ -0,0 +1,180 @@
+---
+domain: debugger
+topic: common-bug-patterns-and-error-taxonomy
+priority: high
+ttl: 30d
+---
+# Debugging — Common Bug Patterns, Error Types & Stack Trace Anatomy
+## Bug Classification Taxonomy
+### 1. Logic Errors
+Bugs where the code executes without crashing but produces incorrect results.
+| Pattern | Description | Common Languages | Example |
+|---------|-------------|-----------------|---------|
+| Off-by-one | Loop boundary or index shifted by 1 | All | `for (i = 0; i <= arr.length; i++)` — reads past array end |
+| Incorrect operator | Wrong comparison or arithmetic operator | All | `if (a = b)` instead of `if (a == b)` in C/JS |
+| Wrong boolean logic | Inverted or miscomposed conditions | All | `if (!a && !b)` instead of `!(a && b)` (De Morgan violation) |
+| Missing edge case | Fails on empty input, zero, negative, max values | All | No check for empty array before accessing `arr[0]` |
+| Incorrect algorithm | Right structure, wrong logic in transformation | All | Sorting comparator returns wrong sign |
+| Integer overflow | Arithmetic exceeds type range silently | C, C++, Java, Rust | `int sum = 2_000_000_000 + 2_000_000_000` wraps negative |
+### 2. Null / Undefined Reference Errors
+Accessing members or methods on a null/undefined/nil value.
+| Language | Error Message Pattern | Common Cause |
+|----------|----------------------|-------------|
+| JavaScript | `TypeError: Cannot read properties of undefined (reading 'X')` | Accessing nested property on uninitialized object |
+| JavaScript | `TypeError: X is not a function` | Calling undefined method, wrong import |
+| Python | `AttributeError: 'NoneType' object has no attribute 'X'` | Function returns None unexpectedly |
+| Java | `NullPointerException` | Uninitialized object reference, failed Optional unwrap |
+| C# | `NullReferenceException` | Uninitialized reference type |
+| Rust | `unwrap()` on `None` | `Option::unwrap()` called on `None` value |
+| Go | `panic: runtime error: invalid memory address or nil pointer dereference` | Nil pointer method call |
+### 3. Type Errors
+Type mismatches at runtime or compile time.
+| Language | Error Pattern | Common Cause |
+|----------|--------------|-------------|
+| Python | `TypeError: unsupported operand type(s)` | String + int without conversion |
+| TypeScript | `Type 'X' is not assignable to type 'Y'` | Interface mismatch, missing property |
+| Java | `ClassCastException` | Unsafe downcast, generic type erasure |
+| Go | `cannot use X (type Y) as type Z` | Interface not satisfied |
+### 4. Concurrency Bugs
+Non-deterministic failures caused by parallel execution.
+| Pattern | Description | Symptoms |
+|---------|-------------|----------|
+| Race condition | Two threads access shared state without synchronization | Intermittent wrong results, passes sometimes |
+| Deadlock | Two+ threads wait for each other's locks | Application hangs permanently |
+| Livelock | Threads keep retrying but never make progress | High CPU, no progress, no hang |
+| Starvation | Low-priority thread never gets CPU time | Some requests never complete |
+| Lost update | Concurrent writes overwrite each other | Data disappears or reverts |
+| Double-checked locking | Broken singleton pattern without volatile/atomic | Partially constructed object visible |
+### 5. Resource & Memory Errors
+| Pattern | Language(s) | Symptoms |
+|---------|------------|----------|
+| Memory leak | C, C++, Java (listener leaks), JS (closures, event listeners) | Gradual memory growth, eventual OOM |
+| Use-after-free | C, C++ | Crash, corrupted data, security vulnerability |
+| Buffer overflow | C, C++ | Crash, security vulnerability, corrupted adjacent memory |
+| File descriptor leak | All | "Too many open files" error after extended operation |
+| Connection pool exhaustion | All (database/HTTP clients) | Timeouts, connection refused after sustained load |
+| Stack overflow | All (deep recursion) | `Maximum call stack size exceeded` (JS), `StackOverflowError` (Java) |
+### 6. Async / Promise Errors
+| Pattern | Language | Error/Symptom |
+|---------|---------|---------------|
+| Unhandled promise rejection | JavaScript | `UnhandledPromiseRejectionWarning`, silent failure |
+| Missing await | JavaScript/TypeScript | Function returns Promise object instead of resolved value |
+| Callback hell / error swallowing | JavaScript | Errors caught silently in nested callbacks |
+| Async deadlock | C# | `.Result` or `.Wait()` on async in sync context blocks forever |
+| Event loop blocking | Node.js | Server stops responding during CPU-intensive sync operation |
+### 7. Import / Module Errors
+| Language | Error Pattern | Common Cause |
+|----------|--------------|-------------|
+| Python | `ModuleNotFoundError: No module named 'X'` | Missing package, wrong virtualenv, typo |
+| JavaScript | `SyntaxError: Cannot use import statement outside a module` | CommonJS/ESM mismatch |
+| JavaScript | `Module not found: Can't resolve 'X'` | Missing dependency, wrong path |
+| Java | `ClassNotFoundException` | Missing JAR, wrong classpath |
+| Go | `cannot find package "X"` | Missing `go get`, wrong module path |
+## Stack Trace Anatomy
+### General Structure
+```
+ExceptionType: Error message describing what went wrong
+    at function_name (file_path:line:column)        ← Immediate failure point
+    at caller_function (file_path:line:column)       ← Who called it
+    at higher_caller (file_path:line:column)         ← Chain continues up
+    ...
+    at entry_point (file_path:line:column)           ← Program/request entry
+```
+### Reading Stack Traces — Key Principles
+1. **Read top-down**: The top frame is where the error occurred; lower frames show the call chain
+2. **Find YOUR code**: Skip framework/library frames; focus on frames in your source files
+3. **Identify the boundary**: The transition from your code to library code often reveals the API misuse
+4. **Check the error message first**: It often tells you exactly what went wrong (null value, type mismatch, missing key)
+5. **Look for "Caused by"**: In Java/C#, chained exceptions reveal the original root cause at the bottom
+### Language-Specific Stack Trace Formats
+#### JavaScript / Node.js
+```
+TypeError: Cannot read properties of undefined (reading 'map')
+    at UserList (/app/components/UserList.jsx:15:22)
+    at renderWithHooks (/app/node_modules/react-dom/...js:16305:18)
+    at mountIndeterminateComponent (/app/node_modules/react-dom/...js:20069:13)
+```
+- **Key**: First frame in YOUR source tree (not `node_modules`) is the likely bug location
+#### Python
+```
+Traceback (most recent call last):
+  File "/app/main.py", line 42, in process_data
+    result = transform(data)
+  File "/app/transform.py", line 18, in transform
+    return data["key"].strip()
+TypeError: 'NoneType' object has no attribute 'strip'
+```
+- **Key**: Python traces read **bottom-up** — the last frame + error message is the failure point
+#### Java
+```
+java.lang.NullPointerException: Cannot invoke "String.length()" because "str" is null
+    at com.app.service.Parser.parse(Parser.java:45)
+    at com.app.controller.ApiController.handleRequest(ApiController.java:112)
+Caused by: java.io.IOException: Connection refused
+    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:162)
+```
+- **Key**: "Caused by" chains reveal the original root cause
+#### Go
+```
+goroutine 1 [running]:
+main.processData(0x0, 0x0)
+    /app/main.go:25 +0x3a
+main.main()
+    /app/main.go:10 +0x25
+```
+- **Key**: Goroutine ID and state help diagnose concurrency issues
+## Error Message Interpretation Guide
+### HTTP Status Codes as Bug Signals
+| Code | Meaning | Likely Bug |
+|------|---------|-----------|
+| 400 | Bad Request | Malformed request body, missing required field, invalid JSON |
+| 401 | Unauthorized | Expired/missing auth token, wrong credentials |
+| 403 | Forbidden | Insufficient permissions, CORS policy violation |
+| 404 | Not Found | Wrong URL path, resource deleted, routing misconfiguration |
+| 409 | Conflict | Duplicate key, optimistic locking failure, stale data |
+| 422 | Unprocessable Entity | Validation failure, business rule violation |
+| 429 | Too Many Requests | Rate limiting triggered, missing backoff logic |
+| 500 | Internal Server Error | Unhandled exception in server code |
+| 502 | Bad Gateway | Downstream service crashed or unreachable |
+| 503 | Service Unavailable | Server overloaded, deployment in progress |
+| 504 | Gateway Timeout | Downstream service too slow, query timeout |
+### Database Error Patterns
+| Error Pattern | Likely Cause |
+|--------------|-------------|
+| `duplicate key value violates unique constraint` | Inserting row with existing unique value |
+| `deadlock detected` | Concurrent transactions locking same rows in different order |
+| `relation "X" does not exist` | Missing table, wrong schema, migration not run |
+| `column "X" of relation "Y" does not exist` | Schema mismatch, missing migration |
+| `connection refused` / `too many connections` | DB server down or connection pool exhausted |
+| `lock wait timeout exceeded` | Long-running transaction blocking others |
+| `value too long for type character varying(N)` | Input exceeds column width constraint |

package/manifest.json ADDED Viewed

@@ -0,0 +1,28 @@
+{
+  "name": "@botlearn/debugger",
+  "version": "0.1.0",
+  "description": "Root cause analysis, bug diagnosis, and fix suggestion for OpenClaw Agent — improves debugging efficiency 5x with systematic hypothesis-driven investigation",
+  "category": "programming-assistance",
+  "author": "BotLearn",
+  "benchmarkDimension": "code-generation",
+  "expectedImprovement": 500,
+  "dependencies": {
+    "@botlearn/code-review": "^1.0.0"
+  },
+  "compatibility": {
+    "openclaw": ">=0.5.0"
+  },
+  "files": {
+    "skill": "skill.md",
+    "knowledge": [
+      "knowledge/domain.md",
+      "knowledge/best-practices.md",
+      "knowledge/anti-patterns.md"
+    ],
+    "strategies": [
+      "strategies/main.md"
+    ],
+    "smokeTest": "tests/smoke.json",
+    "benchmark": "tests/benchmark.json"
+  }
+}

package/package.json ADDED Viewed

@@ -0,0 +1,38 @@
+{
+  "name": "@botlearn/debugger",
+  "version": "0.1.0",
+  "description": "Root cause analysis, bug diagnosis, and fix suggestion for OpenClaw Agent — improves debugging efficiency 5x with systematic hypothesis-driven investigation",
+  "type": "module",
+  "main": "manifest.json",
+  "files": [
+    "manifest.json",
+    "skill.md",
+    "knowledge/",
+    "strategies/",
+    "tests/",
+    "README.md"
+  ],
+  "keywords": [
+    "botlearn",
+    "openclaw",
+    "skill",
+    "programming-assistance"
+  ],
+  "author": "BotLearn",
+  "license": "MIT",
+  "dependencies": {
+    "@botlearn/code-review": "0.1.0"
+  },
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
+    "directory": "packages/skills/debugger"
+  },
+  "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/debugger",
+  "bugs": {
+    "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
+  },
+  "publishConfig": {
+    "access": "public"
+  }
+}

package/skill.md ADDED Viewed

@@ -0,0 +1,48 @@
+---
+name: debugger
+role: Debugging Specialist
+version: 1.0.0
+triggers:
+  - "debug"
+  - "fix bug"
+  - "why is this failing"
+  - "error"
+  - "stack trace"
+  - "exception"
+  - "not working"
+  - "unexpected behavior"
+  - "crash"
+  - "broken"
+---
+# Role
+You are a Debugging Specialist. When activated, you systematically diagnose software bugs through hypothesis-driven investigation, root cause analysis, and evidence-based fix suggestions. You correctly identify root causes at least 60% of the time, improving debugging efficiency by 5x compared to unstructured debugging.
+# Capabilities
+1. Analyze error messages, stack traces, and exception hierarchies to identify the failure point and its upstream causes
+2. Classify bugs by category (logic error, state corruption, race condition, resource leak, type mismatch, off-by-one, null reference, etc.) to narrow the investigation
+3. Formulate ranked hypotheses for the root cause based on symptom patterns, code context, and common bug taxonomies
+4. Design minimal reproduction steps that isolate the bug from unrelated system behavior
+5. Propose targeted fixes with reasoning, including regression test suggestions to prevent recurrence
+6. Leverage @botlearn/code-review capabilities to analyze code structure and identify defect-prone patterns before deep investigation
+# Constraints
+1. Never suggest a fix without first identifying the root cause -- symptom-level patches create technical debt
+2. Never skip the hypothesis phase -- jumping to conclusions leads to incorrect fixes and wasted effort
+3. Never ignore error messages or stack traces -- they contain critical diagnostic information
+4. Always consider side effects of a proposed fix -- verify it does not introduce new bugs
+5. Always suggest at least one regression test for every fix to prevent recurrence
+6. Never assume the first hypothesis is correct -- validate with evidence before recommending a fix
+# Activation
+WHEN the user reports a bug, error, unexpected behavior, or requests debugging assistance:
+1. Collect symptom data: error messages, stack traces, expected vs. actual behavior, environment context
+2. Classify the bug category using knowledge/domain.md
+3. Apply the 7-step debugging strategy from strategies/main.md
+4. Cross-reference with knowledge/best-practices.md for methodology guidance
+5. Verify the approach against knowledge/anti-patterns.md to avoid common debugging mistakes
+6. Output: root cause analysis, ranked fix suggestions, and regression test recommendations

package/strategies/main.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+strategy: debugger
+version: 1.0.0
+steps: 7
+---
+# Debugging Strategy
+## Step 1: Symptom Analysis
+- Collect ALL available diagnostic data: error messages, stack traces, logs, screenshots, user-reported steps
+- Parse the error message completely — identify: exception type, message body, file, line, column
+- Read the FULL stack trace — identify: the failure frame, the boundary between your code and library code, and the call chain
+- Classify the symptom using knowledge/domain.md bug taxonomy:
+  - Logic error / Null reference / Type error / Concurrency / Resource / Async / Import / Other
+- Record: **expected behavior** vs. **actual behavior** vs. **environment context** (OS, runtime version, configuration)
+- IF the symptom report is incomplete THEN ask for: exact error message, steps to reproduce, environment details
+## Step 2: Hypothesis Generation
+- Based on the symptom classification, generate 2-4 ranked hypotheses for the root cause
+- For each hypothesis, specify:
+  - **What**: The specific defect (e.g., "variable `user` is null because the API returns 404 when the user is not found")
+  - **Where**: The file and approximate code region
+  - **Why**: What makes this hypothesis plausible given the symptoms
+  - **Test**: How to confirm or disprove this hypothesis
+- Rank hypotheses by:
+  1. **Consistency** with ALL observed symptoms (must explain every symptom, not just one)
+  2. **Probability** based on common bug patterns (knowledge/domain.md) — null references and off-by-one errors are more likely than compiler bugs
+  3. **Testability** — prefer hypotheses that can be quickly confirmed or disproved
+- IF the code is available THEN leverage @botlearn/code-review to identify defect-prone patterns (deep nesting, missing error handling, unvalidated inputs) that support or refute hypotheses
+## Step 3: Reproduction
+- Design the **minimal reproduction case**: the smallest input + code + configuration that triggers the bug
+- Follow the minimization process from knowledge/best-practices.md:
+  1. Start with the full failing scenario
+  2. Remove components one at a time, verifying the bug persists after each removal
+  3. Simplify input data to the smallest triggering case
+  4. Document the exact steps: "Given X, when Y, then Z happens instead of W"
+- IF the bug is intermittent THEN:
+  - Increase parallelism or load to amplify the timing window
+  - Add logging at synchronization points
+  - Run the reproduction in a loop (100+ iterations) with state logging
+- IF the bug cannot be reproduced locally THEN:
+  - Verify environment parity (versions, configuration, data state)
+  - Check for environment-specific factors: timezone, locale, file system permissions, network latency
+  - Consider using production observability (traces, metrics) if available
+## Step 4: Root Cause Isolation
+- Test the top-ranked hypothesis first:
+  - Change exactly ONE variable and observe the result
+  - IF the prediction from Step 2 holds THEN the hypothesis is supported — gather one more piece of confirming evidence
+  - IF the prediction fails THEN discard the hypothesis and test the next one
+- Use bisection techniques from knowledge/best-practices.md to narrow the search space:
+  - **Code bisection**: Insert diagnostic checks at the midpoint of the suspect code path
+  - **Git bisect**: If the bug is a regression, identify the first bad commit in O(log n) steps
+  - **Data bisection**: If triggered by specific input, bisect the input to find the minimal trigger
+- Verify anti-patterns from knowledge/anti-patterns.md are not present in your investigation:
+  - Are you changing multiple things at once? (shotgun debugging)
+  - Are you ignoring the error message? (ignoring error messages)
+  - Are you blaming the framework without evidence? (assuming bug is elsewhere)
+- The step is complete when you can state: "The root cause is [specific defect] in [specific location] because [evidence]"
+## Step 5: Fix Design
+- Design the fix to address the ROOT CAUSE, not the symptom:
+  - IF the root cause is a missing null check THEN add validation at the source of the null, not a try-catch at the crash point
+  - IF the root cause is a race condition THEN add proper synchronization, not a retry/sleep workaround
+  - IF the root cause is a logic error THEN correct the logic, not add a special-case branch
+- Evaluate fix options if multiple exist:
+  - **Correctness**: Does it fully resolve the root cause?
+  - **Scope**: Is the change minimal and focused? (avoid premature optimization — knowledge/anti-patterns.md #8)
+  - **Side effects**: Could the fix break other code paths? Check callers and dependents
+  - **Consistency**: Does it follow the codebase's existing patterns and conventions?
+- Request @botlearn/code-review on the proposed fix before implementation if the change is non-trivial
+## Step 6: Regression Test Design
+- Write at least ONE test that:
+  1. **Fails** without the fix applied (reproduces the original bug)
+  2. **Passes** with the fix applied
+  3. Covers the specific input/state/sequence that triggered the bug
+- Consider edge case tests:
+  - Boundary values (0, -1, MAX_INT, empty string, empty array, null)
+  - Concurrent access scenarios (if the bug was concurrency-related)
+  - Error handling paths (if the bug was in exception handling)
+- Name the test descriptively to document the bug:
+  - `test_processOrder_returnsError_whenItemQuantityIsZero`
+  - `test_userList_rendersEmpty_whenApiReturnsEmptyArray`
+- IF the codebase has no test infrastructure THEN provide the test as a standalone script with clear pass/fail output
+## Step 7: Verification
+- Apply the fix and run the regression test — confirm it passes
+- Run the FULL test suite — confirm no new failures introduced
+- Re-test the original reproduction case from Step 3 — confirm the bug is resolved
+- Verify edge cases from Step 6 — confirm they pass
+- SELF-CHECK against knowledge/best-practices.md Fix Verification Checklist:
+  - [ ] Original bug no longer reproducible
+  - [ ] No new failures introduced
+  - [ ] Edge cases covered
+  - [ ] Regression test exists and is meaningful
+  - [ ] Fix addresses root cause, not symptom
+  - [ ] Code review completed
+- IF any check fails THEN loop back to the appropriate step:
+  - New failures → Step 5 (revise fix design)
+  - Edge case failures → Step 6 (add more tests, adjust fix)
+  - Root cause not actually fixed → Step 4 (re-investigate)
+- Output the final deliverable:
+  - **Root Cause**: One-sentence description of the defect
+  - **Evidence**: Key diagnostic findings that confirmed the root cause
+  - **Fix**: Description of the change with code diff
+  - **Regression Test**: The test(s) added
+  - **Risk Assessment**: Any residual risk or areas to monitor

package/tests/benchmark.json ADDED Viewed

@@ -0,0 +1,466 @@
+{
+  "version": "0.0.1",
+  "dimension": "code-generation",
+  "tasks": [
+    {
+      "id": "bench-easy-01",
+      "difficulty": "easy",
+      "description": "Debug an off-by-one error in a Python loop",
+      "input": "My Python function is supposed to return the sum of all elements in a list, but it's returning the wrong value for some inputs.\n\n```python\ndef sum_list(numbers):\n    total = 0\n    for i in range(1, len(numbers)):\n        total += numbers[i]\n    return total\n```\n\nExample: `sum_list([10, 20, 30])` returns `50` instead of `60`. What's wrong?",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Correctly identifies that range(1, len(numbers)) skips index 0, so the first element is never added; explains that range() is exclusive of the start in this context (starts at 1, not 0)",
+            "3": "Identifies the loop starts at wrong index but explanation is imprecise",
+            "1": "Mentions off-by-one but doesn't pinpoint the exact issue",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests changing to range(0, len(numbers)) or range(len(numbers)) or using a for-each loop; may also mention sum() as the Pythonic alternative",
+            "3": "Correct fix but no alternatives or explanation of why it works",
+            "1": "Fix works but is overly complex",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests tests including: empty list, single element, multiple elements, negative numbers; provides test code",
+            "3": "Suggests at least one test case",
+            "1": "Mentions testing generally but no specific cases",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 50,
+      "expectedScoreWith": 90
+    },
+    {
+      "id": "bench-easy-02",
+      "difficulty": "easy",
+      "description": "Debug an unhandled null reference in JavaScript",
+      "input": "My Express.js API endpoint crashes sometimes with this error:\n\n```\nTypeError: Cannot read properties of undefined (reading 'email')\n    at /app/routes/users.js:15:28\n```\n\nHere's the code:\n\n```javascript\napp.post('/api/users', (req, res) => {\n  const email = req.body.email;\n  const name = req.body.name;\n  // ... save to database\n  res.json({ success: true });\n});\n```\n\nIt works when I test with Postman but crashes when the frontend form is submitted.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies that req.body is undefined because the JSON body parser middleware (express.json()) is missing or not applied before this route; explains why Postman works (may set Content-Type differently or there's a route ordering issue)",
+            "3": "Identifies req.body is undefined but doesn't explain the middleware issue",
+            "1": "Says body is missing but doesn't identify why",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Suggests adding app.use(express.json()) before the route; also suggests adding input validation (check req.body exists, check email/name are present); mentions Content-Type header requirement",
+            "3": "Suggests adding the middleware but no input validation",
+            "1": "Suggests a workaround like optional chaining without addressing the root cause",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Reads the stack trace, identifies the line, considers the Postman vs frontend difference as a diagnostic clue, forms hypothesis about middleware/headers",
+            "3": "Some analysis shown but incomplete",
+            "1": "Jumps to conclusion without analysis",
+            "0": "No process shown"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 85
+    },
+    {
+      "id": "bench-easy-03",
+      "difficulty": "easy",
+      "description": "Debug a Python ModuleNotFoundError",
+      "input": "I just cloned a Python project and ran it, but I get this error:\n\n```\nTraceback (most recent call last):\n  File \"/app/main.py\", line 3, in <module>\n    from app.services.email_service import send_notification\nModuleNotFoundError: No module named 'app.services.email_service'\n```\n\nThe file structure is:\n```\nproject/\n  app/\n    __init__.py\n    main.py\n    services/\n      email_service.py\n```\n\nI'm running `python app/main.py` from the `project/` directory.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies that running `python app/main.py` sets `app/` as the script directory, so `app.services.email_service` is not resolvable; explains Python's module resolution (sys.path) and that the services/ directory may also be missing __init__.py",
+            "3": "Identifies the import path issue but doesn't fully explain the sys.path mechanism",
+            "1": "Mentions import issue vaguely",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Suggests multiple fixes: (1) run with `python -m app.main` from project/, (2) add __init__.py to services/ if missing, (3) adjust the import to relative import; explains which approach is best and why",
+            "3": "Suggests one correct fix",
+            "1": "Fix is partially correct",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Analyzes the traceback, checks directory structure against the import path, considers multiple possible causes (missing __init__.py, wrong working directory, wrong invocation)",
+            "3": "Some analysis but doesn't consider all possibilities",
+            "1": "Minimal analysis",
+            "0": "No process"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 85
+    },
+    {
+      "id": "bench-med-01",
+      "difficulty": "medium",
+      "description": "Debug a race condition in a Node.js caching layer",
+      "input": "Our Node.js API has a caching layer that sometimes returns stale data. Users report that after updating their profile, the old profile data is returned for a random amount of time (sometimes seconds, sometimes minutes). Here's the caching code:\n\n```javascript\nconst cache = new Map();\n\nasync function getUserProfile(userId) {\n  if (cache.has(userId)) {\n    return cache.get(userId);\n  }\n  const profile = await db.query('SELECT * FROM users WHERE id = $1', [userId]);\n  cache.set(userId, profile);\n  return profile;\n}\n\nasync function updateUserProfile(userId, data) {\n  await db.query('UPDATE users SET name=$1, email=$2 WHERE id=$3', [data.name, data.email, userId]);\n  // Invalidate cache after update\n  cache.delete(userId);\n  return { success: true };\n}\n```\n\nWe're running 4 Node.js worker processes behind a load balancer.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies TWO issues: (1) In-memory cache is per-process, so updating on worker 1 doesn't invalidate cache on workers 2-4; (2) Even within one process there's a TOCTOU race: between cache.has() check and db.query(), another request could populate stale data. Explains load balancer routing as the amplifier",
+            "3": "Identifies the multi-process cache issue but misses the single-process race condition",
+            "1": "Mentions caching issue vaguely without pinpointing the multi-process or race condition aspect",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests replacing in-memory Map with a shared cache (Redis/Memcached) with TTL; optionally suggests cache-aside with invalidation broadcast or write-through caching pattern; discusses TTL as a safety net",
+            "3": "Suggests Redis but doesn't address the race condition or TTL strategy",
+            "1": "Suggests adding a sleep or retry, or only addresses one of the two issues",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Notes the 'random amount of time' symptom as a key clue pointing to multi-process behavior; considers the load balancer routing; systematically analyzes the read and write paths",
+            "3": "Some systematic analysis but misses key diagnostic clues",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a test that simulates concurrent read-after-write across multiple processes/connections and verifies fresh data is returned; or suggests integration test with Redis",
+            "3": "Suggests a basic read-after-write test",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-02",
+      "difficulty": "medium",
+      "description": "Debug a database connection pool exhaustion issue",
+      "input": "Our Java Spring Boot application starts timing out on database queries after running for about 30 minutes under load. The error we see is:\n\n```\norg.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection;\n  nested exception is java.sql.SQLTransientConnectionException:\n  HikariPool-1 - Connection is not available, request timed out after 30000ms.\n```\n\nConnection pool config:\n```yaml\nspring:\n  datasource:\n    hikari:\n      maximum-pool-size: 10\n      connection-timeout: 30000\n```\n\nThe suspicious service method:\n```java\n@Service\npublic class ReportService {\n    @Autowired\n    private JdbcTemplate jdbcTemplate;\n\n    public Report generateReport(Long reportId) {\n        Connection conn = DataSourceUtils.getConnection(jdbcTemplate.getDataSource());\n        try {\n            // Multiple queries using conn directly\n            PreparedStatement ps1 = conn.prepareStatement(\"SELECT * FROM reports WHERE id = ?\");\n            ps1.setLong(1, reportId);\n            ResultSet rs1 = ps1.executeQuery();\n            // ... process results\n\n            PreparedStatement ps2 = conn.prepareStatement(\"SELECT * FROM report_items WHERE report_id = ?\");\n            ps2.setLong(1, reportId);\n            ResultSet rs2 = ps2.executeQuery();\n            // ... process results\n\n            return buildReport(rs1, rs2);\n        } catch (SQLException e) {\n            throw new RuntimeException(e);\n        }\n    }\n}\n```\n\nThe endpoint handling about 50 requests/minute.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies that DataSourceUtils.getConnection() borrows a connection from the pool but the code never releases it back (no finally block, no try-with-resources, no DataSourceUtils.releaseConnection()); connections leak on every call until the pool of 10 is exhausted; explains the 30-minute timeline based on request rate vs pool size",
+            "3": "Identifies the connection leak but doesn't explain the pool exhaustion timeline or the specific missing release mechanism",
+            "1": "Mentions connection issue but doesn't identify the leak",
+            "0": "Incorrect root cause (e.g., suggests increasing pool size)"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests adding a finally block with DataSourceUtils.releaseConnection(conn, dataSource); or refactoring to use JdbcTemplate directly (which manages connections automatically); or using try-with-resources; discusses PreparedStatement/ResultSet closing too",
+            "3": "Suggests one correct fix approach",
+            "1": "Suggests increasing pool size (treats symptom) or incomplete fix",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Analyzes the error message (pool timeout), notes the gradual onset pattern as characteristic of resource leaks, examines connection lifecycle in the code, identifies the missing release",
+            "3": "Some analysis but misses key diagnostic patterns",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a test that calls generateReport() more times than the pool size and verifies no timeout occurs; or suggests monitoring active/idle connection counts",
+            "3": "Suggests basic testing approach",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 75
+    },
+    {
+      "id": "bench-med-03",
+      "difficulty": "medium",
+      "description": "Debug an incorrect API response caused by async/await misuse",
+      "input": "My Node.js Express endpoint is supposed to return enriched product data with reviews, but the reviews array is always empty even though I can see reviews in the database.\n\n```javascript\napp.get('/api/products/:id', async (req, res) => {\n  try {\n    const product = await db.products.findById(req.params.id);\n    if (!product) return res.status(404).json({ error: 'Not found' });\n\n    // Enrich with reviews\n    product.reviews = getReviewsForProduct(product.id);\n\n    // Enrich with recommendations\n    product.recommendations = getRecommendations(product.category);\n\n    res.json(product);\n  } catch (err) {\n    res.status(500).json({ error: err.message });\n  }\n});\n\nasync function getReviewsForProduct(productId) {\n  const reviews = await db.reviews.find({ productId });\n  return reviews.map(r => ({\n    author: r.authorName,\n    rating: r.rating,\n    text: r.content,\n    date: r.createdAt\n  }));\n}\n\nasync function getRecommendations(category) {\n  const items = await db.products.find({ category, limit: 5 });\n  return items.map(i => ({ id: i.id, name: i.name }));\n}\n```\n\nWhen I log `product.reviews` right before `res.json(product)`, it shows `Promise { <pending> }`.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies that getReviewsForProduct() and getRecommendations() are async functions but are called WITHOUT await, so they return Promise objects instead of resolved values; the log showing Promise { <pending> } is the definitive clue; both calls are affected",
+            "3": "Identifies the missing await for reviews but doesn't mention recommendations is also affected",
+            "1": "Mentions async issue vaguely",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Adds await to both calls; also suggests using Promise.all() for parallel execution since the two enrichments are independent, improving performance; shows the corrected code",
+            "3": "Adds await to both calls but doesn't suggest parallel optimization",
+            "1": "Only fixes one of the two calls",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Uses the Promise { <pending> } log output as the key diagnostic clue, traces the function signatures to confirm they are async, identifies the pattern as a common async anti-pattern",
+            "3": "Identifies the issue but doesn't leverage the log output as evidence",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a test that calls the endpoint and asserts product.reviews is an array of review objects (not a Promise); verifies both reviews and recommendations are populated",
+            "3": "Suggests a basic endpoint test",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-med-04",
+      "difficulty": "medium",
+      "description": "Debug a CSS/React rendering issue with conditional class application",
+      "input": "I have a React component that should highlight table rows red when an order is overdue, but the styling never applies even though I can confirm the `isOverdue` flag is true.\n\n```jsx\nfunction OrderTable({ orders }) {\n  return (\n    <table>\n      <tbody>\n        {orders.map(order => (\n          <tr\n            key={order.id}\n            className={order.isOverdue && 'overdue-row'}\n          >\n            <td>{order.id}</td>\n            <td>{order.customerName}</td>\n            <td>{order.dueDate}</td>\n            <td>{order.status}</td>\n          </tr>\n        ))}\n      </tbody>\n    </table>\n  );\n}\n```\n\n```css\n.overdue-row {\n  background-color: #ffcccc;\n  font-weight: bold;\n}\n```\n\nWhen I inspect the DOM, the `<tr>` elements have `class=\"overdue-row\"` set correctly. I also verified the CSS file is imported. But the rows still look completely normal. Other CSS classes in the same file work fine.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies that the CSS class IS being applied (DOM shows it), so the issue is CSS specificity or inheritance, not React logic; specifically, <tr> background-color is often overridden by <td> or browser default table styles; or a more specific selector elsewhere overrides .overdue-row; mentions CSS specificity as the root cause category",
+            "3": "Identifies a CSS specificity issue but doesn't explain why tr background is overridden by td",
+            "1": "Incorrectly focuses on the React conditional logic which is working correctly",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Suggests targeting the <td> elements instead: `.overdue-row td { background-color: #ffcccc; }` or using !important as a quick test to confirm specificity is the issue; may also suggest inspecting computed styles in DevTools to see which rule wins",
+            "3": "Suggests a working fix but doesn't explain the specificity mechanism",
+            "1": "Suggests unrelated fixes (e.g., changing the React conditional logic)",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Systematically rules out React (DOM shows class is applied), rules out CSS import (other classes work), narrows to CSS specificity/cascade issue; suggests using DevTools computed styles tab to trace the cascade",
+            "3": "Some systematic narrowing but incomplete",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-hard-01",
+      "difficulty": "hard",
+      "description": "Debug a subtle memory leak in a Node.js long-running service",
+      "input": "Our Node.js microservice's memory usage grows from 150MB to 1.2GB over 24 hours, then crashes with OOM. We've taken heap snapshots at 1h, 12h, and 23h. The top retained objects at 23h are:\n\n```\nRetained Size | Object\n  412 MB      | (array) in EventEmitter._events.data\n  298 MB      | (array) in Map (connectionHandlers)\n  89 MB       | (string) in Set (processedIds)\n```\n\nHere's the relevant code:\n\n```javascript\nclass MessageProcessor extends EventEmitter {\n  constructor() {\n    super();\n    this.connectionHandlers = new Map();\n    this.processedIds = new Set();\n  }\n\n  registerConnection(connectionId, socket) {\n    const handler = (data) => {\n      this.processedIds.add(data.messageId);\n      this.emit('data', { connectionId, payload: data });\n      // ... process message\n    };\n    this.connectionHandlers.set(connectionId, handler);\n    socket.on('message', handler);\n  }\n\n  handleDisconnect(connectionId) {\n    this.connectionHandlers.delete(connectionId);\n    console.log(`Connection ${connectionId} cleaned up`);\n  }\n}\n\n// In the server setup:\nconst processor = new MessageProcessor();\n\nwsServer.on('connection', (socket) => {\n  const connId = generateId();\n  processor.registerConnection(connId, socket);\n\n  socket.on('close', () => {\n    processor.handleDisconnect(connId);\n  });\n\n  processor.on('data', (event) => {\n    metrics.record(event);\n  });\n});\n```\n\nWe handle about 500 connections/hour, with average session duration of 10 minutes.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies ALL THREE leaks: (1) processor.on('data') inside the connection handler adds a NEW listener for every connection but never removes it (listeners accumulate on the EventEmitter — the 412MB); (2) socket.on('message', handler) — the handler is removed from connectionHandlers Map but the socket listener is never removed with socket.removeListener/off (partial cleanup — the 298MB may relate to closures retained by these orphaned listeners); (3) processedIds Set grows unboundedly since messageIds are never purged (the 89MB). Correlates each to the heap snapshot data",
+            "3": "Identifies 2 of the 3 leaks",
+            "1": "Identifies 1 leak",
+            "0": "Incorrect analysis or no leaks identified"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Fixes all three: (1) move processor.on('data') outside the connection loop or use a single shared listener; (2) in handleDisconnect, also call socket.removeListener('message', handler) or socket.off(); (3) add TTL-based pruning or a max-size cap to processedIds; discusses setMaxListeners warning as a clue that was likely ignored",
+            "3": "Fixes 2 of 3 leaks correctly",
+            "1": "Fixes 1 leak",
+            "0": "Incorrect fixes"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Uses heap snapshot data as primary evidence; correlates retained sizes to specific data structures in the code; calculates expected growth rate (500 conn/hr * 10min avg = ~83 concurrent, but listener count grows monotonically); explains why handleDisconnect is insufficient",
+            "3": "Uses heap snapshot data but analysis is incomplete",
+            "1": "Mentions memory leak patterns generally without connecting to the specific evidence",
+            "0": "No diagnostic process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a test that creates N connections, disconnects them all, then checks: EventEmitter listener count is back to baseline, connectionHandlers Map is empty, processedIds has bounded size; suggests heap snapshot comparison in CI",
+            "3": "Suggests monitoring memory or checking listener count",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 15,
+      "expectedScoreWith": 60
+    },
+    {
+      "id": "bench-hard-02",
+      "difficulty": "hard",
+      "description": "Debug a distributed system consistency issue with eventual consistency and message ordering",
+      "input": "We have an e-commerce system with an Order Service and an Inventory Service communicating via a message queue (RabbitMQ). Occasionally, customers can purchase items that are actually out of stock. The flow is:\n\n1. Order Service receives purchase request\n2. Order Service publishes `OrderCreated` event to queue\n3. Inventory Service consumes the event and decrements stock\n4. If stock goes below 0, Inventory Service publishes `StockDepleted` event\n5. Order Service consumes `StockDepleted` and should cancel the order\n\n```javascript\n// Order Service\nasync function createOrder(req, res) {\n  const order = await db.orders.create({\n    userId: req.body.userId,\n    productId: req.body.productId,\n    quantity: req.body.quantity,\n    status: 'confirmed'\n  });\n  await messageQueue.publish('order.created', {\n    orderId: order.id,\n    productId: order.productId,\n    quantity: order.quantity\n  });\n  return res.json({ order, message: 'Order confirmed' });\n}\n\n// Inventory Service\nasync function handleOrderCreated(event) {\n  const product = await db.products.findById(event.productId);\n  product.stock -= event.quantity;\n  await product.save();\n  if (product.stock < 0) {\n    await messageQueue.publish('stock.depleted', {\n      productId: event.productId,\n      orderId: event.orderId\n    });\n  }\n}\n```\n\nWe see this issue primarily during flash sales when many orders come in simultaneously for the same product. The product might have stock=5 but 20 orders get confirmed.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies MULTIPLE interacting issues: (1) No stock check before confirming the order — the order is 'confirmed' immediately before inventory validation; (2) Race condition in inventory decrement — concurrent handleOrderCreated calls read the same stock value, each decrements from the same base, creating a lost-update problem (no DB-level locking or atomic operation); (3) The architecture is fundamentally flawed for this use case — stock reservation should happen synchronously, not via eventual consistency",
+            "3": "Identifies the race condition in inventory but misses the premature confirmation issue",
+            "1": "Identifies one issue but misses the systemic architecture problem",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Proposes a comprehensive fix: (1) Use optimistic locking or atomic DB operation (UPDATE products SET stock = stock - ? WHERE id = ? AND stock >= ?) for the inventory decrement; (2) Change order status to 'pending' until inventory is confirmed; (3) Consider synchronous stock reservation (e.g., Saga pattern or synchronous API call) for critical-path operations; discusses trade-offs between approaches",
+            "3": "Suggests atomic DB operation but doesn't address the order confirmation flow",
+            "1": "Suggests adding a simple lock without considering the distributed nature",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Analyzes the flash-sale scenario step by step: traces the timeline of concurrent requests, identifies where the data race occurs, explains why eventual consistency is insufficient for inventory reservation, references the stock=5/20-orders scenario as evidence of lost updates",
+            "3": "Some timeline analysis but incomplete",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a concurrent load test: create a product with stock=5, fire 20 simultaneous order requests, verify exactly 5 are confirmed and 15 are rejected/pending; verify final stock is 0 (not negative)",
+            "3": "Suggests a basic concurrency test",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 15,
+      "expectedScoreWith": 60
+    },
+    {
+      "id": "bench-hard-03",
+      "difficulty": "hard",
+      "description": "Debug a complex TypeScript type error in a generic data pipeline",
+      "input": "I'm building a type-safe data transformation pipeline in TypeScript and getting a complex type error I can't understand. The code:\n\n```typescript\ntype TransformFn<TIn, TOut> = (input: TIn) => TOut;\n\ninterface PipelineStep<TIn, TOut> {\n  name: string;\n  transform: TransformFn<TIn, TOut>;\n}\n\nclass Pipeline<TInput> {\n  private steps: PipelineStep<any, any>[] = [];\n\n  pipe<TOut>(step: PipelineStep<TInput, TOut>): Pipeline<TOut> {\n    this.steps.push(step);\n    return this as unknown as Pipeline<TOut>;\n  }\n\n  execute(input: TInput): any {\n    return this.steps.reduce((acc, step) => step.transform(acc), input);\n  }\n}\n\n// Usage:\ninterface RawData {\n  name: string;\n  age: string;\n  scores: string;\n}\n\ninterface ParsedData {\n  name: string;\n  age: number;\n  scores: number[];\n}\n\ninterface EnrichedData extends ParsedData {\n  ageGroup: 'junior' | 'senior';\n  averageScore: number;\n}\n\nconst pipeline = new Pipeline<RawData>()\n  .pipe({\n    name: 'parse',\n    transform: (raw: RawData): ParsedData => ({\n      name: raw.name,\n      age: parseInt(raw.age),\n      scores: raw.scores.split(',').map(Number)\n    })\n  })\n  .pipe({\n    name: 'enrich',\n    transform: (parsed: ParsedData): EnrichedData => ({\n      ...parsed,\n      ageGroup: parsed.age >= 18 ? 'senior' : 'junior',\n      averageScore: parsed.scores.reduce((a, b) => a + b, 0) / parsed.scores.length\n    })\n  });\n\nconst result = pipeline.execute({ name: 'Alice', age: '25', scores: '90,85,92' });\n// result type is 'any' — I want it to be EnrichedData\n// Also, the second .pipe() gives error:\n// Argument of type 'PipelineStep<ParsedData, EnrichedData>' is not assignable\n// to parameter of type 'PipelineStep<RawData, EnrichedData>'\n```\n\nHow do I fix this so the pipeline is truly type-safe with proper inference through the chain?",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies that the core issue is the pipe() method's type parameter: after the first pipe() call, `this` is cast to Pipeline<TOut> (Pipeline<ParsedData>), BUT the cast `as unknown as Pipeline<TOut>` doesn't actually change the object's generic type at the type level for subsequent method calls — TypeScript sees the original TInput=RawData for the second pipe(). The fundamental problem is that TypeScript cannot track mutating generic types on `this` through method chaining without a builder pattern or function composition approach",
+            "3": "Identifies the type erasure issue with the cast but doesn't fully explain why TypeScript can't track the chain",
+            "1": "Mentions generic type issue vaguely",
+            "0": "Incorrect root cause"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Proposes a proper solution: (1) Use a function-based pipe composition (like fp-ts pipe, or a standalone pipe function that returns a new Pipeline with the correct output type), OR (2) Redesign using a builder pattern where each pipe() returns a new Pipeline<TOut> instance (not casting this), OR (3) Use overloaded type signatures for chaining. Shows working code. Fixes the execute() return type to be the final output type instead of any",
+            "3": "Proposes a working fix for the type error but the pipeline isn't fully type-safe end-to-end",
+            "1": "Suggests using 'any' or type assertions to silence the error",
+            "0": "Incorrect fix"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Traces the type inference through each step of the chain, shows what TypeScript infers at each point, explains the disconnect between the runtime cast and the type system's view",
+            "3": "Some type analysis but incomplete",
+            "1": "Minimal analysis",
+            "0": "No process shown"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests compile-time type tests: verify that result is inferred as EnrichedData, verify that piping incompatible types produces a compile error, verify execute() input type matches the initial pipeline type",
+            "3": "Suggests a basic type assertion test",
+            "1": "Mentions testing generally",
+            "0": "No test suggestion"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 60
+    }
+  ]
+}

package/tests/smoke.json ADDED Viewed

@@ -0,0 +1,54 @@
+{
+  "version": "0.0.1",
+  "timeout": 60,
+  "tasks": [
+    {
+      "id": "smoke-01",
+      "description": "Debug a React component crash with a TypeError by analyzing the stack trace, identifying root cause, and suggesting a fix with regression test",
+      "input": "My React app crashes when I load the user profile page. Here's the error:\n\nTypeError: Cannot read properties of undefined (reading 'map')\n    at UserProfile (src/components/UserProfile.jsx:23:34)\n    at renderWithHooks (node_modules/react-dom/cjs/react-dom.development.js:16305:18)\n    at mountIndeterminateComponent (node_modules/react-dom/cjs/react-dom.development.js:20069:13)\n\nHere's the relevant code:\n\n```jsx\nfunction UserProfile({ userId }) {\n  const [user, setUser] = useState(null);\n\n  useEffect(() => {\n    fetch(`/api/users/${userId}`)\n      .then(res => res.json())\n      .then(data => setUser(data));\n  }, [userId]);\n\n  return (\n    <div>\n      <h1>{user.name}</h1>\n      <ul>\n        {user.posts.map(post => (\n          <li key={post.id}>{post.title}</li>\n        ))}\n      </ul>\n    </div>\n  );\n}\n```\n\nIt works fine after I navigate to the page from elsewhere, but crashes on direct page load or refresh.",
+      "rubric": [
+        {
+          "criterion": "Root Cause Identification",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Correctly identifies that user is null on initial render because useState initializes to null and the fetch hasn't completed yet; explains the timing issue between render and async data loading",
+            "3": "Identifies the null reference issue but doesn't fully explain the timing/lifecycle connection",
+            "1": "Mentions null but doesn't pinpoint why it's null on initial render",
+            "0": "Incorrect root cause identification"
+          }
+        },
+        {
+          "criterion": "Fix Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Suggests a proper fix: add a loading state guard (if (!user) return loading), or use optional chaining (user?.posts?.map), or initialize state with {name: '', posts: []}; explains trade-offs between approaches",
+            "3": "Suggests a working fix but doesn't explain trade-offs or only addresses one of the two null access points (user.name and user.posts.map)",
+            "1": "Suggests wrapping in try-catch or a fix that only masks the symptom",
+            "0": "No fix suggested or fix is incorrect"
+          }
+        },
+        {
+          "criterion": "Debugging Process",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Demonstrates systematic analysis: reads stack trace, identifies the failure line, analyzes component lifecycle, formulates hypothesis, explains why it works on navigation (cached data) but not on direct load",
+            "3": "Shows some systematic analysis but skips steps or doesn't explain the navigation vs. direct load difference",
+            "1": "Jumps to fix without analysis",
+            "0": "No debugging process visible"
+          }
+        },
+        {
+          "criterion": "Regression Test",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Suggests a test that renders UserProfile before fetch completes and verifies it shows loading state without crashing; includes test code or clear pseudocode",
+            "3": "Mentions testing but doesn't provide a specific test case",
+            "1": "No test suggestion",
+            "0": "Suggests inappropriate test"
+          }
+        }
+      ],
+      "passThreshold": 60
+    }
+  ]
+}