PyPI - reason-critic - Versions diffs - 0.1.0__tar.gz - Mend

reason-critic 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

reason_critic-0.1.0/LICENSE +21 -0
reason_critic-0.1.0/PKG-INFO +409 -0
reason_critic-0.1.0/README.md +375 -0
reason_critic-0.1.0/pyproject.toml +61 -0
reason_critic-0.1.0/setup.cfg +4 -0
reason_critic-0.1.0/src/reason_critic/__init__.py +34 -0
reason_critic-0.1.0/src/reason_critic/benchmarks/__init__.py +4 -0
reason_critic-0.1.0/src/reason_critic/cli.py +296 -0
reason_critic-0.1.0/src/reason_critic/critic.py +521 -0
reason_critic-0.1.0/src/reason_critic/data_prep.py +504 -0
reason_critic-0.1.0/src/reason_critic/pipeline.py +326 -0
reason_critic-0.1.0/src/reason_critic/server.py +253 -0
reason_critic-0.1.0/src/reason_critic/trainer.py +448 -0
reason_critic-0.1.0/src/reason_critic.egg-info/PKG-INFO +409 -0
reason_critic-0.1.0/src/reason_critic.egg-info/SOURCES.txt +18 -0
reason_critic-0.1.0/src/reason_critic.egg-info/dependency_links.txt +1 -0
reason_critic-0.1.0/src/reason_critic.egg-info/entry_points.txt +2 -0
reason_critic-0.1.0/src/reason_critic.egg-info/requires.txt +28 -0
reason_critic-0.1.0/src/reason_critic.egg-info/top_level.txt +1 -0
reason_critic-0.1.0/tests/test_critic.py +641 -0

reason_critic-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 FableForge Contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

reason_critic-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,409 @@
+Metadata-Version: 2.4
+Name: reason-critic
+Version: 0.1.0
+Summary: A self-verification model that critiques agent output — it doesn't generate, it flags errors.
+Author: FableForge
+License: MIT
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: rich>=13.0
+Requires-Dist: click>=8.1
+Requires-Dist: pydantic>=2.0
+Requires-Dist: httpx>=0.25
+Requires-Dist: fastapi>=0.104.0
+Requires-Dist: uvicorn>=0.24.0
+Provides-Extra: train
+Requires-Dist: torch>=2.1.0; extra == "train"
+Requires-Dist: transformers>=4.36.0; extra == "train"
+Requires-Dist: peft>=0.7.0; extra == "train"
+Requires-Dist: datasets>=2.14.0; extra == "train"
+Requires-Dist: accelerate>=0.25.0; extra == "train"
+Requires-Dist: unsloth>=2024.1; extra == "train"
+Provides-Extra: gpu
+Requires-Dist: bitsandbytes>=0.43.0; extra == "gpu"
+Provides-Extra: dpo
+Requires-Dist: trl>=0.7.0; extra == "dpo"
+Provides-Extra: all
+Requires-Dist: reason-critic[dpo,gpu,train]; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.4.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Dynamic: license-file
+# ReasonCritic
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/) [![Tests](https://img.shields.io/badge/tests-0-yellow.svg)](tests/)
+> A self-verification model that critiques agent output. It doesn't generate — it flags errors.
+## Overview
+ReasonCritic is a verification model trained to detect bugs, security issues, logic errors, and style problems in code generated by AI agents. Unlike generative models, it focuses exclusively on **critique**: given code, it produces a structured verdict (PASS/FAIL), confidence score, issue list, and actionable suggestions.
+### Data Sources
+- **v-Fable verification phase**: 62.2% of traces contain verification steps — extracted as (code, pass/fail) pairs
+- **Glint error/recovery pairs**: 3,725 examples of agent mistakes and their corrections
+### Architecture
+- **Base model**: Qwen3-7B
+- **Training**: Three-stage pipeline (contrastive → LoRA → DPO)
+- **Output**: Structured verification result with verdict, confidence, issues, and suggestions
+## Installation
+```bash
+pip install -e .
+# With DPO training support:
+pip install -e ".[dpo]"
+# With development tools:
+pip install -e ".[dev]"
+```
+## Quick Start
+### CLI
+```bash
+# Verify a code snippet
+critic verify --code "def add(a, b): return a + b"
+# Verify a file
+critic verify --file app.py
+# Verify an agent trace
+critic verify --trace trace.jsonl
+# Train the critic model
+critic train --data pairs.jsonl --model Qwen/Qwen3-7B
+# Start the API server
+critic serve --port 8000
+```
+### Python API
+```python
+from reason_critic import ReasonCritic, VerificationResult
+# Initialize critic
+critic = ReasonCritic(backend="local", model_name="reason-critic-7b")
+# Verify code
+result = critic.verify(
+    code="def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)",
+    language="python",
+)
+print(f"Verdict: {result.pass_fail}")      # PASS or FAIL
+print(f"Confidence: {result.confidence}")  # 0.0 to 1.0
+print(f"Issues: {result.issues}")          # List of issues
+print(f"Suggestions: {result.suggestions}") # List of suggestions
+```
+### Verify an Agent Step
+```python
+step = {
+    "index": 0,
+    "type": "code_generation",
+    "code": "for i in range(11):\n    print(data[i])",
+    "name": "process_data",
+}
+step_result = critic.verify_step(step, context="Processing user data")
+print(step_result.result.pass_fail)  # FAIL (off-by-one)
+```
+### Verify a Full Agent Run
+```python
+run = {
+    "id": "run-abc123",
+    "steps": [
+        {"index": 0, "type": "generation", "code": "x = 1", "name": "init"},
+        {"index": 1, "type": "generation", "code": "y = x + 1", "name": "compute"},
+    ]
+}
+run_result = critic.verify_run(run)
+print(f"Overall: {run_result.overall_verdict}")  # PASS or FAIL
+print(f"Steps passed: {run_result.num_passed}/{len(run_result.step_verifications)}")
+```
+### Generate-then-Verify Pipeline
+```python
+from reason_critic.pipeline import GenerateVerifyPipeline, GeneratorWrapper
+from reason_critic import ReasonCritic
+pipeline = GenerateVerifyPipeline(
+    generator=GeneratorWrapper(model_name="Qwen/Qwen3-7B"),
+    critic=ReasonCritic(backend="local", model_name="reason-critic-7b"),
+    max_attempts=3,
+)
+result = pipeline.generate_and_verify(
+    task="Write a function that checks if a string is a palindrome",
+    language="python",
+)
+print(f"Passed: {result.passed}")
+print(f"Attempts: {result.total_attempts}")
+print(f"Final code:\n{result.final_code}")
+```
+If verification fails, the pipeline feeds issues back to the generator for re-generation, up to `max_attempts` cycles.
+## API Server
+```bash
+# Start the server
+critic serve --port 8000
+```
+### Endpoints
+#### `POST /verify` — Verify code
+```json
+{
+    "code": "def add(a, b): return a - b",
+    "context": "Addition function",
+    "language": "python"
+}
+```
+Response:
+```json
+{
+    "pass_fail": "FAIL",
+    "confidence": 0.92,
+    "issues": ["Subtraction instead of addition"],
+    "suggestions": ["Use + instead of -"],
+    "explanation": "Function uses subtraction where addition is expected",
+    "language": "python"
+}
+```
+#### `POST /verify/step` — Verify a single step
+```json
+{
+    "step": {
+        "index": 0,
+        "type": "code_generation",
+        "code": "for i in range(11): print(data[i])",
+        "name": "loop_data"
+    },
+    "context": "Processing array"
+}
+```
+#### `POST /verify/run` — Verify a full agent run
+```json
+{
+    "run": {
+        "id": "run-123",
+        "steps": [
+            {"index": 0, "type": "generation", "code": "x = 1"},
+            {"index": 1, "type": "generation", "code": "y = x / 0"}
+        ]
+    },
+    "context": "Data processing pipeline"
+}
+```
+#### `POST /pipeline` — Generate-then-verify
+```json
+{
+    "task": "Write a sorting function",
+    "max_attempts": 3,
+    "language": "python"
+}
+```
+#### `GET /health` — Health check
+```json
+{
+    "status": "healthy",
+    "model": "reason-critic-7b",
+    "backend": "local"
+}
+```
+## Training Pipeline
+### Three-Stage Training
+ReasonCritic uses a three-stage training pipeline:
+1. **Stage 1: Contrastive Learning** — Train on correct/incorrect code pairs to learn the difference
+2. **Stage 2: LoRA Fine-Tuning** — Efficient fine-tuning with Low-Rank Adaptation
+3. **Stage 3: DPO Alignment** — Direct Preference Optimization for better verification preferences
+### Data Preparation
+```python
+from reason_critic.data_prep import (
+    extract_verification_pairs,
+    generate_incorrect_versions,
+    create_contrastive_pairs,
+    load_glint_error_recovery,
+)
+# Extract from agent traces
+examples = extract_verification_pairs(traces)
+# Generate buggy versions for contrastive learning
+buggy = generate_incorrect_versions(correct_code, num_versions=3)
+# Create pairs
+pair = create_contrastive_pairs(correct_code, incorrect_code)
+# Load Glint error/recovery data
+glint_examples = load_glint_error_recovery("glint_data.jsonl")
+```
+### Bug Templates
+`generate_incorrect_versions` applies systematic bug-introduction strategies:
+| Bug Type | Description |
+|----------|-------------|
+| `off_by_one` | Off-by-one errors in loop bounds |
+| `wrong_operator` | Swapped comparison operators |
+| `missing_none_check` | Missing None check before attribute access |
+| `forgotten_await` | Missing await on async call |
+| `mutable_default` | Mutable default arguments |
+| `shadowed_variable` | Variable shadowing in inner scope |
+### Training
+```python
+from reason_critic.trainer import TrainingConfig, run_three_stage_pipeline
+config = TrainingConfig(
+    model_name="Qwen/Qwen3-7B",
+    output_dir="./reason-critic-output",
+    contrastive_epochs=3,
+    lora_epochs=2,
+    dpo_epochs=1,
+)
+results = run_three_stage_pipeline(examples, pairs, output_dir="./output", config=config)
+```
+Or via CLI:
+```bash
+critic train --data pairs.jsonl --model Qwen/Qwen3-7B --stage all
+critic train --data pairs.jsonl --stage contrastive
+critic train --data pairs.jsonl --stage lora
+critic train --data pairs.jsonl --stage dpo
+```
+## Benchmarks
+The project includes 130 verification benchmark tasks across 4 categories:
+| Category | Count | Description |
+|----------|-------|-------------|
+| Code Correctness | 50 | Off-by-one, wrong operators, missing checks, mutations, async bugs |
+| Security Issues | 30 | SQL injection, XSS, CSRF, command injection, crypto weaknesses |
+| Logic Errors | 30 | Condition order, inverted logic, De Morgan's law, scope issues |
+| Style Issues | 20 | Missing docs, magic numbers, god objects, naming, logging |
+```python
+from reason_critic.benchmarks import BENCHMARK_CATEGORIES
+import json
+from pathlib import Path
+for category in BENCHMARK_CATEGORIES:
+    path = Path(__file__).parent / "benchmarks" / category / "tasks.json"
+    tasks = json.loads(path.read_text())
+    print(f"{category}: {len(tasks)} tasks")
+```
+## Architecture
+```
+ReasonCritic
+├── critic.py           # Core verification model + backends (local, API, hybrid)
+├── data_prep.py        # Training data preparation from traces
+├── trainer.py           # Three-stage training pipeline
+├── pipeline.py          # Generate-then-verify pipeline
+├── server.py            # FastAPI server
+├── cli.py               # CLI interface
+└── benchmarks/          # Verification benchmark tasks
+    ├── code_correctness/  # 50 tasks
+    ├── security_issues/    # 30 tasks
+    ├── logic_errors/        # 30 tasks
+    └── style_issues/         # 20 tasks
+```
+### Backends
+- **Local**: Load model via transformers/Unsloth for local inference
+- **API**: Call a remote verification service
+- **Hybrid**: Try local first, fall back to API for low-confidence results
+### VerificationResult Schema
+```python
+@dataclass
+class VerificationResult:
+    pass_fail: str         # "PASS" or "FAIL"
+    confidence: float      # 0.0 to 1.0
+    issues: list[str]      # List of detected issues
+    suggestions: list[str] # List of suggested fixes
+    explanation: str       # Brief explanation
+    language: str          # Programming language
+    raw_output: str        # Raw model output
+    model_name: str        # Model that produced this result
+```
+## Running Tests
+```bash
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+## License
+MIT
+## Ecosystem
+Part of the [FableForge](../) ecosystem — 21 open-source projects built from 210K real agent traces:
+| Project | Description |
+| --- | --- |
+| **[Anvil](../anvil)** | Self-verified coding agent |
+| **[VerifyLoop](../verifyloop)** | Plan→Execute→Verify→Recover framework |
+| **[ErrorRecovery](../error-recovery)** | Self-healing middleware (3,725 error patterns) |
+| **[FableForge-14B](../fableforge-14b)** | The fine-tuned 14B model (4-stage training) |
+| **[ShellWhisperer](../shell-whisperer)** | 1.5B edge agent (phone/RPi, 50ms) |
+| **[ReasonCritic](../reason-critic)** | Verification model (130 benchmark tasks) |
+| **[TraceCompiler](../trace-compiler)** | Compile traces → LoRA skills |
+| **[AgentRuntime](../agent-runtime)** | Persistent agent daemon (systemd for AI) |
+| **[AgentSwarm](../agent-swarm)** | Multi-agent from real trace transitions |
+| **[AgentTelemetry](../agent-telemetry)** | Datadog for agents (token tracking, costs) |
+| **[BenchAgent](../bench-agent)** | HumanEval for tool-use (107 tasks) |
+| **[AgentDev](../agent-dev)** | VSCode extension with verification |
+| **[TraceViz](../trace-viz)** | Trace replay visualizer (Next.js) |
+| **[AgentSkills](../agent-skills)** | npm for agent behaviors |
+| **[AgentCurriculum](../agent-curriculum)** | 5-stage progressive training |
+| **[AgentFuzzer](../agent-fuzzer)** | Adversarial testing for agents |
+| **[AgentConstitution](../agent-constitution)** | Safety guardrails from traces |
+| **[CostOptimizer](../cost-optimizer)** | Token cost reduction (50-80%) |
+| **[AgentProfiler](../agent-profiler)** | Behavioral fingerprinting |
+| **[TrajectoryDistiller](../trajectory-distiller)** | Trace→training data pipeline |
+| **[Fable5-Dataset](../fable5-dataset)** | HuggingFace dataset release |