reason-critic 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 FableForge Contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,409 @@
1
+ Metadata-Version: 2.4
2
+ Name: reason-critic
3
+ Version: 0.1.0
4
+ Summary: A self-verification model that critiques agent output — it doesn't generate, it flags errors.
5
+ Author: FableForge
6
+ License: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: rich>=13.0
11
+ Requires-Dist: click>=8.1
12
+ Requires-Dist: pydantic>=2.0
13
+ Requires-Dist: httpx>=0.25
14
+ Requires-Dist: fastapi>=0.104.0
15
+ Requires-Dist: uvicorn>=0.24.0
16
+ Provides-Extra: train
17
+ Requires-Dist: torch>=2.1.0; extra == "train"
18
+ Requires-Dist: transformers>=4.36.0; extra == "train"
19
+ Requires-Dist: peft>=0.7.0; extra == "train"
20
+ Requires-Dist: datasets>=2.14.0; extra == "train"
21
+ Requires-Dist: accelerate>=0.25.0; extra == "train"
22
+ Requires-Dist: unsloth>=2024.1; extra == "train"
23
+ Provides-Extra: gpu
24
+ Requires-Dist: bitsandbytes>=0.43.0; extra == "gpu"
25
+ Provides-Extra: dpo
26
+ Requires-Dist: trl>=0.7.0; extra == "dpo"
27
+ Provides-Extra: all
28
+ Requires-Dist: reason-critic[dpo,gpu,train]; extra == "all"
29
+ Provides-Extra: dev
30
+ Requires-Dist: pytest>=7.4.0; extra == "dev"
31
+ Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
32
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
33
+ Dynamic: license-file
34
+
35
+ # ReasonCritic
36
+
37
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/) [![Tests](https://img.shields.io/badge/tests-0-yellow.svg)](tests/)
38
+
39
+
40
+ > A self-verification model that critiques agent output. It doesn't generate — it flags errors.
41
+
42
+ ## Overview
43
+
44
+ ReasonCritic is a verification model trained to detect bugs, security issues, logic errors, and style problems in code generated by AI agents. Unlike generative models, it focuses exclusively on **critique**: given code, it produces a structured verdict (PASS/FAIL), confidence score, issue list, and actionable suggestions.
45
+
46
+ ### Data Sources
47
+
48
+ - **v-Fable verification phase**: 62.2% of traces contain verification steps — extracted as (code, pass/fail) pairs
49
+ - **Glint error/recovery pairs**: 3,725 examples of agent mistakes and their corrections
50
+
51
+ ### Architecture
52
+
53
+ - **Base model**: Qwen3-7B
54
+ - **Training**: Three-stage pipeline (contrastive → LoRA → DPO)
55
+ - **Output**: Structured verification result with verdict, confidence, issues, and suggestions
56
+
57
+ ## Installation
58
+
59
+ ```bash
60
+ pip install -e .
61
+
62
+ # With DPO training support:
63
+ pip install -e ".[dpo]"
64
+
65
+ # With development tools:
66
+ pip install -e ".[dev]"
67
+ ```
68
+
69
+ ## Quick Start
70
+
71
+ ### CLI
72
+
73
+ ```bash
74
+ # Verify a code snippet
75
+ critic verify --code "def add(a, b): return a + b"
76
+
77
+ # Verify a file
78
+ critic verify --file app.py
79
+
80
+ # Verify an agent trace
81
+ critic verify --trace trace.jsonl
82
+
83
+ # Train the critic model
84
+ critic train --data pairs.jsonl --model Qwen/Qwen3-7B
85
+
86
+ # Start the API server
87
+ critic serve --port 8000
88
+ ```
89
+
90
+ ### Python API
91
+
92
+ ```python
93
+ from reason_critic import ReasonCritic, VerificationResult
94
+
95
+ # Initialize critic
96
+ critic = ReasonCritic(backend="local", model_name="reason-critic-7b")
97
+
98
+ # Verify code
99
+ result = critic.verify(
100
+ code="def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)",
101
+ language="python",
102
+ )
103
+
104
+ print(f"Verdict: {result.pass_fail}") # PASS or FAIL
105
+ print(f"Confidence: {result.confidence}") # 0.0 to 1.0
106
+ print(f"Issues: {result.issues}") # List of issues
107
+ print(f"Suggestions: {result.suggestions}") # List of suggestions
108
+ ```
109
+
110
+ ### Verify an Agent Step
111
+
112
+ ```python
113
+ step = {
114
+ "index": 0,
115
+ "type": "code_generation",
116
+ "code": "for i in range(11):\n print(data[i])",
117
+ "name": "process_data",
118
+ }
119
+ step_result = critic.verify_step(step, context="Processing user data")
120
+ print(step_result.result.pass_fail) # FAIL (off-by-one)
121
+ ```
122
+
123
+ ### Verify a Full Agent Run
124
+
125
+ ```python
126
+ run = {
127
+ "id": "run-abc123",
128
+ "steps": [
129
+ {"index": 0, "type": "generation", "code": "x = 1", "name": "init"},
130
+ {"index": 1, "type": "generation", "code": "y = x + 1", "name": "compute"},
131
+ ]
132
+ }
133
+ run_result = critic.verify_run(run)
134
+ print(f"Overall: {run_result.overall_verdict}") # PASS or FAIL
135
+ print(f"Steps passed: {run_result.num_passed}/{len(run_result.step_verifications)}")
136
+ ```
137
+
138
+ ### Generate-then-Verify Pipeline
139
+
140
+ ```python
141
+ from reason_critic.pipeline import GenerateVerifyPipeline, GeneratorWrapper
142
+ from reason_critic import ReasonCritic
143
+
144
+ pipeline = GenerateVerifyPipeline(
145
+ generator=GeneratorWrapper(model_name="Qwen/Qwen3-7B"),
146
+ critic=ReasonCritic(backend="local", model_name="reason-critic-7b"),
147
+ max_attempts=3,
148
+ )
149
+
150
+ result = pipeline.generate_and_verify(
151
+ task="Write a function that checks if a string is a palindrome",
152
+ language="python",
153
+ )
154
+
155
+ print(f"Passed: {result.passed}")
156
+ print(f"Attempts: {result.total_attempts}")
157
+ print(f"Final code:\n{result.final_code}")
158
+ ```
159
+
160
+ If verification fails, the pipeline feeds issues back to the generator for re-generation, up to `max_attempts` cycles.
161
+
162
+ ## API Server
163
+
164
+ ```bash
165
+ # Start the server
166
+ critic serve --port 8000
167
+ ```
168
+
169
+ ### Endpoints
170
+
171
+ #### `POST /verify` — Verify code
172
+
173
+ ```json
174
+ {
175
+ "code": "def add(a, b): return a - b",
176
+ "context": "Addition function",
177
+ "language": "python"
178
+ }
179
+ ```
180
+
181
+ Response:
182
+ ```json
183
+ {
184
+ "pass_fail": "FAIL",
185
+ "confidence": 0.92,
186
+ "issues": ["Subtraction instead of addition"],
187
+ "suggestions": ["Use + instead of -"],
188
+ "explanation": "Function uses subtraction where addition is expected",
189
+ "language": "python"
190
+ }
191
+ ```
192
+
193
+ #### `POST /verify/step` — Verify a single step
194
+
195
+ ```json
196
+ {
197
+ "step": {
198
+ "index": 0,
199
+ "type": "code_generation",
200
+ "code": "for i in range(11): print(data[i])",
201
+ "name": "loop_data"
202
+ },
203
+ "context": "Processing array"
204
+ }
205
+ ```
206
+
207
+ #### `POST /verify/run` — Verify a full agent run
208
+
209
+ ```json
210
+ {
211
+ "run": {
212
+ "id": "run-123",
213
+ "steps": [
214
+ {"index": 0, "type": "generation", "code": "x = 1"},
215
+ {"index": 1, "type": "generation", "code": "y = x / 0"}
216
+ ]
217
+ },
218
+ "context": "Data processing pipeline"
219
+ }
220
+ ```
221
+
222
+ #### `POST /pipeline` — Generate-then-verify
223
+
224
+ ```json
225
+ {
226
+ "task": "Write a sorting function",
227
+ "max_attempts": 3,
228
+ "language": "python"
229
+ }
230
+ ```
231
+
232
+ #### `GET /health` — Health check
233
+
234
+ ```json
235
+ {
236
+ "status": "healthy",
237
+ "model": "reason-critic-7b",
238
+ "backend": "local"
239
+ }
240
+ ```
241
+
242
+ ## Training Pipeline
243
+
244
+ ### Three-Stage Training
245
+
246
+ ReasonCritic uses a three-stage training pipeline:
247
+
248
+ 1. **Stage 1: Contrastive Learning** — Train on correct/incorrect code pairs to learn the difference
249
+ 2. **Stage 2: LoRA Fine-Tuning** — Efficient fine-tuning with Low-Rank Adaptation
250
+ 3. **Stage 3: DPO Alignment** — Direct Preference Optimization for better verification preferences
251
+
252
+ ### Data Preparation
253
+
254
+ ```python
255
+ from reason_critic.data_prep import (
256
+ extract_verification_pairs,
257
+ generate_incorrect_versions,
258
+ create_contrastive_pairs,
259
+ load_glint_error_recovery,
260
+ )
261
+
262
+ # Extract from agent traces
263
+ examples = extract_verification_pairs(traces)
264
+
265
+ # Generate buggy versions for contrastive learning
266
+ buggy = generate_incorrect_versions(correct_code, num_versions=3)
267
+
268
+ # Create pairs
269
+ pair = create_contrastive_pairs(correct_code, incorrect_code)
270
+
271
+ # Load Glint error/recovery data
272
+ glint_examples = load_glint_error_recovery("glint_data.jsonl")
273
+ ```
274
+
275
+ ### Bug Templates
276
+
277
+ `generate_incorrect_versions` applies systematic bug-introduction strategies:
278
+
279
+ | Bug Type | Description |
280
+ |----------|-------------|
281
+ | `off_by_one` | Off-by-one errors in loop bounds |
282
+ | `wrong_operator` | Swapped comparison operators |
283
+ | `missing_none_check` | Missing None check before attribute access |
284
+ | `forgotten_await` | Missing await on async call |
285
+ | `mutable_default` | Mutable default arguments |
286
+ | `shadowed_variable` | Variable shadowing in inner scope |
287
+
288
+ ### Training
289
+
290
+ ```python
291
+ from reason_critic.trainer import TrainingConfig, run_three_stage_pipeline
292
+
293
+ config = TrainingConfig(
294
+ model_name="Qwen/Qwen3-7B",
295
+ output_dir="./reason-critic-output",
296
+ contrastive_epochs=3,
297
+ lora_epochs=2,
298
+ dpo_epochs=1,
299
+ )
300
+
301
+ results = run_three_stage_pipeline(examples, pairs, output_dir="./output", config=config)
302
+ ```
303
+
304
+ Or via CLI:
305
+ ```bash
306
+ critic train --data pairs.jsonl --model Qwen/Qwen3-7B --stage all
307
+ critic train --data pairs.jsonl --stage contrastive
308
+ critic train --data pairs.jsonl --stage lora
309
+ critic train --data pairs.jsonl --stage dpo
310
+ ```
311
+
312
+ ## Benchmarks
313
+
314
+ The project includes 130 verification benchmark tasks across 4 categories:
315
+
316
+ | Category | Count | Description |
317
+ |----------|-------|-------------|
318
+ | Code Correctness | 50 | Off-by-one, wrong operators, missing checks, mutations, async bugs |
319
+ | Security Issues | 30 | SQL injection, XSS, CSRF, command injection, crypto weaknesses |
320
+ | Logic Errors | 30 | Condition order, inverted logic, De Morgan's law, scope issues |
321
+ | Style Issues | 20 | Missing docs, magic numbers, god objects, naming, logging |
322
+
323
+ ```python
324
+ from reason_critic.benchmarks import BENCHMARK_CATEGORIES
325
+ import json
326
+ from pathlib import Path
327
+
328
+ for category in BENCHMARK_CATEGORIES:
329
+ path = Path(__file__).parent / "benchmarks" / category / "tasks.json"
330
+ tasks = json.loads(path.read_text())
331
+ print(f"{category}: {len(tasks)} tasks")
332
+ ```
333
+
334
+ ## Architecture
335
+
336
+ ```
337
+ ReasonCritic
338
+ ├── critic.py # Core verification model + backends (local, API, hybrid)
339
+ ├── data_prep.py # Training data preparation from traces
340
+ ├── trainer.py # Three-stage training pipeline
341
+ ├── pipeline.py # Generate-then-verify pipeline
342
+ ├── server.py # FastAPI server
343
+ ├── cli.py # CLI interface
344
+ └── benchmarks/ # Verification benchmark tasks
345
+ ├── code_correctness/ # 50 tasks
346
+ ├── security_issues/ # 30 tasks
347
+ ├── logic_errors/ # 30 tasks
348
+ └── style_issues/ # 20 tasks
349
+ ```
350
+
351
+ ### Backends
352
+
353
+ - **Local**: Load model via transformers/Unsloth for local inference
354
+ - **API**: Call a remote verification service
355
+ - **Hybrid**: Try local first, fall back to API for low-confidence results
356
+
357
+ ### VerificationResult Schema
358
+
359
+ ```python
360
+ @dataclass
361
+ class VerificationResult:
362
+ pass_fail: str # "PASS" or "FAIL"
363
+ confidence: float # 0.0 to 1.0
364
+ issues: list[str] # List of detected issues
365
+ suggestions: list[str] # List of suggested fixes
366
+ explanation: str # Brief explanation
367
+ language: str # Programming language
368
+ raw_output: str # Raw model output
369
+ model_name: str # Model that produced this result
370
+ ```
371
+
372
+ ## Running Tests
373
+
374
+ ```bash
375
+ pip install -e ".[dev]"
376
+ pytest tests/ -v
377
+ ```
378
+
379
+ ## License
380
+
381
+ MIT
382
+
383
+ ## Ecosystem
384
+
385
+ Part of the [FableForge](../) ecosystem — 21 open-source projects built from 210K real agent traces:
386
+
387
+ | Project | Description |
388
+ | --- | --- |
389
+ | **[Anvil](../anvil)** | Self-verified coding agent |
390
+ | **[VerifyLoop](../verifyloop)** | Plan→Execute→Verify→Recover framework |
391
+ | **[ErrorRecovery](../error-recovery)** | Self-healing middleware (3,725 error patterns) |
392
+ | **[FableForge-14B](../fableforge-14b)** | The fine-tuned 14B model (4-stage training) |
393
+ | **[ShellWhisperer](../shell-whisperer)** | 1.5B edge agent (phone/RPi, 50ms) |
394
+ | **[ReasonCritic](../reason-critic)** | Verification model (130 benchmark tasks) |
395
+ | **[TraceCompiler](../trace-compiler)** | Compile traces → LoRA skills |
396
+ | **[AgentRuntime](../agent-runtime)** | Persistent agent daemon (systemd for AI) |
397
+ | **[AgentSwarm](../agent-swarm)** | Multi-agent from real trace transitions |
398
+ | **[AgentTelemetry](../agent-telemetry)** | Datadog for agents (token tracking, costs) |
399
+ | **[BenchAgent](../bench-agent)** | HumanEval for tool-use (107 tasks) |
400
+ | **[AgentDev](../agent-dev)** | VSCode extension with verification |
401
+ | **[TraceViz](../trace-viz)** | Trace replay visualizer (Next.js) |
402
+ | **[AgentSkills](../agent-skills)** | npm for agent behaviors |
403
+ | **[AgentCurriculum](../agent-curriculum)** | 5-stage progressive training |
404
+ | **[AgentFuzzer](../agent-fuzzer)** | Adversarial testing for agents |
405
+ | **[AgentConstitution](../agent-constitution)** | Safety guardrails from traces |
406
+ | **[CostOptimizer](../cost-optimizer)** | Token cost reduction (50-80%) |
407
+ | **[AgentProfiler](../agent-profiler)** | Behavioral fingerprinting |
408
+ | **[TrajectoryDistiller](../trajectory-distiller)** | Trace→training data pipeline |
409
+ | **[Fable5-Dataset](../fable5-dataset)** | HuggingFace dataset release |