PyPI - agentdebugx - Versions diffs - 0.2.2__tar.gz → 0.2.4__tar.gz - Mend

agentdebugx 0.2.2tar.gz → 0.2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (67) hide show

{agentdebugx-0.2.2 → agentdebugx-0.2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentdebugx
-Version: 0.2.2
+Version: 0.2.4
 Summary: Portable error analysis, tracing, and recovery framework for agentic AI systems. Import as `agentdebug`.
 License: MIT
 License-File: LICENSE

{agentdebugx-0.2.2 → agentdebugx-0.2.4}/docs/23_status_v0_2.md RENAMED Viewed

@@ -1,4 +1,4 @@
-# 23 — Capability + Test Coverage Status (v0.2.2)
+# 23 — Capability + Test Coverage Status (v0.2.3)
 A live audit of what's implemented, what's tested, and what's specced but
 not yet built. Pair this with [docs/15_roadmap.md](./15_roadmap.md), which is
@@ -20,7 +20,9 @@ the forward-looking plan; this doc is the rear-view mirror.
 | Attribution | `agentdebug.attribution.HeuristicAttributor` | ✅ stable | first-finding + tiebreak |
 | Attribution | `agentdebug.attribution.AllAtOnceAttributor` | ✅ stable | mocked LLM + fallback |
 | Attribution | `agentdebug.attribution.StepByStepAttributor` | ✅ **new 0.2.2** | scripted-LLM + fallback |
+| Attribution | `agentdebug.attribution.BinarySearchAttributor` | ✅ **new 0.2.3** | oracle-LLM logarithmic convergence + fallback + render elision |
 | Recovery | `agentdebug.recovery.ReflexionSuggestion` | ✅ stable | per-finding + empty |
+| Recovery | `agentdebug.recovery.CriticRecoverer` + `VerifierSpec` registry | ✅ **new 0.2.3** | 5 family-matched verifier templates; dedup + custom-override |
 | DeepDebug | `agentdebug.deep.DeepDebugAnalyzer` | ✅ stable | full loop + silent LLM |
 | Cascade view | `agentdebug.traceback.format_traceback` | ✅ stable | cascade + step-order + ANSI + empty |
 | Detectors | `agentdebug.detectors.RepeatedToolCall / RepeatedState / StepCountLimit` | ✅ **new 0.2.2** | threshold + window + budget |
@@ -45,13 +47,11 @@ across 32 source files.
 | [06_detectors.md](./06_detectors.md) | `trajectory_perplexity` (TrajAD) | needs token-level LM perplexity API or embedding model + baseline calibration | v0.3 |
 | [06_detectors.md](./06_detectors.md) | `topic_drift` (embedding cosine) | needs embedding client; consider reusing `OpenAICompatClient` `/embeddings` | v0.3 |
 | [06_detectors.md](./06_detectors.md) | LTL spec monitors | requires user-supplied spec or LLM-synthesized monitors; gated on RV research | v1.2 |
-| [07_attribution.md](./07_attribution.md) | `BinarySearchAttributor` (ddmin) | requires replayable environment; few frameworks expose it | v0.3 |
-| [07_attribution.md](./07_attribution.md) | `CounterfactualAttributor` | requires re-rolling agent actions; same replay constraint | v0.3 |
+| [07_attribution.md](./07_attribution.md) | `CounterfactualAttributor` | requires re-rolling agent actions; framework-replay dependent | v0.3 |
 | [07_attribution.md](./07_attribution.md) | `SBFLAttributor` (Tarantula/Ochiai) | needs corpus of passing + failing traces of same task; gated on Hub adoption | v0.4 |
 | [07_attribution.md](./07_attribution.md) | `DeltaDebugAttributor` (Zeller) | same replay constraint | v0.3 |
-| [07_attribution.md](./07_attribution.md) | `EnsembleAttributor` | trivial once 2+ heavy backends ship; awaits BinarySearch/Counterfactual | v0.3 |
+| [07_attribution.md](./07_attribution.md) | `EnsembleAttributor` | trivial once Counterfactual lands; awaits Counterfactual | v0.3 |
 | [08_recovery.md](./08_recovery.md) | `SelfRefineLoop` | small but needs a generator-critic-refiner orchestration | v0.3 |
-| [08_recovery.md](./08_recovery.md) | `CriticRecoverer` | needs a verifier registry (search, code-exec, type-check) | v0.3 |
 | [08_recovery.md](./08_recovery.md) | `AutoManualRules` | needs persistent project manual + injection into next-run prompts | v0.3 |
 | [08_recovery.md](./08_recovery.md) | `LangGraphRewind` | depends on LangGraph checkpointer; ships when we have a real LangGraph user | v0.3 |
 | [08_recovery.md](./08_recovery.md) | `SagaRollback` | needs compensation registry on tool definitions; new schema | v0.3 |
@@ -82,6 +82,40 @@ The audit found one real bug and a handful of test gaps:
 5. **`recovery.ReflexionSuggestion`** had only an indirect test from DeepDebug
    examples; now has direct happy + empty tests.
+## 3.6 Real-usage E2E (live Gemini)
+Beyond unit tests, `scripts/e2e_real_usage.py` builds three realistic failing
+trajectories using **only the public API** (`AgentDebug`, `traced_tool`,
+`SQLiteTraceStore`) and runs the full pipeline against the live LLM.
+Stage results (see [docs/benchmarks/e2e_v0_2_3.md](./benchmarks/e2e_v0_2_3.md)):
+| Scenario | Stages OK |
+|---|---|
+| `action_format_then_hallucination` (planner → bad tool call → hallucinated answer) | 12 / 12 |
+| `multiagent_handoff_loss` (researcher → handoff drops constraint → wrong summary) | 12 / 12 |
+| `planning_loop` (browser clicks #submit 4× with no progress) | 12 / 12 |
+| UI smoke (`/healthz`, `/api/v1/traces`, `/api/v1/traces/<id>`, `/api/v1/taxonomy`, `/`) | 5 / 5 |
+| Fresh-venv `pip install agentdebugx==0.2.3` + import + CLI listing | ✅ |
+**Honest issues the E2E surfaced** (none of these would have been caught by
+the mocked unit tests):
+1. **LLM judge can return truncated JSON on long traces** — gemini-3-flash
+   spent its `max_tokens` budget on reasoning tokens before completing the
+   findings array; the pipeline gracefully returned 0 findings rather than
+   crashing. Mitigation: per-call `max_tokens=6144`+; document the
+   thinking-token trap (done in [docs/20_deep_debug.md §7](./20_deep_debug.md)).
+2. **`BinarySearchAttributor` falls back to `HeuristicAttributor` when its
+   probe JSON is truncated** — observed in 2 of 3 scenarios. The fallback
+   chain works correctly, but the user loses the O(log N) advantage.
+   Followup: tighter bisection prompts; track in `result.raw['probe_count']`.
+3. **`HeuristicAnalyzer` returns `root_cause_step_index=None` when all
+   findings have `step_index=None`** — the event recorded via `traced_tool`
+   doesn't carry a step index. Real bug; `traced_tool` should auto-assign.
+These are tracked as v0.2.4 fixes.
 ## 4. Coverage matrix (post-0.2.2)
 Run `PYTHONPATH=src pytest --cov=agentdebug --cov-report=term`. The two largest
@@ -97,10 +131,17 @@ remaining gaps are deliberate:
 Before v0.3 ships, this doc should record green checkmarks for:
-- [ ] One replayable counterfactual attributor (`BinarySearchAttributor` is
-      the cheapest entry).
-- [ ] One tool-grounded recovery strategy (`CriticRecoverer`) wired against
-      a `Verifier` Protocol.
+- [x] **Logarithmic-cost attributor** (`BinarySearchAttributor`) shipped in
+      0.2.3 — Who&When method 3, O(log N) LLM calls, bisects the trajectory
+      via prefix evaluation. **Note:** this is not yet a "replayable
+      counterfactual" attributor; it predicts whether the failure has
+      already occurred from the prefix without re-rolling the agent. True
+      counterfactual replay is still v0.3.
+- [x] **Tool-grounded recovery strategy** (`CriticRecoverer` + `VerifierSpec`
+      registry) shipped in 0.2.3 — pattern-matches failure modes against 5
+      default verifier templates (JSON-schema guard, final-state check,
+      tool-result type-check, handoff contract, loop-detector guard) and
+      emits per-finding `FixProposal` with rationale + suggested code.
 - [ ] One additional framework adapter that goes through the full conformance
       suite (CrewAI is the most-requested).
 - [ ] HuggingFace Hub round-trip live test (gated on `HF_TOKEN`).

agentdebugx-0.2.4/docs/benchmarks/e2e_v0_2_3.md ADDED Viewed

@@ -0,0 +1,373 @@
+# AgentDebugX v0.2.3 End-to-End Real-Usage Smoke
+Scenarios: **3**. LLM model: `gemini-3-flash`. Generated by `scripts/e2e_real_usage.py`.
+## Per-scenario pipeline status
+| Scenario | trace_id | OK / Total stages | Failed stages |
+|---|---|---|---|
+| `action_format_then_hallucination` | `trace_f81860…` | 12 / 12 | — |
+| `multiagent_handoff_loss` | `trace_45009e…` | 12 / 12 | — |
+| `planning_loop` | `trace_3d1c98…` | 12 / 12 | — |
+**UI smoke:** ✅ all endpoints responded
+```
+GET /healthz -> 200 {"status":"ok"}
+GET /api/v1/traces -> 3 trace(s)
+GET /api/v1/traces/<id> -> 200 events=11 findings=4
+GET /api/v1/taxonomy -> modes=19
+GET / -> 200 content_length=33666 has_brand=True
+```
+## `action_format_then_hallucination`
+`trace_id=trace_f81860758c6d439aaf1ecd7457de6654`
+### ✅ `heuristic_analyzer` (0.00s) — 1 finding(s); root=None
+### ✅ `cross_event_detectors` (0.00s) — 0 finding(s) from default_detectors()
+### ✅ `traceback_offline` (0.00s) — rendered
+```
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_f81860758c6d439aaf1ecd7457de6654  framework=e2e-react  goal='Find the latest AgentDebug paper, summarize the method, then email alice@example.com'
+      File "root cause", in trajectory
+        Step ?  agent=search_web  mode=system.tool_execution_error  confidence=0.86
+          event_id=evt_582bbb55430a4be583ad6c374f7c1564
+          error>  JSON schema validation failed: missing parameter query
+          evidence:
+            - JSON schema validation failed: missing parameter query
+          suggested: Capture tool stderr/status/latency and classify retryable versus non-retryable failures.
+AgentFailure[system.tool_execution_error]: Likely root cause: Tool execution error in search_web at step None.
+```
+### ✅ `reflexion_suggestion` (0.00s) — 1 proposal(s)
+```
+Reflexion retry hint for system.tool_execution_error at step None
+```
+### ✅ `critic_recoverer` (0.00s) — 1 verifier proposal(s)
+```
+Add tool_result_typecheck before system.tool_execution_error (step None, agent search_web)
+```
+### ✅ `llm_judge` (27.97s) — 0 finding(s); root=None
+### ✅ `attribute_heuristic` (0.00s) — method=heuristic (no hypotheses)
+### ✅ `attribute_all_at_once` (6.27s) — method=all_at_once agent=search_web step=None conf=0.90
+```
+The agent failed to provide the required 'query' parameter in the search tool call, which resulted in a validation error and prevented the agent from finding the paper.
+```
+### ✅ `attribute_step_by_step` (17.05s) — method=step_by_step agent=planner step=4 conf=1.00
+```
+The agent prematurely terminates the task with a generic statement, failing to summarize the method or email the recipient as required by the goal.
+```
+### ✅ `attribute_binary_search` (9.48s) — method=binary_search agent=planner step=4 conf=0.90
+```
+Binary search located the decisive step within 3 probes over 6 events.
+```
+### ✅ `deep_debug` (25.52s) — 3 finding(s); rounds=6
+```
+rounds: plan:3794ms / hypothesize:7161ms / verify:h1:2551ms / verify:h2:2381ms / verify:h3:2097ms / refine:7534ms
+summary: The agent failed to provide the required 'query' parameter for the search tool, and the planner subsequently misjudged the task as complete despite failing to find the paper, summarize it, or send the email.
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_f81860758c6d439aaf1ecd7457de6654  framework=e2e-react  goal='Find the latest AgentDebug paper, summarize the method, then email alice@example.com'
+      File "root cause", in trajectory
+        Step ?  agent=search_web  mode=action.parameter_error  confidence=1.00
+          event_id=evt_582bbb55430a4be583ad6c374f7c1564
+          error>  JSON schema validation failed: missing parameter query
+          evidence:
+            - JSON schema validation failed: missing parameter query
+            - args': '()', 'kwargs': '{}'
+          suggested: Validate parameters against tool schemas and ask for missing user/context fields.
+    ↓ cascaded to
+      File "cascade depth 1", in trajectory
+        Step 4  agent=planner  mode=reflection.progress_misjudge  confidence=1.00
+          module=reflection
+          event_id=evt_047e5ad596874186ac8d4413b8ba8185
+          output> Final answer: AgentDebug is a popular paper. Done.
+          evidence:
+            - Final answer: AgentDebug is a popular paper. Done.
+            - error=JSON schema validation failed: missing parameter query
+          suggested: Add an external task verifier before termination.
+    ↓ cascaded to
+      File "cascade depth 2", in trajectory
+        Step ?  agent=system  mode=verification.missing_task_validation  confidence=1.00
+          event_id=evt_0e34f3f892664015b10bba11ed2ac3dd
+          evidence:
+            - meta={'success': True}
+            - Final answer: AgentDebug is a popular paper. Done.
+          suggested: Add final-state validation that is independent of the acting agent.
+AgentFailure[verification.missing_task_validation]: The agent failed to provide the required 'query' parameter for the search tool, and the planner subsequently misjudged the task as complete despite failing to find the paper, summarize it, or send the email.
+```
+### ✅ `hub_round_trip` (0.01s) — pushed=/home/kunlunz2/AgentDebugX/.agentdebug/e2e_hub/bundle_b8fa2127c001463d81c86c8c03e4002a ; bundle_id=bundle_b8fa2127c001463d81c86c8c03e4002a ; listed=1 ; round-trip ok
+## `multiagent_handoff_loss`
+`trace_id=trace_45009e26b64341e69af395c4d4cabc07`
+### ✅ `heuristic_analyzer` (0.00s) — 1 finding(s); root=2
+### ✅ `cross_event_detectors` (0.00s) — 0 finding(s) from default_detectors()
+### ✅ `traceback_offline` (0.00s) — rendered
+```
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_45009e26b64341e69af395c4d4cabc07  framework=e2e-multiagent  goal='Find the best paper on agent debugging, prefer the most recent.'
+      File "root cause", in trajectory
+        Step 2  agent=researcher  mode=multiagent.handoff_loss  confidence=0.70
+          module=multiagent
+          event_id=evt_09730d2f195349639b671b8278d0202c
+          output> Please summarize the agent debugging paper.
+          evidence:
+            - handoff/context signal in event payload
+          suggested: Make handoff payloads typed and include goal, constraints, evidence, confidence, and open questions.
+AgentFailure[multiagent.handoff_loss]: Likely root cause: Handoff context loss in researcher at step 2.
+```
+### ✅ `reflexion_suggestion` (0.00s) — 1 proposal(s)
+```
+Reflexion retry hint for multiagent.handoff_loss at step 2
+```
+### ✅ `critic_recoverer` (0.00s) — 1 verifier proposal(s)
+```
+Add handoff_context_contract before multiagent.handoff_loss (step 2, agent researcher)
+```
+### ✅ `llm_judge` (8.01s) — 2 finding(s); root=2
+```
+- multiagent.handoff_loss (conf=1.00) step=2 agent=researcher
+- verification.missing_task_validation (conf=0.90) step=None agent=system
+```
+### ✅ `attribute_heuristic` (0.00s) — method=heuristic agent=researcher step=2 conf=1.00
+```
+Earliest finding with non-trivial confidence: Handoff context loss
+```
+### ✅ `attribute_all_at_once` (3.30s) — method=all_at_once agent=researcher step=2 conf=1.00
+```
+The researcher correctly identified Paper A as the most recent in step 1 but failed to communicate this specific choice or the recency constraint during the handoff in step 2, leading the summarizer to pick the wrong paper.
+```
+### ✅ `attribute_step_by_step` (20.70s) — method=step_by_step agent=researcher step=2 conf=1.00
+```
+The researcher agent identified the correct paper in the previous step but failed to deliver the result, instead initiating an unnecessary handoff for summarization.
+```
+### ✅ `attribute_binary_search` (7.03s) — method=heuristic agent=researcher step=2 conf=1.00
+```
+Earliest finding with non-trivial confidence: Handoff context loss
+```
+### ✅ `deep_debug` (30.30s) — 3 finding(s); rounds=6
+```
+rounds: plan:3749ms / hypothesize:8433ms / verify:h1:3764ms / verify:h2:2344ms / verify:h3:4481ms / refine:7526ms
+summary: The researcher agent hallucinated paper candidates without performing a search and subsequently failed to communicate the user's recency constraints and the selected paper to the summarizer, leading to an incorrect final output.
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_45009e26b64341e69af395c4d4cabc07  framework=e2e-multiagent  goal='Find the best paper on agent debugging, prefer the most recent.'
+      File "root cause", in trajectory
+        Step 1  agent=researcher  mode=memory.hallucination  confidence=0.95
+          module=planning
+          event_id=evt_1f436c0e08534b579faba36acb2e6703
+          output> Found two candidate papers: A (May 2025) and B (Mar 2024). A is preferred because it is more recent (per user constraint).
+          evidence:
+            - Found two candidate papers: A (May 2025) and B (Mar 2024)
+            - without any apparent search or data gathering steps
+          suggested: Require memory reads to cite the source event or artifact before use.
+    ↓ cascaded to
+      File "cascade depth 1", in trajectory
+        Step 2  agent=researcher  mode=multiagent.handoff_loss  confidence=1.00
+          module=multiagent
+          event_id=evt_09730d2f195349639b671b8278d0202c
+          output> Please summarize the agent debugging paper.
+          evidence:
+            - Please summarize the agent debugging paper.
+            - omitted_context: 'preference for A; recency constraint'
+          suggested: Make handoff payloads typed and include goal, constraints, evidence, confidence, and open questions.
+    ↓ cascaded to
+      File "cascade depth 2", in trajectory
+        Step 4  agent=summarizer  mode=reflection.progress_misjudge  confidence=0.90
+          module=reflection
+          event_id=evt_e7e8080c4dde47ef88c96dc0db743023
+          output> I summarized paper B (the only one I knew about).
+          evidence:
+            - I summarized paper B (the only one I knew about).
+            - prefer the most recent
+          suggested: Add an external task verifier before termination.
+AgentFailure[reflection.progress_misjudge]: The researcher agent hallucinated paper candidates without performing a search and subsequently failed to communicate the user's recency constraints and the selected paper to the summarizer, leading to an incorrect final output.
+```
+### ✅ `hub_round_trip` (0.01s) — pushed=/home/kunlunz2/AgentDebugX/.agentdebug/e2e_hub/bundle_6c474a1056074785a0275407f24fedc1 ; bundle_id=bundle_6c474a1056074785a0275407f24fedc1 ; listed=3 ; round-trip ok
+## `planning_loop`
+`trace_id=trace_3d1c98d2424a4c05ae104b942fe0a302`
+### ✅ `heuristic_analyzer` (0.00s) — 4 finding(s); root=None
+### ✅ `cross_event_detectors` (0.00s) — 3 finding(s) from default_detectors()
+```
+- planning.inefficient_plan (source=repeated_tool_call)
+- planning.inefficient_plan (source=repeated_state)
+- planning.inefficient_plan (source=repeated_state)
+```
+### ✅ `traceback_offline` (0.00s) — rendered
+```
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_3d1c98d2424a4c05ae104b942fe0a302  framework=e2e-browser  goal='Submit the checkout form on shop.example.com'
+      File "root cause", in trajectory
+        Step ?  agent=browser  mode=planning.inefficient_plan  confidence=0.67
+          event_id=evt_3f474425ad3841b886240a70ec694fa5
+          output> no progress; same checkout screen
+          evidence:
+            - loop/progress signal in event payload
+          suggested: Add loop detection over tool calls and state deltas.
+    ↓ cascaded to
+      File "cascade depth 1", in trajectory
+        Step ?  agent=browser  mode=planning.inefficient_plan  confidence=0.67
+          event_id=evt_bce9260a3a024f89ac18430ac2f660ef
+          output> no progress; same checkout screen
+          evidence:
+            - loop/progress signal in event payload
+          suggested: Add loop detection over tool calls and state deltas.
+    ↓ cascaded to
+      File "cascade depth 2", in trajectory
+        Step ?  agent=browser  mode=planning.inefficient_plan  confidence=0.67
+          event_id=evt_45dc8858899546ff8159af3ce4f8d6dd
+          output> no progress; same checkout screen
+          evidence:
+            - loop/progress signal in event payload
+          suggested: Add loop detection over tool calls and state deltas.
+    ↓ cascaded to
+      File "cascade depth 3", in trajectory
+        Step ?  agent=browser  mode=planning.inefficient_plan  confidence=0.67
+          event_id=evt_90f10dde040341dcb9434192fba3255d
+          output> no progress; same checkout screen
+          evidence:
+            - loop/progress signal in event payload
+          suggested: Add loop detection over tool calls and state deltas.
+AgentFailure[planning.inefficient_plan]: Likely root cause: Inefficient plan in browser at step None.
+```
+### ✅ `reflexion_suggestion` (0.00s) — 4 proposal(s)
+```
+Reflexion retry hint for planning.inefficient_plan at step None
+Reflexion retry hint for planning.inefficient_plan at step None
+Reflexion retry hint for planning.inefficient_plan at step None
+Reflexion retry hint for planning.inefficient_plan at step None
+```
+### ✅ `critic_recoverer` (0.00s) — 4 verifier proposal(s)
+```
+Add loop_detector_guard before planning.inefficient_plan (step None, agent browser)
+Add loop_detector_guard before planning.inefficient_plan (step None, agent browser)
+Add loop_detector_guard before planning.inefficient_plan (step None, agent browser)
+Add loop_detector_guard before planning.inefficient_plan (step None, agent browser)
+```
+### ✅ `llm_judge` (19.00s) — 2 finding(s); root=1
+```
+- planning.inefficient_plan (conf=0.90) step=1 agent=planner
+- reflection.progress_misjudge (conf=1.00) step=None agent=system
+```
+### ✅ `attribute_heuristic` (0.00s) — method=heuristic agent=planner step=1 conf=0.90
+```
+Earliest finding with non-trivial confidence: Inefficient plan
+```
+### ✅ `attribute_all_at_once` (5.44s) — method=all_at_once agent=planner step=1 conf=0.90
+```
+The planner's initial strategy was fundamentally flawed, instructing the agent to repeatedly click a button without verifying form completion or handling errors, which led to the failure.
+```
+### ✅ `attribute_step_by_step` (50.07s) — method=step_by_step agent=planner step=1 conf=0.90
+```
+The planner proposed a brute-force clicking strategy without accounting for the necessary form-filling steps required for a checkout process.
+```
+### ✅ `attribute_binary_search` (6.06s) — method=heuristic agent=planner step=1 conf=0.90
+```
+Earliest finding with non-trivial confidence: Inefficient plan
+```
+### ✅ `deep_debug` (38.24s) — 3 finding(s); rounds=6
+```
+rounds: plan:4530ms / hypothesize:12830ms / verify:h1:4850ms / verify:h2:4577ms / verify:h3:4967ms / refine:6484ms
+summary: The checkout submission failed because the planner ignored mandatory form fields and instead devised a strategy to repeatedly click the submit button, which the browser executed without success.
+AgentTraceback (root cause first, manifested failure last):
+    trace_id=trace_3d1c98d2424a4c05ae104b942fe0a302  framework=e2e-browser  goal='Submit the checkout form on shop.example.com'
+      File "root cause", in trajectory
+        Step 1  agent=planner  mode=planning.constraint_ignorance  confidence=1.00
+          module=planning
+          event_id=evt_b1d43814c6b642079ba7015d13b140c5
+          output> Strategy: click #submit until success
+          evidence:
+            - Strategy: click #submit until success
+          suggested: Compile task and tool constraints into pre-action checks.
+    ↓ cascaded to
+      File "cascade depth 1", in trajectory
+        Step ?  agent=browser  mode=planning.inefficient_plan  confidence=0.95
+          event_id=evt_3ceba7d898d04e5a98f463ea8f9c8e72
+          input>  {'tool': 'click', 'args': '()', 'kwargs': "{'selector': '#submit'}"}
+          evidence:
+            - click
+            - {'selector': '#submit'}
+            - no progress; same checkout screen
+          suggested: Add loop detection over tool calls and state deltas.
+    ↓ cascaded to
+      File "cascade depth 2", in trajectory
+        Step ?  agent=browser  mode=reflection.progress_misjudge  confidence=0.90
+          event_id=evt_90f10dde040341dcb9434192fba3255d
+          output> no progress; same checkout screen
+          evidence:
+            - no progress; same checkout screen
+            - meta={'success': True}
+          suggested: Add an external task verifier before termination.
+AgentFailure[reflection.progress_misjudge]: The checkout submission failed because the planner ignored mandatory form fields and instead devised a strategy to repeatedly click the submit button, which the browser executed without success.
+```
+### ✅ `hub_round_trip` (0.01s) — pushed=/home/kunlunz2/AgentDebugX/.agentdebug/e2e_hub/bundle_15650f55244b4ec98adf6ef0042496ec ; bundle_id=bundle_15650f55244b4ec98adf6ef0042496ec ; listed=4 ; round-trip ok

{agentdebugx-0.2.2 → agentdebugx-0.2.4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "agentdebugx"
-version = "0.2.2"
+version = "0.2.4"
 description = "Portable error analysis, tracing, and recovery framework for agentic AI systems. Import as `agentdebug`."
 authors = ["ULab @ UIUC <ulab@illinois.edu>"]
 license = "MIT"

{agentdebugx-0.2.2 → agentdebugx-0.2.4}/src/agentdebug/__init__.py RENAMED Viewed

@@ -13,6 +13,7 @@ from agentdebug.attribution import (
     AllAtOnceAttributor,
     AttributionResult,
     Attributor,
+    BinarySearchAttributor,
     Blame,
     HeuristicAttributor,
     StepByStepAttributor,
@@ -38,7 +39,14 @@ from agentdebug.models import (
     Modality,
 )
 from agentdebug.recorder import AgentDebug, TraceSession
-from agentdebug.recovery import FixProposal, Recoverer, ReflexionSuggestion
+from agentdebug.recovery import (
+    DEFAULT_VERIFIERS,
+    CriticRecoverer,
+    FixProposal,
+    Recoverer,
+    ReflexionSuggestion,
+    VerifierSpec,
+)
 from agentdebug.traceback import CascadeFrame, build_cascade, format_traceback
 from agentdebug.storage import JsonlTraceStore, SQLiteTraceStore
 from agentdebug.taxonomy import SEED_FAILURE_MODES, get_failure_mode
@@ -53,13 +61,17 @@ __all__ = [
     'Attributor',
     'Blame',
     'BusEvent',
+    'BinarySearchAttributor',
     'CascadeFrame',
+    'CriticRecoverer',
+    'DEFAULT_VERIFIERS',
     'Detector',
     'DetectorConfig',
     'RepeatedStateDetector',
     'RepeatedToolCallDetector',
     'StepByStepAttributor',
     'StepCountLimitDetector',
+    'VerifierSpec',
     'build_cascade',
     'default_detectors',
     'format_traceback',
@@ -84,4 +96,4 @@ __all__ = [
     'get_failure_mode',
 ]
-__version__ = '0.2.2'
+__version__ = '0.2.4'

{agentdebugx-0.2.2 → agentdebugx-0.2.4}/src/agentdebug/analyzers.py RENAMED Viewed

@@ -25,7 +25,11 @@ class HeuristicAnalyzer:
     def analyze(self, trajectory: AgentTrajectory) -> DiagnosticReport:
         findings = [finding for event in trajectory.events for finding in self._event_findings(event)]
-        root = self._select_root_cause(findings)
+        # Build event-order map so the root selector can fall back to it when
+        # findings lack step_index (e.g., events recorded via traced_tool on
+        # pre-0.2.4 captures that pre-date the auto-counter).
+        event_order = {evt.event_id: i for i, evt in enumerate(trajectory.events)}
+        root = self._select_root_cause(findings, event_order=event_order)
         suggestions = self._dedupe(
             finding.suggestion for finding in findings if finding.suggestion is not None
         )
@@ -40,11 +44,19 @@ class HeuristicAnalyzer:
         if root is not None:
             report.root_cause_event_id = root.event_id
             report.root_cause_agent = root.agent_name
-            report.root_cause_step_index = root.step_index
+            # If the finding lacked an explicit step_index, fall back to the
+            # event's position so root_cause_step_index is non-null in the UI
+            # and the AgentTraceback header reads sensibly.
+            inferred_step = root.step_index
+            if inferred_step is None and root.event_id is not None:
+                pos = event_order.get(root.event_id)
+                if pos is not None:
+                    inferred_step = pos
+            report.root_cause_step_index = inferred_step
             report.summary = (
                 f'Likely root cause: {root.failure_mode.name}'
                 f' in {root.agent_name or "unknown agent"}'
-                f' at step {root.step_index}.'
+                f' at step {inferred_step if inferred_step is not None else "?"}.'
             )
         return report
@@ -106,17 +118,35 @@ class HeuristicAnalyzer:
     ) -> Tuple[FailureMode, float, List[str]]:
         return SEED_FAILURE_MODES[mode_id], confidence, evidence
-    def _select_root_cause(self, findings: List[FailureFinding]) -> Optional[FailureFinding]:
+    def _select_root_cause(
+        self,
+        findings: List[FailureFinding],
+        *,
+        event_order: Optional[dict[str, int]] = None,
+    ) -> Optional[FailureFinding]:
         if not findings:
             return None
-        return sorted(
-            findings,
-            key=lambda finding: (
-                finding.step_index is None,
-                finding.step_index if finding.step_index is not None else 10**9,
-                -finding.confidence,
-            ),
-        )[0]
+        # Primary key: step_index (None pushed to end).
+        # Fallback: when step_index is None, use the event's position in the
+        # trajectory so we still pick the *earliest* finding rather than just
+        # the highest-confidence one.
+        event_order = event_order or {}
+        def _key(f: FailureFinding) -> tuple[int, int, int, float]:
+            step = f.step_index if f.step_index is not None else 10**9
+            order = (
+                event_order.get(f.event_id, 10**9)
+                if f.event_id is not None and f.step_index is None
+                else 10**9
+            )
+            return (
+                0 if f.step_index is not None else 1,
+                step,
+                order,
+                -f.confidence,
+            )
+        return sorted(findings, key=_key)[0]
     def _event_text(self, event: AgentEvent) -> str:
         parts = [

agentdebugx 0.2.2__tar.gz → 0.2.4__tar.gz

agentdebugx 0.2.2tar.gz → 0.2.4tar.gz