PyPI - agentdebugx - Versions diffs - 0.2.5__tar.gz → 0.2.7__tar.gz - Mend

agentdebugx 0.2.5tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (69) hide show

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentdebugx
-Version: 0.2.5
+Version: 0.2.7
 Summary: Portable error analysis, tracing, and recovery framework for agentic AI systems. Import as `agentdebug`.
 License: MIT
 License-File: LICENSE

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/23_status_v0_2.md RENAMED Viewed

@@ -21,6 +21,7 @@ the forward-looking plan; this doc is the rear-view mirror.
 | Attribution | `agentdebug.attribution.AllAtOnceAttributor` | ✅ stable | mocked LLM + fallback |
 | Attribution | `agentdebug.attribution.StepByStepAttributor` | ✅ **new 0.2.2** | scripted-LLM + fallback |
 | Attribution | `agentdebug.attribution.BinarySearchAttributor` | ✅ **new 0.2.3** | oracle-LLM logarithmic convergence + fallback + render elision |
+| Attribution | `agentdebug.attribution.CounterfactualAttributor` | ✅ **new 0.2.7** | scripted-rescue-prob ranking + candidate selection priority (findings → errors → tail) + dual fallback (no candidates / silent LLM) |
 | Recovery | `agentdebug.recovery.ReflexionSuggestion` | ✅ stable | per-finding + empty |
 | Recovery | `agentdebug.recovery.CriticRecoverer` + `VerifierSpec` registry | ✅ **new 0.2.3** | 5 family-matched verifier templates; dedup + custom-override |
 | DeepDebug | `agentdebug.deep.DeepDebugAnalyzer` | ✅ stable | full loop + silent LLM |
@@ -47,7 +48,7 @@ across 32 source files.
 | [06_detectors.md](./06_detectors.md) | `trajectory_perplexity` (TrajAD) | needs token-level LM perplexity API or embedding model + baseline calibration | v0.3 |
 | [06_detectors.md](./06_detectors.md) | `topic_drift` (embedding cosine) | needs embedding client; consider reusing `OpenAICompatClient` `/embeddings` | v0.3 |
 | [06_detectors.md](./06_detectors.md) | LTL spec monitors | requires user-supplied spec or LLM-synthesized monitors; gated on RV research | v1.2 |
-| [07_attribution.md](./07_attribution.md) | `CounterfactualAttributor` | requires re-rolling agent actions; framework-replay dependent | v0.3 |
+| [07_attribution.md](./07_attribution.md) | `CounterfactualAttributor` — *real* replay variant | true re-rollout requires framework-specific replay surface; the v0.2.7 LLM-simulated variant ships now, the real-replay variant is gated on adapter support (LangGraph checkpointer / OpenHands rewind) | v0.4 |
 | [07_attribution.md](./07_attribution.md) | `SBFLAttributor` (Tarantula/Ochiai) | needs corpus of passing + failing traces of same task; gated on Hub adoption | v0.4 |
 | [07_attribution.md](./07_attribution.md) | `DeltaDebugAttributor` (Zeller) | same replay constraint | v0.3 |
 | [07_attribution.md](./07_attribution.md) | `EnsembleAttributor` | trivial once Counterfactual lands; awaits Counterfactual | v0.3 |
@@ -82,6 +83,26 @@ The audit found one real bug and a handful of test gaps:
 5. **`recovery.ReflexionSuggestion`** had only an indirect test from DeepDebug
    examples; now has direct happy + empty tests.
+## 3.7 Judge hardening (0.2.6)
+A v0.2.5 Who&When 5-trace live run had `llm_judge_root.agent_match=0.00`
+because the judge truncated mid-array on long multi-agent debate
+transcripts. Three changes in 0.2.6 lifted that to **0.40** on the same
+sample (same model, same traces):
+1. `LLMJudgeAnalyzer.max_tokens` default **4096 → 8192** — leaves room for
+   thinking-model reasoning tokens before the JSON object starts.
+2. `LLMJudgeAnalyzer.max_findings_per_chunk` parameter (default 6) — the
+   system prompt now asks the model to cap its findings array, forcing it
+   to close the JSON even when many candidates are visible.
+3. System prompt now has explicit "CRITICAL OUTPUT RULES" — output ONLY
+   JSON, no markdown fences, no newlines in string values, complete the
+   array.
+Numbers: see [docs/benchmarks/who_when_v0_2_6_leaderboard.md](./benchmarks/who_when_v0_2_6_leaderboard.md).
+Same trick works for `BinarySearchAttributor` (shipped in 0.2.4) — apply
+to remaining LLM-using analyzers as more thinking models surface this.
 ## 3.6 Real-usage E2E (live Gemini)
 Beyond unit tests, `scripts/e2e_real_usage.py` builds three realistic failing
@@ -148,6 +169,16 @@ Before v0.3 ships, this doc should record green checkmarks for:
       into `AgentEvent`s. Conformance test mocks the bus and verifies
       every documented event mapping plus the version-skew degradation
       path. `examples/crewai_demo.py` shows a working two-agent crew.
-- [ ] HuggingFace Hub round-trip live test (gated on `HF_TOKEN`).
-- [ ] Bench harness extended with one published-benchmark loader (Who&When
-      is the obvious first target — we already cite its method).
+- [x] **HuggingFace Hub round-trip live test** — shipped in 0.2.6 as
+      `tests/test_hub_huggingface_live.py`. Gated on `HF_TOKEN` +
+      `AGENTDEBUG_HF_LIVE=1` so it never runs in default CI. Creates the
+      dataset repo if missing, pushes a bundle, lists, pulls back, verifies
+      the trajectory round-trips bit-for-bit. Live-validated against
+      `KunlunZhu/agentdebugx-live-test`.
+- [x] **Bench harness with Who&When loader** — `experiments/prepare_who_when.py`
+      ingests 184 Algorithm-Generated + Hand-Crafted traces (4092 events) and
+      stores labels separately. `experiments/run_who_when_eval.py` runs all
+      4 attributors + DeepDebug against gold labels; reports agent_match,
+      exact_step, near_step. Live-Gemini 5-trace validation captured at
+      [docs/benchmarks/who_when_v0_2_6_leaderboard.md](./benchmarks/who_when_v0_2_6_leaderboard.md).
+      Headline 184-trace run deferred (~6h / ~$5-10 on a frontier model).

agentdebugx-0.2.7/docs/benchmarks/who_when_v0_2_6_leaderboard.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Who&When — 5-trace Live Leaderboard (v0.2.6, gemini-3-flash)
+Tiny validation sample drawn from `data/who_when/processed/labels.jsonl`
+(first 5 algorithm-generated traces). **Not a publishable benchmark** — the
+full benchmark requires the 184-trace dataset + a frontier model and is
+deferred for cost reasons. This run exists to verify the analysis stack
+produces sensible-shaped numbers and to surface regressions early.
+## Aggregate (per attribution method)
+| Method | agent_match | exact_step | near_step | both_near | DeepDebug rounds |
+|---|---:|---:|---:|---:|---:|
+| `heuristic` (rule baseline) | 0.20 | 0.00 | 0.20 | 0.20 | n/a |
+| `llm_judge_root` (judge's root_cause field) | **0.40** | 0.00 | **0.20** | **0.20** | n/a |
+| `all_at_once` (Who&When method 1) | 0.20 | 0.00 | 0.00 | 0.00 | n/a |
+| `step_by_step` (Who&When method 2) | **0.40** | 0.00 | **0.20** | **0.20** | n/a |
+| `deep_debug_root` (DeepDebug refined root) | 0.20 | 0.00 | 0.20 | 0.00 | 6 / trace |
+## What changed in 0.2.6 vs 0.2.5
+Same 5 traces, same model:
+| Method | 0.2.5 agent_match | 0.2.6 agent_match | Δ |
+|---|---:|---:|---:|
+| `heuristic` | 0.20 | 0.20 | — |
+| `llm_judge_root` | 0.00 | **0.40** | +0.40 |
+| `all_at_once` | 0.00 | 0.20 | +0.20 |
+| `step_by_step` | 0.00 | **0.40** | +0.40 |
+The driver was the v0.2.6 judge prompt hardening: `max_tokens` default
+4096 → 8192, an explicit `max_findings_per_chunk=6` cap surfaced through
+the system prompt, and a "CRITICAL OUTPUT RULES" header (output ONLY JSON,
+no markdown, no newlines in strings, complete the array). Before the
+hardening, the judge truncated mid-array on Who&When debate transcripts
+and returned no findings; after, the structured root_cause is populated.
+## Honest caveats
+* n=5; per-method standard error is ±0.22 — these absolute numbers should
+  not be over-interpreted. The 0.4 vs 0.0 jump for two methods is the
+  signal worth reporting; everything else is noise.
+* `deep_debug_root` underperformed `step_by_step` on this sample. The
+  refine round on 7-event traces tends to converge to the *visible*
+  failure rather than the *causal* root (a known Who&When difficulty —
+  manifestation vs root cause).
+* No method beats `near_step=0.20` on this sample. Step-localization
+  remains hard, matching the published Who&When ceiling (~14% step on
+  127 traces with frontier models).
+## Reproducing
+```bash
+# Prepare data (once)
+PYTHONPATH=src python experiments/prepare_who_when.py
+# Set live LLM creds (any OpenAI-compatible endpoint works)
+export AGENTDEBUG_LLM_BASE_URL=...
+export AGENTDEBUG_LLM_API_KEY=...
+export AGENTDEBUG_LLM_MODEL=gemini-3-flash
+# Without DeepDebug (~1 min)
+PYTHONPATH=src python experiments/run_who_when_eval.py \
+  --limit 5 --live-openai \
+  --out-dir experiments/runs/who_when_eval_subset
+# With DeepDebug (~5 min)
+PYTHONPATH=src python experiments/run_who_when_eval.py \
+  --limit 5 --live-openai --deep \
+  --out-dir experiments/runs/who_when_eval_subset_deep
+```
+The headline benchmark (184 traces × 5 methods × DeepDebug) would take
+~6 hours and ~$5-10 in API cost on a frontier model. Run it once before
+paper submission; do not run on every iteration.

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "agentdebugx"
-version = "0.2.5"
+version = "0.2.7"
 description = "Portable error analysis, tracing, and recovery framework for agentic AI systems. Import as `agentdebug`."
 authors = ["ULab @ UIUC <ulab@illinois.edu>"]
 license = "MIT"

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/__init__.py RENAMED Viewed

@@ -15,6 +15,7 @@ from agentdebug.attribution import (
     Attributor,
     BinarySearchAttributor,
     Blame,
+    CounterfactualAttributor,
     HeuristicAttributor,
     StepByStepAttributor,
 )
@@ -63,6 +64,7 @@ __all__ = [
     'BusEvent',
     'BinarySearchAttributor',
     'CascadeFrame',
+    'CounterfactualAttributor',
     'CriticRecoverer',
     'DEFAULT_VERIFIERS',
     'Detector',
@@ -96,4 +98,4 @@ __all__ = [
     'get_failure_mode',
 ]
-__version__ = '0.2.5'
+__version__ = '0.2.7'

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/attribution.py RENAMED Viewed

@@ -566,8 +566,193 @@ def _EVENT_ELLIPSIS(count: int) -> _EllipsisEvent:
     return _EllipsisEvent(count=count)
+_COUNTERFACTUAL_SYSTEM_PROMPT = """You are AgentDebugX-Attributor running an
+LLM-simulated counterfactual replay (AgenTracer-style, arXiv:2509.03312).
+You will be given the goal, the full trajectory, and ONE CANDIDATE STEP. Your
+job is to estimate whether the agent would have succeeded if THAT step had
+been done correctly — leaving everything else the same. This isolates the
+step's causal contribution to the failure.
+CRITICAL OUTPUT RULES (these maximize the chance your reply parses):
+1. Output ONLY a JSON object. No prose before/after. No markdown fences.
+2. Keep "rationale" to ONE short sentence (<= 200 chars).
+3. Do NOT include newlines inside string values.
+4. Emit the JSON object COMPLETE.
+Schema:
+{
+  "rescue_probability": <0..1>,
+  "confidence": <0..1>,
+  "rationale": "<short>",
+  "would_block_downstream_failures": true | false
+}
+Higher rescue_probability = correcting this step would more likely have
+rescued the run; this step is therefore more responsible for the failure.
+"""
+class CounterfactualAttributor:
+    """LLM-simulated counterfactual replay.
+    For each of K candidate steps (top-K from prior findings, or
+    error-bearing events, or the tail of the trajectory) ask the LLM:
+    "if this step had been correct, would the rest of the trajectory still
+    fail?" Steps with the highest rescue-probability become the top blame
+    hypotheses. Costs O(K) LLM calls — comparable to AllAtOnce, with a
+    stronger causal claim per probe.
+    This is *simulated* counterfactual, not real re-rollout — strictly
+    weaker than AgenTracer's actual replay, but framework-independent and
+    runnable today against any LLM. When the underlying framework gains a
+    real replay surface (LangGraph checkpointer, OpenHands rewind), wire
+    that in as an alternative ``replay_fn`` and the algorithm carries over.
+    """
+    id = 'counterfactual'
+    def __init__(
+        self,
+        llm: LLMClient,
+        *,
+        max_candidates: int = 5,
+        max_tokens: int = 2048,
+        fallback: Optional[Attributor] = None,
+    ) -> None:
+        self.llm = llm
+        self.max_candidates = max_candidates
+        self.max_tokens = max_tokens
+        self.fallback: Attributor = fallback or HeuristicAttributor()
+    def attribute(
+        self,
+        trajectory: AgentTrajectory,
+        findings: List[FailureFinding],
+    ) -> AttributionResult:
+        candidates = self._pick_candidates(trajectory, findings)
+        if not candidates:
+            return self.fallback.attribute(trajectory, findings)
+        ranked: List[tuple[AgentEvent, Dict[str, Any]]] = []
+        for evt in candidates:
+            verdict = self._ask_counterfactual(trajectory, evt)
+            if verdict is None:
+                continue
+            ranked.append((evt, verdict))
+        if not ranked:
+            return self.fallback.attribute(trajectory, findings)
+        # Sort by rescue_probability desc, tie-break by confidence.
+        ranked.sort(
+            key=lambda r: (
+                -self._coerce_float(r[1].get('rescue_probability'), 0.0),
+                -self._coerce_float(r[1].get('confidence'), 0.0),
+            )
+        )
+        hypotheses: List[Blame] = []
+        for evt, verdict in ranked:
+            hypotheses.append(Blame(
+                span_id=evt.event_id,
+                step_index=evt.step_index,
+                agent_name=evt.agent_name,
+                confidence=self._coerce_float(verdict.get('rescue_probability'), 0.0),
+                rationale=(
+                    str(verdict.get('rationale') or 'no rationale')
+                    + f' [rescue_probability={verdict.get("rescue_probability")}]'
+                ),
+                evidence=[
+                    f'event_id={evt.event_id}',
+                    f'step={evt.step_index}',
+                ],
+                sources=[self.id],
+            ))
+        return AttributionResult(
+            method=self.id,
+            hypotheses=hypotheses,
+            raw={'candidates_probed': len(ranked)},
+        )
+    def _pick_candidates(
+        self,
+        trajectory: AgentTrajectory,
+        findings: List[FailureFinding],
+    ) -> List[AgentEvent]:
+        events_by_id = {e.event_id: e for e in trajectory.events}
+        candidates: List[AgentEvent] = []
+        seen: set[str] = set()
+        # 1. Prior findings (the judge already nominated suspects).
+        for f in findings:
+            evt = events_by_id.get(f.event_id) if f.event_id else None
+            if evt is not None and evt.event_id not in seen:
+                candidates.append(evt)
+                seen.add(evt.event_id)
+                if len(candidates) >= self.max_candidates:
+                    return candidates
+        # 2. Events that recorded an error directly.
+        for evt in trajectory.events:
+            if evt.error and evt.event_id not in seen:
+                candidates.append(evt)
+                seen.add(evt.event_id)
+                if len(candidates) >= self.max_candidates:
+                    return candidates
+        # 3. Fallback: tail of the trajectory (failure most often manifests there).
+        for evt in reversed(trajectory.events):
+            if evt.event_id not in seen:
+                candidates.append(evt)
+                seen.add(evt.event_id)
+                if len(candidates) >= self.max_candidates:
+                    return candidates
+        return candidates
+    def _ask_counterfactual(
+        self, trajectory: AgentTrajectory, candidate: AgentEvent,
+    ) -> Optional[Dict[str, Any]]:
+        events_doc = '\n'.join(
+            f'event_id={e.event_id} step={e.step_index} agent={e.agent_name} '
+            f'type={getattr(e.event_type, "value", e.event_type)} '
+            f'output={str(e.output)[:200]} error={str(e.error)[:200]}'
+            for e in trajectory.events
+        )
+        user = (
+            f'GOAL: {trajectory.goal!r}\n'
+            f'FRAMEWORK: {trajectory.framework!r}\n\n'
+            f'FULL TRAJECTORY:\n{events_doc}\n\n'
+            f'CANDIDATE STEP TO COUNTERFACTUALLY CORRECT:\n'
+            f'  event_id={candidate.event_id}\n'
+            f'  step={candidate.step_index} agent={candidate.agent_name}\n'
+            f'  module={candidate.module}\n'
+            f'  input={str(candidate.input)[:300]}\n'
+            f'  output={str(candidate.output)[:300]}\n'
+            f'  error={str(candidate.error)[:300]}\n\n'
+            f'Question: if this step had been DONE CORRECTLY, what is the '
+            f'probability the run would have succeeded?'
+        )
+        try:
+            result = self.llm.complete(
+                messages=[
+                    {'role': 'system', 'content': _COUNTERFACTUAL_SYSTEM_PROMPT},
+                    {'role': 'user', 'content': user},
+                ],
+                max_tokens=self.max_tokens,
+            )
+        except Exception as exc:  # pragma: no cover
+            LOG.warning('counterfactual probe failed at event=%s: %s',
+                        candidate.event_id, exc)
+            return None
+        parsed = extract_json_block(result.text)
+        if parsed is None:
+            return None
+        return cast(Dict[str, Any], parsed)
+    @staticmethod
+    def _coerce_float(value: Any, default: float) -> float:
+        try:
+            return float(value)
+        except (TypeError, ValueError):
+            return default
 __all__ = [
     'Attributor', 'Blame', 'AttributionResult',
     'HeuristicAttributor', 'AllAtOnceAttributor', 'StepByStepAttributor',
-    'BinarySearchAttributor',
+    'BinarySearchAttributor', 'CounterfactualAttributor',
 ]

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/judges.py RENAMED Viewed

@@ -39,21 +39,18 @@ the allowed failure mode codes. Be conservative — only flag steps where the
 evidence in the event payload supports the label. If the trajectory contains no
 failure, return an empty findings list.
-Respond ONLY with a JSON object matching this schema (no prose, no markdown):
+CRITICAL OUTPUT RULES (these maximize the chance your reply parses):
+1. Output ONLY a JSON object. No prose before/after. No markdown fences.
+2. Cap the findings array at {max_findings} entries — pick the most important.
+3. Keep each "evidence" entry under 120 characters; keep each "rationale" /
+   "summary" under 200 characters.
+4. Do NOT include newlines inside string values.
+5. Emit the JSON object COMPLETE — never stop mid-key or mid-array.
-{
-  "findings": [
-    {
-      "event_id": "<event_id from the input>",
-      "step_index": <int or null>,
-      "agent_name": "<agent_name from the input>",
-      "failure_mode_id": "<one of the allowed codes>",
-      "confidence": <float between 0 and 1>,
-      "evidence": ["<short quote or summary of the supporting payload>"]
-    }
-  ],
-  "summary": "<one-sentence diagnosis or 'No failure detected.'>"
-}
+Schema (compact — fields in this order):
+{{"findings":[{{"event_id":"...", "step_index":N|null, "agent_name":"...",
+  "failure_mode_id":"...", "confidence":0..1, "evidence":["..."]}}, ...],
+  "summary":"<short>"}}
 """
@@ -66,15 +63,21 @@ class LLMJudgeAnalyzer:
         *,
         max_events_per_call: int = 80,
         max_evidence_chars: int = 300,
-        max_tokens: int = 4096,
+        max_tokens: int = 8192,
+        max_findings_per_chunk: int = 6,
     ) -> None:
         self.llm = llm
         self.max_events_per_call = max_events_per_call
         self.max_evidence_chars = max_evidence_chars
         # NOTE: thinking models (Gemini 2.x/3.x, o-series) spend a substantial
         # fraction of `max_tokens` on reasoning tokens before any text is
-        # emitted. 4096 is the safe default; bump higher for long traces.
+        # emitted. 8192 is the safe default after the v0.2.6 Who&When debate-
+        # trace observation that 4096 truncated mid-array on long traces.
         self.max_tokens = max_tokens
+        # The system prompt asks the model to cap its findings array so the
+        # JSON closes even when many candidate failures exist. Reuse the prompt
+        # placeholder for this cap.
+        self.max_findings_per_chunk = max_findings_per_chunk
     def analyze(self, trajectory: AgentTrajectory) -> DiagnosticReport:
         events = trajectory.events
@@ -121,8 +124,11 @@ class LLMJudgeAnalyzer:
         self, trajectory: AgentTrajectory, chunk: List[AgentEvent]
     ) -> tuple[List[FailureFinding], str]:
         user = self._render_user_prompt(trajectory, chunk)
+        # Inject the max_findings cap into the system prompt at format time so
+        # we can tune it per-call without forking the prompt.
+        system = _SYSTEM_PROMPT.format(max_findings=self.max_findings_per_chunk)
         messages = [
-            {'role': 'system', 'content': _SYSTEM_PROMPT},
+            {'role': 'system', 'content': system},
             {'role': 'user', 'content': user},
         ]
         result = self.llm.complete(messages=messages, max_tokens=self.max_tokens)

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/ui/server.py RENAMED Viewed

@@ -211,8 +211,10 @@ _INDEX_HTML = """<!doctype html>
   .button {
     border:1px solid #373b3a; border-radius:8px; background:#1a1c1c; color:var(--fg);
     height:32px; padding:0 11px; font-size:12px; display:inline-flex;
-    align-items:center; gap:7px;
+    align-items:center; justify-content:center; gap:7px; cursor:pointer;
+    font-family:inherit; white-space:nowrap;
   }
+  .button:hover { border-color:#4b5250; background:#202323; }
   .button.primary { border-color:#356568; color:#d8fdff; background:#173033; }
   .content { padding:22px; max-width:1440px; margin:0 auto; }
   .hero {
@@ -330,6 +332,18 @@ _INDEX_HTML = """<!doctype html>
     .topbar { position:static; }
     .trace-legend, .trace-pair { grid-template-columns:1fr; }
   }
+  @media (max-width: 640px) {
+    .topbar { display:grid; grid-template-columns:1fr; align-items:start; padding:14px 16px; }
+    .top-actions { width:100%; display:grid; grid-template-columns:repeat(3,minmax(0,1fr)); }
+    .button { width:100%; min-width:0; padding:0 8px; overflow:hidden; text-overflow:ellipsis; }
+    .content { padding:22px 16px; }
+    h1 { font-size:27px; line-height:1.1; }
+    .stats { grid-template-columns:repeat(2,minmax(0,1fr)); }
+    .root-grid { grid-template-columns:1fr; }
+    .event { grid-template-columns:46px minmax(0,1fr); padding:10px; }
+    .step-index { width:38px; height:38px; }
+    .event-grid { grid-template-columns:1fr; }
+  }
 </style>
 </head>
 <body>
@@ -358,9 +372,9 @@ _INDEX_HTML = """<!doctype html>
         <div class="brand-sub" id="trace-count">Loading traces</div>
       </div>
       <div class="top-actions">
-        <span class="button">Analyze</span>
-        <span class="button">Export Bundle</span>
-        <span class="button primary">Open Error Hub</span>
+        <button class="button" id="analyze-btn" type="button">Analyze</button>
+        <button class="button" id="export-btn" type="button">Bundle</button>
+        <button class="button primary" id="hub-btn" type="button">Hub</button>
       </div>
     </div>
     <div class="content" id="detail">
@@ -370,6 +384,8 @@ _INDEX_HTML = """<!doctype html>
 </div>
 <script>
 const BOOTSTRAP = __BOOTSTRAP_JSON__;
+let CURRENT_TRACE_ID = null;
+let CURRENT_TRACE_DATA = null;
 async function api(path) {
   const r = await fetch(path);
   if (!r.ok) throw new Error('HTTP ' + r.status);
@@ -430,9 +446,11 @@ function renderTraceList(traceIds, selectedId) {
 async function selectTrace(tid, li) {
   document.querySelectorAll('.run').forEach(el => el.classList.remove('active'));
   li.classList.add('active');
+  CURRENT_TRACE_ID = tid;
   document.getElementById('detail').innerHTML = '<div class="empty">Loading trace...</div>';
   try {
     const data = await api('/api/v1/traces/' + encodeURIComponent(tid));
+    CURRENT_TRACE_DATA = data;
     renderTrace(data.trajectory, data.report);
   } catch (e) {
     document.getElementById('detail').innerHTML = '<div class="empty">' + escapeHtml(e) + '</div>';
@@ -485,7 +503,7 @@ function renderTrace(traj, report) {
   for (const f of findings) html += renderFinding(f);
   html += '</div></div></div>';
-  html += '<div class="panel"><div class="panel-head"><div class="panel-title">Use Case Flow</div><span class="chip cyan">Error Hub</span></div><div class="panel-body"><div class="flow">';
+  html += '<div class="panel" id="error-hub-flow"><div class="panel-head"><div class="panel-title">Use Case Flow</div><span class="chip cyan">Error Hub</span></div><div class="panel-body"><div class="flow">';
   html += flow(1, 'Capture trajectory from the running agent with the lightweight recorder or adapter.');
   html += flow(2, 'Diagnose the trace, localize the likely root cause, and generate recovery suggestions.');
   html += flow(3, 'Scrub secrets and PII, package a reproducible error bundle, and publish to Git or Hugging Face.');
@@ -566,6 +584,32 @@ function renderEvent(ev, isRoot, finding) {
   html += '</div></div></div>';
   return html;
 }
+function downloadJson(filename, value) {
+  const blob = new Blob([JSON.stringify(value, null, 2)], {type: 'application/json'});
+  const url = URL.createObjectURL(blob);
+  const a = document.createElement('a');
+  a.href = url;
+  a.download = filename;
+  document.body.appendChild(a);
+  a.click();
+  a.remove();
+  URL.revokeObjectURL(url);
+}
+function bindTopActions() {
+  document.getElementById('analyze-btn').onclick = () => {
+    const active = document.querySelector('.run.active');
+    if (CURRENT_TRACE_ID && active) selectTrace(CURRENT_TRACE_ID, active);
+  };
+  document.getElementById('export-btn').onclick = () => {
+    if (!CURRENT_TRACE_DATA) return;
+    const name = (CURRENT_TRACE_ID || 'trace') + '.agentdebugx.report.json';
+    downloadJson(name, CURRENT_TRACE_DATA);
+  };
+  document.getElementById('hub-btn').onclick = () => {
+    const flow = document.getElementById('error-hub-flow');
+    if (flow) flow.scrollIntoView({behavior: 'smooth', block: 'start'});
+  };
+}
 function field(label, value, isError) {
   return '<div class="field ' + (isError ? 'error' : '') + '"><div class="field-label">' + escapeHtml(label) + '</div><div class="field-value">' + escapeHtml(value || '-') + '</div></div>';
 }
@@ -585,11 +629,14 @@ if (BOOTSTRAP && BOOTSTRAP.traces) {
   const selected = BOOTSTRAP.selected ? BOOTSTRAP.selected.trajectory.trace_id : null;
   renderTraceList(BOOTSTRAP.traces, selected);
   if (BOOTSTRAP.selected) {
+    CURRENT_TRACE_ID = selected;
+    CURRENT_TRACE_DATA = BOOTSTRAP.selected;
     renderTrace(BOOTSTRAP.selected.trajectory, BOOTSTRAP.selected.report);
   } else {
     document.getElementById('detail').innerHTML = '<div class="empty">No traces in store.</div>';
   }
 }
+bindTopActions();
 loadTraceList(!(BOOTSTRAP && BOOTSTRAP.selected));
 </script>
 </body>

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/LICENSE RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/README.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/00_overview.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/01_literature_survey.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/02_architecture.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/03_taxonomy.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/04_trace_schema.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/05_adapters.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/06_detectors.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/07_attribution.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/08_recovery.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/09_error_database.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/10_taxonomy_induction.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/11_multimodal.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/12_ui_dashboard.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/13_class_design.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/14_api_reference.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/15_roadmap.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/16_governance.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/17_claude_code_design_patterns.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/18_comparison_codex_vs_design.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/19_error_hub.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/20_deep_debug.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/21_integrations.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/22_industry_track_paper_eval_plan.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/ERROR_TAXONOMY.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/OPEN_SOURCE_DEVELOPMENT_PLAN.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/README.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/RESEARCH_SURVEY.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/benchmarks/e2e_v0_2_3.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/benchmarks/e2e_v0_2_4.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/benchmarks/v0_1_smoke.json RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/docs/benchmarks/v0_1_smoke.md RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/__init__.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/base.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/crewai.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/langgraph.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/otel.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/adapters/raw.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/analyzers.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/cli.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/deep.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/detectors.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/events.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/hub/__init__.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/hub/backend_base.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/hub/backends.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/hub/bundle.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/hub/scrub.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/instrumentation.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/integrations/__init__.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/integrations/claude_skill.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/integrations/openhands.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/llm.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/models.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/recorder.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/recovery.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/storage.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/taxonomy.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/traceback.py RENAMED Viewed

File without changes

{agentdebugx-0.2.5 → agentdebugx-0.2.7}/src/agentdebug/ui/__init__.py RENAMED Viewed

File without changes

agentdebugx 0.2.5__tar.gz → 0.2.7__tar.gz

agentdebugx 0.2.5tar.gz → 0.2.7tar.gz