PyPI - agent-failure-debugger - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

agent-failure-debugger 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

{agent_failure_debugger-0.2.0 → agent_failure_debugger-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agent-failure-debugger
-Version: 0.2.0
+Version: 0.2.1
 Summary: Diagnose why your LLM agent failed. Deterministic causal analysis with fix generation.
 License: MIT
 Project-URL: Homepage, https://github.com/kiyoshisasano/agent-failure-debugger
@@ -41,15 +41,23 @@ print(result["explanation"]["context_summary"])
 ## Use the Debugger
-Use this when:
-- An agent gives confident answers without data
-- Tools return empty results or errors
-- Behavior changes between runs and you need to understand why
+Call `diagnose()` after every agent run. It returns execution quality (healthy, degraded, or failed), root cause analysis when failures are detected, and fix proposals.
-Choose your entry point:
+```python
+result = diagnose(raw_log, adapter="langchain")
+status = result["summary"]["execution_quality"]["status"]
+# In CI/CD or automated pipelines:
+assert status != "failed", f"Agent execution failed: {result['summary']['root_cause']}"
+```
+When the agent runs normally, you get `healthy` with confidence scores and grounding state. When something goes wrong, you get the root cause, causal path, and a fix proposal — without changing how you call the tool.
-- **During development** — use Atlas [`watch()`](https://github.com/kiyoshisasano/llm-failure-atlas) to observe live executions and diagnose behavior as it happens
-- **After failures** — use `diagnose()` to analyze a raw log or exported trace after the fact
+**Entry points:**
+- **Every run** — call `diagnose()` on the raw log or trace after each execution
+- **Live observation** — use Atlas [`watch()`](https://github.com/kiyoshisasano/llm-failure-atlas) to capture telemetry and diagnose during execution
+- **Multi-run comparison** — use `compare_runs()` and `diff_runs()` to track stability across runs
 Atlas detects failures; the debugger explains why they happened and proposes fixes. You can use Atlas alone for detection, but diagnosis requires the debugger.
@@ -148,7 +156,29 @@ For a copy-paste example without an API key, see [Reproducible Examples](#reprod
 pip install agent-failure-debugger
 ```
-### From Python (copy-paste-run)
+### Healthy run
+```python
+from agent_failure_debugger import diagnose
+raw_log = {
+    "inputs": {"query": "What was Q3 revenue?"},
+    "outputs": {"response": "Q3 revenue was $4.2M based on the latest earnings report."},
+    "steps": [
+        {"type": "tool", "name": "search_earnings", "inputs": {"quarter": "Q3"},
+         "outputs": {"revenue": "$4.2M", "source": "10-Q filing"}, "error": None},
+        {"type": "llm", "outputs": {"text": "Q3 revenue was $4.2M based on the latest earnings report."}}
+    ]
+}
+result = diagnose(raw_log, adapter="langchain")
+print(result["summary"]["execution_quality"]["status"])  # healthy
+print(result["summary"]["failure_count"])                 # 0
+```
+The tool returns a result on every run. When the agent is healthy, you get confirmation — not silence.
+### Degraded run
 ```python
 from agent_failure_debugger import diagnose
@@ -170,9 +200,12 @@ raw_log = {
 }
 result = diagnose(raw_log, adapter="langchain")
-print(result["summary"]["root_cause"])
+print(result["summary"]["root_cause"])                    # incorrect_output
+print(result["summary"]["execution_quality"]["status"])   # degraded
 ```
+Same function, same interface. The difference is in the input, not in how you call the tool.
 ### From matcher output (advanced)
 If you already have matcher output (e.g., from a custom integration):
@@ -184,7 +217,7 @@ result = run_pipeline(matcher_output, use_learning=True)
 print(result["summary"])
 ```
-See [Quick Start Guide](docs/quickstart.md) for more usage patterns including `watch()` and direct telemetry.
+See [Quick Start Guide](docs/quickstart.md) for more usage patterns including `watch()`, multi-run analysis, and direct telemetry.
 ## Common Mistakes
@@ -210,7 +243,7 @@ See [Limitations & FAQ](docs/limitations_faq.md) for details.
 ### Execution quality
-Every `diagnose()` and `run_pipeline()` result now includes execution quality assessment in the summary:
+Every `diagnose()` and `run_pipeline()` result includes execution quality assessment — this is what makes the tool useful on every run, not just when failures occur.
 ```python
 eq = result["summary"]["execution_quality"]
@@ -221,9 +254,11 @@ print(eq["summary"])             # one-line human-readable assessment
 ```
 - **healthy** — no significant issues detected
-- **degraded** — output may have been produced but quality indicators are weak (low alignment, weak grounding, unmodeled failures)
+- **degraded** — output may have been produced but quality indicators are weak (low alignment, weak grounding, redundant tool results, unmodeled failures)
 - **failed** — execution did not produce usable output (silent exit or error)
+Degradation indicators include: low alignment score (< 0.5), tools called but no usable data returned, high expansion ratio without uncertainty disclosure (> 3.0), low tool result diversity (< 0.5 across 2+ calls — tools returned identical results), low observation coverage, and unmodeled or conflicting failure signals.
 Execution quality uses existing telemetry and diagnosis results. No new matcher patterns are added.
 ### Multi-run analysis
@@ -245,6 +280,8 @@ print(diff["causal_path_diff"])                  # where paths diverge
 `compare_runs()` measures stability — whether the same task produces consistent diagnoses across runs. `diff_runs()` identifies divergence — what structural differences separate successful runs from failed ones.
+For runnable examples with expected output, see [examples/multi_run_stability](examples/multi_run_stability/) (compare_runs → diff_runs workflow) and [examples/termination_divergence](examples/termination_divergence/) (same root cause, different exit modes).
 ### Enhanced explanation
 ```python
@@ -416,6 +453,13 @@ matcher_output.json
 | `reliability.py` | Cross-run stability and differential analysis |
 | `execution_quality.py` | Single-run execution behavior assessment |
+### Examples
+| Directory | Demonstrates |
+|---|---|
+| `examples/termination_divergence/` | `diff_runs()`: same root cause, different termination modes |
+| `examples/multi_run_stability/` | `compare_runs()` → `diff_runs()`: two-step stability and divergence workflow |
 ---
 ## Graph Source
@@ -463,7 +507,56 @@ All scoring weights and gate thresholds are in `config.py`.
 ## Reproducible Examples
-**Try without an API key** (copy-paste-run):
+**Healthy run** (copy-paste-run, no API key needed):
+```bash
+pip install agent-failure-debugger
+```
+```python
+from agent_failure_debugger import diagnose
+raw_log = {
+    "inputs": {"query": "What was Q3 revenue?"},
+    "outputs": {"response": "Q3 revenue was $4.2M based on the latest earnings report."},
+    "steps": [
+        {"type": "tool", "name": "search_earnings", "inputs": {"quarter": "Q3"},
+         "outputs": {"revenue": "$4.2M", "source": "10-Q filing"}, "error": None},
+        {"type": "llm", "outputs": {"text": "Q3 revenue was $4.2M based on the latest earnings report."}}
+    ]
+}
+result = diagnose(raw_log, adapter="langchain")
+print(result["summary"]["execution_quality"]["status"])   # healthy
+print(result["summary"]["failure_count"])                  # 0
+```
+**Degraded run** (copy-paste-run):
+```python
+raw_log = {
+    "inputs": {"query": "Change my flight to tomorrow morning"},
+    "outputs": {"response": "I've found several hotels near the airport for you."},
+    "steps": [
+        {"type": "llm", "outputs": {"text": "Let me check available flights."}},
+        {"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
+         "outputs": {"flights": []}, "error": None},
+        {"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
+         "outputs": {"flights": []}, "error": None},
+        {"type": "tool", "name": "search_flights", "inputs": {"date": "2025-03-20"},
+         "outputs": {"flights": []}, "error": None},
+        {"type": "llm", "outputs": {"text": "I've found several hotels near the airport."}}
+    ],
+    "feedback": {"user_correction": "I asked about flights, not hotels."}
+}
+result = diagnose(raw_log, adapter="langchain")
+print(result["summary"]["root_cause"])
+print(result["summary"]["execution_quality"]["status"])
+# → root cause + execution quality (degraded)
+```
+**With a live agent** (requires `langchain-core` and `langgraph`):
 ```bash
 pip install agent-failure-debugger[langchain] langgraph
@@ -493,14 +586,23 @@ graph = watch(workflow.compile(), auto_diagnose=True)
 graph.invoke({"messages": [HumanMessage(content="What was Q3 revenue?")]})
 ```
+Note: `watch()` with `FakeListLLM` demonstrates the callback integration but may not trigger failure patterns — the fake LLM produces no tool calls or user corrections. For failure detection examples, use `diagnose()` with the raw log above.
 **Regression test examples:**
-10 examples in [llm-failure-atlas](https://github.com/kiyoshisasano/llm-failure-atlas) under `examples/`. Each contains `log.json`, `matcher_output.json`, and `expected_debugger_output.json`.
+12 examples in [llm-failure-atlas](https://github.com/kiyoshisasano/llm-failure-atlas) under `examples/` (10 agent + 2 non-LLM). Each contains `log.json`, `matcher_output.json`, and `expected_debugger_output.json`.
 ```bash
 python -m agent_failure_debugger.main matcher_output.json
 ```
+**Multi-run analysis examples:**
+2 examples in this repository under `examples/`. Each contains input fixtures, a runnable script, and `expected_output.json`:
+- [termination_divergence](examples/termination_divergence/) — `diff_runs()` comparing silent exit vs error exit
+- [multi_run_stability](examples/multi_run_stability/) — `compare_runs()` → `diff_runs()` two-step workflow
 ---
 ## Internals

agent-failure-debugger 0.2.0__tar.gz → 0.2.1__tar.gz

agent-failure-debugger 0.2.0tar.gz → 0.2.1tar.gz