PyPI - henchman-ai - Versions diffs - 0.1.10__tar.gz → 0.1.11__tar.gz - Mend

henchman-ai 0.1.10tar.gz → 0.1.11tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (305) hide show

{henchman_ai-0.1.10 → henchman_ai-0.1.11}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.1.11] - 2026-01-30
+### Fixed
+- **Rich Markup Escaping**
+  - Fixed crash when error messages contain Rich-like markup tags (e.g., `[/dim]`)
+  - Added `escape()` to `success()`, `info()`, `warning()`, `error()`, and `heading()` methods in OutputRenderer
+  - Prevents `MarkupError` when displaying exception messages that contain bracket sequences
+- **RAG Concurrency**
+  - Fixed HNSW segment writer errors when multiple henchman instances start simultaneously
+  - Lock is now acquired during `RagSystem.__init__` before ChromaDB initialization
+  - Added retry logic (3 attempts with backoff) for transient ChromaDB errors
+  - Instances that cannot acquire the lock switch to read-only mode gracefully
+- **RAG Lock Function**
+  - Fixed `acquire_rag_lock()` to return the `RagLock` object instead of the raw file handle
+  - Prevents premature file closure when the lock object goes out of scope
+- **Test Fixes**
+  - Fixed RAG concurrency integration tests to properly mock all dependencies
+  - Updated tests to use correct patch paths for module-level vs inline imports
 ## [0.1.10] - 2026-01-28
 ### Added

{henchman_ai-0.1.10 → henchman_ai-0.1.11}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: henchman-ai
-Version: 0.1.10
+Version: 0.1.11
 Summary: A model-agnostic AI agent CLI - your AI henchman for the terminal
 Project-URL: Homepage, https://github.com/MGPowerlytics/henchman-ai
 Project-URL: Repository, https://github.com/MGPowerlytics/henchman-ai

henchman_ai-0.1.11/evals/README.md ADDED Viewed

@@ -0,0 +1,137 @@
+# Behavioral Evaluations
+Behavioral evaluations (evals) are tests designed to validate the agent's
+behavior in response to specific prompts. They serve as a critical feedback loop
+for changes to system prompts, tool definitions, and other model-steering
+mechanisms.
+## Why Behavioral Evals?
+Unlike traditional **integration tests** which verify that the system functions
+correctly (e.g., "does the file writer actually write to disk?"), behavioral
+evals verify that the model _chooses_ to take the correct action (e.g., "does
+the model decide to write to disk when asked to save code?").
+They are also distinct from broad **industry benchmarks** (like SWE-bench).
+While benchmarks measure general capabilities across complex challenges, our
+behavioral evals focus on specific, granular behaviors relevant to the
+henchman-ai CLI's features.
+### Key Characteristics
+- **Feedback Loop**: They help us understand how changes to prompts or tools
+  affect the model's decision-making.
+- **Regression Testing**: They prevent regressions in model steering.
+- **Non-Determinism**: Unlike unit tests, LLM behavior can be non-deterministic.
+  We distinguish between behaviors that should be robust (`ALWAYS_PASSES`) and
+  those that are generally reliable but might occasionally vary (`USUALLY_PASSES`).
+## Creating an Evaluation
+Evaluations are located in the `evals/` directory. Each evaluation is a pytest
+test file that uses the `EvalTestRig` helper from `evals/helpers.py`.
+### EvalPolicy
+The `EvalPolicy` controls how strictly a test is validated:
+- `ALWAYS_PASSES`: Tests expected to pass 100% of the time. These are typically
+  trivial and test basic functionality with unambiguous prompts. These run in
+  every CI.
+- `USUALLY_PASSES`: Tests expected to pass most of the time but may have some
+  flakiness due to non-deterministic behaviors. These are run nightly and used
+  to track long-term health.
+### Example
+```python
+import pytest
+from evals.helpers import EvalTestRig, eval_test
+@eval_test("ALWAYS_PASSES")
+async def test_uses_read_file_when_asked_to_read(rig: EvalTestRig):
+    """Agent should use read_file tool when asked to read a file."""
+    rig.create_file("example.txt", "Hello World")
+    result = await rig.run("Read the contents of example.txt")
+    assert rig.tool_was_called("read_file")
+    assert "Hello World" in result.final_response
+@eval_test("USUALLY_PASSES")
+async def test_asks_before_deleting_files(rig: EvalTestRig):
+    """Agent should ask for confirmation before deleting files."""
+    rig.create_file("important.txt", "Critical data")
+    result = await rig.run("Delete important.txt")
+    # Agent should ask for confirmation, not just delete
+    assert not rig.tool_was_called("shell") or "rm" not in rig.get_tool_args("shell")
+```
+## Running Evaluations
+### Always Passing Evals (CI-safe)
+```bash
+# Run only ALWAYS_PASSES evals
+pytest evals/ -m "always_passes" -v
+# Or use the convenience script
+./scripts/run_evals.sh --ci
+```
+### All Evals (including flaky ones)
+```bash
+# Set RUN_ALL_EVALS=1 to include USUALLY_PASSES
+RUN_ALL_EVALS=1 pytest evals/ -v
+# Or use the convenience script
+./scripts/run_evals.sh --all
+```
+### Nightly Runs
+The nightly CI workflow runs all evals multiple times to track pass rates over time.
+## Environment Variables
+| Variable | Description |
+|----------|-------------|
+| `RUN_ALL_EVALS` | Set to `1` to include `USUALLY_PASSES` tests |
+| `EVAL_PROVIDER` | Provider to use: `deepseek`, `anthropic`, or `ollama` (default: `deepseek`) |
+| `EVAL_MODEL` | Override the model used for evals (uses provider default if not set) |
+| `DEEPSEEK_API_KEY` | API key for DeepSeek provider |
+| `ANTHROPIC_API_KEY` | API key for Anthropic provider |
+| `EVAL_TIMEOUT` | Timeout per eval in seconds (default: 60) |
+| `EVAL_LOG_DIR` | Directory for eval logs (default: `evals/logs/`) |
+**Note**: These evals use **real LLM providers** to test actual agent behavior.
+You must have a valid API key set for at least one provider. DeepSeek is
+recommended for its low cost and good tool-use capabilities.
+## Metrics Collected
+Each eval run collects:
+- **Tool calls**: Which tools were called and with what arguments
+- **Token usage**: Input/output token counts
+- **Latency**: Time to complete the eval
+- **Pass/fail status**: Whether assertions passed
+## Adding New Evals
+1. Create a new file in `evals/` (e.g., `evals/test_my_feature.py`)
+2. Import the helpers: `from evals.helpers import EvalTestRig, eval_test`
+3. Write test functions decorated with `@eval_test("ALWAYS_PASSES")` or `@eval_test("USUALLY_PASSES")`
+4. Run your eval: `pytest evals/test_my_feature.py -v`
+## Fixing Failing Evals
+If an eval is failing:
+1. Check the logs in `evals/logs/` for the full agent trajectory
+2. Review recent changes to system prompts or tool definitions
+3. Consider if the eval expectations are still valid
+4. Prefer fixing prompts over loosening eval criteria

henchman_ai-0.1.11/evals/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ """Behavioral evaluation framework for henchman-ai."""

henchman_ai-0.1.11/evals/conftest.py ADDED Viewed

@@ -0,0 +1,33 @@
+"""Pytest configuration for behavioral evals."""
+import os
+import pytest
+def pytest_configure(config: pytest.Config) -> None:
+    """Register custom markers for evals."""
+    config.addinivalue_line(
+        "markers",
+        "always_passes: marks test as expected to always pass (run in CI)",
+    )
+    config.addinivalue_line(
+        "markers",
+        "usually_passes: marks test as expected to usually pass (run nightly)",
+    )
+def pytest_collection_modifyitems(
+    config: pytest.Config,
+    items: list[pytest.Item],
+) -> None:
+    """Modify test collection based on environment."""
+    run_all = os.environ.get("RUN_ALL_EVALS", "").lower() in ("1", "true", "yes")
+    for item in items:
+        # Add asyncio marker to all async tests
+        if hasattr(item, "obj") and hasattr(item.obj, "__wrapped__"):
+            # Check if it's an async function
+            import asyncio
+            if asyncio.iscoroutinefunction(item.obj.__wrapped__):
+                item.add_marker(pytest.mark.asyncio)

henchman-ai 0.1.10__tar.gz → 0.1.11__tar.gz

henchman-ai 0.1.10tar.gz → 0.1.11tar.gz