PyPI - nighthawk-python - Versions diffs - 0.6.1__tar.gz → 0.8.0__tar.gz - Mend

nighthawk-python 0.6.1tar.gz → 0.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (167) hide show

nighthawk_python-0.8.0/.claude/rules/src.md ADDED Viewed

@@ -0,0 +1,17 @@
+---
+paths:
+  - "src/**/*.py"
+---
+# Coding standards
+- Prefer concrete code. Add a new abstraction only when the same change uses it from non-test code.
+- Default to module-private names. Export via `__all__` only for stable non-test consumers.
+- If a change expands or changes public API, update or confirm `tests/public/`.
+- Prefer async implementations in `runtime/` and `backends/`; keep sync bridges only at compatibility boundaries.
+- Reuse the existing `NighthawkError` hierarchy before adding a new exception class.
+- Prefer Pydantic (`BaseModel`, `TypeAdapter`) and Pydantic AI primitives over custom validation, parsing, schema, or agent/tool plumbing.
+- Use `opentelemetry.trace` spans at run/scope/step/tool boundaries and `logging.getLogger("nighthawk")` for diagnostics. Do not import `logfire` in `src/`.
+- Use PEP 695 `type` statements for new type aliases.
+- Ask before adding a new `src/` subpackage for a single module.
+- Follow `CONTRIBUTING.md` § Docstring Guide for docstring scope and format.

nighthawk_python-0.8.0/.claude/rules/tests.md ADDED Viewed

@@ -0,0 +1,13 @@
+---
+paths:
+  - "tests/**"
+---
+# Testing (pytest)
+- Prefer deterministic pytest coverage by default. Use helpers from `nighthawk.testing` before reaching for live LLM calls.
+- Use `tests/execution/stub_executor.py` only for envelope and runtime parser checks; prefer `nighthawk.testing` for normal Natural-function tests.
+- Keep live-LLM tests in `tests/integration/` and behind the documented environment gates.
+- For Python behavior changes, add or update pytest coverage in the same change and run `uv run pytest -q`.
+- If a change affects public API or README examples, confirm `tests/public/`. If it affects docs examples or anchors, confirm `tests/docs/`.
+- If a change affects prompt rendering, system prompt text, suffix generation, or tool exposure behavior, follow `.claude/rules/promptfoo.md`.

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/.github/workflows/publish.yml RENAMED Viewed

@@ -3,7 +3,7 @@ name: Publish to PyPI
 on:
   push:
     tags:
-      - "v[0-9]+.[0-9]+.[0-9]+"
+      - "v[0-9]+.[0-9]+.[0-9]+*"
 permissions:
   contents: read

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,31 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.8.0]
+### Added
+- Unit tests covering prompt token-budget injection: system prompt resolves `$tool_result_max_tokens`, and custom user prompt templates can resolve the same placeholder.
+### Changed
+- Default step system prompt now states that tool result `value` is a preview and includes the injected max-token limit.
+- User prompt template rendering now uses `Template.safe_substitute`, aligned with system prompt injection behavior and compatible with optional `$tool_result_max_tokens` placeholders.
+## [0.7.0]
+### Added
+- `nighthawk.UsageMeter`: run-scoped, thread-safe LLM token usage accumulator. Created automatically by `nh.run()` and readable via `nh.get_current_usage_meter()`.
+- `nighthawk.resilience.budget` transformer: composable token and cost budget enforcement with pre-call and post-call checks. Parameters: `tokens`, `tokens_per_call`, `cost`, `cost_per_call`, `cost_function`, `estimate_usage`.
+  - `BudgetExceededError`, `BudgetLimitKind`, `CostFunction` supporting types.
+  - OpenTelemetry span event `nighthawk.resilience.budget.exceeded` and `nighthawk.resilience` logger warning on budget violation.
+- Resilience OpenTelemetry events for retry/timeout/circuit paths: `nighthawk.resilience.retry.attempt`, `nighthawk.resilience.retry.exhausted`, `nighthawk.resilience.timeout.triggered`, `nighthawk.resilience.circuit.opened`.
+### Changed
+- Project status promoted from Alpha to Beta.
+- Updated one-line description.
+- Removed "experimental" language from README and documentation.
+- Updated PyPI keywords for improved discoverability.
+- Generalized `StepContext` implicit references to value-based mappings (`implicit_reference_name_to_value`), and added additive scope injection via `nh.scope(implicit_references={...})` across nested scopes.
 ## [0.6.1]
 ### Added
@@ -102,7 +127,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Step executor abstraction and provider integration foundation.
 - Core documentation and project scaffolding.
-[Unreleased]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.6.1...HEAD
+[Unreleased]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.8.0...HEAD
+[0.8.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.7.0...v0.8.0
+[0.7.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.6.1...v0.7.0
 [0.6.1]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.6.0...v0.6.1
 [0.6.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.5.0...v0.6.0
 [0.5.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.4.0...v0.5.0

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/CONTRIBUTING.md RENAMED Viewed

@@ -27,15 +27,13 @@ uv run python
 # Format code
 uv run ruff format .
-# Lint (check / auto-fix)
-uv run ruff check .
+# Lint (auto-fix)
 uv run ruff check --fix .
 # Type check
 uv run pyright
 # Run tests
-uv run pytest          # full suite
 uv run pytest -q       # quiet output
 ```

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: nighthawk-python
-Version: 0.6.1
-Summary: An experimental Python library that embeds Natural blocks inside Python functions and executes them using an LLM.
+Version: 0.8.0
+Summary: A Python library where Python controls flow and LLMs or coding agents reason within constrained Natural blocks.
 Project-URL: Repository, https://github.com/kurusugawa-computer/nighthawk-python
 Project-URL: Documentation, https://kurusugawa-computer.github.io/nighthawk-python/
 Project-URL: Changelog, https://github.com/kurusugawa-computer/nighthawk-python/blob/main/CHANGELOG.md
@@ -9,8 +9,8 @@ Project-URL: Bug Tracker, https://github.com/kurusugawa-computer/nighthawk-pytho
 Author-email: "Kurusugawa Computer Inc." <oss@kurusugawa.jp>
 License-Expression: MIT
 License-File: LICENSE
-Keywords: embedded-dsl,interoperability,llm,natural-language,pydantic-ai
-Classifier: Development Status :: 3 - Alpha
+Keywords: agent,ai,anthropic,dsl,llm,natural-language,openai,prompt-engineering,pydantic-ai,structured-output
+Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
@@ -44,12 +44,12 @@ Description-Content-Type: text/markdown
 <img src="https://github.com/kurusugawa-computer/nighthawk-python/raw/main/docs/assets/nighthawk_logo-128x128.png" alt="nighthawk-logo" width="128px" margin="10px"></img>
 </div>
-Nighthawk is an experimental Python library exploring a clear separation:
+Nighthawk is a Python library where Python controls flow and LLMs or coding agents reason within constrained Natural blocks.
-- Use **hard control** (Python code) for strict procedure, verification, and deterministic flow.
-- Use **soft reasoning** (an LLM or coding agent) for semantic interpretation inside small embedded "Natural blocks".
+- **Hard control** (Python code): strict procedure, verification, and deterministic flow.
+- **Soft reasoning** (an LLM or coding agent): semantic interpretation inside small embedded "Natural blocks".
-Python controls all flow; the LLM or coding agent is constrained to small Natural blocks with explicit input/output boundaries. The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
+The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
 This repository is a compact reimplementation of the core ideas of [Nightjar](https://github.com/psg-mit/nightjarpy).

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/README.md RENAMED Viewed

@@ -9,12 +9,12 @@
 <img src="https://github.com/kurusugawa-computer/nighthawk-python/raw/main/docs/assets/nighthawk_logo-128x128.png" alt="nighthawk-logo" width="128px" margin="10px"></img>
 </div>
-Nighthawk is an experimental Python library exploring a clear separation:
+Nighthawk is a Python library where Python controls flow and LLMs or coding agents reason within constrained Natural blocks.
-- Use **hard control** (Python code) for strict procedure, verification, and deterministic flow.
-- Use **soft reasoning** (an LLM or coding agent) for semantic interpretation inside small embedded "Natural blocks".
+- **Hard control** (Python code): strict procedure, verification, and deterministic flow.
+- **Soft reasoning** (an LLM or coding agent): semantic interpretation inside small embedded "Natural blocks".
-Python controls all flow; the LLM or coding agent is constrained to small Natural blocks with explicit input/output boundaries. The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
+The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
 This repository is a compact reimplementation of the core ideas of [Nightjar](https://github.com/psg-mit/nightjarpy).

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/docs/api.md RENAMED Viewed

@@ -21,8 +21,10 @@
         - to_jsonable_value
         - ExecutionContext
         - get_current_step_context
+        - get_current_usage_meter
         - get_execution_context
         - get_step_executor
+        - UsageMeter
 ## Errors
@@ -106,6 +108,10 @@
 ::: nighthawk.resilience
     options:
       members:
+        - budget
+        - BudgetExceededError
+        - BudgetLimitKind
+        - CostFunction
         - retrying
         - timeout
         - fallback

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/docs/for-coding-agents.md RENAMED Viewed

@@ -109,6 +109,9 @@ deep_executor = nh.AgentStepExecutor.from_configuration(
 )
+def search_repository(query: str) -> list[str]: ...
 @nh.natural_function
 def classify_ticket(text: str) -> str:
     label: str = ""
@@ -136,10 +139,16 @@ def write_analysis_report(ticket_text: str, product_context: str) -> str:
 with nh.run(fast_executor):
     label = classify_ticket(ticket_text)
-    with nh.scope(step_executor=deep_executor):
+    with nh.scope(
+        step_executor=deep_executor,
+        implicit_references={"search_repository": search_repository},
+    ):
         report = write_analysis_report(ticket_text, product_summary)
 ```
+`implicit_references` can inject global helper functions as block capabilities.
+Nested scopes still merge additively (set union by key).
 ## 4. The standard contract shape
 Prefer the post-block logic pattern. Let the block write a typed value, then validate or transform it in Python.
@@ -264,6 +273,8 @@ Do not inject untrusted raw text into Natural source. If input is user-controlle
 Rules:
 - The model sees callable signatures from both LOCALS and GLOBALS.
+- For object read bindings, the model also sees a capability view: object header, public methods (with signatures), and public fields (with typed previews).
+- Object capability views expose public members only. Private/dunder members are omitted, and properties are not evaluated.
 - Put per-invocation data in function parameters. Put stable, reusable capabilities at module level.
 - Do not annotate callable parameters as `object` or `Any` -- this erases the signature the model needs:
@@ -311,7 +322,7 @@ Async rule:
 Resilience rule:
-- Keep retry, fallback, timeout, and circuit-breaker policy in Python, not inside Natural text.
+- Keep retry, fallback, timeout, budget, and circuit-breaker policy in Python, not inside Natural text.
 - Import from `nighthawk.resilience` (not re-exported from `nighthawk`):
 ```py
@@ -323,11 +334,11 @@ with nh.run(executor):
     label = resilient_classify(ticket_text)
 ```
-See [Patterns: Resilience](https://kurusugawa-computer.github.io/nighthawk-python/patterns/#resilience-patterns) for `fallback`, `vote`, `timeout`, and `circuit_breaker`.
+See [Patterns: Resilience](https://kurusugawa-computer.github.io/nighthawk-python/patterns/#resilience-patterns) for `fallback`, `vote`, `timeout`, `budget`, and `circuit_breaker`.
 ## 9. Context budget discipline
-Prompt context is finite. When you see `<snipped>`, fix in this order:
+Prompt context is finite. When you see `<snipped>`, the marked data is truncated from the prompt but remains in Python memory -- the model can still reach it through binding functions. Fix context pressure in this order:
 1. Remove irrelevant locals and globals from the function scope.
 2. Split the block into smaller, focused blocks.

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/docs/index.md RENAMED Viewed

@@ -4,12 +4,12 @@
 <img src="assets/nighthawk_logo-128x128.png" alt="logo" width="128px">
 </div>
-Nighthawk is an experimental Python library exploring a clear separation:
+Nighthawk is a Python library where Python controls flow and LLMs or coding agents reason within constrained Natural blocks.
-- Use **hard control** (Python code) for strict procedure, verification, and deterministic flow.
-- Use **soft reasoning** (an LLM or coding agent) for semantic interpretation inside small embedded "Natural blocks".
+- **Hard control** (Python code): strict procedure, verification, and deterministic flow.
+- **Soft reasoning** (an LLM or coding agent): semantic interpretation inside small embedded "Natural blocks".
-Python controls all flow; the LLM or coding agent is constrained to small Natural blocks with explicit input/output boundaries. The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests").
+The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests").
 ```py
 import nighthawk as nh

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/docs/patterns.md RENAMED Viewed

@@ -323,7 +323,7 @@ def compute_with_context(context_text: str) -> int:
 Natural blocks are non-deterministic by nature. Production deployments need strategies to handle transient failures, unstable outputs, and provider outages. The `nighthawk.resilience` module provides composable **function transformers** -- each takes a callable and returns a new callable with the same signature.
 ```py
-from nighthawk.resilience import retrying, fallback, vote, timeout, circuit_breaker
+from nighthawk.resilience import retrying, fallback, vote, timeout, budget, circuit_breaker
 ```
 Import directly from `nighthawk.resilience`. Resilience primitives are not re-exported from the top-level `nighthawk` namespace.
@@ -349,7 +349,12 @@ for attempt in retrying(attempts=3):
         result = classify(text)
 ```
-Customize which exceptions trigger retries and the backoff strategy:
+`retrying` separates retry control into four roles:
+- `on`: type-level retry eligibility.
+- `retry_if`: content-level retry eligibility after `on` matches.
+- `wait`: retry interval strategy.
+- `on_retry`: side-effect hook when a retry is decided.
 ```py
 from tenacity import wait_fixed
@@ -357,10 +362,14 @@ from tenacity import wait_fixed
 resilient = retrying(
     attempts=5,
     on=(ExecutionError, TimeoutError),
+    retry_if=lambda exception: "transient" in str(exception).lower(),
     wait=wait_fixed(2),
+    on_retry=lambda retry_state: logger.info("retrying", extra={"attempt": retry_state.attempt_number}),
 )(classify)
 ```
+Use only what you need. For most cases, `retrying(attempts=3)(fn)` is enough.
 ### Fallback
 Try multiple functions in order. The first success wins.
@@ -405,6 +414,42 @@ async with timeout(seconds=30):
     result = await slow_operation()
 ```
+### Budget
+Enforce token or monetary cost limits on wrapped functions. Requires an active `nh.run()` context (the run-scoped `UsageMeter` tracks cumulative usage automatically).
+```py
+from nighthawk.resilience import budget
+safe_classify = budget(tokens=50_000, tokens_per_call=5_000)(classify)
+result = safe_classify(text)
+```
+`tokens` caps cumulative usage across all calls; `tokens_per_call` caps a single call. Both are checked before and after each invocation. When a limit is breached, `BudgetExceededError` is raised -- combine with `fallback` to degrade gracefully:
+```py
+from nighthawk.resilience import budget, fallback, BudgetExceededError
+composed = fallback(
+    budget(tokens=50_000)(classify_gpt4),
+    classify_mini,
+    on=(BudgetExceededError,),
+)
+```
+For monetary budgets, supply a `cost_function` that converts `RunUsage` to a float:
+```py
+from pydantic_ai.usage import RunUsage
+def dollar_cost(usage: RunUsage) -> float:
+    return usage.input_tokens * 3e-6 + usage.output_tokens * 15e-6
+budgeted = budget(cost=1.00, cost_function=dollar_cost)(classify)
+```
+Outside a `nh.run()` context, the transformer is a no-op.
 ### Circuit breaker
 Prevent repeated calls to a failing service. After `fail_threshold` consecutive failures, the circuit opens and rejects calls immediately with `CircuitOpenError`. After `reset_timeout` seconds, one probe call is allowed.
@@ -440,10 +485,11 @@ Recommended composition order (innermost to outermost):
 | Order | Transformer | Why |
 |---|---|---|
 | 1 | `timeout` | Bound each individual call |
-| 2 | `vote` | Aggregate multiple bounded calls |
-| 3 | `retrying` | Retry the aggregated operation |
-| 4 | `circuit_breaker` | Protect against persistent failure |
-| 5 | `fallback` | Switch to alternative on exhaustion |
+| 2 | `budget` | Cap token or monetary cost |
+| 3 | `vote` | Aggregate multiple bounded calls |
+| 4 | `retrying` | Retry the aggregated operation |
+| 5 | `circuit_breaker` | Protect against persistent failure |
+| 6 | `fallback` | Switch to alternative on exhaustion |
 ### Caching LLM results

{nighthawk_python-0.6.1 → nighthawk_python-0.8.0}/docs/philosophy.md RENAMED Viewed

@@ -1,10 +1,10 @@
 # Philosophy
-Python controls orchestration; the LLM operates inside typed blocks with explicit state transfer.
+Python owns the control flow. The LLM works inside typed blocks, receiving inputs and returning outputs through explicit bindings.
 ## Execution model
-Nighthawk embeds Natural blocks inside ordinary Python functions. Each block is a typed boundary: read bindings (`<name>`) inject input state from Python variables, write bindings (`<:name>`) commit output state back with type validation, and binding functions give the LLM composable access to Python callables during block execution. Python controls the sequencing -- loops, conditionals, error handling, retries -- and the LLM operates inside each block with no implicit message history carried across blocks.
+Nighthawk embeds Natural blocks inside ordinary Python functions. Each block has a typed boundary. Read bindings (`<name>`) pass Python values in. Write bindings (`<:name>`) pass results back out, validated against their type annotations. Binding functions let the LLM call Python functions during execution. Python controls the sequencing -- loops, conditionals, error handling, retries -- and the LLM operates inside each block with no implicit message history carried across blocks.
 ```py
 def python_average(numbers):
@@ -23,7 +23,7 @@ calculate_average([1, "2", "three", "cuatro", "五"])  # 3.0
 Binding functions like `<python_average>` appear in the prompt as a compact signature line. The LLM's pre-trained Python knowledge lets it reason about types, return values, and composition from the signature alone, without JSON Schema or protocol overhead. See [Tool exposure efficiency](#tool-exposure-efficiency) for the quantitative comparison with MCP and CLI tool exposure.
-With provider-backed executors, each Natural block is a single LLM call where typed bindings do the heavy lifting. A sentiment classifier whose write binding is typed as `Literal["positive", "negative", "neutral"]` will reject any output that falls outside the declared set -- the type annotation is not a hint but a runtime-enforced contract via Pydantic validation. The same mechanism applies to numeric extraction (`int`, `float`), structured parsing (Pydantic models), and any task where the judgment space is bounded. Because the host program owns the loop, a misclassified result can be retried, logged, or routed to a fallback -- all in ordinary Python.
+With provider-backed executors, each Natural block is a single LLM call. A sentiment classifier whose write binding is typed as `Literal["positive", "negative", "neutral"]` rejects any output outside the declared set -- Pydantic validates the type annotation at runtime, not as a hint. The same mechanism applies to numeric extraction (`int`, `float`), structured parsing (Pydantic models), and any task where the judgment space is bounded. Because the host program owns the loop, a misclassified result can be retried, logged, or routed to a fallback -- all in ordinary Python.
 With [coding agent backends](coding-agent-backends.md), the same boundary contract applies, but each Natural block becomes an autonomous agent execution. The agent can read files, run commands, and invoke skills -- while typed bindings enforce what crosses the boundary back to Python. The same `scope()` and `run()` context managers that structure human-written workflows are equally legible to a coding agent constructing workflows programmatically. When a coding agent operates inside a Natural block, binding functions appear as Python signatures in the prompt:
@@ -32,13 +32,17 @@ fetch_items: (category: str, limit: int = 10) -> list[Item]
 merge_results: (primary: list[Item], secondary: list[Item]) -> list[Item]
 ```
-The underlying LLM's pre-trained Python knowledge lets it infer that `Item` has attributes, that the return value supports iteration and indexing, and that `merge_results` accepts the output of `fetch_items` directly -- all from the type annotations alone. An equivalent CLI tool description (`fetch-items --category X --limit 10`) conveys invocation syntax but not output structure; the model must infer or discover the output format separately.
+The underlying LLM's pre-trained Python knowledge lets it infer that `Item` has attributes, that the return value supports iteration and indexing, and that `merge_results` accepts the output of `fetch_items` directly -- all from the type annotations alone. A CLI tool description (`fetch-items --category X --limit 10`) is optimized for invocation syntax; output structure is left to the model's training data.
-Coding agent backends make this especially practical because the agent can immediately apply that inferred structure while reading workflow code, invoking tools, editing implementations, running `pytest`, and iterating within the same Python codebase. No framework-specific tooling, no graph serialization format, no separate configuration language.
+Coding agent backends make this especially practical because the agent can immediately apply that inferred structure while reading workflow code, invoking tools, editing implementations, running `pytest`, and iterating within the same Python codebase. The agent works directly in Python with standard tooling -- debugger, test runner, type checker -- rather than through a separate orchestration layer.
+When the prompt exceeds token limits, the runtime omits remaining entries from the rendered context and appends a `<snipped>` marker. The underlying data stays in Python memory -- binding functions can still query it at runtime. Truncation optimizes prompt coherence without causing data loss.
+Because each Natural block is a fresh prompt with no implicit history, the entire prompt surface -- block text (including f-string interpolation), bindings, and scope configuration -- is determined by the host program at each invocation. Changing any of these between invocations has no side effects on other blocks.
 ## The harness matters more than the model
-The strongest direct evidence comes from agentic coding tasks; extending the principle to provider-backed judgments is a design inference, not a measured claim.
+The strongest direct evidence comes from agentic coding tasks. The subsections below separate what has been measured from where Nighthawk extends the principle.
 ### Observed evidence
@@ -50,27 +54,29 @@ The direct evidence concerns LLM-driven code editing and file management tasks,
 ### Design inference for Nighthawk
-Extending the principle to provider-backed lightweight judgments (sentiment classification, numeric interpretation) is a design inference, not an empirical claim: typed bindings structurally constrain hallucination, and resilience transformers absorb transient failures, but these benefits have not been independently measured in the same controlled fashion.
+We think the same principle applies to provider-backed judgments like sentiment classification and numeric interpretation, but we have not measured it directly. Typed bindings limit what the LLM can return, and resilience transformers handle transient failures -- both should help, but neither has been tested in the same controlled way as the coding-task evidence above.
-Regardless of scope, the practical question is how harness improvements are expressed. Configuration-file-based guardrail systems -- rule files, lifecycle hooks, permission modes, tool filtering -- are effective for restricting behavior but cannot express dynamic orchestration: conditional retry strategies, type-level input/output contracts, scope-dependent tool visibility, or prompts that adapt to runtime state. The constraint vocabulary is limited to what the configuration format allows.
+Regardless of scope, the practical question is how harness improvements are expressed. Configuration-file guardrails -- rule files, lifecycle hooks, permission modes, tool filtering -- are effective at restricting behavior. They are optimized for static constraints. Dynamic orchestration (conditional retries, typed input/output contracts, scope-dependent tool visibility, prompts that adapt at runtime) requires a programming language, which is where Nighthawk's Python-first approach fits.
-The primitives described in the [Execution model](#execution-model) and the following sections -- typed bindings, resilience transformers, scoped execution contexts -- are Nighthawk's implementation of the principle, through Python programming rather than configuration.
+The primitives described in the [Execution model](#execution-model) and the following sections -- typed bindings, resilience transformers, scoped execution contexts -- are how Nighthawk implements the principle in Python.
 ## Design consequences
-The execution model introduced typed bindings as the boundary mechanism between Python and LLM reasoning. The following subsections explore what design consequences follow from that choice -- from resilience and scoping to tool exposure, multi-agent coordination, and the tradeoffs the design accepts.
+The sections below explore what follows from the typed-binding execution model: resilience, scoping, tool exposure, multi-agent coordination, and the tradeoffs the design accepts.
 ### Resilience as composable functions
-Production LLM applications need strategies for handling transient failures, unstable outputs, and provider outages. Workflow engines build retry, checkpointing, and human-in-the-loop into the graph runtime -- resilience is inseparable from the orchestration layer. Nighthawk takes a different approach: resilience primitives (`nighthawk.resilience`) are ordinary Python function transformers that wrap any callable. Each transformer takes a function and returns a new function with the same signature. Retry, fallback, voting, timeout, and circuit breaker logic composes by nesting -- no graph DSL, no framework-managed state, and no implicit retry policy. The host controls exactly which calls are retried, how many times, and what happens on exhaustion -- using the same Python debugger, pytest, and code review workflows as the rest of the application. This applies equally to lightweight provider-backed judgments and autonomous agent executions. See [Patterns](patterns.md#resilience-patterns) for usage patterns and composition examples.
+Production LLM applications need strategies for transient failures, unstable outputs, and provider outages. Workflow engines build retry, checkpointing, and human-in-the-loop into the graph runtime. Nighthawk takes a different approach. Resilience primitives (`nighthawk.resilience`) are ordinary Python function transformers that wrap any callable. Each transformer takes a function and returns a new function with the same signature. Retry, fallback, voting, timeout, and circuit breaker logic composes by nesting -- no graph DSL, no framework-managed state, and no implicit retry policy. The host controls exactly which calls are retried, how many times, and what happens on exhaustion -- using the same Python debugger, pytest, and code review workflows as the rest of the application. This applies equally to lightweight provider-backed judgments and autonomous agent executions. See [Patterns](patterns.md#resilience-patterns) for usage patterns and composition examples.
 ### Scoped execution contexts
-`run()` establishes the execution boundary: it links a step executor to the current context as an explicit Python `with` statement rather than as a global configuration or implicit thread-local. `scope()` narrows configuration within an existing run -- model override, prompt suffix, or executor replacement -- each taking effect only within the nested `with` block. Nesting is natural Python lexical scoping: the host program's control flow, not a framework runtime, determines which configuration is active at any point. This connects directly to the philosophy that runtime behavior should live in Python structures rather than in prose-only instructions or static configuration. See [Runtime configuration](runtime-configuration.md) for details and examples.
+`run()` establishes the execution boundary: it links a step executor to the current context as an explicit Python `with` statement rather than as a global configuration or implicit thread-local. `scope()` narrows configuration within an existing run -- model override, prompt suffix, or executor replacement -- each taking effect only within the nested `with` block. Nesting is natural Python lexical scoping: the host program's control flow, not a framework runtime, determines which configuration is active at any point. Runtime behavior lives in Python structures rather than in prose-only instructions or static configuration. See [Runtime configuration](runtime-configuration.md) for details and examples.
 ### Tool exposure efficiency
-Because binding functions are Python signatures rather than JSON Schema objects or CLI descriptions, the per-tool context cost is on the order of a single signature line. MCP tool definitions carry per-request JSON Schema overhead that grows with the number of exposed tools. CLI tools reduce definition overhead but carry hidden costs -- Mario Zechner's [2025 benchmark](https://mariozechner.at/posts/2025-08-15-mcp-vs-cli/) found that CLI invocations in Claude Code trigger per-command security classification that consumed an order of magnitude more tokens than equivalent MCP calls. In both approaches, substantial context budget is spent on tool plumbing before the model sees the actual task.
+Binding functions carry higher information density per token than JSON Schema or CLI descriptions (see [Execution model](#execution-model) for how they appear in the prompt). This section compares the per-tool context cost across approaches.
+MCP tool definitions carry per-request JSON Schema overhead that grows with the number of exposed tools. CLI tools reduce definition overhead but carry hidden costs -- Mario Zechner's [2025 benchmark](https://mariozechner.at/posts/2025-08-15-mcp-vs-cli/) found that CLI invocations in Claude Code trigger per-command security classification that consumed an order of magnitude more tokens than equivalent MCP calls. In both approaches, substantial context budget is spent on tool plumbing before the model sees the actual task.
 **MCP** defines tools as JSON Schema objects served over a protocol layer. Each tool definition consumes tokens in every request.
@@ -82,7 +88,7 @@ Because binding functions are Python signatures rather than JSON Schema objects
 find_top_items: (category: str) -> list[dict]  # Return the highest-scored recent items in a category.
 ```
-This is on the order of a single signature line -- comparable in token cost to the most compact CLI description, but carrying higher information density. The type annotations let the LLM reason structurally: a `list[dict]` return supports iteration and key access, an `Item` return type has discoverable attributes, and typed parameters make it clear what another binding function will accept. A CLI description of similar compactness conveys invocation syntax but leaves output structure to inference from training data. There is no protocol layer, no serialization boundary, and no per-tool JSON Schema overhead. The same type annotations serve as targets for optional static analysis (pyright) and as hooks for Nighthawk's runtime validation (via Pydantic). Testing, debugging, and composition use standard Python tooling.
+The type annotations let the LLM reason structurally: a `list[dict]` return supports iteration and key access, an `Item` return type has discoverable attributes, and typed parameters make it clear what another binding function will accept. There is no protocol layer, no serialization boundary, and no per-tool JSON Schema overhead. The same type annotations serve as targets for optional static analysis (pyright) and as hooks for Nighthawk's runtime validation (via Pydantic). Testing, debugging, and composition use standard Python tooling.
 | Approach | Per-tool context cost | Information density | Type safety | Composability | Testing | Interoperability |
 |---|---|---|---|---|---|---|
@@ -92,7 +98,7 @@ This is on the order of a single signature line -- comparable in token cost to t
 ### Multi-agent coordination without a framework
-Multi-agent systems face three structural challenges: how agents communicate state, how agents are isolated from each other, and how results from multiple agents are merged. Existing workflow engines address these through framework-specific mechanisms -- graph state for communication, managed runtimes for isolation, message aggregation for merging -- but each solution locks users into the framework's abstractions, and no single framework provides all three comprehensively.
+Multi-agent systems face three structural challenges: how agents communicate state, how agents are isolated from each other, and how results from multiple agents are merged. Existing workflow engines address these through framework-specific mechanisms -- graph state for communication, managed runtimes for isolation, message aggregation for merging -- but each ties communication, isolation, and merging to the framework's own abstractions.
 Nighthawk is not a multi-agent framework. It is a building block that composes with Python's existing ecosystem for each challenge.
@@ -100,7 +106,7 @@ Nighthawk is not a multi-agent framework. It is a building block that composes w
 **Isolation.** Nighthawk provides logical isolation at binding boundaries: read bindings prevent name rebinding, write bindings are type-validated, and each Natural block executes with an independent step context carrying no implicit message history. Read bindings do not prevent in-place mutation of mutable objects -- this is intentional and underlies the [carry pattern](patterns.md#the-carry-pattern). OS-level isolation -- sandboxing, filesystem scoping, permission control -- is delegated to the execution backend. Coding agent backends provide their own sandbox modes and working directory scoping, which Nighthawk configures but does not reimplement.
-**Result merging.** The resilience module provides composable patterns for common cases: `vote` for majority consensus across repeated invocations, `fallback` for sequential first-success chaining. Domain-specific merging -- reconciling edits from multiple agents, aggregating heterogeneous outputs, resolving conflicts -- belongs in user code, because merge semantics are inherently domain-dependent. Nighthawk's role is to ensure that each agent's output crosses the boundary as a typed, validated Python object that merge logic can operate on directly.
+**Result merging.** The resilience module provides composable patterns for common cases: `vote` for majority consensus across repeated invocations, `fallback` for sequential first-success chaining. Domain-specific merging -- reconciling edits from multiple agents, aggregating heterogeneous outputs, resolving conflicts -- belongs in user code, because merge semantics are inherently domain-dependent. Nighthawk ensures that each agent's output crosses the boundary as a typed, validated Python object that merge logic can operate on directly.
 ### Tradeoffs
@@ -108,7 +114,7 @@ The boundary-centric design has costs:
 - **Python lock-in.** Binding functions, type annotations, and resilience transformers are Python constructs. Nighthawk does not offer a language-neutral protocol; interoperability with non-Python systems requires explicit bridging (e.g., REST endpoints wrapping Natural functions).
 - **Per-invocation cost.** Every Natural block invocation calls the LLM. There is no compilation step that amortizes cost across inputs. For high-throughput, low-judgment tasks where a deterministic Python function would suffice, a Natural block is the wrong tool. See [Why evaluate every time](#why-evaluate-every-time) for the design rationale.
-- **Integration tests are essential.** Mock tests verify Python logic around Natural blocks, but verifying that the LLM produces correct judgments requires integration tests against a real provider. The [two-layer testing strategy](verification.md) is not optional -- it is a structural consequence of delegating judgment to an LLM.
+- **Integration tests are essential.** Mock tests verify Python logic around Natural blocks, but verifying that the LLM produces correct judgments requires integration tests against a real provider. The [two-layer testing strategy](verification.md) is not optional -- because the LLM produces the judgment, only a real LLM call can verify it.
 - **Manual orchestration burden.** Nighthawk leaves branching, retries, merge logic, and recovery policy in user code rather than a graph runtime. This is a direct cost of the "Python controls all flow" principle.
 - **Python API design discipline.** Binding functions are only as effective as their signatures, type annotations, and naming. Poor API design degrades the LLM's ability to reason about composition.
@@ -118,7 +124,7 @@ A natural question: why not use an LLM once to translate a Natural block into eq
 The answer is that Natural blocks exist precisely for tasks that cannot be reduced to deterministic code. "Classify the sentiment of this review" or "interpret this ambiguous user input" require judgment that depends on the specific input, world knowledge, and context. If a task could be written as deterministic Python, it should be -- this is the core design principle (see [Natural blocks](natural-blocks.md#responsibility-split)).
-One-time compilation has additional structural limitations:
+One-time compilation has additional limitations:
 - The generated code would freeze the LLM's world knowledge at compilation time.
 - The input space is unbounded: "three apples, a dozen eggs, and cinco naranjas" requires open-ended interpretation that no finite code generation can fully anticipate.
@@ -153,7 +159,7 @@ Target: `[1, "2", "three", "cuatro", "五"]`
 Store the computed average in `result`.
 ````
-The instruction references embedded code, but there is no explicit boundary for how `result` crosses back to the host program. The narrative assumes the value will be available to subsequent steps, but the mechanism for state transfer is implicit -- the reader must infer it from convention rather than from a declared contract.
+The instruction references embedded code, but there is no explicit boundary for how `result` crosses back to the host program. The narrative assumes the value will be available to subsequent steps, but getting `result` back to the host program is implicit -- it depends on convention, not a declared contract.
 ### Nighthawk

nighthawk-python 0.6.1__tar.gz → 0.8.0__tar.gz

nighthawk-python 0.6.1tar.gz → 0.8.0tar.gz