PyPI - nighthawk-python - Versions diffs - 0.4.0__tar.gz → 0.5.0__tar.gz - Mend

nighthawk-python 0.4.0tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (141) hide show

nighthawk_python-0.5.0/.claude/rules/docs.md ADDED Viewed

@@ -0,0 +1,106 @@
+---
+paths:
+  - "docs/**/*.md"
+---
+# Documentation rules
+## File roles and boundaries
+Content belongs in exactly one file; cross-reference rather than duplicate. Exception: `for-coding-agents.md` condenses (distills) content from other files into actionable rules.
+| File | Audience | Role | Scope boundary |
+|---|---|---|---|
+| `index.md` | First-time visitors | Project overview, motivation, workflow styles | What Nighthawk is and why. Brief positioning summary with link to `philosophy.md`. No API details, no how-to. |
+| `philosophy.md` | Users evaluating Nighthawk | Deep positioning and design rationale | Workflow styles, comparison with workflow engines, tool exposure tradeoffs (MCP/CLI/binding functions), runtime evaluation rationale. Technical arguments with benchmarks. |
+| `quickstart.md` | New users | Shortest path to running a Natural block | Setup, first example, credentials, troubleshooting. No deep explanations. |
+| `tutorial.md` | Users learning the system | Build understanding from first principles | Bindings, functions and discoverability, control flow, composition, configuration, async. Assumes quickstart is done. Guidelines and testing are in `practices.md`. |
+| `practices.md` | Users applying patterns | Practical patterns and guidelines | Writing guidelines, binding function design, testing and debugging, observability. Assumes tutorial is done. |
+| `design.md` | Implementors and advanced users | Canonical specification (target behavior) | Full technical detail: syntax rules, state layers, prompt rendering, tool contracts, outcome schema, frontmatter. |
+| `providers.md` | Users choosing and configuring models | Provider selection, Pydantic AI setup, custom backends | Provider categories, capability matrix, model identifiers, Pydantic AI model settings, step executor protocols. No coding-agent-backend-specific content. |
+| `coding-agent-backends.md` | Users of Claude Code or Codex backends | Coding agent backend configuration and features | Backend-specific settings, skills, MCP tool exposure, working directory, project-scoped files. |
+| `for-coding-agents.md` | Coding agents (LLMs) working on Nighthawk projects | Condensed development knowledge base | Nighthawk mental model, Natural block writing, binding function design, control flow, composition, testing, common mistakes. Not a human tutorial; an LLM reference. |
+| `api.md` | Developers using the library | Auto-generated API reference (mkdocstrings) | Public API surface only. Content comes from source docstrings; do not hand-edit. |
+| `roadmap.md` | Contributors and planners | Future directions | Ideas and desired directions only. Must not restate what is already implemented. |
+## Content routing (non-obvious cases)
+Most content maps to exactly one file via the scope boundaries above. These cases involve splits or non-intuitive placement:
+- **Credential setup** -> `quickstart.md` (`OPENAI_API_KEY` only), `providers.md` (Pydantic AI providers), `coding-agent-backends.md` (coding agent backends)
+- **Error types** -> `design.md` (specification), `tutorial.md` (practical usage with examples)
+- **Testing patterns** -> `practices.md` (patterns with examples), `for-coding-agents.md` (condensed rules)
+- **Writing guidelines** -> `practices.md` (patterns with examples), `for-coding-agents.md` (condensed rules)
+- **Observability** -> `practices.md` (practical setup), `design.md` (specification)
+- **Conceptual impact of coding agent backends** (how they expand what Natural blocks can do) -> `philosophy.md` (positioning arguments), `index.md` (brief summary), `practices.md` (guidelines). Configuration details stay in `coding-agent-backends.md`.
+## Writing guidelines
+### General
+- Cross-reference with relative links (e.g., `[Section 5](tutorial.md#5-cross-block-composition)`). Exception: `for-coding-agents.md` uses absolute URLs based on `site_url` from `mkdocs.yml`.
+- File boundary delineation: `index.md` owns "why" (motivation, positioning); `tutorial.md` owns "how" (usage with examples); `practices.md` owns "how to do it well" (guidelines, patterns, testing); `design.md` owns "exact rules and edge cases" (specification).
+- Maintain consistent terminology across files (e.g., "one task per block" everywhere, not "one judgment" in one file and "one task" in another).
+- Keep code examples self-contained: understandable without surrounding prose.
+- Built-in tool names (`nh_eval`, `nh_exec`, `nh_assign`) are implementation details. Only `design.md` may expose them; all other files describe behavior instead.
+- `@nh.tool` is discouraged. `design.md` documents it as specification. `tutorial.md` may mention it with a "prefer binding functions" note. All other files (including `for-coding-agents.md`) must not reference it.
+- The PyPI package name is `nighthawk-python`. Always use `nighthawk-python` in `pip install` commands and extras.
+- Terminology: "task" refers to the structural unit a Natural block performs (contract: inputs, outputs, outcome). "judgment" refers to the cognitive act the LLM performs (classification, interpretation, generation). Use "one task per block", not "one judgment per block".
+### Per-file rules
+**index.md**
+- Documentation links list must stay in sync with `nav` entries in `mkdocs.yml`.
+- Keep positioning sections as brief summary paragraphs linking to `philosophy.md`. Do not add detailed comparisons, benchmarks, or technical arguments here.
+**philosophy.md**
+- Owns all detailed positioning arguments: workflow styles, workflow engine comparison, tool exposure tradeoffs, runtime evaluation rationale.
+- External references (benchmarks, blog posts) are acceptable. Prefer stable URLs; include enough inline context that the argument survives link rot.
+- May reference `tutorial.md` and `coding-agent-backends.md` for cross-cutting concepts but must not duplicate how-to content.
+**quickstart.md**
+- Optimize for copy-paste. Keep troubleshooting to common first-run errors only.
+- Includes both a Pydantic AI provider example and a coding agent backend (claude-code-cli) example.
+**tutorial.md**
+- One concept per section. Combine related ideas only when they share an example.
+- `<!-- prompt-example:name -->` markers are test anchors verified by `tests/docs/test_prompt_examples.py`. Never modify content between markers without updating the test.
+- Backend-agnostic in configuration: no backend-specific file layouts, credentials, or initialization variants; point to `providers.md` or `coding-agent-backends.md` instead. Conceptual references to what coding agent backends enable are acceptable as brief pointers.
+**practices.md**
+- Assumes the reader has completed the tutorial. No repetition of fundamentals.
+- No `<!-- prompt-example:name -->` markers; all prompt examples remain in `tutorial.md`.
+- Focus on practical application patterns, setup instructions, and decision frameworks.
+- Cross-reference `tutorial.md` for concepts, `design.md` for specifications.
+**design.md**
+- The specification. If implementation diverges, prefer changing the implementation (Section 0.1).
+- Precise, unambiguous language. No hedging for specified behavior.
+- Keep section hierarchy stable; other docs link to anchors.
+**providers.md**
+- Capability matrix must clearly show which features require a coding agent backend.
+- Concise, runnable setup snippets over narrative; link to `tutorial.md` for concepts.
+- Custom backends: show `AgentStepExecutor.from_agent` snippet; link to `design.md` for protocol details.
+- No credential details; delegate to Pydantic AI documentation.
+**coding-agent-backends.md**
+- Document shared capabilities once in a shared section; per-backend sections focus on differences.
+- For CLI integrations, separate what Nighthawk configures from what is delegated to backend CLI rules.
+- Include a settings field table per backend (type, default, description).
+- Reference `providers.md` for the overall provider landscape and capability matrix; do not duplicate the matrix.
+**for-coding-agents.md**
+- The reader is a coding agent (LLM). Write for immediate applicability, not progressive learning. Include runnable code templates.
+- Distill principles from tutorial.md, practices.md, design.md, and guidelines into actionable rules. Do not duplicate prose.
+- Information flows from human-oriented docs to this file, never the reverse. All facts, patterns, and rules in this file must have a source in tutorial.md, practices.md, or design.md. Do not introduce new information here.
+- Self-contained: readable without sibling files. All doc references use absolute URLs from `site_url` in `mkdocs.yml` (currently `https://kurusugawa-computer.github.io/nighthawk-python/`). Update URLs if `site_url` changes.
+- Keep the "common mistakes" table current.
+- Filter for coding-agent relevance: omit infrastructure concerns (scoped overrides, exception hierarchy beyond `ExecutionError`, observability) that don't affect writing Natural blocks or binding functions.
+**api.md**
+- Content from source docstrings; edit source code, not api.md. Hand-editing limited to `:::` directive structure.
+- Use `members` filters in `:::` directives to avoid duplicate rendering when the same module appears in multiple sections.
+**roadmap.md**
+- Future-facing only. Remove items once implemented. No implementation details.

nighthawk_python-0.5.0/.claude/rules/promptfoo.md ADDED Viewed

@@ -0,0 +1,137 @@
+---
+paths:
+  - "evals/promptfoo/**"
+---
+# Prompt evaluation (promptfoo)
+## Directory roles
+| Directory | Purpose | Determinism | Cost |
+|---|---|---|---|
+| `evals/promptfoo/` | Prompt experimentation: system prompt variants, tool descriptions, suffix fragments, backend comparison | Non-deterministic; use `--repeat N` to measure stability. | API calls x N x providers. |
+| `evals/promptfoo/outputs/` | Transient raw output (gitignored). Working files for in-progress analysis. | N/A | N/A |
+| `evals/promptfoo/evidence/` | Committed eval evidence. Decision rationale for adopted/rejected variants. | N/A | N/A |
+## Prompt changes workflow
+1. Edit eval-layer prompts or provider config in `evals/promptfoo/`.
+2. Run eval: `eval $PFOO eval --filter-providers "<provider>" --no-cache`.
+3. Compare against previous eval in the promptfoo DB or JSON output.
+4. Once validated, port the change to the corresponding production code (see mapping below).
+5. Run `uv run pytest -q` to confirm no regressions.
+**`--filter-providers` caveat**: The flag takes a regex pattern, not an exact label. A pattern like `"gpt-5.4-mini"` matches every provider whose label contains that substring (e.g. both `openai-responses` and `codex` labels). Use an anchored or label-specific pattern (e.g. `"^gpt-5.4-mini"`, `"codex:"`) to target a single provider.
+## Prompt variant cleanup (deletion timing and criteria)
+- Delete rejected prompt variants immediately when any of the following is true:
+  - Deterministic regression (`N/N` failures across repeats, e.g. `--repeat 3`)
+  - Clear degradation versus baseline on primary metrics
+  - Temporary/scratch variants created for quick checks
+- Keep a variant only if it has clear re-test value (e.g. flaky `1/N` behavior or meaningful metric trade-offs).
+- Retention for kept variants is limited to `min(7 days, 2 experiment cycles)`.
+  - If not re-evaluated within this window, delete it.
+- Run cleanup at three checkpoints:
+  1. Right after each experiment run
+  2. Before opening a PR
+  3. Before merge (final sweep)
+- Before deleting, record a short rejection reason in the corresponding `evals/promptfoo/evidence/` file; do not keep dead prompt files just for history.
+## Backend prerequisites
+| Backend | Requirement |
+|---|---|
+| `openai-responses` | `OPENAI_API_KEY` environment variable. |
+| `codex` | Pre-authenticated `codex` CLI (`codex login`) or `CODEX_API_KEY` environment variable. |
+| `claude-code-cli` | Pre-authenticated `claude` CLI (`claude login`) or `ANTHROPIC_API_KEY` environment variable. |
+| `claude-code-sdk` | `ANTHROPIC_API_KEY` environment variable. |
+Evals that include a backend without its prerequisite will run but produce errors for that backend's test cases. Use `--filter-providers` to exclude unavailable backends.
+## Eval-to-production mapping
+Eval prompts are experimental copies of production code. Keep them in sync; divergence means eval results do not predict production behavior.
+| Eval file | Production counterpart |
+|---|---|
+| `evals/promptfoo/prompts/eval_default.txt` | `configuration.py:DEFAULT_STEP_SYSTEM_PROMPT_TEMPLATE` |
+| `evals/promptfoo/prompts/eval_coding_agent.txt` | No single counterpart; coding agent backends receive this prompt via `system_prompt_file` config. |
+| Suffix variants in `evals/promptfoo/provider.py` (`_build_suffix_*`) | `step_contract.py:build_step_system_prompt_suffix_fragment` |
+| Tool presets in `evals/promptfoo/provider.py` (`_build_tool_preset`) | `tools/registry.py` + `tools/assignment.py` |
+## Eval evidence
+### When to save evidence
+| Save | Do not save |
+|---|---|
+| Eval that decided adoption or rejection of a prompt/suffix/tool variant | In-progress trial runs during development |
+| Regression baseline update | Single-test filter runs |
+| Eval backing a change merged via PR | Transient output already in `outputs/` |
+### File path convention
+```
+evals/promptfoo/evidence/{YYYY-MM-DD}-{experiment-slug}.md
+```
+Examples: `2026-03-20-suffix-ab.md`, `2026-03-25-regression-v2.md`.
+### File format
+```markdown
+---
+eval_id: <promptfoo eval ID>
+config: <YAML config file used>
+date: YYYY-MM-DD
+decision: <one-line summary of what was adopted/rejected>
+---
+## Providers tested
+- <provider label> (<variants if A/B>)
+## Results summary
+| Variant | Pass | Fail | Error | Score | Latency |
+|---------|------|------|-------|-------|---------|
+| ...     |      |      |       |       |         |
+## Decision rationale
+<Why the chosen variant was adopted.>
+## Rejected variants
+- <variant>: <short rejection reason>
+```
+### Rules
+- Only Markdown files (`*.md`) are committed under `evidence/`. Raw JSON stays in `outputs/` (gitignored).
+- One file per experiment decision. If a follow-up eval revises a prior decision, create a new file; do not overwrite.
+- Reference the `eval_id` so the full raw result can be retrieved from the promptfoo local DB (`promptfoo view`) or `-o` export if needed.
+## Eval interpretation
+- **Exit code 100**: promptfoo returns exit code 100 when any test case fails. This is not a system error; it signals "some assertions did not pass". CI scripts and background runners should treat exit 100 as "check results" rather than "eval crashed".
+- **OpenAI 500 errors**: Transient; ignore unless persistent across runs.
+- **Codex CLI errors**: Codex backend is unstable; isolate with `--filter-providers`.
+- **Codex binding-not-returned**: Codex may return `None` for write bindings that `openai-responses` handles correctly. This is a distinct failure mode from flaky LLM non-determinism; it typically indicates the coding-agent backend did not invoke the assignment tool.
+- **Mutation vs filter ambiguity**: Natural language instructions like "remove negative numbers" can be interpreted as either in-place mutation or filter-to-new-list. LLMs inconsistently choose between these, producing flaky failures where the binding value is the original unmodified collection. This pattern recurs across providers and suffix variants.
+- **1/N failures (flaky)**: Inherent LLM non-determinism. Use `--repeat 3` to distinguish from deterministic regressions.
+- **Deterministic failures** (N/N across repeats): Require prompt or code fix before merging.
+## Baseline workflow
+Run a full baseline when: (a) setting up the eval environment for the first time, (b) after a major production code change, or (c) when prior baselines are stale (> 2 weeks or across model version changes).
+1. Verify prerequisites for each backend (see Backend prerequisites above).
+2. Run all applicable configs in parallel with `--no-cache` and `-o outputs/baseline-{slug}.json`:
+   - `promptfooconfig.yaml` — regression across all available backends.
+   - `promptfooconfig-prompt-ab.yaml` — prompt/tool variant comparison.
+   - `promptfooconfig-suffix-ab.yaml` — suffix fragment comparison.
+   - `promptfooconfig-agents.yaml` — coding-agent backends (requires `codex` and `claude` CLIs).
+3. For backends without prerequisites, either skip the config or use `--filter-providers` to exclude unavailable providers.
+4. Create one evidence file per config under `evals/promptfoo/evidence/` with the `baseline-` slug prefix.
+## Commands
+See `CONTRIBUTING.md` "Prompt evaluation with promptfoo" for eval commands, config files, directory layout, and flags.

nighthawk_python-0.5.0/.claude/rules/tests.md ADDED Viewed

@@ -0,0 +1,27 @@
+---
+paths:
+  - "tests/**"
+---
+# Testing (pytest)
+## Directory roles
+| Directory | Purpose | Determinism | Cost |
+|---|---|---|---|
+| `tests/` | Pytest suite: unit tests (ScriptedExecutor) and integration tests (real LLM) | Unit: deterministic. Integration: non-deterministic but single-run. | Unit: free. Integration: API calls. |
+| `src/nighthawk/testing.py` | Test utility API for deterministic Natural-function tests (`ScriptedExecutor`, `CallbackExecutor`, and response factories). | Deterministic. | Free. |
+## Workflow
+### Scope boundary (pytest vs promptfoo)
+- Do not force prompt behavior validation into pytest-only checks.
+- When prompt rendering, system prompt text, suffix generation, or tool-exposure behavior changes, follow `.claude/rules/promptfoo.md`.
+### Python code changes (tools, executor, contracts)
+1. Write or update unit tests in `tests/` first.
+2. Prefer helpers from `nighthawk.testing` (for example `ScriptedExecutor`, `CallbackExecutor`, `pass_response`, `return_response`) when avoiding live LLM calls.
+3. Run `uv run pytest -q`.
+4. If the change affects prompt rendering or tool behavior, follow `.claude/rules/promptfoo.md` and run the relevant eval subset.

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/.devcontainer/litellm-config.yaml RENAMED Viewed

@@ -30,7 +30,7 @@ model_list:
       additional_drop_params: ["context_management", "output_config"]
   - model_name: claude-haiku-*
     litellm_params:
-      model: openai/gpt-5-mini
+      model: openai/gpt-5.4-mini
       api_key: os.environ/OPENAI_API_KEY
       additional_drop_params: ["context_management", "output_config"]
 litellm_settings:

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/.gitignore RENAMED Viewed

@@ -3,6 +3,12 @@
 *.local.json
 .vscode/
+# promptfoo evaluation outputs
+evals/promptfoo/outputs/
+evals/promptfoo/*.json
+evals/promptfoo/*.html
+evals/promptfoo/*.csv
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[codz]

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/AGENTS.md RENAMED Viewed

@@ -51,13 +51,13 @@ Python 3.13+, `uv` for dependencies, `pytest` for tests. Prefer LSP-based toolin
 | `uv run python` | Investigate interactively |
 | `uv sync --all-extras --all-groups` | Install/sync dependencies |
 | `uv run ruff format .` | Format |
-| `uv run ruff check .` | Lint |
 | `uv run ruff check --fix .` | Auto-fix lint |
 | `uv run pyright` | Type check |
-| `uv run pytest` | Full test suite |
 | `uv run pytest -q` | Tests (quiet) |
-| `set -a; source .env; set +a; uv run pytest -q` | Integration tests |
+| `NIGHTHAWK_RUN_INTEGRATION_TESTS=1 uv run pytest -q` | Integration tests |
 `uv` hardlinking warnings do not indicate failure. Suppress: `export UV_LINK_MODE=copy`.
 Environment: `OPENAI_API_KEY` (OpenAI), `CODEX_API_KEY` (Codex).
+Promptfoo evaluation details (commands, configs, directory layout, flags): see `CONTRIBUTING.md` "Prompt evaluation with promptfoo".

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.5.0]
+### Added
+- `evals/promptfoo/` evaluation harness for system prompt optimization using [promptfoo](https://www.promptfoo.dev/): custom Python provider, reusable assertions, and prompt variant A/B comparison support. See `CONTRIBUTING.md` for usage.
+- `docs/philosophy.md`: design philosophy and motivation behind Nighthawk.
+- `docs/practices.md`: practical patterns and binding function design guidance (extracted from tutorial).
+### Changed
+- Replaced `return_reference_path` with `return_expression` in step execution contract: return values are now specified as Python expressions evaluated against step locals/globals, consistent with `nh_eval`/`nh_assign` expression evaluation. This unblocks coding-agent backends (e.g. Claude Code CLI) that compute results via native tools without bridging values through `nh_assign`.
+- `nh_assign` now infers binding types from initial values when no explicit annotation is provided, enabling type-mismatch retry for unannotated write bindings.
+- Default `json_renderer_style` changed from `"strict"` to `"default"`, making truncation visible via `…` omission markers in prompt context and tool results.
+- Merged `nh_exec` into `nh_eval`: `nh_eval` now handles expression evaluation, function calls, and in-place mutation. `nh_exec` is removed.
+- Condensed system prompt: simplified tool selection guidance (single `nh_eval` tool), added execution order section, clarified tool result format.
+- Condensed step execution contract (outcome prompt suffix) for reduced token usage.
+- Improved `nh_assign` and `nh_eval` tool descriptions for LLM clarity.
+- Restructured documentation: rewrote `index.md`, `tutorial.md`, `for-coding-agents.md`; cross-referenced specification and practice guides.
+- Integration tests: replaced single `NIGHTHAWK_RUN_INTEGRATION_TESTS` gate with per-backend environment variables (`NIGHTHAWK_CODEX_INTEGRATION_TESTS`, `NIGHTHAWK_CLAUDE_SDK_INTEGRATION_TESTS`, `NIGHTHAWK_CLAUDE_CLI_INTEGRATION_TESTS`).
+### Removed
+- `nh_exec` tool (functionality absorbed by `nh_eval`).
+- Three redundant OpenAI integration tests from `test_llm_integration.py` (covered by promptfoo evaluation harness).
+- `pytest_sessionstart` credential-check hook (replaced by per-backend skip helpers).
 ## [0.4.0] - 2026-03-20
 ### Added
@@ -57,7 +80,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Step executor abstraction and provider integration foundation.
 - Core documentation and project scaffolding.
-[Unreleased]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.4.0...HEAD
+[Unreleased]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.5.0...HEAD
+[0.5.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.4.0...v0.5.0
 [0.4.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.3.1...v0.4.0
 [0.3.1]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.3.0...v0.3.1
 [0.3.0]: https://github.com/kurusugawa-computer/nighthawk-python/compare/v0.2.0...v0.3.0

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/CONTRIBUTING.md RENAMED Viewed

@@ -70,6 +70,48 @@ OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 uv run pytest -q tests/integra
 Traces appear in the otel-tui terminal UI in real time.
+### Prompt evaluation with promptfoo
+System prompt, tool descriptions, and backend behavior are evaluated with [promptfoo](https://www.promptfoo.dev/). Requires Node.js (for `npx`). API keys must be loaded from `.env` first.
+```bash
+set -a; source .env; set +a
+PFOO="cd evals/promptfoo && PROMPTFOO_PYTHON=\"$(uv python find)\" npx promptfoo@latest"
+```
+| Command | Purpose |
+|---|---|
+| `eval $PFOO eval` | Full regression (all backends, all tests) |
+| `eval $PFOO eval -c promptfooconfig-prompt-ab.yaml` | Prompt A/B test (gpt-5.4-mini only) |
+| `eval $PFOO eval -c promptfooconfig-agents.yaml` | Coding agent backends (reduced test set) |
+| `eval $PFOO eval -c promptfooconfig.yaml --filter-pattern "P-BIND-001"` | Single test |
+| `eval $PFOO eval -c promptfooconfig.yaml --filter-providers "claude-code-cli"` | Single backend |
+| `eval $PFOO view` | Open results in browser |
+#### Config files (`evals/promptfoo/`)
+- `promptfooconfig.yaml` — Regression: winner prompt combo across openai-responses, claude-code-cli, and codex backends. All test cases.
+- `promptfooconfig-prompt-ab.yaml` — A/B testing: 4 prompt/tool variants on gpt-5.4-mini. All test cases.
+- `promptfooconfig-agents.yaml` — Coding agent only: claude-code-cli and codex with reduced test set.
+#### Directory layout
+- `provider.py` — Custom provider wrapping `AgentStepExecutor`. Handles tool preset installation, backend-specific model settings, and callable fixture resolution.
+- `prompts/` — System prompt variants (`eval_default.txt`, `eval_sequenced.txt`, `eval_mutation_aware.txt`, `eval_coding_agent.txt`, etc.).
+- `test_cases/` — YAML test suites: `binding_operations`, `tool_selection`, `outcome_kinds`, `edge_cases`, `loop_outcomes`, `multi_step`, `null_handling`, `tool_selection_core`.
+- `assertions/` — Custom Python assertions: `binding_value.py`, `outcome_kind.py`, `raise_message.py`.
+#### Adding tests
+Each test case YAML entry needs: `description` (with `P-SLUG-NNN` Id), `vars` (natural_program, input_bindings, output_binding_names), and `assert` list. Callable bindings use `"__callable:<key>"` resolved from `_CALLABLE_FIXTURE_REGISTRY` in `provider.py`.
+#### Useful flags
+- `--no-cache` — Skip cache (required after changing provider.py or prompts).
+- `--filter-pattern "<regex>"` — Run only matching test descriptions.
+- `--filter-providers "<regex>"` — Run only matching provider labels.
+- `-o <path>.json` — Write structured results to file for analysis.
 ### Environment variables
 - `OPENAI_API_KEY`: Required for OpenAI integration tests (also requires `pydantic-ai-slim[openai]`).
@@ -160,7 +202,7 @@ def run(
     Example:
         ```python
         executor = AgentStepExecutor.from_configuration(
-            configuration=StepExecutorConfiguration(model="openai-responses:gpt-5-mini"),
+            configuration=StepExecutorConfiguration(model="openai-responses:gpt-5.4-mini"),
         )
         with nighthawk.run(executor):
             result = my_natural_function()

{nighthawk_python-0.4.0 → nighthawk_python-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: nighthawk-python
-Version: 0.4.0
+Version: 0.5.0
 Summary: An experimental Python library that embeds Natural blocks inside Python functions and executes them using an LLM.
 Project-URL: Repository, https://github.com/kurusugawa-computer/nighthawk-python
 Project-URL: Documentation, https://kurusugawa-computer.github.io/nighthawk-python/
@@ -33,7 +33,7 @@ Requires-Dist: mcp>=1.26; extra == 'codex'
 Description-Content-Type: text/markdown
 [![PyPI](https://img.shields.io/pypi/v/nighthawk-python)](https://pypi.org/project/nighthawk-python)
-![PyPI - Downloads](https://img.shields.io/pypi/dm/nighthawk-python)
+[![PyPI Stats](https://img.shields.io/pypi/dm/nighthawk-python)](https://pypistats.org/packages/nighthawk-python)
 [![license](https://img.shields.io/github/license/kurusugawa-computer/nighthawk-python.svg)](https://github.com/kurusugawa-computer/nighthawk-python/blob/main/LICENSE)
 [![issue resolution](https://img.shields.io/github/issues-closed-raw/kurusugawa-computer/nighthawk-python)](https://github.com/kurusugawa-computer/nighthawk-python/issues)
@@ -43,25 +43,32 @@ Description-Content-Type: text/markdown
 <img src="https://github.com/kurusugawa-computer/nighthawk-python/raw/main/docs/assets/nighthawk_logo-128x128.png" alt="nighthawk-logo" width="128px" margin="10px"></img>
 </div>
-Nighthawk is an experimental Python library exploring a clear separation between **hard control** (Python code) for strict procedure and deterministic flow, and **soft reasoning** (an LLM) for semantic interpretation inside small embedded "Natural blocks". It is a compact reimplementation of the core ideas of [Nightjar](https://github.com/psg-mit/nightjarpy).
+Nighthawk is an experimental Python library exploring a clear separation:
-## Quickstart
+- Use **hard control** (Python code) for strict procedure, verification, and deterministic flow.
+- Use **soft reasoning** (an LLM or coding agent) for semantic interpretation inside small embedded "Natural blocks".
-Prerequisites: Python 3.13+
+Python controls all flow; the LLM or coding agent is constrained to small Natural blocks with explicit input/output boundaries. The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
+This repository is a compact reimplementation of the core ideas of [Nightjar](https://github.com/psg-mit/nightjarpy).
-Install Nighthawk and a provider:
+## Installation
+Prerequisites: Python 3.13+
 ```bash
 pip install nighthawk-python pydantic-ai-slim[openai]
 ```
-Save as `quickstart.py`:
+For other providers, see [Providers](https://kurusugawa-computer.github.io/nighthawk-python/providers/).
+## Example
 ```py
 import nighthawk as nh
 step_executor = nh.AgentStepExecutor.from_configuration(
-    configuration=nh.StepExecutorConfiguration(model="openai-responses:gpt-5-mini")
+    configuration=nh.StepExecutorConfiguration(model="openai-responses:gpt-5.4-mini")
 )
 with nh.run(step_executor):
@@ -75,17 +82,22 @@ with nh.run(step_executor):
         return total
     print(calculate_total("three apples, a dozen eggs, and 5 oranges"))
+    # => 20
 ```
-Run with your API key:
+See the **[Quickstart](https://kurusugawa-computer.github.io/nighthawk-python/quickstart/)** for setup details, credentials, and troubleshooting.
-```bash
-export OPENAI_API_KEY=sk-xxxxxxxxx
-python quickstart.py
-# => 20
-```
+## Documentation
-For backends, credentials, model identifiers, and detailed guidance, see the [documentation site](https://kurusugawa-computer.github.io/nighthawk-python/).
+- **[Quickstart](https://kurusugawa-computer.github.io/nighthawk-python/quickstart/)** — Setup and first example.
+- **[Tutorial](https://kurusugawa-computer.github.io/nighthawk-python/tutorial/)** — Learn from first principles.
+- **[Practices](https://kurusugawa-computer.github.io/nighthawk-python/practices/)** — Guidelines, patterns, and testing.
+- **[Providers](https://kurusugawa-computer.github.io/nighthawk-python/providers/)** — LLM providers and configuration.
+- **[Coding agent backends](https://kurusugawa-computer.github.io/nighthawk-python/coding-agent-backends/)** — Claude Code and Codex integration.
+- **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** — Design rationale and positioning.
+- **[Design](https://kurusugawa-computer.github.io/nighthawk-python/design/)** — Canonical specification.
+- **[API Reference](https://kurusugawa-computer.github.io/nighthawk-python/api/)** — Auto-generated API documentation.
+- **[Roadmap](https://kurusugawa-computer.github.io/nighthawk-python/roadmap/)** — Future directions.
 ## Development & Contributing

nighthawk_python-0.5.0/README.md ADDED Viewed

@@ -0,0 +1,74 @@
+[![PyPI](https://img.shields.io/pypi/v/nighthawk-python)](https://pypi.org/project/nighthawk-python)
+[![PyPI Stats](https://img.shields.io/pypi/dm/nighthawk-python)](https://pypistats.org/packages/nighthawk-python)
+[![license](https://img.shields.io/github/license/kurusugawa-computer/nighthawk-python.svg)](https://github.com/kurusugawa-computer/nighthawk-python/blob/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/kurusugawa-computer/nighthawk-python)](https://github.com/kurusugawa-computer/nighthawk-python/issues)
+# Nighthawk
+<div align="center">
+<img src="https://github.com/kurusugawa-computer/nighthawk-python/raw/main/docs/assets/nighthawk_logo-128x128.png" alt="nighthawk-logo" width="128px" margin="10px"></img>
+</div>
+Nighthawk is an experimental Python library exploring a clear separation:
+- Use **hard control** (Python code) for strict procedure, verification, and deterministic flow.
+- Use **soft reasoning** (an LLM or coding agent) for semantic interpretation inside small embedded "Natural blocks".
+Python controls all flow; the LLM or coding agent is constrained to small Natural blocks with explicit input/output boundaries. The same mechanism handles lightweight LLM judgments ("classify this sentiment") and autonomous agent executions ("refactor this module and write tests"). See **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** for the full design rationale.
+This repository is a compact reimplementation of the core ideas of [Nightjar](https://github.com/psg-mit/nightjarpy).
+## Installation
+Prerequisites: Python 3.13+
+```bash
+pip install nighthawk-python pydantic-ai-slim[openai]
+```
+For other providers, see [Providers](https://kurusugawa-computer.github.io/nighthawk-python/providers/).
+## Example
+```py
+import nighthawk as nh
+step_executor = nh.AgentStepExecutor.from_configuration(
+    configuration=nh.StepExecutorConfiguration(model="openai-responses:gpt-5.4-mini")
+)
+with nh.run(step_executor):
+    @nh.natural_function
+    def calculate_total(items: str) -> int:
+        total = 0
+        """natural
+        Read <items> and set <:total> to the sum of all quantities mentioned.
+        """
+        return total
+    print(calculate_total("three apples, a dozen eggs, and 5 oranges"))
+    # => 20
+```
+See the **[Quickstart](https://kurusugawa-computer.github.io/nighthawk-python/quickstart/)** for setup details, credentials, and troubleshooting.
+## Documentation
+- **[Quickstart](https://kurusugawa-computer.github.io/nighthawk-python/quickstart/)** — Setup and first example.
+- **[Tutorial](https://kurusugawa-computer.github.io/nighthawk-python/tutorial/)** — Learn from first principles.
+- **[Practices](https://kurusugawa-computer.github.io/nighthawk-python/practices/)** — Guidelines, patterns, and testing.
+- **[Providers](https://kurusugawa-computer.github.io/nighthawk-python/providers/)** — LLM providers and configuration.
+- **[Coding agent backends](https://kurusugawa-computer.github.io/nighthawk-python/coding-agent-backends/)** — Claude Code and Codex integration.
+- **[Philosophy](https://kurusugawa-computer.github.io/nighthawk-python/philosophy/)** — Design rationale and positioning.
+- **[Design](https://kurusugawa-computer.github.io/nighthawk-python/design/)** — Canonical specification.
+- **[API Reference](https://kurusugawa-computer.github.io/nighthawk-python/api/)** — Auto-generated API documentation.
+- **[Roadmap](https://kurusugawa-computer.github.io/nighthawk-python/roadmap/)** — Future directions.
+## Development & Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, development commands, and contribution guidelines.
+## References
+- Nightjar (upstream concept): https://github.com/psg-mit/nightjarpy

nighthawk-python 0.4.0__tar.gz → 0.5.0__tar.gz

nighthawk-python 0.4.0tar.gz → 0.5.0tar.gz