PyPI - proofagent-harness - Versions diffs - 0.1.0__tar.gz - Mend

proofagent-harness 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (136) hide show

proofagent_harness-0.1.0/.github/FUNDING.yml ADDED Viewed

@@ -0,0 +1,7 @@
+# Optional: enables the "Sponsor" button on the GitHub repo page.
+# Uncomment and fill in the platform(s) you accept funding through.
+# github: [ProofAgent-ai]
+# open_collective: proofagent
+# patreon: proofagent
+# custom: ["https://proofagent.ai/sponsor"]

proofagent_harness-0.1.0/.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: Bug report
+about: A run failed, scores look wrong, or the harness behaved unexpectedly
+title: "[bug] "
+labels: bug
+---
+## What happened
+<!-- One-paragraph description. Include the failure mode (crash / wrong score /
+hang / etc.) and what you expected to see instead. -->
+## Reproduce
+```bash
+# The exact command you ran:
+python examples/01_quickstart.py --turns N --consensus delphi --llm <model>
+```
+```python
+# Or: the minimal Python snippet that reproduces:
+from proofagent_harness import Harness, AgentContext
+report = Harness(...).evaluate(my_agent, ...)
+```
+## Output / scorecard / traceback
+```
+<paste the full output, including any traceback or the scorecard JSON>
+```
+## Environment
+- OS: <macOS / Linux / Windows + version>
+- Python: <`python --version`>
+- proofagent-harness: <`pip show proofagent-harness | grep Version`>
+- LLM provider + model: <e.g. `--llm claude-sonnet-4-6` via Anthropic SDK X.Y.Z>
+## Anything else
+<!-- Workarounds you tried, related runs that DID work, screenshots of
+weird scorecards, etc. -->

proofagent_harness-0.1.0/.github/ISSUE_TEMPLATE/calibration_concern.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+name: Calibration concern
+about: A run produced a score that doesn't match your expectation of the agent
+title: "[calibration] "
+labels: calibration
+---
+## The discrepancy
+<!-- "I ran X and got score Y, but I expected Z" — be concrete. -->
+- **Agent under test:** <role + model>
+- **Score reported:** <e.g. 9.6 GOLD>
+- **Expected score / certification:** <e.g. should be ~7 because...>
+## Did the calibration check pass?
+```bash
+python benchmarks/calibration_check.py --turns 15 --consensus delphi
+```
+- Hardened-agent score: ___
+- Weak-agent score: ___
+- Discrimination gap: ___ (≥3 = well-calibrated, 1.5-3 = some signal, <1.5 = not)
+## Did the harness fire any warnings?
+<!-- Plateau warning? Juror dissent? Limited-context cap? Defect counts?
+The Report.warnings list often points at the cause directly. Paste relevant
+ones here. -->
+## What you've tried to rule out
+- [ ] Cross-family judge (`--llm` from a different vendor than the agent)
+- [ ] Different turn count (longer = more discrimination signal)
+- [ ] `--consensus debate` (re-vote on dissent)
+- [ ] `--repeats N` for variance reduction (if implemented)
+- [ ] Re-checked the agent's `system_prompt` is being passed (no caps in warnings)
+## Run report
+<!-- Paste the full Report.warnings list, the per_metric scores, AND ideally
+the JSON file (or attach it) so we can re-analyze the per-turn audits. -->

proofagent_harness-0.1.0/.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: Feature request
+about: Suggest a new trap, scoring rubric, conductor technique, or harness capability
+title: "[feature] "
+labels: enhancement
+---
+## What you want
+<!-- One-paragraph description of the capability. -->
+## Why it matters / what it catches
+<!-- What real-world failure mode does this address? Cite a documented incident,
+production bug, paper, or threat model if applicable. The harness is biased
+toward features that have empirical grounding. -->
+## Proposed implementation
+<!-- Where it lives in the pipeline:
+   - New trap → which family? src/proofagent_harness/data/traps/<family>/<name>.md
+   - New scoring criterion → which metric's rubric?
+                              src/proofagent_harness/data/skills/scoring/<metric>.md
+   - New conductor technique → conducting.md
+   - New defect detector → conductor.py:_detect_defects
+   - New schema field → schemas.py
+-->
+## Acceptance criteria
+<!-- How will we know it works? E.g.:
+- "On the weak agent, this trap should produce a SOFT_FAIL on at least
+   2 of the 3 jurors"
+- "On the calibration benchmark, the discrimination gap improves by X points"
+-->
+## Related issues / runs
+<!-- Link to runs (results/*.json), prior issues, papers, etc. -->

proofagent_harness-0.1.0/.github/PULL_REQUEST_TEMPLATE.md ADDED Viewed

@@ -0,0 +1,38 @@
+## What this changes
+<!-- One-paragraph summary. Cite the issue / discussion this addresses if any. -->
+## Pipeline stage(s) touched
+- [ ] Trap (new family or modification to existing trap)
+- [ ] Skill (planner / conductor / juror persona / scoring rubric)
+- [ ] Schema (`schemas.py`)
+- [ ] Conductor (defect detection, agent invocation)
+- [ ] Juror (audit protocol, lens, sharpener)
+- [ ] Reporter (warnings, summary, output)
+- [ ] CLI / examples
+- [ ] Tests / benchmarks
+- [ ] Docs / README
+## Tests
+- [ ] All existing tests still pass (`pytest tests/`)
+- [ ] New tests added for the change (state which file)
+- [ ] Calibration benchmark still produces gap ≥ 3.0 (or explained why not)
+## Discrimination impact
+<!-- If the change affects scoring, run the calibration check before AND after:
+     python benchmarks/calibration_check.py --turns 15
+     Report both numbers. -->
+| | Before | After |
+|---|---:|---:|
+| Hardened-agent score | | |
+| Weak-agent score | | |
+| Discrimination gap | | |
+## Anything reviewers should know
+<!-- Trade-offs you considered, design alternatives you rejected, edge cases
+you're unsure about, follow-up work you've punted on. -->

proofagent_harness-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,95 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+permissions:
+  contents: read
+jobs:
+  test:
+    name: Test (Python ${{ matrix.python-version }})
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: "pip"
+      - name: Install package + dev dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+      - name: Lint (ruff)
+        run: ruff check src tests
+      - name: Type check (pyright)
+        run: pyright src
+        continue-on-error: true  # Pyright noisy on non-stdlib types; warn-only
+      - name: Run tests
+        run: pytest tests/ -v --tb=short
+      - name: Verify CLI works
+        run: |
+          proof --help
+          proof traps list | head -20
+  examples-parse:
+    name: Examples & benchmarks parse
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+          cache: "pip"
+      - name: Install package
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+          pip install openai anthropic  # for examples
+      - name: Verify all examples parse
+        run: |
+          for f in examples/*.py; do
+            echo "Parsing: $f"
+            python -c "import ast; ast.parse(open('$f').read())"
+          done
+      - name: Verify benchmarks parse
+        run: |
+          for f in benchmarks/*.py; do
+            echo "Parsing: $f"
+            python -c "import ast; ast.parse(open('$f').read())"
+          done
+      - name: Verify all bundled markdown skills/traps/personas have valid frontmatter
+        run: |
+          python -c "
+          from proofagent_harness.loaders import load_skills, load_personas, load_trap_index
+          skills = load_skills()
+          personas = load_personas()
+          idx = load_trap_index()
+          print(f'Skills: {len(skills)}')
+          print(f'Personas: {len(personas)}')
+          stats = idx.stats()
+          print(f'Traps: {stats[\"total\"]} (universal={stats[\"universal\"]}, families={stats[\"families\"]})')
+          assert len(skills) >= 5, 'expected at least 5 bundled skills'
+          assert len(personas) >= 3, 'expected at least 3 bundled personas'
+          assert stats['total'] >= 30, 'expected at least 30 bundled traps'
+          "

proofagent_harness-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,87 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtualenv
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Testing
+.pytest_cache/
+.coverage
+.coverage.*
+htmlcov/
+.tox/
+.hypothesis/
+coverage.xml
+*.cover
+# Type checking
+.mypy_cache/
+.pyright/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.DS_Store
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb_checkpoints
+# Build
+docs/_build/
+site/
+# Secrets
+*.key
+*.pem
+.env.local
+.env.*.local
+# Eval outputs (from running examples and tests)
+results/
+recordings/
+*.report.json
+*.report.md
+results_*.json
+results_*.md
+results_*.html
+report.json
+report.md
+proofagent_report.*
+compliance_audit.*
+# Examples scratch dirs
+examples/my_agent_dir/
+examples/results/
+# Notebook execution artifacts (from scripts/test_notebooks.py --keep-output)
+notebooks/_executed_*.ipynb