claude-dev-env 1.38.1 → 1.39.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +10 -36
- package/_shared/pr-loop/audit-reply-template.md +147 -0
- package/_shared/pr-loop/fix-protocol.md +25 -4
- package/_shared/pr-loop/gh-payloads.md +37 -50
- package/_shared/pr-loop/scripts/code_rules_gate.py +0 -60
- package/_shared/pr-loop/scripts/config/post_audit_thread_constants.py +189 -0
- package/_shared/pr-loop/scripts/post_audit_thread.py +947 -0
- package/_shared/pr-loop/scripts/tests/test_code_rules_gate.py +0 -19
- package/_shared/pr-loop/scripts/tests/test_post_audit_thread.py +923 -0
- package/_shared/pr-loop/scripts/tests/test_post_audit_thread_constants.py +127 -0
- package/_shared/pr-loop/state-schema.md +1 -1
- package/agents/clean-coder.md +2 -2
- package/bin/install.mjs +6 -7
- package/bin/install.test.mjs +8 -0
- package/commands/doc-gist.md +16 -0
- package/commands/plan.md +0 -2
- package/commands/review-plan.md +1 -1
- package/docs/CODE_RULES.md +122 -2
- package/hooks/blocking/bot_mention_comment_blocker.py +75 -0
- package/hooks/blocking/code_rules_enforcer.py +1143 -129
- package/hooks/blocking/convergence_gate_blocker.py +130 -0
- package/hooks/blocking/destructive_command_blocker.py +74 -0
- package/hooks/blocking/gh_body_arg_blocker.py +30 -0
- package/hooks/blocking/md_to_html_blocker.py +119 -0
- package/hooks/blocking/test_bot_mention_comment_blocker.py +131 -0
- package/hooks/blocking/test_code_rules_enforcer.py +21 -0
- package/hooks/blocking/test_code_rules_enforcer_any_exempt_files.py +70 -0
- package/hooks/blocking/test_code_rules_enforcer_any_imports_and_cast.py +92 -0
- package/hooks/blocking/test_code_rules_enforcer_banned_import_alias.py +143 -0
- package/hooks/blocking/test_code_rules_enforcer_banned_prefixes.py +152 -0
- package/hooks/blocking/test_code_rules_enforcer_bare_except.py +120 -0
- package/hooks/blocking/test_code_rules_enforcer_boundary_types.py +175 -0
- package/hooks/blocking/test_code_rules_enforcer_cap_meta.py +0 -1
- package/hooks/blocking/test_code_rules_enforcer_collection_prefix.py +50 -0
- package/hooks/blocking/test_code_rules_enforcer_docstring_format.py +255 -0
- package/hooks/blocking/test_code_rules_enforcer_inline_tuple_string_magic.py +130 -0
- package/hooks/blocking/test_code_rules_enforcer_stub_implementations.py +141 -0
- package/hooks/blocking/test_code_rules_enforcer_test_branching.py +143 -0
- package/hooks/blocking/test_code_rules_enforcer_thin_wrapper_files.py +169 -0
- package/hooks/blocking/test_code_rules_enforcer_todo_markers.py +99 -0
- package/hooks/blocking/test_code_rules_enforcer_typed_dict_pairs.py +141 -0
- package/hooks/blocking/test_convergence_gate_blocker.py +63 -0
- package/hooks/blocking/test_destructive_command_blocker.py +146 -0
- package/hooks/blocking/test_destructive_command_blocker_no_verify.py +102 -0
- package/hooks/blocking/test_gh_body_arg_blocker.py +45 -0
- package/hooks/blocking/test_md_to_html_blocker.py +317 -0
- package/hooks/config/any_type_config.py +7 -0
- package/hooks/config/banned_identifiers_constants.py +11 -0
- package/hooks/config/blocking_check_limits.py +38 -0
- package/hooks/config/bot_mention_comment_blocker_constants.py +20 -0
- package/hooks/config/code_rules_enforcer_constants.py +53 -0
- package/hooks/config/convergence_branch_constants.py +9 -0
- package/hooks/config/doc_gist_auto_publish_constants.py +18 -0
- package/hooks/config/html_companion_constants.py +20 -0
- package/hooks/config/inline_tuple_string_magic_constants.py +22 -0
- package/hooks/config/test_banned_identifiers_constants.py +17 -0
- package/hooks/hooks.json +28 -20
- package/hooks/pyproject.toml +69 -0
- package/hooks/validators/mypy_integration.py +47 -1
- package/hooks/validators/run_all_validators.py +3 -3
- package/hooks/validators/test_mypy_integration.py +50 -1
- package/hooks/workflow/doc_gist_auto_publish.py +144 -0
- package/hooks/workflow/md_to_html_companion.py +365 -0
- package/hooks/workflow/test_doc_gist_auto_publish.py +117 -0
- package/hooks/workflow/test_md_to_html_companion.py +452 -0
- package/package.json +1 -1
- package/rules/gh-body-file.md +2 -0
- package/scripts/Install-SweepEmptyDirs.ps1 +111 -0
- package/scripts/check.ps1 +106 -0
- package/scripts/config/timing.py +11 -0
- package/scripts/sweep_empty_dirs.py +138 -0
- package/scripts/sync_to_cursor/rules.py +1 -1
- package/scripts/test_sweep_empty_dirs.py +183 -0
- package/skills/_shared/pr-loop/prompts/pr-consistency-audit.xml +323 -0
- package/skills/_shared/pr-loop/scripts/_cli_utils.py +22 -0
- package/skills/_shared/pr-loop/scripts/_path_resolver.py +165 -0
- package/skills/_shared/pr-loop/scripts/_xml_utils.py +20 -0
- package/skills/_shared/pr-loop/scripts/build_audit_prompt.py +182 -0
- package/skills/_shared/pr-loop/scripts/build_fix_prompt.py +185 -0
- package/skills/_shared/pr-loop/scripts/config/__init__.py +0 -0
- package/skills/_shared/pr-loop/scripts/config/path_resolver_constants.py +78 -0
- package/skills/_shared/pr-loop/scripts/init_loop_state.py +135 -0
- package/skills/_shared/pr-loop/scripts/teardown_worktrees.py +175 -0
- package/skills/_shared/pr-loop/scripts/write_audit_outcomes.py +182 -0
- package/skills/_shared/pr-loop/scripts/write_fix_outcomes.py +206 -0
- package/skills/bugteam/CONSTRAINTS.md +21 -22
- package/skills/bugteam/EXAMPLES.md +3 -3
- package/skills/bugteam/PROMPTS.md +227 -67
- package/skills/bugteam/SKILL.md +114 -455
- package/skills/bugteam/reference/README.md +1 -1
- package/skills/bugteam/reference/audit-and-teammates.md +112 -39
- package/skills/bugteam/reference/audit-contract.md +4 -22
- package/skills/bugteam/reference/copilot-gap-analysis.md +8 -5
- package/skills/bugteam/reference/design-rationale.md +2 -2
- package/skills/bugteam/reference/github-pr-reviews.md +50 -57
- package/skills/bugteam/reference/obstacles/audit-assign-ids.md +13 -0
- package/skills/bugteam/reference/obstacles/audit-capture-excerpts.md +13 -0
- package/skills/bugteam/reference/obstacles/audit-walk-categories.md +13 -0
- package/skills/bugteam/reference/obstacles/audit-write-xml.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-append-summary.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-apply-fixes.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-git-add-commit.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-git-push.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-post-reply.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-publish-summary.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-py-compile.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-read-files.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-resolve-thread.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-test-suite.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-violation-count.md +13 -0
- package/skills/bugteam/reference/obstacles/fix-write-xml.md +13 -0
- package/skills/bugteam/reference/team-setup.md +106 -9
- package/skills/bugteam/reference/teardown-publish-permissions.md +39 -8
- package/skills/bugteam/scripts/README.md +60 -0
- package/skills/bugteam/scripts/_claude_permissions_common.py +358 -0
- package/skills/bugteam/scripts/bugteam_code_rules_gate.py +976 -0
- package/skills/bugteam/scripts/bugteam_fix_hookspath.py +375 -0
- package/skills/bugteam/scripts/bugteam_preflight.py +294 -0
- package/skills/bugteam/scripts/config/bugteam_code_rules_gate_constants.py +25 -0
- package/skills/bugteam/scripts/config/bugteam_fix_hookspath_constants.py +26 -0
- package/skills/bugteam/scripts/config/bugteam_preflight_constants.py +35 -0
- package/skills/bugteam/scripts/config/claude_permissions_common_constants.py +20 -0
- package/skills/bugteam/scripts/config/probe_code_rules_enforcer_check_constants.py +12 -0
- package/skills/bugteam/scripts/config/windows_safe_rmtree_constants.py +7 -0
- package/skills/bugteam/scripts/grant_project_claude_permissions.py +175 -0
- package/skills/bugteam/scripts/probe_code_rules_enforcer_check.py +107 -0
- package/skills/bugteam/scripts/revoke_project_claude_permissions.py +220 -0
- package/skills/bugteam/scripts/test__claude_permissions_common.py +112 -0
- package/skills/bugteam/scripts/test_bugteam_code_rules_gate.py +400 -0
- package/skills/bugteam/scripts/test_bugteam_fix_hookspath.py +384 -0
- package/skills/bugteam/scripts/test_bugteam_preflight.py +268 -0
- package/skills/bugteam/scripts/test_claude_permissions_common.py +195 -0
- package/skills/bugteam/scripts/test_grant_project_claude_permissions.py +55 -0
- package/skills/bugteam/scripts/test_probe_code_rules_enforcer_check.py +76 -0
- package/skills/bugteam/scripts/test_revoke_project_claude_permissions.py +55 -0
- package/skills/bugteam/scripts/test_windows_safe_rmtree.py +108 -0
- package/skills/bugteam/scripts/windows_safe_rmtree.py +100 -0
- package/skills/bugteam/test_skill_additions.py +1 -11
- package/skills/code/SKILL.md +176 -0
- package/skills/doc-gist/SKILL.md +99 -0
- package/skills/doc-gist/references/examples/01-exploration-code-approaches.html +453 -0
- package/skills/doc-gist/references/examples/02-exploration-visual-designs.html +515 -0
- package/skills/doc-gist/references/examples/03-code-review-pr.html +638 -0
- package/skills/doc-gist/references/examples/04-code-understanding.html +491 -0
- package/skills/doc-gist/references/examples/05-design-system.html +629 -0
- package/skills/doc-gist/references/examples/06-component-variants.html +605 -0
- package/skills/doc-gist/references/examples/07-prototype-animation.html +455 -0
- package/skills/doc-gist/references/examples/08-prototype-interaction.html +396 -0
- package/skills/doc-gist/references/examples/09-slide-deck.html +592 -0
- package/skills/doc-gist/references/examples/10-svg-illustrations.html +492 -0
- package/skills/doc-gist/references/examples/11-status-report.html +528 -0
- package/skills/doc-gist/references/examples/12-incident-report.html +596 -0
- package/skills/doc-gist/references/examples/13-flowchart-diagram.html +395 -0
- package/skills/doc-gist/references/examples/14-research-feature-explainer.html +381 -0
- package/skills/doc-gist/references/examples/15-research-concept-explainer.html +368 -0
- package/skills/doc-gist/references/examples/16-implementation-plan.html +702 -0
- package/skills/doc-gist/references/examples/17-pr-writeup.html +595 -0
- package/skills/doc-gist/references/examples/18-editor-triage-board.html +573 -0
- package/skills/doc-gist/references/examples/19-editor-feature-flags.html +663 -0
- package/skills/doc-gist/references/examples/20-editor-prompt-tuner.html +722 -0
- package/skills/doc-gist/references/examples/README.md +5 -0
- package/skills/doc-gist/scripts/config/__init__.py +0 -0
- package/skills/doc-gist/scripts/config/gist_upload_constants.py +16 -0
- package/skills/doc-gist/scripts/gist_upload.py +177 -0
- package/skills/doc-gist/scripts/test_gist_upload.py +51 -0
- package/skills/findbugs/SKILL.md +68 -2
- package/skills/monitor-open-prs/SKILL.md +13 -32
- package/skills/monitor-open-prs/test_skill_contract.py +0 -11
- package/skills/pr-consistency-audit/SKILL.md +112 -0
- package/skills/pr-consistency-audit/reference/detection-rules.md +96 -0
- package/skills/pr-consistency-audit/reference/illustrations.md +78 -0
- package/skills/pr-converge/SKILL.md +227 -23
- package/skills/pr-converge/config/__init__.py +0 -0
- package/skills/pr-converge/config/constants.py +62 -0
- package/skills/pr-converge/reference/convergence-gates.md +138 -44
- package/skills/pr-converge/reference/examples.md +43 -11
- package/skills/pr-converge/reference/fix-protocol.md +6 -5
- package/skills/pr-converge/reference/ground-rules.md +5 -3
- package/skills/pr-converge/reference/multi-pr-orchestration.md +44 -19
- package/skills/pr-converge/reference/obstacles/fix-post-replies.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-publish-summary.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-push.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-read-filelines.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-reset-state.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-resolve-threads.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-spawn-clean-coder.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-stage-commit.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-trigger-bugbot.md +13 -0
- package/skills/pr-converge/reference/obstacles/fix-write-test.md +13 -0
- package/skills/pr-converge/reference/per-tick.md +90 -31
- package/skills/pr-converge/reference/state-schema.md +22 -1
- package/skills/pr-converge/reference/stop-conditions.md +9 -7
- package/skills/pr-converge/scripts/README.md +34 -46
- package/skills/pr-converge/scripts/check_bugbot_ci.py +174 -0
- package/skills/pr-converge/scripts/check_convergence.py +497 -0
- package/skills/pr-converge/scripts/check_pending_reviews.py +154 -0
- package/skills/pr-converge/scripts/config/pr_converge_constants.py +118 -0
- package/skills/pr-converge/scripts/fetch_copilot_reviews.py +134 -0
- package/skills/pr-converge/scripts/post_fix_reply.py +168 -0
- package/skills/pr-converge/workflows/schedule-wakeup-loop.md +5 -12
- package/skills/qbug/SKILL.md +132 -27
- package/skills/session-log/SKILL.md +216 -114
- package/skills/session-tidy/SKILL.md +1 -1
- package/skills/skill-builder/SKILL.md +138 -56
- package/skills/skill-builder/references/delegation-map.md +72 -113
- package/skills/skill-builder/references/progressive-disclosure.md +122 -0
- package/skills/skill-builder/references/self-audit-checklist.md +92 -0
- package/skills/skill-builder/references/skill-types.md +228 -0
- package/skills/skill-builder/references/thariq-x-post-skills.json +33 -0
- package/skills/skill-builder/templates/gap-analysis.md +15 -8
- package/skills/skill-builder/workflows/improve-skill.md +86 -57
- package/skills/skill-builder/workflows/new-skill.md +80 -168
- package/skills/skill-builder/workflows/polish-skill.md +78 -54
- package/skills/structure-prompt/SKILL.md +50 -0
- package/skills/structure-prompt/reference/adversarial-tuning.md +62 -0
- package/skills/structure-prompt/reference/block-classification.md +27 -0
- package/skills/structure-prompt/reference/canonical-case.md +48 -0
- package/skills/structure-prompt/reference/citation-depth.md +70 -0
- package/skills/structure-prompt/reference/cleanup.md +33 -0
- package/skills/structure-prompt/reference/constraints.md +33 -0
- package/skills/structure-prompt/reference/directives.md +37 -0
- package/skills/structure-prompt/reference/examples.md +72 -0
- package/skills/structure-prompt/reference/instantiation.md +51 -0
- package/skills/structure-prompt/reference/output-contract.md +72 -0
- package/skills/structure-prompt/reference/per-category.md +23 -0
- package/skills/structure-prompt/reference/persona.md +38 -0
- package/skills/structure-prompt/reference/research.md +33 -0
- package/skills/structure-prompt/reference/structure.md +28 -0
- package/agents/code-standards-agent.md +0 -93
- package/agents/groq-coder.md +0 -113
- package/agents/plan-executor.md +0 -226
- package/agents/project-docs-analyzer.md +0 -53
- package/agents/project-structure-organizer-agent.md +0 -72
- package/agents/skill-to-agent-converter.md +0 -370
- package/agents/skill-writer-agent.md +0 -470
- package/agents/user-docs-writer.md +0 -67
- package/agents/workflow-visual-documenter.md +0 -82
- package/commands/readability-review.md +0 -20
- package/hooks/mypy.ini +0 -2
- package/hooks/notification/attention_needed_notify.py +0 -71
- package/hooks/notification/claude_notification_handler.py +0 -67
- package/hooks/notification/notification_utils.py +0 -267
- package/hooks/notification/subagent_complete_notify.py +0 -381
- package/hooks/notification/test_attention_needed_notify.py +0 -47
- package/hooks/notification/test_claude_notification_handler.py +0 -54
- package/hooks/notification/test_notification_utils.py +0 -91
- package/hooks/notification/test_subagent_complete_notify.py +0 -79
- package/scripts/config/groq_bugteam_config.py +0 -230
- package/scripts/config/test_groq_bugteam_config.py +0 -83
- package/scripts/config/test_spec_implementer_prompt.py +0 -32
- package/scripts/groq_bugteam.README.md +0 -131
- package/scripts/groq_bugteam.py +0 -647
- package/scripts/groq_bugteam_dotenv.py +0 -40
- package/scripts/groq_bugteam_spec.py +0 -226
- package/scripts/test_groq_bugteam.py +0 -529
- package/scripts/test_groq_bugteam_apply_fix_from_spec.py +0 -426
- package/scripts/test_groq_bugteam_dotenv.py +0 -66
- package/scripts/test_groq_bugteam_spec.py +0 -338
- package/skills/bugteam/SKILL_EVALS.md +0 -309
- package/skills/dream/SKILL.md +0 -118
- package/skills/ingest/SKILL.md +0 -40
- package/skills/npm-creator/SKILL.md +0 -187
- package/skills/readability-review/SKILL.md +0 -127
- package/skills/resume-review/SKILL.md +0 -261
- package/skills/rule-audit/SKILL.md +0 -307
- package/skills/rule-creator/SKILL.md +0 -150
- package/skills/searching-obsidian-vault/SKILL.md +0 -131
- package/skills/skill-writer/REFERENCE.md +0 -284
- package/skills/skill-writer/SKILL.md +0 -222
- package/skills/tdd-team/SKILL.md +0 -128
|
@@ -1,338 +0,0 @@
|
|
|
1
|
-
"""Coherence tests for groq_bugteam_spec module import surface.
|
|
2
|
-
|
|
3
|
-
The behavioral contract for apply_fix_from_spec lives in
|
|
4
|
-
test_groq_bugteam_apply_fix_from_spec.py; those tests pass whether the
|
|
5
|
-
function is defined in groq_bugteam.py directly or re-exported from the
|
|
6
|
-
spec module. This file exists solely so the spec module has a
|
|
7
|
-
same-named test companion for filename-based test pairing.
|
|
8
|
-
"""
|
|
9
|
-
|
|
10
|
-
from __future__ import annotations
|
|
11
|
-
|
|
12
|
-
import importlib.util
|
|
13
|
-
import io
|
|
14
|
-
import json
|
|
15
|
-
import pathlib
|
|
16
|
-
import sys
|
|
17
|
-
import types
|
|
18
|
-
|
|
19
|
-
import pytest
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
def _load_spec_module():
|
|
23
|
-
scripts_directory = pathlib.Path(__file__).parent
|
|
24
|
-
sys.path.insert(0, str(scripts_directory))
|
|
25
|
-
sys.modules.pop("groq_bugteam_spec", None)
|
|
26
|
-
module_path = scripts_directory / "groq_bugteam_spec.py"
|
|
27
|
-
module_spec = importlib.util.spec_from_file_location(
|
|
28
|
-
"groq_bugteam_spec", module_path
|
|
29
|
-
)
|
|
30
|
-
loaded_module = importlib.util.module_from_spec(module_spec)
|
|
31
|
-
sys.modules["groq_bugteam_spec"] = loaded_module
|
|
32
|
-
module_spec.loader.exec_module(loaded_module)
|
|
33
|
-
return loaded_module
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
groq_bugteam_spec = _load_spec_module()
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
def test_is_spec_mode_invocation_detects_flag_value_pair():
|
|
40
|
-
assert groq_bugteam_spec.is_spec_mode_invocation(["--mode", "spec"]) is True
|
|
41
|
-
assert groq_bugteam_spec.is_spec_mode_invocation(["--mode", "pipeline"]) is False
|
|
42
|
-
assert groq_bugteam_spec.is_spec_mode_invocation([]) is False
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
def _attach_required_groq_attributes(target_module: types.ModuleType) -> None:
|
|
46
|
-
target_module.call_groq_with_fallback = lambda *args, **kwargs: None
|
|
47
|
-
target_module.parse_json_object = lambda text: {}
|
|
48
|
-
target_module.preserve_trailing_newline = lambda original, updated: updated
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
def test_resolver_prefers_registered_groq_bugteam_over_main(monkeypatch):
|
|
52
|
-
fake_groq_bugteam = types.ModuleType("groq_bugteam")
|
|
53
|
-
_attach_required_groq_attributes(fake_groq_bugteam)
|
|
54
|
-
monkeypatch.setitem(sys.modules, "groq_bugteam", fake_groq_bugteam)
|
|
55
|
-
|
|
56
|
-
resolved_module = groq_bugteam_spec.resolve_groq_bugteam_module()
|
|
57
|
-
|
|
58
|
-
assert resolved_module is fake_groq_bugteam
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
def test_resolver_falls_back_to_main_when_groq_bugteam_absent(monkeypatch):
|
|
62
|
-
monkeypatch.delitem(sys.modules, "groq_bugteam", raising=False)
|
|
63
|
-
fake_main = types.ModuleType("__main__")
|
|
64
|
-
_attach_required_groq_attributes(fake_main)
|
|
65
|
-
monkeypatch.setitem(sys.modules, "__main__", fake_main)
|
|
66
|
-
|
|
67
|
-
resolved_module = groq_bugteam_spec.resolve_groq_bugteam_module()
|
|
68
|
-
|
|
69
|
-
assert resolved_module is fake_main
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
def test_resolver_falls_back_to_main_when_registered_module_is_stub(monkeypatch):
|
|
73
|
-
stub_groq_bugteam = types.ModuleType("groq_bugteam")
|
|
74
|
-
stub_groq_bugteam.call_groq_with_fallback = lambda *args, **kwargs: None
|
|
75
|
-
monkeypatch.setitem(sys.modules, "groq_bugteam", stub_groq_bugteam)
|
|
76
|
-
complete_main = types.ModuleType("__main__")
|
|
77
|
-
_attach_required_groq_attributes(complete_main)
|
|
78
|
-
monkeypatch.setitem(sys.modules, "__main__", complete_main)
|
|
79
|
-
|
|
80
|
-
resolved_module = groq_bugteam_spec.resolve_groq_bugteam_module()
|
|
81
|
-
|
|
82
|
-
assert resolved_module is complete_main
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
def test_resolver_raises_when_registered_module_missing_required_attributes(
|
|
86
|
-
monkeypatch,
|
|
87
|
-
):
|
|
88
|
-
stub_groq_bugteam = types.ModuleType("groq_bugteam")
|
|
89
|
-
stub_groq_bugteam.call_groq_with_fallback = lambda *args, **kwargs: None
|
|
90
|
-
monkeypatch.setitem(sys.modules, "groq_bugteam", stub_groq_bugteam)
|
|
91
|
-
monkeypatch.delitem(sys.modules, "__main__", raising=False)
|
|
92
|
-
|
|
93
|
-
try:
|
|
94
|
-
groq_bugteam_spec.resolve_groq_bugteam_module()
|
|
95
|
-
except RuntimeError as resolver_error:
|
|
96
|
-
resolver_error_text = str(resolver_error)
|
|
97
|
-
assert "parse_json_object" in resolver_error_text
|
|
98
|
-
assert "preserve_trailing_newline" in resolver_error_text
|
|
99
|
-
else:
|
|
100
|
-
raise AssertionError("resolver should have raised RuntimeError")
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
def test_resolver_raises_when_neither_module_available(monkeypatch):
|
|
104
|
-
monkeypatch.delitem(sys.modules, "groq_bugteam", raising=False)
|
|
105
|
-
placeholder_main = types.ModuleType("__main__")
|
|
106
|
-
monkeypatch.setitem(sys.modules, "__main__", placeholder_main)
|
|
107
|
-
|
|
108
|
-
try:
|
|
109
|
-
groq_bugteam_spec.resolve_groq_bugteam_module()
|
|
110
|
-
except RuntimeError as resolver_error:
|
|
111
|
-
resolver_error_text = str(resolver_error)
|
|
112
|
-
assert "groq_bugteam" in resolver_error_text
|
|
113
|
-
else:
|
|
114
|
-
raise AssertionError("resolver should have raised RuntimeError")
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
FAKE_API_KEY = "gsk_test_placeholder_value"
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
def _install_fake_groq_bugteam_module(monkeypatch, response_object):
|
|
121
|
-
"""Register a minimal fake groq_bugteam module for resolver lookup."""
|
|
122
|
-
|
|
123
|
-
fake_module = types.ModuleType("groq_bugteam")
|
|
124
|
-
|
|
125
|
-
def fake_call(api_key, messages, temperature, max_completion_tokens):
|
|
126
|
-
return types.SimpleNamespace(
|
|
127
|
-
content=json.dumps(response_object),
|
|
128
|
-
model="fake-model",
|
|
129
|
-
)
|
|
130
|
-
|
|
131
|
-
def fake_parse_json_object(text):
|
|
132
|
-
return json.loads(text)
|
|
133
|
-
|
|
134
|
-
def fake_preserve_trailing_newline(original, updated):
|
|
135
|
-
if original.endswith("\n") and not updated.endswith("\n"):
|
|
136
|
-
return updated + "\n"
|
|
137
|
-
if not original.endswith("\n") and updated.endswith("\n"):
|
|
138
|
-
return updated[:-1]
|
|
139
|
-
return updated
|
|
140
|
-
|
|
141
|
-
fake_module.call_groq_with_fallback = fake_call
|
|
142
|
-
fake_module.parse_json_object = fake_parse_json_object
|
|
143
|
-
fake_module.preserve_trailing_newline = fake_preserve_trailing_newline
|
|
144
|
-
monkeypatch.setitem(sys.modules, "groq_bugteam", fake_module)
|
|
145
|
-
monkeypatch.setenv("GROQ_API_KEY", FAKE_API_KEY)
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
def test_skipped_entry_missing_finding_index_does_not_crash(monkeypatch):
|
|
149
|
-
original_file = "alpha\nbeta\n"
|
|
150
|
-
spec_list = [
|
|
151
|
-
{
|
|
152
|
-
"finding_index": 4,
|
|
153
|
-
"severity": "P1",
|
|
154
|
-
"category": "J",
|
|
155
|
-
"file": "sample.py",
|
|
156
|
-
"target_line_start": 1,
|
|
157
|
-
"target_line_end": 1,
|
|
158
|
-
"intended_change": "rename alpha",
|
|
159
|
-
"replacement_code": "alpha_fixed",
|
|
160
|
-
"acceptance_criteria": ["alpha_fixed appears on line 1"],
|
|
161
|
-
}
|
|
162
|
-
]
|
|
163
|
-
patched_file = "alpha_fixed\nbeta\n"
|
|
164
|
-
fake_response = {
|
|
165
|
-
"updated_content": patched_file,
|
|
166
|
-
"applied_finding_indexes": [4],
|
|
167
|
-
"skipped": [{"reason": "malformed entry without finding_index"}],
|
|
168
|
-
"acceptance_checks": [
|
|
169
|
-
{
|
|
170
|
-
"finding_index": 4,
|
|
171
|
-
"criterion": "alpha_fixed appears on line 1",
|
|
172
|
-
"met": True,
|
|
173
|
-
}
|
|
174
|
-
],
|
|
175
|
-
}
|
|
176
|
-
_install_fake_groq_bugteam_module(monkeypatch, fake_response)
|
|
177
|
-
|
|
178
|
-
outcome = groq_bugteam_spec.apply_fix_from_spec(spec_list, original_file)
|
|
179
|
-
|
|
180
|
-
assert outcome["updated_content"] == patched_file
|
|
181
|
-
assert outcome["applied_finding_indexes"] == [4]
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
def test_null_updated_content_falls_back_to_current_content(monkeypatch):
|
|
185
|
-
original_file = "alpha\nbeta\n"
|
|
186
|
-
spec_list = [
|
|
187
|
-
{
|
|
188
|
-
"finding_index": 0,
|
|
189
|
-
"severity": "P2",
|
|
190
|
-
"category": "E",
|
|
191
|
-
"file": "sample.py",
|
|
192
|
-
"target_line_start": 1,
|
|
193
|
-
"target_line_end": 1,
|
|
194
|
-
"intended_change": "no-op fallback",
|
|
195
|
-
"replacement_code": "alpha",
|
|
196
|
-
"acceptance_criteria": ["alpha remains on line 1"],
|
|
197
|
-
}
|
|
198
|
-
]
|
|
199
|
-
fake_response = {
|
|
200
|
-
"updated_content": None,
|
|
201
|
-
"applied_finding_indexes": [],
|
|
202
|
-
"skipped": [
|
|
203
|
-
{
|
|
204
|
-
"finding_index": 0,
|
|
205
|
-
"reason": "Groq returned null updated_content",
|
|
206
|
-
}
|
|
207
|
-
],
|
|
208
|
-
"acceptance_checks": [],
|
|
209
|
-
}
|
|
210
|
-
_install_fake_groq_bugteam_module(monkeypatch, fake_response)
|
|
211
|
-
|
|
212
|
-
outcome = groq_bugteam_spec.apply_fix_from_spec(spec_list, original_file)
|
|
213
|
-
|
|
214
|
-
assert outcome["updated_content"] == original_file
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
def test_null_collection_fields_coerce_to_empty_lists(monkeypatch):
|
|
218
|
-
original_file = "alpha\n"
|
|
219
|
-
spec_list = [
|
|
220
|
-
{
|
|
221
|
-
"finding_index": 1,
|
|
222
|
-
"severity": "P2",
|
|
223
|
-
"category": "E",
|
|
224
|
-
"file": "sample.py",
|
|
225
|
-
"target_line_start": 1,
|
|
226
|
-
"target_line_end": 1,
|
|
227
|
-
"intended_change": "no-op",
|
|
228
|
-
"replacement_code": "alpha",
|
|
229
|
-
"acceptance_criteria": ["alpha remains"],
|
|
230
|
-
}
|
|
231
|
-
]
|
|
232
|
-
fake_response = {
|
|
233
|
-
"updated_content": original_file,
|
|
234
|
-
"applied_finding_indexes": None,
|
|
235
|
-
"skipped": None,
|
|
236
|
-
"acceptance_checks": None,
|
|
237
|
-
}
|
|
238
|
-
_install_fake_groq_bugteam_module(monkeypatch, fake_response)
|
|
239
|
-
|
|
240
|
-
outcome = groq_bugteam_spec.apply_fix_from_spec(spec_list, original_file)
|
|
241
|
-
|
|
242
|
-
assert outcome["applied_finding_indexes"] == []
|
|
243
|
-
assert outcome["skipped"] == []
|
|
244
|
-
assert outcome["acceptance_checks"] == []
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
def test_dict_collection_fields_coerce_to_empty_lists(monkeypatch):
|
|
248
|
-
original_file = "alpha\n"
|
|
249
|
-
spec_list = [
|
|
250
|
-
{
|
|
251
|
-
"finding_index": 2,
|
|
252
|
-
"severity": "P2",
|
|
253
|
-
"category": "E",
|
|
254
|
-
"file": "sample.py",
|
|
255
|
-
"target_line_start": 1,
|
|
256
|
-
"target_line_end": 1,
|
|
257
|
-
"intended_change": "no-op",
|
|
258
|
-
"replacement_code": "alpha",
|
|
259
|
-
"acceptance_criteria": ["alpha remains"],
|
|
260
|
-
}
|
|
261
|
-
]
|
|
262
|
-
fake_response = {
|
|
263
|
-
"updated_content": original_file,
|
|
264
|
-
"applied_finding_indexes": {"not": "a list"},
|
|
265
|
-
"skipped": {"0": "not a list either"},
|
|
266
|
-
"acceptance_checks": {"also": "a dict"},
|
|
267
|
-
}
|
|
268
|
-
_install_fake_groq_bugteam_module(monkeypatch, fake_response)
|
|
269
|
-
|
|
270
|
-
outcome = groq_bugteam_spec.apply_fix_from_spec(spec_list, original_file)
|
|
271
|
-
|
|
272
|
-
assert outcome["applied_finding_indexes"] == []
|
|
273
|
-
assert outcome["skipped"] == []
|
|
274
|
-
assert outcome["acceptance_checks"] == []
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
def test_non_string_updated_content_falls_back_to_current_content(monkeypatch):
|
|
278
|
-
original_file = "alpha\nbeta\n"
|
|
279
|
-
spec_list = [
|
|
280
|
-
{
|
|
281
|
-
"finding_index": 0,
|
|
282
|
-
"severity": "P2",
|
|
283
|
-
"category": "E",
|
|
284
|
-
"file": "sample.py",
|
|
285
|
-
"target_line_start": 1,
|
|
286
|
-
"target_line_end": 1,
|
|
287
|
-
"intended_change": "no-op fallback",
|
|
288
|
-
"replacement_code": "alpha",
|
|
289
|
-
"acceptance_criteria": ["alpha remains on line 1"],
|
|
290
|
-
}
|
|
291
|
-
]
|
|
292
|
-
fake_response = {
|
|
293
|
-
"updated_content": {"unexpected": "dict instead of str"},
|
|
294
|
-
"applied_finding_indexes": [],
|
|
295
|
-
"skipped": [],
|
|
296
|
-
"acceptance_checks": [],
|
|
297
|
-
}
|
|
298
|
-
_install_fake_groq_bugteam_module(monkeypatch, fake_response)
|
|
299
|
-
|
|
300
|
-
outcome = groq_bugteam_spec.apply_fix_from_spec(spec_list, original_file)
|
|
301
|
-
|
|
302
|
-
assert outcome["updated_content"] == original_file
|
|
303
|
-
|
|
304
|
-
|
|
305
|
-
def test_run_spec_mode_main_emits_error_json_on_missing_api_key(
|
|
306
|
-
monkeypatch, capsys
|
|
307
|
-
):
|
|
308
|
-
monkeypatch.delenv("GROQ_API_KEY", raising=False)
|
|
309
|
-
monkeypatch.setattr(
|
|
310
|
-
"groq_bugteam_dotenv.load_claude_dev_env_dotenv_file",
|
|
311
|
-
lambda: None,
|
|
312
|
-
)
|
|
313
|
-
spec_payload = {
|
|
314
|
-
"spec": [
|
|
315
|
-
{
|
|
316
|
-
"finding_index": 0,
|
|
317
|
-
"severity": "P1",
|
|
318
|
-
"category": "J",
|
|
319
|
-
"file": "sample.py",
|
|
320
|
-
"target_line_start": 1,
|
|
321
|
-
"target_line_end": 1,
|
|
322
|
-
"intended_change": "noop",
|
|
323
|
-
"replacement_code": "noop",
|
|
324
|
-
"acceptance_criteria": ["noop"],
|
|
325
|
-
}
|
|
326
|
-
],
|
|
327
|
-
"current_content": "noop\n",
|
|
328
|
-
}
|
|
329
|
-
monkeypatch.setattr("sys.stdin", io.StringIO(json.dumps(spec_payload)))
|
|
330
|
-
|
|
331
|
-
with pytest.raises(SystemExit) as exit_info:
|
|
332
|
-
groq_bugteam_spec.run_spec_mode_main()
|
|
333
|
-
|
|
334
|
-
captured = capsys.readouterr()
|
|
335
|
-
emitted_outcome = json.loads(captured.out)
|
|
336
|
-
assert "error" in emitted_outcome
|
|
337
|
-
assert "GROQ_API_KEY" in emitted_outcome["error"]
|
|
338
|
-
assert exit_info.value.code != 0
|
|
@@ -1,309 +0,0 @@
|
|
|
1
|
-
# Bugteam — Evaluation Suite
|
|
2
|
-
|
|
3
|
-
Evaluation-driven iteration set for the `bugteam` skill, following [Anthropic — Agent Skills best practices: evaluation and iteration](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices#evaluation-and-iteration).
|
|
4
|
-
|
|
5
|
-
## Methodology
|
|
6
|
-
|
|
7
|
-
Evals are split into two layers. Both layers run against the same trace but carry different failure semantics.
|
|
8
|
-
|
|
9
|
-
**Layer A — Ironclad invariants.** Order-and-presence rules that MUST hold on every run regardless of fixture, regardless of model choice, regardless of the exact number of loops taken. Citations use **section headings and companion files** (`SKILL.md`, `CONSTRAINTS.md`, `reference/*.md`) — not fragile line numbers — so layout edits to `SKILL.md` do not invalidate the contract. If an assertion fails, either the run diverged from the skill or the cited text is ambiguous and needs patching.
|
|
10
|
-
|
|
11
|
-
**Layer B — Fixture-dependent expectations.** The concrete tool trace predicted for a specific fixture (fixed PR state, canned audit XML, canned fix XML). Layer B is prediction — reality may diverge in small ways (extra `Bash("git rev-parse HEAD")` checkpoints the lead inserts for sanity; retry loops on transient failures; consolidated cleanup calls) without indicating a skill defect. Layer B failures trigger reconciliation, not auto-failure.
|
|
12
|
-
|
|
13
|
-
**Process note.** This document was drafted before running a real trace. Layer B predictions are labeled *predicted*, not *observed*. On the first real run, every Layer B prediction is reconciled against the observed trace and the diffs written back here — that reconciliation is Cycle 0 of the iteration protocol below.
|
|
14
|
-
|
|
15
|
-
## Ironclad invariants (Layer A, apply to every eval)
|
|
16
|
-
|
|
17
|
-
Each invariant cites the normative section or companion file it derives from. All spawns use `Agent(..., run_in_background=true)`. Invariants apply uniformly across all eval fixtures.
|
|
18
|
-
|
|
19
|
-
| # | Invariant | Citation |
|
|
20
|
-
|---|---|---|
|
|
21
|
-
| I-1 | `Bash` invoking `scripts/grant_project_claude_permissions.py` precedes the first audit `Agent` spawn. | `SKILL.md` § Step 0 |
|
|
22
|
-
| I-2 | `Bash` invoking `scripts/revoke_project_claude_permissions.py` runs exactly once per invocation on every exit path, after teardown. | `SKILL.md` § Step 5 |
|
|
23
|
-
| I-3 | Orchestration uses `Agent(..., run_in_background=true)` only — no `TeamCreate`, `TeamDelete`, `SendMessage`, or `Task` tool calls. | `SKILL.md` § Step 2; § Step 4 |
|
|
24
|
-
| I-4 | `Agent` calls are fresh per loop (`run_in_background=true`; new `name` each loop). | `CONSTRAINTS.md` — **Fresh subagent per loop** |
|
|
25
|
-
| I-5 | Audit sibling spawns pass `model="haiku"`; validator and fix spawns pass `model="opus"`. | `SKILL.md` § AUDIT action (parallel auditors); § FIX action; `CONSTRAINTS.md` — **Opus 4.7 at xhigh effort for validator and fix subagents** |
|
|
26
|
-
| I-6 | Loop count ≤ 10 audits. 11th audit never fires. | `SKILL.md` YAML `description` (10-loop cap); § Step 3 (**Pre-audit** / **FIX** increment rules) |
|
|
27
|
-
| I-7 | From loop 4 onward without convergence, eleven parallel `Agent(..., run_in_background=true)` calls in one message for audit. | `SKILL.md` § AUDIT action (**Parallel auditors**) |
|
|
28
|
-
| I-8 | Lead reads `.bugteam-pr<N>-loop<L>.outcomes.xml` with the `Read` tool after each audit, before the next action. | `SKILL.md` § AUDIT action |
|
|
29
|
-
| I-9 | Teardown sequence: `git worktree remove` each PR → `rmtree` `<run_temp_dir>` → Step 4.5 → revoke. | `SKILL.md` § Step 4; § Step 4.5; § Step 5 |
|
|
30
|
-
| I-10 | The bugfind subagent posts ONE per-loop review; the bugfix subagent posts fix replies. The lead's only PR-write action is the Step 4.5 description rewrite. | `CONSTRAINTS.md` — **Audit/fix comment posting** |
|
|
31
|
-
|
|
32
|
-
Any eval failing one or more Layer A invariants fails the run.
|
|
33
|
-
|
|
34
|
-
## Observation strategy
|
|
35
|
-
|
|
36
|
-
Evals run in a harness that intercepts the tool layer:
|
|
37
|
-
|
|
38
|
-
- A **mock tool layer** records each tool call with its arguments and returns synthetic responses matching the real tool's response shape. Nothing hits GitHub; no real teammates spawn.
|
|
39
|
-
- A **fixture repo** supplies deterministic git state and a mocked `gh` CLI that returns canned JSON for `pr view`, `pr diff`, and `api` calls.
|
|
40
|
-
- **Assertions** run against the recorded call list, not against real PR state.
|
|
41
|
-
|
|
42
|
-
The harness does not yet exist; this document defines its contract.
|
|
43
|
-
|
|
44
|
-
---
|
|
45
|
-
|
|
46
|
-
## Eval 1 — Smoke: background subagent spawns fire correctly
|
|
47
|
-
|
|
48
|
-
**Scenario.** PR exists; PR is a clean target with no unusual pre-conditions.
|
|
49
|
-
|
|
50
|
-
**Trigger.** `/bugteam`
|
|
51
|
-
|
|
52
|
-
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-8, I-9, I-10.
|
|
53
|
-
|
|
54
|
-
**Layer B predicted trace (smoke).**
|
|
55
|
-
1. `Bash("python .../grant_project_claude_permissions.py")` runs (Step 0).
|
|
56
|
-
2. `Agent(subagent_type="code-quality-agent", name="bugfind-pr...-loop1", run_in_background=true, model="opus", ...)` spawned for AUDIT.
|
|
57
|
-
3. Lead awaits background-completion notification, then `Read(".bugteam-pr42-loop1.outcomes.xml")`.
|
|
58
|
-
4. `Agent(subagent_type="clean-coder", name="bugfix-pr...-loop1", run_in_background=true, model="opus", ...)` spawned for FIX (if findings).
|
|
59
|
-
5. `Bash("python .../revoke_project_claude_permissions.py")` on exit.
|
|
60
|
-
|
|
61
|
-
**Pass criteria.**
|
|
62
|
-
- Non-zero `Agent(subagent_type="code-quality-agent", run_in_background=true)` and `Agent(subagent_type="clean-coder", run_in_background=true)` calls.
|
|
63
|
-
|
|
64
|
-
---
|
|
65
|
-
|
|
66
|
-
## Eval 2 — Refusal: missing PR, no upstream diff
|
|
67
|
-
|
|
68
|
-
**Scenario.** Current branch is `main` with no PR and no upstream difference.
|
|
69
|
-
|
|
70
|
-
**Layer B predicted trace.**
|
|
71
|
-
1. `pull_request_read(method="get", pullNumber=N, owner=O, repo=R)` → fails / no matching PR.
|
|
72
|
-
2. `Bash("git merge-base HEAD origin/main")` → empty.
|
|
73
|
-
3. No grant script.
|
|
74
|
-
|
|
75
|
-
**Pass criteria.** Assistant message matches `No PR or upstream diff. /bugteam needs a target.`. Zero downstream tool calls.
|
|
76
|
-
|
|
77
|
-
---
|
|
78
|
-
|
|
79
|
-
## Eval 3 — Refusal: uncommitted changes in working tree
|
|
80
|
-
|
|
81
|
-
**Scenario.** Clean PR exists but `git status --porcelain` shows unstaged changes.
|
|
82
|
-
|
|
83
|
-
**Pass criteria.** Assistant message matches `Uncommitted changes detected. Stash, commit, or revert before /bugteam.`. Zero downstream tool calls.
|
|
84
|
-
|
|
85
|
-
---
|
|
86
|
-
|
|
87
|
-
## Eval 4 — Refusal: required subagent missing
|
|
88
|
-
|
|
89
|
-
**Scenario.** `code-quality-agent` is present in the available-agents list; `clean-coder` is not.
|
|
90
|
-
|
|
91
|
-
**Pass criteria.** Assistant message contains `Required subagent type clean-coder not installed.`. Zero grant script call, zero `Agent` spawns.
|
|
92
|
-
|
|
93
|
-
---
|
|
94
|
-
|
|
95
|
-
## Eval 5 — Happy path: converges in 2 loops
|
|
96
|
-
|
|
97
|
-
**Scenario.** PR #42 contains three P1 bugs all addressable by the mock fix subagent. Loop 1 audit returns 3 findings; loop 1 fix commits cleanly; loop 2 audit returns zero findings.
|
|
98
|
-
|
|
99
|
-
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10.
|
|
100
|
-
|
|
101
|
-
**Layer B predicted trace.**
|
|
102
|
-
|
|
103
|
-
| # | Tool call | Source |
|
|
104
|
-
|---|---|---|
|
|
105
|
-
| 1 | `Bash("python .../scripts/grant_project_claude_permissions.py")` | `SKILL.md` § Step 0 |
|
|
106
|
-
| 2 | `pull_request_read(method="get", pullNumber=42, owner=..., repo=...)` | `SKILL.md` § Step 1 |
|
|
107
|
-
| 3 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse HEAD")` → captures `starting_sha` | `SKILL.md` § Step 2 — **Loop state** block |
|
|
108
|
-
| 4 | `Bash("mkdir -p <run_temp_dir>/pr-42")` | `SKILL.md` § AUDIT action |
|
|
109
|
-
| 5 | `pull_request_read(method="get_diff", pullNumber=42, owner=..., repo=...)` → write to `<run_temp_dir>/pr-42/loop-1.patch` | `SKILL.md` § AUDIT action |
|
|
110
|
-
| 6 | `Agent(subagent_type="code-quality-agent", name="bugfind-pr42-loop1", run_in_background=true, model="opus", description=..., prompt=<audit XML loop 1>)` | `SKILL.md` § AUDIT action |
|
|
111
|
-
| 7 | Lead awaits background-completion notification | `SKILL.md` § AUDIT action |
|
|
112
|
-
| 8 | `Read(".bugteam-pr42-loop1.outcomes.xml")` | `SKILL.md` § AUDIT action |
|
|
113
|
-
| 9 | `Agent(subagent_type="clean-coder", name="bugfix-pr42-loop1", run_in_background=true, model="opus", description=..., prompt=<fix XML loop 1>)` | `SKILL.md` § FIX action |
|
|
114
|
-
| 10 | Lead awaits background-completion notification | `SKILL.md` § FIX action |
|
|
115
|
-
| 11 | `Read(".bugteam-pr42-loop1.outcomes.xml")` — bugfix outcome XML | `SKILL.md` § FIX action |
|
|
116
|
-
| 12 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse HEAD")` → verify HEAD advanced | `SKILL.md` § FIX action (**Verify**) |
|
|
117
|
-
| 13 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" fetch origin <branch>")` → fetch remote state | `SKILL.md` § FIX action (**Verify**) |
|
|
118
|
-
| 14 | `Bash("git -C \"<run_temp_dir>/pr-42/worktree\" rev-parse origin/<branch>")` → confirm matches HEAD | `SKILL.md` § FIX action (**Verify**) |
|
|
119
|
-
| 15 | `pull_request_read(method="get_diff", pullNumber=42, owner=..., repo=...)` → write to `<run_temp_dir>/pr-42/loop-2.patch` | `SKILL.md` § AUDIT action |
|
|
120
|
-
| 16 | `Agent(subagent_type="code-quality-agent", name="bugfind-pr42-loop2", run_in_background=true, ...)` (loop 2) | `SKILL.md` § AUDIT action |
|
|
121
|
-
| 17 | Lead awaits background-completion notification | `SKILL.md` § AUDIT action |
|
|
122
|
-
| 18 | `Read(".bugteam-pr42-loop2.outcomes.xml")` — zero findings | `SKILL.md` § AUDIT action |
|
|
123
|
-
| 19 | `Bash("git worktree remove \"<run_temp_dir>/pr-42/worktree\"")` | `SKILL.md` § Step 4 step 1 |
|
|
124
|
-
| 20 | `Bash("python -c \"...shutil.rmtree(r'<run_temp_dir>', ...)\"")` | `SKILL.md` § Step 4 step 2 (Windows-safe teardown) |
|
|
125
|
-
| 21 | `pull_request_read(method="get_diff", pullNumber=42, owner=..., repo=...)` → write to `.bugteam-final.diff` | `SKILL.md` § Step 4.5 step 1 |
|
|
126
|
-
| 22 | `pull_request_read(method="get", pullNumber=42, owner=..., repo=...)` → extract `.body`, write to `.bugteam-original-body.md` | `SKILL.md` § Step 4.5 step 2 |
|
|
127
|
-
| 23 | `Agent(subagent_type="pr-description-writer", description=..., prompt=<brief>)` | `SKILL.md` § Step 4.5 |
|
|
128
|
-
| 24 | `Write(".bugteam-final-body.md", <returned body>)` | `SKILL.md` § Step 4.5 step 4 |
|
|
129
|
-
| 25 | `update_pull_request(pullNumber=42, owner=..., repo=..., body=...)` | `SKILL.md` § Step 4.5 step 4 |
|
|
130
|
-
| 26 | `Bash("rm .bugteam-final.diff .bugteam-original-body.md .bugteam-final-body.md")` | `SKILL.md` § Step 4.5 step 5 |
|
|
131
|
-
| 27 | `Bash("python .../scripts/revoke_project_claude_permissions.py")` | `SKILL.md` § Step 5 |
|
|
132
|
-
|
|
133
|
-
**Pass criteria.**
|
|
134
|
-
- All Layer A invariants hold.
|
|
135
|
-
- Exactly 2 `Agent(name="bugfind-pr42-loop...")` calls, exactly 1 `Agent(name="bugfix-pr42-loop...")` call.
|
|
136
|
-
- Final report contains `/bugteam exit: converged` and `Loops: 2`.
|
|
137
|
-
|
|
138
|
-
**Process check after first real run.** Compare the observed trace against steps 1–27. Common expected divergences that should not fail the eval:
|
|
139
|
-
- Extra `Bash("git rev-parse HEAD")` calls the lead inserts for bookkeeping.
|
|
140
|
-
- Consolidated `Bash` calls (step 25 may split into two or three calls).
|
|
141
|
-
- Extra `Read` calls when the lead re-reads an outcome XML to quote specific findings.
|
|
142
|
-
- Reordered but still-Layer-A-compliant cleanup sequencing.
|
|
143
|
-
|
|
144
|
-
Patch this table to match observation and annotate each correction.
|
|
145
|
-
|
|
146
|
-
---
|
|
147
|
-
|
|
148
|
-
## Eval 6 — Stuck path: fix subagent produces no commit
|
|
149
|
-
|
|
150
|
-
**Scenario.** Loop 1 audit finds 2 P1 bugs; the mock fix subagent reports both as `could_not_address` (no commit created).
|
|
151
|
-
|
|
152
|
-
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10. I-6 trivially holds.
|
|
153
|
-
|
|
154
|
-
**Layer B predicted trace.** Identical to Eval 5 steps 1–12 with this divergence:
|
|
155
|
-
- Step 11 bugfix outcome XML marks every finding `status="could_not_address"`.
|
|
156
|
-
- Step 12 `Bash("git rev-parse HEAD")` returns the pre-fix SHA unchanged.
|
|
157
|
-
- Skill sets exit reason = `stuck`, skips loop 2, and falls through to `rmtree`.
|
|
158
|
-
|
|
159
|
-
**Pass criteria.**
|
|
160
|
-
- Loop count stops at 1.
|
|
161
|
-
- Final report contains `/bugteam exit: stuck` and names the two unresolved findings.
|
|
162
|
-
- Steps 19–26 fire despite the stuck exit — I-2 and I-9 enforce this.
|
|
163
|
-
|
|
164
|
-
---
|
|
165
|
-
|
|
166
|
-
## Eval 7 — Cap reached: 10 loops, no convergence
|
|
167
|
-
|
|
168
|
-
**Scenario.** Mock audit returns one P2 finding every loop. Mock fix subagent always commits but never clears the finding.
|
|
169
|
-
|
|
170
|
-
**Layer A invariants.** All of I-1 through I-10.
|
|
171
|
-
|
|
172
|
-
**Layer B predicted behavior.**
|
|
173
|
-
- Loops 1–3: single `Agent(name="bugfind-pr<N>-loop<L>", run_in_background=true)` per loop.
|
|
174
|
-
- Loops 4–10: eleven parallel `Agent(name="bugfind-pr<N>-loop<L>-[a..k]", run_in_background=true)` in a single assistant message per loop (10 haiku + 1 opus validator); lead awaits the validator notification.
|
|
175
|
-
- Each loop produces one `Agent(name="bugfix-pr<N>-loop<L>", run_in_background=true)`.
|
|
176
|
-
- Exactly 10 audit phases, exactly 10 fix phases.
|
|
177
|
-
- Steps 19–26 from Eval 5 fire at teardown.
|
|
178
|
-
|
|
179
|
-
**Pass criteria.**
|
|
180
|
-
- I-6 holds: exactly 10 audit phases.
|
|
181
|
-
- I-7 holds: loops 4–10 each emit eleven audit `Agent` calls in a single assistant message.
|
|
182
|
-
- Final report contains `/bugteam exit: cap reached` and the remaining bug count.
|
|
183
|
-
|
|
184
|
-
**Process check.** The distinct `Agent(name=...)` audit-call count is a prediction. On the first real run, record the exact count and rewrite the formula here.
|
|
185
|
-
|
|
186
|
-
---
|
|
187
|
-
|
|
188
|
-
## Eval 8 — Clean on first audit
|
|
189
|
-
|
|
190
|
-
**Scenario.** Loop 1 audit returns zero findings.
|
|
191
|
-
|
|
192
|
-
**Layer A invariants.** I-1, I-2, I-3, I-4, I-5, I-6, I-8, I-9, I-10.
|
|
193
|
-
|
|
194
|
-
**Layer B predicted trace.** Eval 5 steps 1–8 and 19–26 only — no FIX phase because zero findings means the skill exits the loop at `last_action == "audited"` and `last_findings.total == 0`.
|
|
195
|
-
|
|
196
|
-
**Pass criteria.**
|
|
197
|
-
- Exactly 1 `Agent(subagent_type="code-quality-agent", run_in_background=true)` call, 0 fix agent spawns.
|
|
198
|
-
- Bugfind's outcome XML records zero findings; the per-loop review POST carries body `## /bugteam loop 1 audit: 0P0 / 0P1 / 0P2 → clean`.
|
|
199
|
-
- Step 4.5 and Step 5 still fire.
|
|
200
|
-
|
|
201
|
-
---
|
|
202
|
-
|
|
203
|
-
## Eval 9 — Anchor fallback: finding outside diff
|
|
204
|
-
|
|
205
|
-
**Scenario.** Loop 1 audit returns 3 findings; 1 anchors to a line outside the captured diff.
|
|
206
|
-
|
|
207
|
-
**Layer A invariants.** Same as Eval 5.
|
|
208
|
-
|
|
209
|
-
**Layer B predicted subagent-side behavior** (observed via the recorded `gh api ... /reviews` POST payload in the bugfind subagent fixture).
|
|
210
|
-
- `comments[]` length in the POST body = 2 (anchored findings only).
|
|
211
|
-
- Review body contains a `### Findings without a diff anchor` section listing the third finding.
|
|
212
|
-
- Bugfix outcome XML marks all 3 findings with a `reply_comment_url`; the unanchored finding's `used_fallback="true"` and `finding_comment_url` equals the parent review URL.
|
|
213
|
-
|
|
214
|
-
**Pass criteria.** Confirmed in the fixture's canned teammate outcome XML; Layer A invariants hold on the lead side.
|
|
215
|
-
|
|
216
|
-
---
|
|
217
|
-
|
|
218
|
-
## Eval 10 — Review POST failure fallback
|
|
219
|
-
|
|
220
|
-
**Scenario.** The first `POST /pulls/42/reviews` call from the bugfind teammate returns HTTP 422.
|
|
221
|
-
|
|
222
|
-
**Layer B predicted teammate-side behavior.**
|
|
223
|
-
- Bugfind teammate retries via the issue-comments endpoint `POST /repos/.../issues/42/comments` with a single body carrying the review header and every finding inline.
|
|
224
|
-
- Every finding's outcome XML carries `used_fallback="true"` and the issue-comment URL as `finding_comment_url`.
|
|
225
|
-
- Cycle continues to the FIX action without aborting.
|
|
226
|
-
|
|
227
|
-
**Open item for the real run.** The issue-comments fallback uses `add_issue_comment(owner=..., repo=..., issueNumber=42, body=...)` (`SKILL.md` § Step 2.5 **Review POST fails**; full narrative in `reference/github-pr-reviews.md` § **Review POST failure fallback**). Before running Eval 10 for real, confirm the teammate obeys this shape — the fixture must assert the `add_issue_comment` tool call.
|
|
228
|
-
|
|
229
|
-
---
|
|
230
|
-
|
|
231
|
-
## Eval 11 — Hook-blocked fix commit
|
|
232
|
-
|
|
233
|
-
**Scenario.** Bugfix stages edits but `git commit` fails because a `pre-commit` hook returns non-zero.
|
|
234
|
-
|
|
235
|
-
**Layer B predicted behavior.**
|
|
236
|
-
- Bugfix teammate outcome XML marks every finding `status="hook_blocked"` with populated `<hook_output>`.
|
|
237
|
-
- Bugfix teammate posts `Hook blocked the fix commit: <one-line summary>` to each finding comment.
|
|
238
|
-
- Lead's `Bash("git rev-parse HEAD")` after fix detects no SHA change → exit reason `stuck`.
|
|
239
|
-
- Steps 19–26 from Eval 5 fire at teardown.
|
|
240
|
-
|
|
241
|
-
**Pass criteria.** Layer A I-2 and I-9 hold. Final report contains `/bugteam exit: stuck` and surfaces the hook_output summary.
|
|
242
|
-
|
|
243
|
-
---
|
|
244
|
-
|
|
245
|
-
## Eval 12 — `pr-description-writer` unavailable, `general-purpose` available
|
|
246
|
-
|
|
247
|
-
**Scenario.** The available-agents list does not include `pr-description-writer` but does include `general-purpose`.
|
|
248
|
-
|
|
249
|
-
**Layer B predicted trace.** Eval 5 steps 1–21 identical; step 22 becomes:
|
|
250
|
-
|
|
251
|
-
```
|
|
252
|
-
Agent(subagent_type="general-purpose", description="Rewrite PR 42 body from cumulative diff", prompt=<same brief>)
|
|
253
|
-
```
|
|
254
|
-
|
|
255
|
-
Steps 23–26 follow normally.
|
|
256
|
-
|
|
257
|
-
**Pass criteria.** Exactly 1 `Agent(subagent_type="general-purpose", ...)` call for the description rewrite. `gh pr edit` fires. Final report carries no Step 4.5 skip warning.
|
|
258
|
-
|
|
259
|
-
---
|
|
260
|
-
|
|
261
|
-
## Eval 13 — Neither PR-description agent available
|
|
262
|
-
|
|
263
|
-
**Scenario.** Neither `pr-description-writer` nor `general-purpose` appear in the available-agents list.
|
|
264
|
-
|
|
265
|
-
**Layer B predicted trace.** Eval 5 steps 1–21, then skip steps 22–24. Steps 25–26 still fire.
|
|
266
|
-
|
|
267
|
-
**Pass criteria.**
|
|
268
|
-
- Zero `Agent` calls for PR description rewriting.
|
|
269
|
-
- Zero `gh pr edit` calls.
|
|
270
|
-
- Final report carries the Step 4.5 skip warning.
|
|
271
|
-
- Layer A I-2 holds: revoke still fires.
|
|
272
|
-
|
|
273
|
-
---
|
|
274
|
-
|
|
275
|
-
## Eval 14 — Permissions revoke on error path
|
|
276
|
-
|
|
277
|
-
**Scenario.** Bugfind subagent completes but writes no outcomes XML (background subagent completes notification arrives with no file at the expected path).
|
|
278
|
-
|
|
279
|
-
**Layer B predicted trace.** Eval 5 steps 1–7, then:
|
|
280
|
-
- Lead awaits notification and calls `Read(".bugteam-pr42-loop1.outcomes.xml")` → file missing.
|
|
281
|
-
- Skill sets exit reason = `error: outcomes XML missing after bugfind loop 1`.
|
|
282
|
-
- Teardown (steps 19–26 from Eval 5) all fire.
|
|
283
|
-
|
|
284
|
-
**Pass criteria.** Final report surfaces the error and the loop number. Revoke fires despite the error.
|
|
285
|
-
|
|
286
|
-
---
|
|
287
|
-
|
|
288
|
-
## Iteration protocol
|
|
289
|
-
|
|
290
|
-
1. **Cycle 0 — Reconcile predictions with reality.** On the first real run, diff every Layer B predicted trace against the observed trace. Patch this file to match reality and annotate each correction with a reason.
|
|
291
|
-
2. **Baseline.** Run every eval with the skill unloaded. Record which cases the base model handles from memory versus which it gets wrong.
|
|
292
|
-
3. **Treatment.** Run every eval with the skill loaded. Layer A invariants must pass on every case. Layer B mismatches trigger Cycle 0 reconciliation.
|
|
293
|
-
4. **Regress on change.** Every edit to normative text in `SKILL.md`, `CONSTRAINTS.md`, `PROMPTS.md`, or `reference/*.md` sections that Layer A cites re-runs the full suite. A passing→failing transition on any Layer A invariant blocks the change. A Layer B mismatch after such an edit triggers a patch to the affected eval trace in the same commit.
|
|
294
|
-
5. **Extend on gotcha.** When the skill misfires in real use, add a new eval that reproduces the miss before patching the orchestration or companion files.
|
|
295
|
-
|
|
296
|
-
## Harness sketch (future work)
|
|
297
|
-
|
|
298
|
-
A minimal Python harness under `packages/claude-dev-env/skills/bugteam/evals/`:
|
|
299
|
-
|
|
300
|
-
- `harness.py` — loads a fixture, injects a mock tool layer that records calls and returns canned responses, invokes the lead with the trigger, collects the recorded trace, evaluates pass criteria.
|
|
301
|
-
- `fixtures/` — one subdirectory per eval with canned `gh` responses, canned audit XML, canned fix XML, and the expected trace JSON.
|
|
302
|
-
- `run_evals.py` — discovery + pass/fail reporting, exits non-zero on any failure for CI.
|
|
303
|
-
- `invariants.py` — the Layer A assertion bank, imported by every fixture.
|
|
304
|
-
|
|
305
|
-
## Open research items flagged during this pass
|
|
306
|
-
|
|
307
|
-
1. **GitHub REST review-POST payload shape.** Eval 9 and Eval 10 depend on the exact body shape of `POST /pulls/<number>/reviews`. The `jq -n --rawfile ... --argjson ... | gh api ... --input -` fence lives in `SKILL.md` § Step 2.5 (**Review POST**); expanded copy in `reference/github-pr-reviews.md` § **Per-loop review**. Before running Eval 9/10 for real, fetch the current GitHub REST reference to confirm the request schema (fields `commit_id`, `event`, `body`, `comments[]`) and the multi-line anchor `{path, start_line, start_side, line, side, body}` shape still apply. Record the confirmed version and URL here.
|
|
308
|
-
2. **Background subagent completion signal.** Real-run observation (loop 1 of eval run 2026-04-18) confirmed: background subagents self-terminate when their task is complete — the background-completion notification arrives and the lead reads the outcomes XML. No shutdown handshake required. `SKILL.md` § AUDIT / FIX actions document this flow. Layer A **I-4** encodes “fresh subagent per loop.”
|
|
309
|
-
3. **Model override redundancy.** `clean-coder` pins `model: opus` in its agent definition, while `code-quality-agent` currently uses `model: inherit`. The explicit `model="opus"` in every spawn is insurance against frontmatter drift; on the first real run, confirm the resolved model is `claude-opus-4-7` and that effort defaults to `xhigh` (Claude Code shows the active effort next to the spinner per the model-config docs). If a teammate's frontmatter ever pins a non-default `effort:` value, that frontmatter overrides the model default for that subagent (https://code.claude.com/docs/en/model-config — *"Frontmatter effort applies when that skill or subagent is active, overriding the session level but not the environment variable."*).
|