@guilz-dev/sdlc-gh 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.github/CODEOWNERS +5 -0
- package/.github/ISSUE_TEMPLATE/bug_report.yml +68 -0
- package/.github/ISSUE_TEMPLATE/config.yml +1 -0
- package/.github/ISSUE_TEMPLATE/feature_request.yml +39 -0
- package/.github/ISSUE_TEMPLATE/support.yml +56 -0
- package/.github/ISSUE_TEMPLATE/task.yml +89 -0
- package/.github/agents/implementer.agent.md +17 -0
- package/.github/agents/reviewer.agent.md +18 -0
- package/.github/agents/triager.agent.md +13 -0
- package/.github/aw/actions-lock.json +9 -0
- package/.github/copilot-instructions.md +35 -0
- package/.github/hooks/hooks.json +12 -0
- package/.github/instructions/core.instructions.md +11 -0
- package/.github/instructions/profiles/go.instructions.md +10 -0
- package/.github/instructions/profiles/php.instructions.md +11 -0
- package/.github/instructions/profiles/python.instructions.md +11 -0
- package/.github/instructions/profiles/ruby.instructions.md +11 -0
- package/.github/instructions/profiles/typescript.instructions.md +11 -0
- package/.github/labels.yml +55 -0
- package/.github/pull_request_template.md +33 -0
- package/.github/ruleset.example.json +33 -0
- package/.github/ruleset.harness-eval.example.json +29 -0
- package/.github/skills/quality-loop/SKILL.md +23 -0
- package/.github/workflows/agent-retry-orchestrator.yml +161 -0
- package/.github/workflows/copilot-setup-steps.yml +64 -0
- package/.github/workflows/eval-ci.yml +169 -0
- package/.github/workflows/eval-drift.yml +75 -0
- package/.github/workflows/gh-aw-dogfood-ci.yml +73 -0
- package/.github/workflows/harness-ci.yml +244 -0
- package/.github/workflows/harness-sync.yml +28 -0
- package/.github/workflows/l1-readiness-check.yml +45 -0
- package/.github/workflows/labels-sync.yml +24 -0
- package/.github/workflows/nightly-harness-review.lock.yml +1643 -0
- package/.github/workflows/nightly-harness-review.md +87 -0
- package/.github/workflows/nightly-harness-review.yml +63 -0
- package/.github/workflows/npm-publish.yml +49 -0
- package/.github/workflows/pr-context-comment.yml +138 -0
- package/.github/workflows/product-ci-go.yml +33 -0
- package/.github/workflows/product-ci-php.yml +39 -0
- package/.github/workflows/product-ci-python.yml +34 -0
- package/.github/workflows/product-ci-ruby.yml +35 -0
- package/.github/workflows/product-ci-ts.yml +37 -0
- package/.github/workflows/task-issue-label-sync.yml +50 -0
- package/.github/workflows/weekly-redteam.lock.yml +1571 -0
- package/.github/workflows/weekly-redteam.md +76 -0
- package/.github/zizmor.yml +11 -0
- package/AGENTS.md +54 -0
- package/LICENSE +21 -0
- package/README.md +366 -0
- package/config/stacks.json +55 -0
- package/docs/adoption.md +126 -0
- package/docs/arch.md +535 -0
- package/docs/auth-boundaries.md +16 -0
- package/docs/coding-agent-l1.md +152 -0
- package/docs/exceptions/README.md +25 -0
- package/docs/exceptions/TEMPLATE.md +8 -0
- package/docs/failure-taxonomy.md +23 -0
- package/docs/gh-aw-dogfood.md +109 -0
- package/docs/kpi-baseline.md +9 -0
- package/docs/nightly-harness-review.md +94 -0
- package/docs/operations.md +108 -0
- package/docs/publishing.md +79 -0
- package/docs/revert-playbook.md +44 -0
- package/docs/shared-config.md +30 -0
- package/docs/telemetry-artifacts.md +78 -0
- package/docs/telemetry-schema.md +60 -0
- package/evals/.score-baseline.json +6 -0
- package/evals/e2e-bench/README.md +28 -0
- package/evals/e2e-bench/manifest.json +16 -0
- package/evals/e2e-bench/tasks/e2e-001.yml +10 -0
- package/evals/e2e-bench/tasks/e2e-002.yml +11 -0
- package/evals/e2e-bench/tasks/e2e-003.yml +10 -0
- package/evals/e2e-bench/tasks/e2e-004.yml +14 -0
- package/evals/e2e-bench/tasks/e2e-005.yml +11 -0
- package/evals/e2e-bench/tasks/e2e-006.yml +10 -0
- package/evals/e2e-bench/tasks/e2e-007.yml +10 -0
- package/evals/e2e-bench/tasks/e2e-008.yml +10 -0
- package/evals/e2e-bench/tasks/e2e-009.yml +10 -0
- package/evals/trajectories/rubric.md +12 -0
- package/evals/trajectories/test_harness_conventions.py +271 -0
- package/infra/README.md +49 -0
- package/infra/langfuse/docker-compose.yml +25 -0
- package/infra/otel/collector-config.yml +24 -0
- package/infra/samples/gh-aw-dogfood-report.json +44 -0
- package/infra/samples/harness-review-routing-plan.json +19 -0
- package/infra/samples/harness-review-summary.json +61 -0
- package/infra/samples/telemetry-artifact.json +29 -0
- package/infra/samples/telemetry-payload.json +19 -0
- package/package.json +85 -0
- package/prompts/triager-classify.prompt.yml +10 -0
- package/sample/go/add.go +5 -0
- package/sample/go/add_test.go +9 -0
- package/sample/go/go.mod +3 -0
- package/sample/php/composer.json +26 -0
- package/sample/php/composer.lock +1881 -0
- package/sample/php/phpunit.xml +8 -0
- package/sample/php/src/Add.php +13 -0
- package/sample/php/tests/AddTest.php +16 -0
- package/sample/python/requirements-dev.txt +2 -0
- package/sample/python/src/__init__.py +0 -0
- package/sample/python/src/greet.py +3 -0
- package/sample/python/tests/conftest.py +4 -0
- package/sample/python/tests/test_greet.py +5 -0
- package/sample/ruby/.rubocop.yml +10 -0
- package/sample/ruby/Gemfile +6 -0
- package/sample/ruby/Gemfile.lock +58 -0
- package/sample/ruby/lib/add.rb +9 -0
- package/sample/ruby/spec/add_spec.rb +11 -0
- package/sample/ts/biome.json +6 -0
- package/sample/ts/package-lock.json +1763 -0
- package/sample/ts/package.json +15 -0
- package/sample/ts/src/add.ts +3 -0
- package/sample/ts/tests/add.test.ts +8 -0
- package/sample/ts/tsconfig.json +12 -0
- package/scripts/aggregate-harness-review.mjs +48 -0
- package/scripts/bootstrap-harness.sh +411 -0
- package/scripts/check-diff-size.mjs +46 -0
- package/scripts/check-e2e-manifest.mjs +35 -0
- package/scripts/check-eval-score-drift.mjs +31 -0
- package/scripts/check-gh-aw-dogfood-scope.mjs +51 -0
- package/scripts/check-issue-spec.mjs +215 -0
- package/scripts/check-l1-readiness.mjs +82 -0
- package/scripts/check-open-pr-limit.mjs +34 -0
- package/scripts/doctor.mjs +177 -0
- package/scripts/emit-gh-aw-dogfood-report.mjs +112 -0
- package/scripts/emit-telemetry-artifact.mjs +99 -0
- package/scripts/fetch-telemetry-artifacts.mjs +176 -0
- package/scripts/harness-drift-report.mjs +99 -0
- package/scripts/lib/bootstrap-copy.mjs +123 -0
- package/scripts/lib/ccsd-contract.mjs +212 -0
- package/scripts/lib/diff-size.mjs +103 -0
- package/scripts/lib/doctor-local.mjs +179 -0
- package/scripts/lib/e2e-manifest.mjs +76 -0
- package/scripts/lib/gh-aw-dogfood.mjs +293 -0
- package/scripts/lib/github-config.mjs +94 -0
- package/scripts/lib/harness-ci-fragments.mjs +98 -0
- package/scripts/lib/harness-review-routing.mjs +244 -0
- package/scripts/lib/harness-review.mjs +388 -0
- package/scripts/lib/issue-form-label-sync.mjs +56 -0
- package/scripts/lib/l1-readiness.mjs +258 -0
- package/scripts/lib/merge-harness-package.mjs +36 -0
- package/scripts/lib/npm-package.mjs +129 -0
- package/scripts/lib/setup-wizard.mjs +224 -0
- package/scripts/lib/stacks.mjs +138 -0
- package/scripts/lib/telemetry-artifact.mjs +253 -0
- package/scripts/lib/template-root.mjs +39 -0
- package/scripts/merge-harness-package.mjs +14 -0
- package/scripts/route-harness-review.mjs +168 -0
- package/scripts/run-e2e-bench.mjs +216 -0
- package/scripts/sdlc-gh-cli.mjs +91 -0
- package/scripts/select-eval-jobs.mjs +41 -0
- package/scripts/setup-github.mjs +242 -0
- package/scripts/setup-github.sh +4 -0
- package/scripts/setup-wizard.mjs +426 -0
- package/scripts/test-bootstrap-guidance-scenarios.mjs +94 -0
- package/scripts/test-diff-size-scenarios.mjs +88 -0
- package/scripts/test-doctor-scenarios.mjs +70 -0
- package/scripts/test-e2e-manifest-scenarios.mjs +65 -0
- package/scripts/test-gh-aw-dogfood-scenarios.mjs +74 -0
- package/scripts/test-harness-review-routing-scenarios.mjs +130 -0
- package/scripts/test-harness-review-scenarios.mjs +92 -0
- package/scripts/test-hooks-scenarios.mjs +44 -0
- package/scripts/test-issue-form-label-sync-scenarios.mjs +48 -0
- package/scripts/test-issue-spec-scenarios.mjs +258 -0
- package/scripts/test-l1-readiness-scenarios.mjs +204 -0
- package/scripts/test-merge-harness-package-scenarios.mjs +53 -0
- package/scripts/test-npm-package-scenarios.mjs +31 -0
- package/scripts/test-sdlc-gh-cli-scenarios.mjs +54 -0
- package/scripts/test-setup-github-scenarios.mjs +103 -0
- package/scripts/test-setup-wizard-scenarios.mjs +114 -0
- package/scripts/test-telemetry-artifact-scenarios.mjs +69 -0
- package/scripts/trim-harness-ci.mjs +18 -0
- package/scripts/validate-gh-aw-compile.mjs +64 -0
- package/scripts/validate-harness.mjs +199 -0
- package/scripts/validate-telemetry.mjs +21 -0
- package/scripts/verify-bootstrap-stacks.sh +192 -0
package/docs/adoption.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
# Adoption Guide
|
|
2
|
+
|
|
3
|
+
Apply this harness template to any repository.
|
|
4
|
+
|
|
5
|
+
## Prerequisites
|
|
6
|
+
|
|
7
|
+
- GitHub repository with Actions enabled
|
|
8
|
+
- GitHub Copilot (Business or Enterprise) for coding agent features
|
|
9
|
+
- Optional: self-hosted Langfuse for telemetry
|
|
10
|
+
|
|
11
|
+
## New repository
|
|
12
|
+
|
|
13
|
+
1. Use **GitHub Template repository** → Create new repository from this template.
|
|
14
|
+
2. Or run the wizard (recommended):
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
cd /path/to/new-product
|
|
18
|
+
npx @guilz-dev/sdlc-gh
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
3. Or run bootstrap manually, then wizard:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
git clone <harness-template-url> /tmp/harness
|
|
25
|
+
/tmp/harness/scripts/bootstrap-harness.sh \
|
|
26
|
+
--repo /path/to/new-product \
|
|
27
|
+
--stack ts \
|
|
28
|
+
--mode new \
|
|
29
|
+
--codeowners-team @your-org/harness-engineers
|
|
30
|
+
cd /path/to/new-product && npx @guilz-dev/sdlc-gh --yes --stack ts --codeowners @your-org/harness-engineers
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
3. Run `./scripts/setup-github.sh` to sync labels and create/update the `main-protection` ruleset.
|
|
34
|
+
4. *(Optional, Phase 3)* After eval CI is green in your org, run `./scripts/setup-github.sh --with-eval-ruleset` to create/update the `harness-pr-eval-required` ruleset. The template ruleset applies to all PRs targeting `main`; narrow conditions in GitHub Settings if you only want harness-asset PRs blocked. GitHub Models enablement is still required before `prompt-eval` can block merges.
|
|
35
|
+
5. Run `./scripts/doctor.mjs --strict` and fix any remaining failures.
|
|
36
|
+
6. Manual fallback: import `.github/ruleset.example.json` and apply `.github/labels.yml` if `gh` cannot be used.
|
|
37
|
+
|
|
38
|
+
## GitHub setup order
|
|
39
|
+
|
|
40
|
+
Apply in this order (or run `./scripts/setup-wizard.mjs` to perform steps 1–2 and verify with doctor):
|
|
41
|
+
|
|
42
|
+
1. **Labels sync** — `task:*` and `autonomy:*` from `.github/labels.yml`
|
|
43
|
+
2. **Main protection** — `main-protection` ruleset with harness + product CI checks
|
|
44
|
+
3. **Optional eval ruleset** — `--with-eval-ruleset` adds `harness-pr-eval-required` (eval CI checks only; does not enable GitHub Models). This ruleset targets `main` and requires `select` + `trajectory-conventions` on **all** PRs to that branch — enable only when your org accepts that cost, or narrow the ruleset conditions in GitHub Settings after creation.
|
|
45
|
+
|
|
46
|
+
### Setup wizard
|
|
47
|
+
|
|
48
|
+
`./scripts/setup-wizard.mjs` orchestrates Phase 0–1 install settings:
|
|
49
|
+
|
|
50
|
+
- writes `.harness-stack` (primary stack for rulesets)
|
|
51
|
+
- replaces the `CODEOWNERS` placeholder on **product repos** (skipped by default with `--template`)
|
|
52
|
+
- runs `setup-github.sh` (labels + rulesets)
|
|
53
|
+
- runs `doctor --strict` (pass `--template` for multi-stack template repos)
|
|
54
|
+
|
|
55
|
+
Non-interactive flags: `--yes`, `--stack`, `--codeowners`, `--github-repo`, `--with-eval-ruleset`, `--skip-github`, `--dry-run`, `--patch-codeowners` (opt-in CODEOWNERS replacement in template mode), `--force-bootstrap` (destructive; never with `--yes`).
|
|
56
|
+
|
|
57
|
+
## Behavior / spec corrections (template updates)
|
|
58
|
+
|
|
59
|
+
When pulling harness updates, review these intentional behavior alignments with [arch.md](arch.md):
|
|
60
|
+
|
|
61
|
+
| Area | Current spec | Legacy behavior |
|
|
62
|
+
|------|--------------|-----------------|
|
|
63
|
+
| `autonomy:L0` diff-size | Proposal only — no LOC/file gate | Some versions applied L1 limits with warn |
|
|
64
|
+
| L1 over-limit | Warn by default; opt-in hard-fail via `DIFF_SIZE_L1_HARD_FAIL=1` | — |
|
|
65
|
+
|
|
66
|
+
## Existing repository (phased)
|
|
67
|
+
|
|
68
|
+
| Step | Assets | Risk |
|
|
69
|
+
|------|--------|------|
|
|
70
|
+
| 1 | FF only: instructions, agents, hooks, templates | Low |
|
|
71
|
+
| 2 | `harness-ci.yml` + stack `product-ci` | Medium |
|
|
72
|
+
| 3 | Eval CI + ruleset eval required | Medium |
|
|
73
|
+
| 4 | Coding agent L1 on `task:docs` / `task:test-fix` (CC-SD contract required) | Low tasks first |
|
|
74
|
+
|
|
75
|
+
gh-aw outer loop: use the **gh-aw dogfood track** ([gh-aw-dogfood.md](gh-aw-dogfood.md)) for bounded `gh aw compile` validation on `sdlc-gh` itself. Standard GHA aggregation remains the operational baseline — see [nightly-harness-review.md](nightly-harness-review.md). Do not enable unrestricted gh-aw across the repo until dogfood criteria stay green over multiple runs.
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
./scripts/bootstrap-harness.sh \
|
|
79
|
+
--repo /path/to/existing \
|
|
80
|
+
--codeowners-team @your-org/harness-engineers
|
|
81
|
+
cd /path/to/existing
|
|
82
|
+
npx @guilz-dev/sdlc-gh --yes --stack ts --codeowners @your-org/harness-engineers
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Or skip manual bootstrap entirely:
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
cd /path/to/existing
|
|
89
|
+
npx @guilz-dev/sdlc-gh
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
## Stack selection
|
|
93
|
+
|
|
94
|
+
| Stack | Profile | Sample | CI workflow |
|
|
95
|
+
|-------|---------|--------|-------------|
|
|
96
|
+
| `ts` | `typescript.instructions.md` | `sample/ts/` | `product-ci-ts.yml` |
|
|
97
|
+
| `python` | `python.instructions.md` | `sample/python/` | `product-ci-python.yml` |
|
|
98
|
+
| `go` | `go.instructions.md` | `sample/go/` | `product-ci-go.yml` |
|
|
99
|
+
| `ruby` | `ruby.instructions.md` | `sample/ruby/` | `product-ci-ruby.yml` |
|
|
100
|
+
| `php` | `php.instructions.md` | `sample/php/` | `product-ci-php.yml` |
|
|
101
|
+
|
|
102
|
+
Stack metadata is centralized in [`config/stacks.json`](../config/stacks.json). Bootstrap copies **only** the selected stack's profile and `product-ci-*` workflow, and replaces the `CODEOWNERS` team placeholder at install time.
|
|
103
|
+
|
|
104
|
+
## CC-SD contract (L1 only in v1)
|
|
105
|
+
|
|
106
|
+
Phase 4 L1 delegation uses a lightweight **Issue-embedded CC-SD contract** — not a separate spec file. v1 enforces the contract only for `task:docs` and `task:test-fix` at `autonomy:L1` via the `issue-spec-check` CI job. `feature-small`, `infra`, and `security-sensitive` are out of scope until a later version.
|
|
107
|
+
|
|
108
|
+
Required Issue fields: `Goal`, `Non-goals`, `Constraints`, `Acceptance criteria`, `Rollback hints`. See [coding-agent-l1.md](coding-agent-l1.md). Enforcement uses Issue **labels** (`task:*`, `autonomy:*`), not the form dropdown alone.
|
|
109
|
+
|
|
110
|
+
`issue-spec-check` is safe to keep always required: non-L1 and unlinked PRs exit successfully (warn/skip only).
|
|
111
|
+
|
|
112
|
+
## Sync from canonical template (Phase 4)
|
|
113
|
+
|
|
114
|
+
Use `harness-sync.yml` or subtree merge to pull harness updates. Review drift report before merge.
|
|
115
|
+
|
|
116
|
+
## Rollback
|
|
117
|
+
|
|
118
|
+
See [revert-playbook.md](revert-playbook.md) for the canonical procedure. Quick steps:
|
|
119
|
+
|
|
120
|
+
1. Revert the bootstrap commit or sync PR.
|
|
121
|
+
2. Disable required status checks for `harness-ci` in ruleset.
|
|
122
|
+
3. Remove `.github/agents` if coding agent assignment causes issues.
|
|
123
|
+
|
|
124
|
+
## Multi-project rollout
|
|
125
|
+
|
|
126
|
+
Target **3+ product repos** sharing the same template version. Pin template ref in `harness-sync.yml`.
|
package/docs/arch.md
ADDED
|
@@ -0,0 +1,535 @@
|
|
|
1
|
+
# Agent Harness Architecture — GitHub Copilot Core
|
|
2
|
+
|
|
3
|
+
**Version**: 1.1 (2026-07-04)
|
|
4
|
+
**Repository**: `sdlc-gh` — template harness for GitHub Copilot coding agents
|
|
5
|
+
**Canonical ops**: [operations.md](operations.md) (thresholds, retry policy, forbidden ops)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. Executive summary
|
|
10
|
+
|
|
11
|
+
This document describes the architecture of **sdlc-gh**, a stack-agnostic agent harness template built on the GitHub Copilot ecosystem (coding agent, CLI/IDE, Agentic Workflows, GitHub Models) and complementary OSS (Langfuse, DeepEval, promptfoo, etc.).
|
|
12
|
+
|
|
13
|
+
A **harness** is the full mechanism for keeping AI agents aligned with intent — not just prompts. It combines:
|
|
14
|
+
|
|
15
|
+
- **Feed-forward**: instructions, agents, skills, tool limits, credential boundaries
|
|
16
|
+
- **Feedback**: deterministic walls (CI, hooks, diff-size gates), observability, evals, human PR review
|
|
17
|
+
|
|
18
|
+
Three design conclusions:
|
|
19
|
+
|
|
20
|
+
1. **Use off-the-shelf enforcement.** Isolation, safe outputs, deterministic walls, observability, and eval runners come from GitHub platform + OSS — do not rebuild them.
|
|
21
|
+
2. **Invest in intent definition.** Golden datasets, rubrics, wall content, and revision-cycle operations are the differentiation layer.
|
|
22
|
+
3. **Converge human judgment on PR review.** No matter how autonomous agents become, add gates at review time — not mid-execution approval prompts.
|
|
23
|
+
|
|
24
|
+
### Implementation status (this repo)
|
|
25
|
+
|
|
26
|
+
| Area | Status |
|
|
27
|
+
|------|--------|
|
|
28
|
+
| Bootstrap, stack catalog, harness/product CI | **Implemented** |
|
|
29
|
+
| Hooks, diff-size gate, CC-SD issue-spec check | **Implemented** |
|
|
30
|
+
| Custom agents (triager / implementer / reviewer) | **Implemented** |
|
|
31
|
+
| Eval CI with change-type job selection | **Implemented** |
|
|
32
|
+
| Retry orchestrator, PR context comments | **Implemented** |
|
|
33
|
+
| E2E bench (executable acceptance checks) | **Partial** — 9 tasks; not yet break-and-fix agent runner |
|
|
34
|
+
| `gh models eval` in CI | **Scaffolded** — runs when prompts exist; org must enable Models |
|
|
35
|
+
| gh-aw outer loop (`nightly-harness-review`, `weekly-redteam`) | **Partial** — GHA nightly review + gh-aw dogfood track (#7); `.md`/`.lock.yml` stubs remain |
|
|
36
|
+
| Langfuse / OTel export | **Scaffolded** — `infra/` + schema; inner-loop JSON artifacts wired |
|
|
37
|
+
|
|
38
|
+
Operational details and thresholds live in companion docs — see [Documentation index](#11-related-documentation).
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## 2. Design principles
|
|
43
|
+
|
|
44
|
+
### 2.1 Harness as a control system
|
|
45
|
+
|
|
46
|
+
Model the harness as a dual-loop control system:
|
|
47
|
+
|
|
48
|
+
```mermaid
|
|
49
|
+
flowchart LR
|
|
50
|
+
subgraph OUTER["Outer loop (cross-task, daily–weekly)"]
|
|
51
|
+
EVAL[Eval platform<br/>eval / trace analysis]
|
|
52
|
+
REVISE[Harness revision<br/>instructions / skills / walls]
|
|
53
|
+
end
|
|
54
|
+
subgraph INNER["Inner loop (single task, seconds–minutes)"]
|
|
55
|
+
FF[Feed-forward<br/>instructions / skills / tool limits]
|
|
56
|
+
AGENT[Agent execution<br/>plan → act → observe]
|
|
57
|
+
WALL[Deterministic walls<br/>tests / lint / hooks / diff-size]
|
|
58
|
+
end
|
|
59
|
+
INTENT[Intent<br/>Issue / CC-SD contract] --> FF
|
|
60
|
+
FF --> AGENT
|
|
61
|
+
AGENT --> WALL
|
|
62
|
+
WALL -- fail → retry --> AGENT
|
|
63
|
+
WALL -- pass --> OUT[PR artifact]
|
|
64
|
+
AGENT -. trace .-> EVAL
|
|
65
|
+
WALL -. pass/fail log .-> EVAL
|
|
66
|
+
OUT -. review outcome .-> EVAL
|
|
67
|
+
EVAL --> REVISE
|
|
68
|
+
REVISE -- asset update --> FF
|
|
69
|
+
REVISE -- wall addition --> WALL
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
The **inner loop** runs fast (seconds–minutes). The **outer loop** runs slower (daily–weekly). Without the outer loop, the harness degrades as models and task distributions shift.
|
|
73
|
+
|
|
74
|
+
### 2.2 Five principles
|
|
75
|
+
|
|
76
|
+
**Principle 1: Walls are declarative and deterministic.** Constraints that depend on agent goodwill are not constraints. Prefer CI jobs, hooks, rulesets, and (when available) gh-aw safe outputs over prompt pleading.
|
|
77
|
+
|
|
78
|
+
**Principle 2: Structurally separate secrets from agents.** gh-aw's secretless design is the ideal; coding agent uses short-lived scoped tokens; CLI/IDE and SDK use delegated or proxied credentials. Never expose long-lived keys in agent-readable files.
|
|
79
|
+
|
|
80
|
+
**Principle 3: Log at every trust boundary.** Observability enables future control. Traces use OpenTelemetry; avoid vendor lock-in on the export path.
|
|
81
|
+
|
|
82
|
+
**Principle 4: No harness change without eval.** Changes to instructions, agents, skills, hooks, or eval assets must pass the change-type eval matrix before merge.
|
|
83
|
+
|
|
84
|
+
**Principle 5: One human gate — make it strong.** Collect decision inputs on the PR (scores, cost, trace links, harness asset SHAs). Do not scatter synchronous approval prompts.
|
|
85
|
+
|
|
86
|
+
> L2 auto-merge and L3 full auto-merge are **future promotions** with strict scope limits. PR review is never abolished — only its sync timing and scope change. See [operations.md](operations.md).
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## 3. Platform and gap analysis
|
|
91
|
+
|
|
92
|
+
### 3.1 GitHub platform maturity (mid-2026)
|
|
93
|
+
|
|
94
|
+
| Layer | Offering | Maturity | Notes |
|
|
95
|
+
|-------|----------|----------|-------|
|
|
96
|
+
| Instruction hierarchy | `copilot-instructions.md`, `AGENTS.md`, `.instructions.md` | GA | Priority: custom agent > path-specific > global |
|
|
97
|
+
| Custom agents | `.github/agents/*.agent.md` | GA | Tools, handoffs, org distribution via `.github-private` |
|
|
98
|
+
| Skills | Agent Skills (open spec) | Available | On-demand load; compatible with Copilot / Claude Code / Codex |
|
|
99
|
+
| Hooks | `hooks.json` (6 events) | Available | Deterministic block of destructive ops |
|
|
100
|
+
| Environment setup | `copilot-setup-steps.yml` | Available | Agent continues even if setup fails — monitor it |
|
|
101
|
+
| Isolated execution | coding agent (Actions VM, own branch) | GA | Requester cannot approve own PR |
|
|
102
|
+
| Automation | Agentic Workflows (gh-aw) | Public Preview | Markdown → lock.yml, AWF firewall, safe outputs, threat detection |
|
|
103
|
+
| Observability | gh aw logs/audit, OTel export | Available | |
|
|
104
|
+
| Eval | GitHub Models: `.prompt.yml` + `gh models eval` | Available | Single-prompt eval; not full agent trajectory |
|
|
105
|
+
| Embedding | Copilot SDK | Public Preview | Node/Python/Go/.NET/Java |
|
|
106
|
+
|
|
107
|
+
### 3.2 OSS complement map
|
|
108
|
+
|
|
109
|
+
| Purpose | Primary | Alternative | Rationale |
|
|
110
|
+
|---------|---------|-------------|-----------|
|
|
111
|
+
| Trace / observability | Langfuse (self-host) | Phoenix, OpenLLMetry | OTel-compatible; de facto OSS choice |
|
|
112
|
+
| Trajectory / agent eval | DeepEval | promptfoo, Ragas | pytest-style CI integration; G-Eval |
|
|
113
|
+
| Red team | NVIDIA garak | AI-Infra-Guard | Periodic prompt-injection testing |
|
|
114
|
+
| Workflow static analysis | zizmor, actionlint | — | Integrated in `harness-ci` |
|
|
115
|
+
| Template assets | github/awesome-copilot | — | Official recipes |
|
|
116
|
+
|
|
117
|
+
### 3.3 Gaps requiring custom work
|
|
118
|
+
|
|
119
|
+
| Gap | Description | sdlc-gh response |
|
|
120
|
+
|-----|-------------|------------------|
|
|
121
|
+
| **G1** Trajectory golden dataset | `gh models eval` is prompt-level, not E2E Issue→PR | `evals/e2e-bench/` — executable acceptance checks today; break-and-fix runner planned |
|
|
122
|
+
| **G2** Rubrics | "Good PR" definition is domain-specific | `evals/trajectories/rubric.md` + convention tests |
|
|
123
|
+
| **G3** Revision-cycle operations | Trace → classify → revise routing | Documented in [failure-taxonomy.md](failure-taxonomy.md); gh-aw stubs for automation |
|
|
124
|
+
|
|
125
|
+
#### G1 runner boundary (current vs planned)
|
|
126
|
+
|
|
127
|
+
| Concern | Current (`run-e2e-bench.mjs`) | Planned break-and-fix runner |
|
|
128
|
+
|---------|-------------------------------|------------------------------|
|
|
129
|
+
| **Task input** | Static YAML fixture in `tasks/*.yml` | Issue + CC-SD contract + repo snapshot |
|
|
130
|
+
| **Expected artifact** | File content / command exit code | Agent-produced PR diff |
|
|
131
|
+
| **Verifier contract** | `verification_*` fields in task YAML | Same fields + agent execution harness |
|
|
132
|
+
| **Result summary** | Per-task ok/fail; class/stack counts; executed/skipped totals | Above + pass@1, retry count, wall failure class |
|
|
133
|
+
|
|
134
|
+
Manifest validation: `scripts/check-e2e-manifest.mjs`. Details: [evals/e2e-bench/README.md](../evals/e2e-bench/README.md).
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## 4. UX design
|
|
139
|
+
|
|
140
|
+
### 4.1 Scope
|
|
141
|
+
|
|
142
|
+
| Category | Default | Examples |
|
|
143
|
+
|----------|---------|----------|
|
|
144
|
+
| In scope | L0–L2 candidates | App code, tests, refactors, dep bumps, docs |
|
|
145
|
+
| Always human | Production DB, prod secrets, billing/legal/PII | Never auto-delegate |
|
|
146
|
+
| L3 (conditional) | Typo/link/comment/docs only | Small, non-executable changes |
|
|
147
|
+
|
|
148
|
+
**v1 L1 enforcement** (CC-SD contract + `issue-spec-check`) applies only to `task:docs` and `task:test-fix`. See [coding-agent-l1.md](coding-agent-l1.md).
|
|
149
|
+
|
|
150
|
+
### 4.2 Personas
|
|
151
|
+
|
|
152
|
+
| Persona | Responsibility | Primary touchpoints |
|
|
153
|
+
|---------|----------------|---------------------|
|
|
154
|
+
| **Developer** | Write Issue, delegate to agent, review PR | Issue, PR, IDE, CLI |
|
|
155
|
+
| **Reviewer** | Final gate on PRs with bundled context | PR |
|
|
156
|
+
| **Harness engineer** | Maintain harness assets, run outer loop | `.github/**`, `evals/**`, morning queue |
|
|
157
|
+
| **Org admin** | MCP allowlist, model policy, budget | Organization settings |
|
|
158
|
+
|
|
159
|
+
### 4.3 UX principles
|
|
160
|
+
|
|
161
|
+
- **Single gate**: PR review only; hooks and CI replace mid-task approvals.
|
|
162
|
+
- **Bundled decision inputs**: PR context comment auto-posts diff stats, labels, retry count, instruction/skill SHAs, eval baseline, trace link placeholder.
|
|
163
|
+
- **Morning queue**: Outer-loop artifacts batched for daily review, not real-time interrupts.
|
|
164
|
+
- **Graduated autonomy**: L0→L3 with promotion evidence; see task-class matrix below.
|
|
165
|
+
- **Visible failures**: Retry exhaustion, security blocks, and drift warnings surface as structured PR comments and Issues.
|
|
166
|
+
|
|
167
|
+
### 4.4 Task classification and autonomy matrix
|
|
168
|
+
|
|
169
|
+
Enforced by Issue labels (`task:*`, `autonomy:*`) and `scripts/check-diff-size.mjs`:
|
|
170
|
+
|
|
171
|
+
| Task class | Examples | Max autonomy (default) | Size limits (LOC / files) |
|
|
172
|
+
|------------|----------|------------------------|---------------------------|
|
|
173
|
+
| `docs` | README, design docs | L3 | 60 / 2 |
|
|
174
|
+
| `test-fix` | Fix or add tests | L2 | 120 / 4 |
|
|
175
|
+
| `refactor` | Rename, dedupe | L1 | 300 / 8 |
|
|
176
|
+
| `feature-small` | Small feature | L1 | 300 / 8 |
|
|
177
|
+
| `dependency-bump` | patch/minor deps | L1 | 300 / 8 |
|
|
178
|
+
| `infra` | CI, IaC, deploy | L0 | Human gate |
|
|
179
|
+
| `security-sensitive` | Auth, billing, secrets | L0 | Proposal only |
|
|
180
|
+
|
|
181
|
+
L2/L3 labeled PRs **hard-fail** CI when limits exceeded. L1 **warns by default** (template). Phase 4 supports opt-in L1 hard-fail via `DIFF_SIZE_L1_HARD_FAIL=1`. `autonomy:L0` is proposal-only (no size gate).
|
|
182
|
+
|
|
183
|
+
### 4.5 CC-SD contract (L1 docs / test-fix)
|
|
184
|
+
|
|
185
|
+
Lightweight Issue-embedded spec — not a separate file:
|
|
186
|
+
|
|
187
|
+
| Field | Required |
|
|
188
|
+
|-------|----------|
|
|
189
|
+
| Goal | yes |
|
|
190
|
+
| Non-goals | yes |
|
|
191
|
+
| Constraints | yes |
|
|
192
|
+
| Acceptance criteria | yes |
|
|
193
|
+
| Rollback hints | yes |
|
|
194
|
+
| Additional context | optional |
|
|
195
|
+
|
|
196
|
+
Enforcement flow:
|
|
197
|
+
|
|
198
|
+
1. Author fills `.github/ISSUE_TEMPLATE/task.yml`
|
|
199
|
+
2. Triager validates contract and applies **labels** (form dropdown alone does not trigger CI)
|
|
200
|
+
3. `issue-spec-check` resolves the linked Issue (`closingIssuesReferences` first, then `fixes/closes #N` in the PR body), validates CC-SD when labels are `task:docs` or `task:test-fix` + `autonomy:L1`
|
|
201
|
+
4. Unlinked PRs warn and skip. Issue fetch failure **fails** only when PR or Issue proxy labels indicate L1 docs/test-fix; otherwise warn and skip
|
|
202
|
+
5. Reviewer checks Issue → PR summary → diff alignment
|
|
203
|
+
|
|
204
|
+
Canonical field names: `scripts/lib/ccsd-contract.mjs`.
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
208
|
+
## 5. Architecture overview
|
|
209
|
+
|
|
210
|
+
### 5.1 Layer model
|
|
211
|
+
|
|
212
|
+
```mermaid
|
|
213
|
+
flowchart TB
|
|
214
|
+
subgraph L6["L6 Outer loop (process + gh-aw stubs)"]
|
|
215
|
+
NIGHTLY[nightly-harness-review<br/>failure classify / revision PR]
|
|
216
|
+
DRIFT[eval-drift / harness-sync<br/>bench rotation / template drift]
|
|
217
|
+
end
|
|
218
|
+
subgraph L5["L5 Eval (gh models + pytest + e2e-bench)"]
|
|
219
|
+
PEVAL[prompt-eval<br/>prompts/*.prompt.yml]
|
|
220
|
+
TEVAL[trajectory tests<br/>evals/trajectories/]
|
|
221
|
+
E2E[e2e-bench<br/>executable acceptance checks]
|
|
222
|
+
end
|
|
223
|
+
subgraph L4["L4 Observability (OTel → Langfuse)"]
|
|
224
|
+
TRACE[telemetry-schema fields<br/>PR context comment links]
|
|
225
|
+
end
|
|
226
|
+
subgraph L3["L3 Deterministic walls"]
|
|
227
|
+
HARNESS[harness-ci<br/>static / hooks / diff-size / issue-spec]
|
|
228
|
+
PRODUCT[product-ci-*<br/>stack tests / lint]
|
|
229
|
+
HOOKS[hooks.json preToolUse]
|
|
230
|
+
end
|
|
231
|
+
subgraph L2["L2 Execution (GitHub platform)"]
|
|
232
|
+
CODING[coding agent]
|
|
233
|
+
GHAW[Agentic Workflows]
|
|
234
|
+
CLI[Copilot CLI / IDE]
|
|
235
|
+
end
|
|
236
|
+
subgraph L1["L1 Feed-forward assets"]
|
|
237
|
+
INS[instructions / AGENTS.md]
|
|
238
|
+
AGT[agents: triager / implementer / reviewer]
|
|
239
|
+
SKL[skills: quality-loop]
|
|
240
|
+
end
|
|
241
|
+
subgraph L0["L0 Governance"]
|
|
242
|
+
POL[rulesets / CODEOWNERS / labels / budget]
|
|
243
|
+
end
|
|
244
|
+
L0 --> L1 --> L2 --> L3
|
|
245
|
+
L2 -. OTel .-> L4
|
|
246
|
+
L3 -. outcomes .-> L4
|
|
247
|
+
L4 --> L5 --> L6
|
|
248
|
+
L6 -- revise --> L1
|
|
249
|
+
L6 -- add walls --> L3
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
### 5.2 Component map (as implemented in sdlc-gh)
|
|
253
|
+
|
|
254
|
+
| # | Component | Source | Implementation |
|
|
255
|
+
|---|-----------|--------|----------------|
|
|
256
|
+
| C1 | Instruction hierarchy | GitHub + custom | `.github/copilot-instructions.md`, `.github/instructions/` (core + stack profiles), `AGENTS.md` |
|
|
257
|
+
| C2 | Custom agents | GitHub + custom | `triager` (read), `implementer` (read/edit/search/execute), `reviewer` (read/search); handoffs triager→implementer |
|
|
258
|
+
| C3 | Skills | Open spec + custom | `.github/skills/quality-loop/SKILL.md` — verify against CC-SD before complete |
|
|
259
|
+
| C4 | Deterministic guards | GitHub + custom | `hooks/hooks.json` (force-push, rm -rf, DROP TABLE); `check-diff-size.mjs`; `check-open-pr-limit.mjs` (3 open PRs proxy for safe outputs) |
|
|
260
|
+
| C5 | Execution modes | GitHub | CLI/IDE, coding agent, gh-aw (stubs), SDK — see [auth-boundaries.md](auth-boundaries.md) |
|
|
261
|
+
| C6 | Walls (content) | Custom per repo | Stack `product-ci-*` workflows; sample apps under `sample/{stack}/` |
|
|
262
|
+
| C7 | Harness CI | Custom | `harness-ci.yml`: harness-static, issue-spec-check, open-pr-limit, diff-size, detect-projects → product-ci |
|
|
263
|
+
| C8 | Eval CI | Custom | `eval-ci.yml` + `select-eval-jobs.mjs` change-type matrix |
|
|
264
|
+
| C9 | Retry orchestrator | Custom | `agent-retry-orchestrator.yml` — max 3 retries, same-signature stop, security no-retry |
|
|
265
|
+
| C10 | PR context | Custom | `pr-context-comment.yml` — decision table on every PR |
|
|
266
|
+
| C11 | Bootstrap | Custom | `scripts/bootstrap-harness.sh` + `config/stacks.json` |
|
|
267
|
+
| C12 | Observability | OSS scaffold | `infra/langfuse/`, `infra/otel/`, [telemetry-schema.md](telemetry-schema.md) |
|
|
268
|
+
|
|
269
|
+
### 5.2.1 Eval matrix by change type
|
|
270
|
+
|
|
271
|
+
Implemented in `scripts/select-eval-jobs.mjs`:
|
|
272
|
+
|
|
273
|
+
| Changed paths | Eval jobs triggered |
|
|
274
|
+
|---------------|---------------------|
|
|
275
|
+
| `prompts/*.prompt.yml` | `prompt-eval` |
|
|
276
|
+
| `.github/agents/**` | `prompt-eval`, `agent-policy` |
|
|
277
|
+
| `.github/instructions/**`, `AGENTS.md` | `trajectory-conventions` |
|
|
278
|
+
| `.github/skills/**` | `trajectory-task` |
|
|
279
|
+
| `evals/**` | `meta-eval` |
|
|
280
|
+
| Default (other harness paths) | `trajectory-conventions` |
|
|
281
|
+
|
|
282
|
+
Weekly schedule runs full `e2e-bench` job regardless of PR paths.
|
|
283
|
+
|
|
284
|
+
### 5.2.2 Change size limits
|
|
285
|
+
|
|
286
|
+
Canonical values in [operations.md](operations.md). Enforced by `scripts/check-diff-size.mjs` from `autonomy:*` PR labels.
|
|
287
|
+
|
|
288
|
+
| Level | Max LOC | Max files | CI behavior |
|
|
289
|
+
|-------|---------|-----------|-------------|
|
|
290
|
+
| L0 | — | — | Proposal only |
|
|
291
|
+
| L1 | 300 | 8 | Warn (hard-fail opt-in via `DIFF_SIZE_L1_HARD_FAIL=1`) |
|
|
292
|
+
| L2 | 120 | 4 | Hard fail |
|
|
293
|
+
| L3 | 60 | 2 | Hard fail |
|
|
294
|
+
|
|
295
|
+
Over-limit changes should be split, not force-merged.
|
|
296
|
+
|
|
297
|
+
### 5.3 Task lifecycle (data flow)
|
|
298
|
+
|
|
299
|
+
```mermaid
|
|
300
|
+
sequenceDiagram
|
|
301
|
+
participant Dev as Developer
|
|
302
|
+
participant GH as GitHub Issue/PR
|
|
303
|
+
participant TRI as triager agent
|
|
304
|
+
participant IMP as implementer agent
|
|
305
|
+
participant WALL as Walls (harness + product CI)
|
|
306
|
+
participant RET as Retry orchestrator
|
|
307
|
+
participant REV as Reviewer
|
|
308
|
+
Dev->>GH: Issue with CC-SD contract
|
|
309
|
+
GH->>TRI: Classify task:* / autonomy:*
|
|
310
|
+
TRI->>GH: Labels applied (L1 requires complete contract)
|
|
311
|
+
GH->>IMP: Delegate implementation
|
|
312
|
+
loop Inner loop
|
|
313
|
+
IMP->>IMP: Plan → edit → test
|
|
314
|
+
IMP->>GH: Draft PR commits
|
|
315
|
+
GH->>WALL: Required checks
|
|
316
|
+
alt Check failure
|
|
317
|
+
WALL->>RET: check_suite completed (failure)
|
|
318
|
+
RET->>GH: Structured comment + retry:N label
|
|
319
|
+
Note over RET: Max 3; same sig ×2 stops;<br/>security escalates immediately
|
|
320
|
+
else Check pass
|
|
321
|
+
WALL->>GH: PR context comment posted
|
|
322
|
+
GH->>REV: Single human gate
|
|
323
|
+
REV->>GH: Approve or request changes
|
|
324
|
+
end
|
|
325
|
+
end
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
**Failure classification** (outer-loop routing): feed-forward gap, wall gap, model limit — see [failure-taxonomy.md](failure-taxonomy.md). Retry comments include `wall_failure_type` and `failure_sig`.
|
|
329
|
+
|
|
330
|
+
### 5.4 Telemetry minimum schema
|
|
331
|
+
|
|
332
|
+
Required span fields defined in [telemetry-schema.md](telemetry-schema.md). Inner-loop workflows emit JSON artifacts per [telemetry-artifacts.md](telemetry-artifacts.md). PR context comment surfaces repo + PR number for Langfuse lookup when `LANGFUSE_HOST` is configured.
|
|
333
|
+
|
|
334
|
+
### 5.5 Reviewer checklist
|
|
335
|
+
|
|
336
|
+
From `reviewer.agent.md` and arch §5.5:
|
|
337
|
+
|
|
338
|
+
1. **Requirement fit** — Goal and Acceptance criteria met?
|
|
339
|
+
2. **Non-goal preservation** — Out-of-scope items untouched?
|
|
340
|
+
3. **Boundary compliance** — Constraints respected?
|
|
341
|
+
4. **Test adequacy** — Tests constrain the change?
|
|
342
|
+
5. **Accountability** — Eval scores, cost, trace links present?
|
|
343
|
+
6. **Rollback ease** — Rollback hints plausible?
|
|
344
|
+
|
|
345
|
+
Compare **Issue → PR summary → diff** in one pass.
|
|
346
|
+
|
|
347
|
+
### 5.6 Repository layout
|
|
348
|
+
|
|
349
|
+
**Template repo (`sdlc-gh`)**:
|
|
350
|
+
|
|
351
|
+
```text
|
|
352
|
+
sdlc-gh/
|
|
353
|
+
├── AGENTS.md # Project instructions (task classes, roles)
|
|
354
|
+
├── config/
|
|
355
|
+
│ └── stacks.json # Stack catalog (ts / python / go / ruby / php)
|
|
356
|
+
├── .github/
|
|
357
|
+
│ ├── copilot-instructions.md # Global agent policy
|
|
358
|
+
│ ├── instructions/
|
|
359
|
+
│ │ ├── core.instructions.md
|
|
360
|
+
│ │ └── profiles/ # Per-stack conventions
|
|
361
|
+
│ ├── agents/ # triager / implementer / reviewer
|
|
362
|
+
│ ├── skills/quality-loop/SKILL.md
|
|
363
|
+
│ ├── hooks/hooks.json
|
|
364
|
+
│ ├── labels.yml # task:* / autonomy:* definitions
|
|
365
|
+
│ ├── CODEOWNERS
|
|
366
|
+
│ ├── ruleset.example.json
|
|
367
|
+
│ ├── ISSUE_TEMPLATE/task.yml # CC-SD contract template
|
|
368
|
+
│ ├── pull_request_template.md
|
|
369
|
+
│ ├── aw/actions-lock.json
|
|
370
|
+
│ └── workflows/
|
|
371
|
+
│ ├── harness-ci.yml # Walls + stack detection + product CI
|
|
372
|
+
│ ├── product-ci-{stack}.yml # Per-stack test/lint (5 stacks)
|
|
373
|
+
│ ├── eval-ci.yml # Change-type eval matrix
|
|
374
|
+
│ ├── eval-drift.yml # Bench rotation + score drift Issues
|
|
375
|
+
│ ├── agent-retry-orchestrator.yml
|
|
376
|
+
│ ├── pr-context-comment.yml
|
|
377
|
+
│ ├── copilot-setup-steps.yml
|
|
378
|
+
│ ├── labels-sync.yml
|
|
379
|
+
│ ├── harness-sync.yml # Weekly drift report
|
|
380
|
+
│ ├── nightly-harness-review.md # gh-aw stub (Phase 3)
|
|
381
|
+
│ ├── nightly-harness-review.lock.yml
|
|
382
|
+
│ ├── weekly-redteam.md # gh-aw stub (Phase 3)
|
|
383
|
+
│ └── weekly-redteam.lock.yml
|
|
384
|
+
├── docs/ # Architecture and operations
|
|
385
|
+
├── evals/
|
|
386
|
+
│ ├── trajectories/ # pytest convention tests + rubric.md
|
|
387
|
+
│ ├── e2e-bench/ # Task fixtures + manifest.json
|
|
388
|
+
│ └── .score-baseline.json
|
|
389
|
+
├── prompts/ # .prompt.yml for gh models eval
|
|
390
|
+
├── scripts/ # CI gate implementations, bootstrap
|
|
391
|
+
├── sample/ # Minimal apps per stack (CI targets)
|
|
392
|
+
└── infra/ # Langfuse + OTel collector scaffolding
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
**After bootstrap** (`scripts/bootstrap-harness.sh`):
|
|
396
|
+
|
|
397
|
+
- Only the selected stack's profile and `product-ci-*` workflow are copied
|
|
398
|
+
- `harness-ci.yml` is trimmed to a single product job
|
|
399
|
+
- Sample code expands to repo root when `--mode new`
|
|
400
|
+
|
|
401
|
+
In the template repo, marker detection runs all present stacks' product CI via `detect-projects` job.
|
|
402
|
+
|
|
403
|
+
Harness assets live in the product repo so Git history, PR review, and rollback apply directly. Org-wide shared assets can ship from a `.github-private` repository.
|
|
404
|
+
|
|
405
|
+
---
|
|
406
|
+
|
|
407
|
+
## 6. CI and automation reference
|
|
408
|
+
|
|
409
|
+
### 6.1 Harness CI jobs
|
|
410
|
+
|
|
411
|
+
| Job | Trigger | Purpose |
|
|
412
|
+
|-----|---------|---------|
|
|
413
|
+
| `harness-static` | PR, push main | `validate-harness.mjs`, actionlint, zizmor, hooks + issue-spec scenario tests |
|
|
414
|
+
| `issue-spec-check` | PR | CC-SD completeness for L1 docs/test-fix (`check-issue-spec.mjs`; uses `PR_LABELS` + linked Issue labels) |
|
|
415
|
+
| `open-pr-limit` | PR | Warn when author has >3 open PRs (`check-open-pr-limit.mjs`) |
|
|
416
|
+
| `diff-size` | PR | Autonomy size gate (`check-diff-size.mjs`) |
|
|
417
|
+
| `product-ci-*` | PR, push main | Stack tests/lint (conditional on marker files) |
|
|
418
|
+
|
|
419
|
+
### 6.2 Eval CI jobs
|
|
420
|
+
|
|
421
|
+
| Job | When | Purpose |
|
|
422
|
+
|-----|------|---------|
|
|
423
|
+
| `select` | PR / schedule | `select-eval-jobs.mjs` |
|
|
424
|
+
| `prompt-eval` | Selected | `gh models eval` on `prompts/*.prompt.yml` |
|
|
425
|
+
| `agent-policy` | Selected | Agent definition validation |
|
|
426
|
+
| `trajectory-conventions` | Selected | pytest harness convention tests |
|
|
427
|
+
| `trajectory-task` | Selected | Skill/task rubric tests |
|
|
428
|
+
| `meta-eval` | Selected | E2E manifest + bench runner + pytest |
|
|
429
|
+
| `e2e-bench` | Weekly schedule | Full bench run |
|
|
430
|
+
|
|
431
|
+
### 6.3 Local validation
|
|
432
|
+
|
|
433
|
+
```bash
|
|
434
|
+
npm run validate # Harness asset consistency
|
|
435
|
+
npm run test-hooks # Hook block/allow scenarios
|
|
436
|
+
npm run test-diff-size # Diff-size scenarios
|
|
437
|
+
npm run test-e2e-manifest # E2E manifest scenarios
|
|
438
|
+
npm run test-doctor # Doctor local check scenarios
|
|
439
|
+
npm run check-e2e # E2E manifest checks
|
|
440
|
+
npm run verify-bootstrap # Bootstrap integration (all stacks)
|
|
441
|
+
pytest evals/trajectories -q
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
Node 22 recommended for full E2E verifier parity with CI.
|
|
445
|
+
|
|
446
|
+
---
|
|
447
|
+
|
|
448
|
+
## 7. Phased rollout
|
|
449
|
+
|
|
450
|
+
Do not enable everything at once. Each phase uses prior metrics as promotion evidence.
|
|
451
|
+
|
|
452
|
+
| Phase | Timeline | Enable | sdlc-gh state |
|
|
453
|
+
|-------|----------|--------|---------------|
|
|
454
|
+
| **0** | ~2 weeks | CI walls, rulesets, optional Langfuse | harness-ci + product-ci |
|
|
455
|
+
| **1** | ~1 month | Instructions, agents, hooks, templates; record baseline KPIs | FF assets + labels sync |
|
|
456
|
+
| **2** | ~2 months | Eval CI + change-type matrix; E2E bench v1 | eval-ci, 9 e2e tasks |
|
|
457
|
+
| **3** | ~3 months | Retry orchestrator, PR context, gh-aw outer loop stubs | orchestrator + stubs (compile not guaranteed) |
|
|
458
|
+
| **4** | Stable ops | L2 promotion for docs; exception ledger; revert playbook | See [revert-playbook.md](revert-playbook.md), [exceptions/](exceptions/README.md) |
|
|
459
|
+
|
|
460
|
+
Detailed adoption steps: [adoption.md](adoption.md). L1 trial guide: [coding-agent-l1.md](coding-agent-l1.md).
|
|
461
|
+
|
|
462
|
+
---
|
|
463
|
+
|
|
464
|
+
## 8. Risks and mitigations
|
|
465
|
+
|
|
466
|
+
| Risk | Mitigation |
|
|
467
|
+
|------|------------|
|
|
468
|
+
| gh-aw Public Preview instability | Limit to outer loop; commit lock.yml; track release notes |
|
|
469
|
+
| Prompt injection | Input sanitization (gh-aw), secretless design, garak red-team stub; treat Issue body as untrusted |
|
|
470
|
+
| Auth boundary drift per mode | [auth-boundaries.md](auth-boundaries.md) matrix; no long-lived secrets in repo |
|
|
471
|
+
| Approval fatigue | Single PR gate; structured context comment; no mid-task prompts |
|
|
472
|
+
| Cost runaway | `max-ai-credits`; Langfuse cost fields; open-PR limit proxy |
|
|
473
|
+
| Eval overfitting | Quarterly 20% E2E rotation (`eval-drift.yml`); 15pt production gap threshold |
|
|
474
|
+
| Principle 4 hollowed out | Eval CI path filters + ruleset eval required |
|
|
475
|
+
| Retry loop runaway | Max 3 retries, same-signature stop, security immediate escalation |
|
|
476
|
+
| Oversized diffs | Autonomy size gates; split recommendation |
|
|
477
|
+
| Harness asset drift | `harness-sync.yml` weekly drift report; bootstrap re-run |
|
|
478
|
+
|
|
479
|
+
---
|
|
480
|
+
|
|
481
|
+
## 9. Custom investment areas
|
|
482
|
+
|
|
483
|
+
Quality ultimately depends on:
|
|
484
|
+
|
|
485
|
+
1. **Wall content** — test suites, contract tests, business-rule checks in product CI
|
|
486
|
+
2. **E2E bench (G1)** — expand from acceptance checks to break-and-fix agent runner; target 20–100 tasks
|
|
487
|
+
3. **Rubrics (G2)** — `evals/trajectories/rubric.md` and G-Eval / LLM-as-judge specs
|
|
488
|
+
4. **Revision cycle (G3)** — failure taxonomy routing; morning queue (~30 min/day)
|
|
489
|
+
5. **Feed-forward content** — domain rules in instructions / agents / skills
|
|
490
|
+
6. **Exception ledger** — [docs/exceptions/](exceptions/README.md)
|
|
491
|
+
|
|
492
|
+
---
|
|
493
|
+
|
|
494
|
+
## 10. KPIs
|
|
495
|
+
|
|
496
|
+
**Inner loop**: PR rejection rate, first-pass wall rate, AI credits per task, average retry count.
|
|
497
|
+
|
|
498
|
+
**Outer loop**: E2E pass rate trend, harness revision PR adoption rate, failure-class mix (wall-gap ratio declining = walls improving).
|
|
499
|
+
|
|
500
|
+
**Lagging quality**: 7-day revert rate, agent-caused hotfix rate, post-review fix rate.
|
|
501
|
+
|
|
502
|
+
**UX**: Review time per PR, morning queue processing time, autonomy level distribution.
|
|
503
|
+
|
|
504
|
+
Tracking template: [kpi-baseline.md](kpi-baseline.md).
|
|
505
|
+
|
|
506
|
+
---
|
|
507
|
+
|
|
508
|
+
## 11. Related documentation
|
|
509
|
+
|
|
510
|
+
| Document | Contents |
|
|
511
|
+
|----------|----------|
|
|
512
|
+
| [adoption.md](adoption.md) | Installation, bootstrap, rollback |
|
|
513
|
+
| [operations.md](operations.md) | **Canonical** thresholds, retry policy, forbidden ops |
|
|
514
|
+
| [coding-agent-l1.md](coding-agent-l1.md) | First L1 delegations (docs / test-fix) |
|
|
515
|
+
| [failure-taxonomy.md](failure-taxonomy.md) | Failure classification for outer loop |
|
|
516
|
+
| [telemetry-schema.md](telemetry-schema.md) | Required observability fields |
|
|
517
|
+
| [telemetry-artifacts.md](telemetry-artifacts.md) | Inner-loop JSON artifact format |
|
|
518
|
+
| [gh-aw-dogfood.md](gh-aw-dogfood.md) | Bounded gh-aw validation on sdlc-gh |
|
|
519
|
+
| [auth-boundaries.md](auth-boundaries.md) | Credential matrix per execution mode |
|
|
520
|
+
| [shared-config.md](shared-config.md) | Cross-repo harness distribution |
|
|
521
|
+
| [kpi-baseline.md](kpi-baseline.md) | Weekly KPI template |
|
|
522
|
+
| [exceptions/README.md](exceptions/README.md) | Policy exception records |
|
|
523
|
+
| [revert-playbook.md](revert-playbook.md) | Revert procedure |
|
|
524
|
+
| [infra/README.md](../infra/README.md) | Langfuse / OTel self-host |
|
|
525
|
+
|
|
526
|
+
---
|
|
527
|
+
|
|
528
|
+
## Appendix: Assumptions (July 2026)
|
|
529
|
+
|
|
530
|
+
- Agentic Workflows remain Public Preview with safe outputs, AWF firewall, and secretless defaults.
|
|
531
|
+
- Copilot SDK is Public Preview (MIT; JSON-RPC to CLI server).
|
|
532
|
+
- `gh models eval` supports `.prompt.yml` in CI (similarity, string match, LLM-as-judge).
|
|
533
|
+
- GitHub Copilot billing moved to AI credits (June 2026).
|
|
534
|
+
- Agent Skills are an open spec shared across Copilot, Claude Code, and Codex.
|
|
535
|
+
- Custom agents support tool restriction, handoffs, and org-level distribution.
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
# Auth Boundaries
|
|
2
|
+
|
|
3
|
+
Execution mode credential matrix (arch.md §4.2).
|
|
4
|
+
|
|
5
|
+
| Mode | Credentials | Scope | Audit |
|
|
6
|
+
|------|-------------|-------|-------|
|
|
7
|
+
| CLI / IDE | Developer local delegation | User's local permissions | Copilot audit logs |
|
|
8
|
+
| coding agent | Short-lived, repo-scoped token | Isolated VM, own branch only, cannot approve own PR | Actions + OTel |
|
|
9
|
+
| gh-aw | Secretless; proxy/gateway auth | AWF firewall, domain allowlist | gh aw audit, firewall logs |
|
|
10
|
+
| SDK | Proxy execution service | Limited operations only | Application audit log |
|
|
11
|
+
|
|
12
|
+
## Invariants
|
|
13
|
+
|
|
14
|
+
- No long-lived secrets in prompt context or agent-readable files
|
|
15
|
+
- Production credentials never in harness assets committed to git
|
|
16
|
+
- Exceptions require documented approval in `docs/exceptions/`
|