@pieerry/harness-kit 3.1.2 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (64) hide show
  1. package/.claude/.hk-version +1 -0
  2. package/.claude/agents/product-manager.md +2 -2
  3. package/.claude/agents/staff-software-engineer.md +2 -2
  4. package/.claude/commands/pipeline/continue.md +8 -8
  5. package/.claude/commands/pipeline/reset.md +4 -4
  6. package/.claude/commands/product-manager/prd.md +4 -4
  7. package/.claude/commands/product-manager/prp.md +7 -7
  8. package/.claude/commands/product-manager/run.md +5 -5
  9. package/.claude/commands/sse/dev.md +17 -12
  10. package/.claude/commands/sse/plan.md +8 -8
  11. package/.claude/commands/sse/pr-monitor.md +56 -0
  12. package/.claude/commands/sse/pr.md +21 -11
  13. package/.claude/commands/sse/run.md +7 -7
  14. package/.claude/commands/sse/test.md +15 -9
  15. package/.claude/conventions/README.md +12 -0
  16. package/.claude/hooks/activity-pre-read.sh +18 -0
  17. package/.claude/hooks/pipeline-session-start.sh +26 -2
  18. package/.claude/hooks/status-line.sh +17 -1
  19. package/.claude/plugins/product-manager/evals/prd-quality.md +3 -3
  20. package/.claude/plugins/product-manager/evals/prd-readiness.md +1 -1
  21. package/.claude/plugins/product-manager/evals/prp-context-readiness.md +5 -5
  22. package/.claude/plugins/product-manager/evals/prp-quality.md +1 -1
  23. package/.claude/plugins/product-manager/guides/pipeline.md +14 -14
  24. package/.claude/plugins/product-manager/guides/prd-guidelines.md +4 -4
  25. package/.claude/plugins/product-manager/guides/product-guidelines.md +16 -16
  26. package/.claude/plugins/product-manager/guides/prp-guidelines.md +16 -16
  27. package/.claude/plugins/product-manager/guides/writing-style.md +9 -9
  28. package/.claude/plugins/product-manager/sensors/prd-acceptance-criteria.md +6 -6
  29. package/.claude/plugins/product-manager/sensors/prd-structure.md +2 -2
  30. package/.claude/plugins/product-manager/sensors/prp-context-quality.md +6 -6
  31. package/.claude/plugins/product-manager/sensors/prp-links.md +2 -2
  32. package/.claude/plugins/product-manager/sensors/prp-structure.md +1 -1
  33. package/.claude/plugins/staff-software-engineer/evals/dev-quality.md +48 -0
  34. package/.claude/plugins/staff-software-engineer/evals/plan-quality.md +6 -6
  35. package/.claude/plugins/staff-software-engineer/evals/pr-quality.md +48 -0
  36. package/.claude/plugins/staff-software-engineer/evals/test-quality.md +48 -0
  37. package/.claude/plugins/staff-software-engineer/guides/coding-style.md +7 -7
  38. package/.claude/plugins/staff-software-engineer/guides/commit-style.md +3 -3
  39. package/.claude/plugins/staff-software-engineer/guides/conventions-override.md +13 -13
  40. package/.claude/plugins/staff-software-engineer/guides/pipeline.md +12 -12
  41. package/.claude/plugins/staff-software-engineer/hooks/post-eval-sse.sh +65 -0
  42. package/.claude/plugins/staff-software-engineer/hooks/post-write-sse.sh +60 -0
  43. package/.claude/plugins/staff-software-engineer/sensors/code-conventions.md +4 -4
  44. package/.claude/plugins/staff-software-engineer/sensors/dev-structure.md +46 -0
  45. package/.claude/plugins/staff-software-engineer/sensors/pr-structure.md +49 -0
  46. package/.claude/plugins/staff-software-engineer/sensors/test-coverage.md +6 -6
  47. package/.claude/plugins/staff-software-engineer/sensors/test-structure.md +46 -0
  48. package/.claude/plugins/staff-software-engineer/skills/backend/SKILL.md +8 -8
  49. package/.claude/plugins/staff-software-engineer/skills/devops/SKILL.md +3 -3
  50. package/.claude/plugins/staff-software-engineer/skills/mobile/SKILL.md +3 -3
  51. package/.claude/plugins/staff-software-engineer/skills/web/SKILL.md +2 -2
  52. package/.claude/scripts/activity.py +68 -0
  53. package/.claude/scripts/pr-monitor.py +113 -0
  54. package/.claude/scripts/stage-card.md +37 -34
  55. package/.claude/settings.json +74 -0
  56. package/.claude/settings.local.json +13 -1
  57. package/CLAUDE.md +6 -0
  58. package/README.md +161 -61
  59. package/VERSION +1 -1
  60. package/bin/hk.js +6 -1
  61. package/package.json +1 -1
  62. package/setup/install.sh +16 -8
  63. package/.claude/plugins/staff-software-engineer/hooks/post-eval-plan.sh +0 -43
  64. package/.claude/plugins/staff-software-engineer/hooks/post-write-plan.sh +0 -49
@@ -22,11 +22,11 @@ If any fail, abort and return sensor feedback.
22
22
 
23
23
  ## Phase 2: executor simulation
24
24
 
25
- You are a coding agent that has never seen this codebase. You are handed only this PRP. You may not ask the PM follow-up questions. Read it end-to-end and answer:
25
+ You are coding agent that has never seen this codebase. Handed only this PRP. Cannot ask PM follow-ups. Read end-to-end and answer:
26
26
 
27
- 1. Could you ship with only what is in the PRP plus linked files? (yes / partial / no)
28
- 2. List every place where you would need to stop and ask a human.
29
- 3. List every claim that looks plausible but you cannot verify from the PRP alone.
27
+ 1. Could you ship with only what is in PRP plus linked files? (yes / partial / no)
28
+ 2. List every place you would need to stop and ask a human.
29
+ 3. List every claim looking plausible but you cannot verify from PRP alone.
30
30
  4. Rate one-shot-success likelihood from 0 to 1.
31
31
 
32
32
  ## Output format
@@ -48,4 +48,4 @@ PM confirms or requests another iteration.
48
48
 
49
49
  Phase 1 fails: re-run failing sensor, regenerate failed sections.
50
50
  Phase 2 fails criteria: regenerate weakest section based on blocking_questions.
51
- Phase 3 rejects: collect PM feedback, retry the loop.
51
+ Phase 3 rejects: collect PM feedback, retry loop.
@@ -30,7 +30,7 @@ Blueprint specific enough that executor never decides naming, file location, or
30
30
  - 0: prose handwaving
31
31
 
32
32
  ### Validation executability (weight 15%)
33
- Validation gates are real commands the executor copy-pastes? Manual checks specific?
33
+ Validation gates are real commands executor copy-pastes? Manual checks specific?
34
34
 
35
35
  - 10: copy-paste ready
36
36
  - 5: commands with placeholders
@@ -1,13 +1,13 @@
1
1
  # Pipeline
2
2
 
3
- Shared rules for PRD and PRP generation. Edit retry, approval, publish, and token accounting here.
3
+ Shared rules for PRD and PRP generation. Edit retry, approval, publish, token accounting here.
4
4
 
5
5
  ## Stages per artifact
6
6
 
7
- 1. Generate the artifact using the template in skills/{prd|prp}/SKILL.md.
7
+ 1. Generate artifact using template in skills/{prd|prp}/SKILL.md.
8
8
  2. Apply sensors listed in skills/{prd|prp}/SKILL.md.
9
9
  3. Apply evals listed in skills/{prd|prp}/SKILL.md.
10
- 4. Append the approval marker. Publish hook fires.
10
+ 4. Append approval marker. Publish hook fires.
11
11
 
12
12
  ## Retry policy
13
13
 
@@ -18,7 +18,7 @@ Trigger on:
18
18
  - eval weighted total below threshold (default 8.0)
19
19
 
20
20
  Strategy:
21
- 1. Read the feedback.
21
+ 1. Read feedback.
22
22
  2. Regenerate only failed or low-scoring sections.
23
23
  3. Re-run sensors and eval.
24
24
 
@@ -42,7 +42,7 @@ Triggers hooks/post-eval-{prd|prp}.sh.
42
42
 
43
43
  Local: file stays in outputs/{prd|prp}/.
44
44
 
45
- Confluence: fires if JIRA_USERNAME and JIRA_API_TOKEN are set. Calls scripts/confluence-publish.py. Missing creds skips silently.
45
+ Confluence: fires if JIRA_USERNAME and JIRA_API_TOKEN set. Calls scripts/confluence-publish.py. Missing creds skips silently.
46
46
 
47
47
  After publish, hook appends:
48
48
  ```
@@ -51,34 +51,34 @@ After publish, hook appends:
51
51
 
52
52
  ## Token accounting
53
53
 
54
- Each phase brackets a measurable token window. Phases for this plugin:
54
+ Each phase brackets measurable token window. Phases:
55
55
  - prd-generate: from skill invocation to first save in outputs/prd/
56
- - prd-validate: from first save to approval marker (covers sensors and evals)
56
+ - prd-validate: from first save to approval marker (sensors + evals)
57
57
  - prp-generate: from skill invocation to first save in outputs/prp/
58
58
  - prp-validate: from first save to approval marker
59
59
 
60
- Markers live in outputs/.markers/{feature_id}.{phase}.{start|end}, each with `{"timestamp": ISO, "session_id": ""}`. Skill writes the .start marker; hooks write the .end markers.
60
+ Markers in outputs/.markers/{feature_id}.{phase}.{start|end}, each `{"timestamp": ISO, "session_id": ""}`. Skill writes .start; hooks write .end.
61
61
 
62
- After each artifact's eval passes, the publish hook runs scripts/token-phase.py for both of its phases. The script reads the Claude transcript JSONL, sums usage tokens (input, output, cache_read, cache_creation) within each window, and appends a phase entry to outputs/tokens/{feature_id}.json. Markers are deleted after consumption.
62
+ After eval passes, publish hook runs scripts/token-phase.py for both phases. Script reads Claude transcript JSONL, sums usage tokens (input, output, cache_read, cache_creation) within each window, appends phase entry to outputs/tokens/{feature_id}.json. Markers deleted after consumption.
63
63
 
64
- The post-eval hook also appends an inline summary comment to the artifact:
64
+ Post-eval hook also appends inline summary to artifact:
65
65
  ```
66
66
  <!-- tokens: outputs/tokens/{feature_id}.json in={N} out={N} cache_r={N} -->
67
67
  ```
68
68
 
69
- Future phases (dev, code review, launch) append new entries to the same outputs/tokens/{feature_id}.json by reusing the feature_id slug.
69
+ Future phases (dev, code review, launch) append entries to same outputs/tokens/{feature_id}.json by reusing feature_id slug.
70
70
 
71
- If the transcript is not readable, the script logs a warning and exits 0. Token accounting never blocks publish.
71
+ If transcript not readable, script logs warning and exits 0. Token accounting never blocks publish.
72
72
 
73
73
  ## Orchestrator order
74
74
 
75
75
  1. PRD: stages 1-4.
76
- 2. hooks/pre-prp-check.sh validates the PRD approved marker.
76
+ 2. hooks/pre-prp-check.sh validates PRD approved marker.
77
77
  3. PRP: stages 1-4.
78
78
  4. Return summary.
79
79
 
80
80
  ## Stop conditions
81
81
 
82
- - 3 failed attempts on the same artifact
82
+ - 3 failed attempts on same artifact
83
83
  - missing required input after one clarification
84
84
  - pre-prp-check.sh denies PRP start
@@ -2,11 +2,11 @@
2
2
 
3
3
  Rules specific to PRDs. General PM rules in product-guidelines.md.
4
4
 
5
- Business only. Zero technical detail. File paths, APIs, schemas belong in the PRP.
5
+ Business only. Zero technical detail. File paths, APIs, schemas belong in PRP.
6
6
 
7
7
  User stories: As a {persona}, I want {action}, so that {benefit}.
8
8
 
9
- Metrics need baseline, target, horizon in every row. No empty cells.
9
+ Metrics need baseline, target, horizon per row. No empty cells.
10
10
 
11
11
  Mermaid only, never ASCII.
12
12
 
@@ -18,10 +18,10 @@ Word count by stage:
18
18
  - Solution Review: 800-1200
19
19
  - Launch Readiness: 1500-2000
20
20
 
21
- Required sections at each stage:
21
+ Required sections per stage:
22
22
  - Team Kickoff: 1, 2, 8
23
23
  - Planning Review: + 5
24
24
  - Solution Review: + 4, 7
25
25
  - Launch Readiness: all
26
26
 
27
- Stage progression: do not skip stages. Each gate adds requirements, never removes them.
27
+ Stage progression: do not skip stages. Each gate adds requirements, never removes.
@@ -1,23 +1,23 @@
1
1
  # Product Guidelines
2
2
 
3
- How PMs think about product. Read before drafting a PRD.
3
+ How PMs think about product. Read before drafting PRD.
4
4
 
5
- ## We ship outcomes, not features
5
+ ## Ship outcomes, not features
6
6
 
7
- Every PRD touches a real user or customer cohort. Always answer:
8
- - Who pays for the pain today?
7
+ Every PRD touches real user or customer cohort. Always answer:
8
+ - Who pays for pain today?
9
9
  - Whose workflow changes?
10
- - What metric in the customer's business or experience moves?
10
+ - What metric in customer's business or experience moves?
11
11
 
12
- If you cannot answer all three, the PRD is not ready.
12
+ If you cannot answer all three, PRD not ready.
13
13
 
14
14
  ## Squads
15
15
 
16
- Defined per org in `context-library/business-info.md` and `context-library/squads/`. Multi-squad PRDs: name a primary squad, call out cross-squad interfaces.
16
+ Defined per org in `context-library/business-info.md` and `context-library/squads/`. Multi-squad PRDs: name primary squad, call out cross-squad interfaces.
17
17
 
18
18
  ## Customer hierarchy
19
19
 
20
- When prioritizing, make trade-offs between cohorts explicit, not invisible. A common ranking template (override per org):
20
+ When prioritizing, make trade-offs between cohorts explicit, not invisible. Common ranking template (override per org):
21
21
 
22
22
  1. Paying customers with high operational impact
23
23
  2. Reliability partners (SLAs, contracts)
@@ -26,9 +26,9 @@ When prioritizing, make trade-offs between cohorts explicit, not invisible. A co
26
26
 
27
27
  ## Metrics
28
28
 
29
- Prefer metrics already on existing dashboards. New metrics need an owner to instrument.
29
+ Prefer metrics already on existing dashboards. New metrics need owner to instrument.
30
30
 
31
- Strong examples (the shape of a strong metric — specific, owned, instrumented):
31
+ Strong examples (shape of strong metric — specific, owned, instrumented):
32
32
  - error rate of X by segment
33
33
  - first-attempt success rate
34
34
  - latency P95 of Y action
@@ -40,17 +40,17 @@ Weak:
40
40
  - "Improved efficiency" with no number
41
41
  - "Reduced friction" with no baseline
42
42
 
43
- ## Non-goals are mandatory
43
+ ## Non-goals mandatory
44
44
 
45
- Every PRD lists at least one non-goal. If you cannot think of one, you have not scoped the feature.
45
+ Every PRD lists at least one non-goal. If you cannot think of one, you have not scoped feature.
46
46
 
47
- ## Rollout is not "ship it"
47
+ ## Rollout not "ship it"
48
48
 
49
49
  Every PRD ends with phases, audience per phase, pass criteria per phase, rollback. Feature flags by default.
50
50
 
51
51
  ## Kill criteria
52
52
 
53
- Decide before launch what makes you turn the feature off. Examples:
53
+ Decide before launch what makes you turn feature off. Examples:
54
54
  - error rate above X%
55
55
  - latency P95 above Yms
56
56
  - adoption below Z% after N days
@@ -59,7 +59,7 @@ Decide before launch what makes you turn the feature off. Examples:
59
59
 
60
60
  In PRD:
61
61
  - why we are building
62
- - who feels the pain
62
+ - who feels pain
63
63
  - success metric and target
64
64
  - open questions about scope
65
65
 
@@ -68,7 +68,7 @@ In PRP:
68
68
  - API endpoints, validation commands
69
69
  - open questions about implementation
70
70
 
71
- If the question is "should we build this", PRD. If it is "how do we build this", PRP.
71
+ If question is "should we build this", PRD. If "how do we build this", PRP.
72
72
 
73
73
  ## Diagrams
74
74
 
@@ -1,32 +1,32 @@
1
1
  # PRP Guidelines
2
2
 
3
- A PRP is a PRD plus curated codebase intelligence plus an agent runbook. If the executor has to leave the document to figure out what to do, the PRP is incomplete.
3
+ PRP = PRD + curated codebase intelligence + agent runbook. If executor leaves document to figure out what to do, PRP incomplete.
4
4
 
5
5
  ## Three pillars
6
6
 
7
7
  ### Context
8
8
 
9
- Everything the executor needs, included or linked. Specific:
9
+ Everything executor needs, included or linked. Specific:
10
10
  - file paths with line numbers when localized
11
11
  - existing patterns to mimic, one concrete example each
12
- - external docs scoped to the relevant section
12
+ - external docs scoped to relevant section
13
13
  - known gotchas (concurrency, idempotency, retries, timezones)
14
14
 
15
- If the codebase has a similar feature already, point to it. Coding agents do their best work by analogy.
15
+ If codebase has similar feature, point to it. Coding agents work best by analogy.
16
16
 
17
17
  ### Implementation detail
18
18
 
19
19
  Pseudocode or numbered steps concrete enough to translate one-to-one. Include:
20
20
  - class names, method signatures, return types where decided
21
- - schema changes with the actual migration file or DDL
21
+ - schema changes with actual migration file or DDL
22
22
  - endpoint definitions: method, path, request/response shape
23
23
  - logging, metrics, alert hooks
24
24
 
25
- You do not write the code. You remove every "what should I name this" decision.
25
+ You do not write code. You remove every "what should I name this" decision.
26
26
 
27
27
  ### Validation gates
28
28
 
29
- Real commands the executor runs and passes. Vague checks fail.
29
+ Real commands executor runs and passes. Vague checks fail.
30
30
 
31
31
  Bad:
32
32
  - "Make sure tests pass"
@@ -36,29 +36,29 @@ Good:
36
36
  - `npm --prefix web run typecheck`
37
37
  - `./gradlew :billing-service:detekt`
38
38
 
39
- Manual gates count, but must be specific:
40
- - [ ] Open `https://staging.example.com/invoices/new` and verify the deadline badge renders for region `EU-WEST`.
39
+ Manual gates count, must be specific:
40
+ - [ ] Open `https://staging.example.com/invoices/new` and verify deadline badge renders for region `EU-WEST`.
41
41
 
42
- ## A good PRP
42
+ ## Good PRP
43
43
 
44
44
  1. Executor reads top-to-bottom, never tabs out except to view linked files.
45
- 2. No decisions left for the executor that change scope.
46
- 3. Every TODO has an owner.
47
- 4. Validation block is copy-pasteable.
45
+ 2. No decisions left for executor that change scope.
46
+ 3. Every TODO has owner.
47
+ 4. Validation block copy-pasteable.
48
48
 
49
49
  ## Anti-patterns
50
50
 
51
51
  - Prose mush where bullets would do
52
- - Restating the PRD instead of linking
52
+ - Restating PRD instead of linking
53
53
  - Hidden scope ("also we should refactor X")
54
54
  - Optimism about state ("there is probably a helper for this already")
55
55
  - Fabricated file paths or class names
56
56
 
57
- ## When to split a PRP
57
+ ## When to split PRP
58
58
 
59
59
  Split if:
60
60
  - scope crosses two repos with independent deploy cycles
61
61
  - one PR would exceed ~600 lines
62
62
  - independent validation gates pass or fail separately
63
63
 
64
- Otherwise keep it one. Context fragmentation costs more than a slightly longer doc.
64
+ Otherwise keep one. Context fragmentation costs more than slightly longer doc.
@@ -1,13 +1,13 @@
1
1
  # Writing Style
2
2
 
3
- How PRDs and PRPs read .
3
+ How PRDs and PRPs read.
4
4
 
5
5
  ## Voice
6
6
 
7
- - Direct. Lead with the point.
7
+ - Direct. Lead with point.
8
8
  - Specific. Real numbers, real names, real quotes. "47% abandon at step 3" beats "many users struggle".
9
9
  - Conversational, professional. Contractions ok. Use "we", not "I".
10
- - Action-oriented. Every section helps a reader decide or act.
10
+ - Action-oriented. Every section helps reader decide or act.
11
11
 
12
12
  ## Length
13
13
 
@@ -32,21 +32,21 @@ Dead giveaways for AI fluff:
32
32
 
33
33
  ## Punctuation
34
34
 
35
- - No em-dashes. Use commas, periods, parentheses, or a new sentence.
35
+ - No em-dashes. Use commas, periods, parentheses, or new sentence.
36
36
  - Oxford comma: yes.
37
- - One space after a period.
37
+ - One space after period.
38
38
 
39
39
  ## Negative parallelism
40
40
 
41
- Avoid. Lead with the positive:
41
+ Avoid. Lead with positive:
42
42
  - bad: "Don't use X. Use Y."
43
43
  - good: "Use Y."
44
44
 
45
45
  ## AI-detector test
46
46
 
47
- If a paragraph sounds like a LinkedIn post, rewrite it. Real human writing:
47
+ If paragraph sounds like LinkedIn post, rewrite. Real human writing:
48
48
  - occasional sentence starting with "And" or "But"
49
- - specific details a generic writer would not know
49
+ - specific details generic writer would not know
50
50
  - slightly imperfect rhythm
51
51
  - mid-paragraph asides
52
52
 
@@ -64,7 +64,7 @@ Not:
64
64
 
65
65
  ## Tables
66
66
 
67
- Use when you have comparison across 3+ items, structured data with same fields per row, or decision matrices. Otherwise use bullets.
67
+ Use when you have comparison across 3+ items, structured data with same fields per row, or decision matrices. Otherwise bullets.
68
68
 
69
69
  ## Diagrams
70
70
 
@@ -16,23 +16,23 @@ Mode: hard gate
16
16
 
17
17
  ## Checks
18
18
 
19
- User story format. Every bullet in Solution Overview that starts with "As a" must follow:
19
+ User story format. Every Solution Overview bullet starting with "As a" must follow:
20
20
 
21
21
  ```
22
22
  - As a {persona}, I want {action}, so that {benefit}.
23
23
  ```
24
24
 
25
- Metric completeness. Success Metrics table: every row must have non-empty Baseline, Target, Horizon. Cells with only "TBD", "N/A", "?", or empty fail.
25
+ Metric completeness. Success Metrics table: every row needs non-empty Baseline, Target, Horizon. Cells with only "TBD", "N/A", "?", or empty fail.
26
26
 
27
- Guardrails block. A "Guardrails" line must exist in Success Metrics.
27
+ Guardrails block. "Guardrails" line must exist in Success Metrics.
28
28
 
29
- Kill criteria. A "Kill criteria:" line must exist in Success Metrics and contain at least one digit.
29
+ Kill criteria. "Kill criteria:" line must exist in Success Metrics and contain at least one digit.
30
30
 
31
31
  Rollout phases. Rollout table needs at least 2 rows.
32
32
 
33
- Non-goals listed. Scope and Non-Goals must have a Non-goals subsection with 1-3 bullets.
33
+ Non-goals listed. Scope and Non-Goals needs Non-goals subsection with 1-3 bullets.
34
34
 
35
- Open gaps budget. Count of `NOT FOUND` markers must be 5 or fewer.
35
+ Open gaps budget. `NOT FOUND` marker count must be 5 or fewer.
36
36
 
37
37
  ## On failure
38
38
 
@@ -3,7 +3,7 @@
3
3
  Type: deterministic
4
4
  Mode: hard gate
5
5
 
6
- ## Required sections (all must be present, in order)
6
+ ## Required sections (all present, in order)
7
7
 
8
8
  - Problem and Hypothesis
9
9
  - Customers
@@ -36,4 +36,4 @@ Mode: hard gate
36
36
 
37
37
  ## On failure
38
38
 
39
- Block publish. Return list of missing sections, forbidden tokens, rule violations. Agent regenerates only failed parts.
39
+ Block publish. Return missing sections, forbidden tokens, rule violations. Agent regenerates failed parts only.
@@ -3,7 +3,7 @@
3
3
  Type: deterministic plus heuristic
4
4
  Mode: hard gate
5
5
 
6
- A PRP must give an executor enough context to ship without coming back to ask.
6
+ PRP must give executor enough context to ship without coming back to ask.
7
7
 
8
8
  ## Required sections
9
9
 
@@ -13,9 +13,9 @@ A PRP must give an executor enough context to ship without coming back to ask.
13
13
 
14
14
  ## Checks within Context
15
15
 
16
- File paths referenced. Repos and files touched table must reference at least 2 real file paths. A file path has `/` and a known extension (java, kt, ts, tsx, js, jsx, vue, py, sql, yaml, yml, md, sh, gradle).
16
+ File paths referenced. Repos and files touched table must reference at least 2 real file paths. File path has `/` and known extension (java, kt, ts, tsx, js, jsx, vue, py, sql, yaml, yml, md, sh, gradle).
17
17
 
18
- Patterns with concrete examples. Patterns to follow must have at least 1 pattern. Each pattern must include an `Example in codebase:` line pointing to a concrete path.
18
+ Patterns with concrete examples. Patterns to follow needs at least 1 pattern. Each pattern must include `Example in codebase:` line pointing to concrete path.
19
19
 
20
20
  Known gotchas. Subsection `### Known gotchas` must exist with at least 1 bullet.
21
21
 
@@ -23,9 +23,9 @@ External documentation. Subsection `### External documentation` must exist. Eith
23
23
 
24
24
  ## Checks within Implementation blueprint
25
25
 
26
- Numbered steps in a fenced code block.
26
+ Numbered steps in fenced code block.
27
27
 
28
- At least 1 numbered step references a class, file, or command (uppercase or path-shaped token).
28
+ At least 1 numbered step references class, file, or command (uppercase or path-shaped token).
29
29
 
30
30
  ## Checks within What
31
31
 
@@ -35,7 +35,7 @@ Success criteria must include at least 1 line matching:
35
35
 
36
36
  ## Open gaps budget
37
37
 
38
- Count of `NEEDS REVIEW`, `NOT FOUND`, or `TBD` markers must be 5 or fewer.
38
+ `NEEDS REVIEW`, `NOT FOUND`, or `TBD` marker count must be 5 or fewer.
39
39
 
40
40
  ## On failure
41
41
 
@@ -7,7 +7,7 @@ Implemented in scripts/link-validator.py. Same rules when agent self-checks.
7
7
 
8
8
  ## Hard checks (block)
9
9
 
10
- Source PRD resolvable. The "Source PRD:" line must point to a path that exists under outputs/prd/, relative to the PRP file, or relative to the repo root.
10
+ Source PRD resolvable. "Source PRD:" line must point to path existing under outputs/prd/, relative to PRP file, or relative to repo root.
11
11
 
12
12
  No localhost URLs. `http://localhost` or `http://127.0.0.1` not allowed as pinned references.
13
13
 
@@ -15,7 +15,7 @@ GitHub permalinks. Links like `github.com/.../blob/{main|master|develop}/...` no
15
15
 
16
16
  ## Soft checks (warn)
17
17
 
18
- Path format. Inline code spans that look like file paths (have `/` and end with extension) should end in a recognized extension.
18
+ Path format. Inline code spans looking like file paths (have `/` and end with extension) should end in recognized extension.
19
19
 
20
20
  URL syntax. http/https URLs must be syntactically valid.
21
21
 
@@ -3,7 +3,7 @@
3
3
  Type: deterministic
4
4
  Mode: hard gate
5
5
 
6
- ## Required sections (all must be present, in order)
6
+ ## Required sections (all present, in order)
7
7
 
8
8
  - Goal
9
9
  - Why
@@ -0,0 +1,48 @@
1
+ # Eval: Dev Quality
2
+
3
+ Type: LLM-judge
4
+ Mode: quality gate
5
+ Threshold: weighted total >= 8.0
6
+
7
+ Score each dimension 0-10. Cite file paths or commit SHAs when below 7. Weighted total = sum(score x weight%) / 100.
8
+
9
+ ## Rubric
10
+
11
+ ### Plan fidelity (weight 25%)
12
+ Did implementation cover every step in source plan? Any scope creep or missed items?
13
+
14
+ ### Files touched specificity (weight 15%)
15
+ Real, concrete paths listed (not placeholders)? Match what plan named?
16
+
17
+ ### Commit hygiene (weight 20%)
18
+ Conventional Commits prefix correct. Small commits (1-4 files, < 100 lines ideal). One concern per commit. Messages explain why, not what.
19
+
20
+ ### Test coverage (weight 20%)
21
+ Every new feature/bugfix has matching tests. Edge cases covered. Tests run before commit.
22
+
23
+ ### Convention adherence (weight 10%)
24
+ Coding style, framework version, package layout match `coding-style.md` and `conventions/{area}.md`.
25
+
26
+ ### Blocker quality (weight 10%)
27
+ If blockers exist, specific (file:line, error message). If none, state `none` explicitly.
28
+
29
+ ## On failure (total below 8.0)
30
+
31
+ Retry. Regenerate weakest section only. Max 3 attempts.
32
+
33
+ ## Output
34
+
35
+ ```json
36
+ {
37
+ "scores": {
38
+ "plan_fidelity": 0,
39
+ "files_touched": 0,
40
+ "commit_hygiene": 0,
41
+ "test_coverage": 0,
42
+ "convention_adherence": 0,
43
+ "blocker_quality": 0
44
+ },
45
+ "weighted_total": 0.0,
46
+ "feedback": ["dimension: specific issue with file or sha ref"]
47
+ }
48
+ ```
@@ -9,22 +9,22 @@ Score each dimension 0-10. Cite line numbers when below 7. Weighted total = sum(
9
9
  ## Rubric
10
10
 
11
11
  ### Scope clarity (weight 20%)
12
- Is what is being built clear? Tied to the source PRP?
12
+ What is being built clear? Tied to source PRP?
13
13
 
14
14
  ### Files touched specificity (weight 20%)
15
- Are the files and modules concrete? Real paths, not placeholders?
15
+ Files and modules concrete? Real paths, not placeholders?
16
16
 
17
17
  ### Execution flow (weight 20%)
18
- Are the steps actionable in order? Could a fresh engineer follow them?
18
+ Steps actionable in order? Could fresh engineer follow them?
19
19
 
20
20
  ### Risk awareness (weight 15%)
21
- Are real risks called out with mitigations?
21
+ Real risks called out with mitigations?
22
22
 
23
23
  ### Rollout (weight 15%)
24
- Is there a phased plan or feature flag strategy?
24
+ Phased plan or feature flag strategy?
25
25
 
26
26
  ### Tests (weight 10%)
27
- Does the plan name the test cases to cover?
27
+ Plan names test cases to cover?
28
28
 
29
29
  ## On failure (total below 8.0)
30
30
 
@@ -0,0 +1,48 @@
1
+ # Eval: PR Quality
2
+
3
+ Type: LLM-judge
4
+ Mode: quality gate
5
+ Threshold: weighted total >= 8.0
6
+
7
+ Score each dimension 0-10. Cite line refs from PR body when below 7. Weighted total = sum(score x weight%) / 100.
8
+
9
+ ## Rubric
10
+
11
+ ### Title quality (weight 20%)
12
+ Conventional Commits prefix correct (`feat`, `fix`, `chore`, etc). Scope present if relevant. <= 70 chars. Concrete, not vague.
13
+
14
+ ### Summary clarity (weight 20%)
15
+ Bullets explain what changed and why. Reviewer can grasp change without opening files. No marketing language.
16
+
17
+ ### Test plan completeness (weight 20%)
18
+ Markdown checklist covers golden path and edge cases. Items are checkable actions, not vague ("test it works").
19
+
20
+ ### Links and refs (weight 15%)
21
+ Source plan and dev paths linked. Ticket id (e.g. `PROJ-123`) referenced if branch carried one.
22
+
23
+ ### Risk callouts (weight 15%)
24
+ Migration risks, feature flags, rollback plan named when applicable. Marked `none` if truly nothing.
25
+
26
+ ### Draft / readiness signal (weight 10%)
27
+ Draft status matches stage of work. If `--ready`, CI gates passed and reviewers assignable. If draft, what's still pending is stated.
28
+
29
+ ## On failure (total below 8.0)
30
+
31
+ Retry. Regenerate weakest section only. Max 3 attempts.
32
+
33
+ ## Output
34
+
35
+ ```json
36
+ {
37
+ "scores": {
38
+ "title_quality": 0,
39
+ "summary_clarity": 0,
40
+ "test_plan_completeness": 0,
41
+ "links_and_refs": 0,
42
+ "risk_callouts": 0,
43
+ "draft_readiness": 0
44
+ },
45
+ "weighted_total": 0.0,
46
+ "feedback": ["dimension: specific issue with body line ref"]
47
+ }
48
+ ```
@@ -0,0 +1,48 @@
1
+ # Eval: Test Quality
2
+
3
+ Type: LLM-judge
4
+ Mode: quality gate
5
+ Threshold: weighted total >= 8.0
6
+
7
+ Score each dimension 0-10. Cite test names or output lines when below 7. Weighted total = sum(score x weight%) / 100.
8
+
9
+ ## Rubric
10
+
11
+ ### Result accuracy (weight 25%)
12
+ Does Result field (pass | fail) match exit code and counts? No misreporting.
13
+
14
+ ### Failure detail (weight 25%)
15
+ If failures occurred, failing test names listed with enough context (file:line, assertion message)? If pass, Failures explicitly `none`?
16
+
17
+ ### Coverage of changes (weight 20%)
18
+ Tests run cover files changed in dev phase? Gaps called out?
19
+
20
+ ### Command reproducibility (weight 15%)
21
+ Command field real shell command another engineer could run as-is? Includes filters or args if used.
22
+
23
+ ### Duration sanity (weight 10%)
24
+ Duration reported with unit? Plausible for suite size?
25
+
26
+ ### Regression risk callouts (weight 5%)
27
+ Flakes or slow tests flagged for follow-up?
28
+
29
+ ## On failure (total below 8.0)
30
+
31
+ Retry. Regenerate weakest section only. Max 3 attempts.
32
+
33
+ ## Output
34
+
35
+ ```json
36
+ {
37
+ "scores": {
38
+ "result_accuracy": 0,
39
+ "failure_detail": 0,
40
+ "coverage_of_changes": 0,
41
+ "command_reproducibility": 0,
42
+ "duration_sanity": 0,
43
+ "regression_risk": 0
44
+ },
45
+ "weighted_total": 0.0,
46
+ "feedback": ["dimension: specific issue with test name or output ref"]
47
+ }
48
+ ```