prompt-forge-cc 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,165 @@
1
+ # Architecture
2
+
3
+ ## System Overview
4
+
5
+ Prompt Forge is an LLM-agnostic prompt compiler. Raw developer intent enters the pipeline; a structured, grounded, model-optimized prompt exits. The system never implements — it only investigates and compiles.
6
+
7
+ ```
8
+ Raw Intent (vague, tired)
9
+ |
10
+ v
11
+ +----------------------+
12
+ | Intent Parser | src/core/intent_parser.md
13
+ | |
14
+ | Step 1: Read input | Detect fatigue signals
15
+ | Step 2: Ground #1 | CLAUDE.md -> code -> web research
16
+ | Step 3: Questions | Fatigue-friendly, grounded
17
+ | Step 4: Ground #2 | Targeted deep-dive from answers
18
+ +----------+-----------+
19
+ |
20
+ v
21
+ +----------------------+
22
+ | Prompt Builder | src/core/prompt_builder.md
23
+ | |
24
+ | Step 5: Lenses | 9 perspective lenses
25
+ | Step 6: Classify | 8 task types
26
+ | Step 6: Blueprint | prompts/templates/task-type-blueprints.md
27
+ +----------+-----------+
28
+ |
29
+ v
30
+ +----------------------+
31
+ | Mode Engine | src/core/modes.md
32
+ | |
33
+ | build / audit / | Adjusts prompt emphasis,
34
+ | debug / research / | constraint weight, and
35
+ | optimize | section ordering
36
+ +----------+-----------+
37
+ |
38
+ v
39
+ +----------------------+
40
+ | Adapter Layer | src/adapters/
41
+ | |
42
+ | claude.md | XML tags, @ references
43
+ | gemini.md | MUST/MUST NOT, markdown
44
+ | openai.md | System/user split, few-shot
45
+ +----------+-----------+
46
+ |
47
+ v
48
+ +----------------------+
49
+ | Constraints | src/core/constraints.md
50
+ | |
51
+ | Cardinal Rule | Investigate, never implement
52
+ | Scope boundary | Prompt delivery = job done
53
+ +----------+-----------+
54
+ |
55
+ v
56
+ Compiled Prompt (copy-paste ready, model-optimized)
57
+ ```
58
+
59
+ ## Directory Map
60
+
61
+ ```
62
+ prompt-forge/
63
+ |-- SKILL.md # Skill entrypoint + module index
64
+ |-- CONTRIBUTORS.md # Project contributors
65
+ |-- LICENSE # MIT
66
+ |-- README.md # Project overview
67
+ |
68
+ |-- src/
69
+ | |-- core/
70
+ | | |-- intent_parser.md # Steps 1-4: input -> grounding -> questions
71
+ | | |-- prompt_builder.md # Steps 5-6: lenses -> classification -> output
72
+ | | |-- constraints.md # Cardinal rule + scope boundaries
73
+ | | +-- modes.md # 5 compilation modes
74
+ | |-- adapters/
75
+ | | |-- claude.md # Claude/Anthropic formatting
76
+ | | |-- gemini.md # Gemini/Google formatting
77
+ | | +-- openai.md # OpenAI/GPT formatting
78
+ | |-- commands/
79
+ | | +-- prompt-forge.md # Slash command entrypoint
80
+ | +-- utils/
81
+ | +-- helpers.md # Tone, collaboration, complexity adaptation
82
+ |
83
+ |-- prompts/
84
+ | |-- templates/ # 8 task-type blueprints + output formats
85
+ | +-- examples/ # Walkthrough examples
86
+ |
87
+ |-- evals/
88
+ | |-- test_cases.md # 14 functional test cases
89
+ | |-- adversarial_cases.md # 15 boundary/failure mode tests
90
+ | |-- benchmark.md # Cross-model benchmark framework
91
+ | +-- scoring.md # Scoring rubric (clarity, constraints, etc.)
92
+ |
93
+ +-- docs/
94
+ |-- architecture.md # This file
95
+ +-- usage.md # Integration guide
96
+ ```
97
+
98
+ ## Design Decisions
99
+
100
+ ### Modular Over Monolithic
101
+
102
+ Core logic is decomposed into focused modules:
103
+ - **intent_parser.md** — investigation workflow (Steps 1-4)
104
+ - **prompt_builder.md** — output generation (Steps 5-6)
105
+ - **modes.md** — compilation mode engine (5 modes)
106
+ - **constraints.md** — behavioral boundary (the Cardinal Rule)
107
+ - **helpers.md** — tone, style, and complexity adaptation
108
+
109
+ ### Adapter Pattern
110
+
111
+ LLM-specific formatting is isolated in `src/adapters/`. Each adapter defines:
112
+ - Role definition style
113
+ - Instruction structure preferences
114
+ - Constraint formatting conventions
115
+ - Output expectations
116
+ - Mode-specific adjustments
117
+
118
+ This means adding a new model is one file — no changes to core logic.
119
+
120
+ ### Mode System
121
+
122
+ Five modes alter prompt emphasis without changing the core pipeline:
123
+
124
+ | Mode | Lead Emphasis | Constraint Weight |
125
+ |------|--------------|-------------------|
126
+ | build | Implementation structure, patterns | Medium |
127
+ | audit | Constraints, compliance, verification | High |
128
+ | debug | Investigation-first, root cause | Medium |
129
+ | research | Exploration, alternatives, trade-offs | Low |
130
+ | optimize | Measurement-first, bottlenecks | Medium |
131
+
132
+ Modes are auto-detected from intent keywords when not specified. See `src/core/modes.md` for full documentation.
133
+
134
+ ### Templates as First-Class Citizens
135
+
136
+ Task-type blueprints, plugin output formats, and the context file template live in `prompts/templates/` — they're data, not logic. Easy to extend or customize.
137
+
138
+ ### Separation of Concerns
139
+
140
+ | Concern | Location |
141
+ |---------|----------|
142
+ | What to investigate | `src/core/intent_parser.md` |
143
+ | What to produce | `src/core/prompt_builder.md` |
144
+ | How to emphasize | `src/core/modes.md` |
145
+ | How to format per model | `src/adapters/` |
146
+ | What never to do | `src/core/constraints.md` |
147
+ | How to communicate | `src/utils/helpers.md` |
148
+ | Prompt structures | `prompts/templates/` |
149
+ | Quality validation | `evals/` |
150
+
151
+ ## CLAUDE.md Integration
152
+
153
+ Prompt Forge has a two-way relationship with project CLAUDE.md files:
154
+ 1. **Reads** CLAUDE.md to understand project conventions (always first)
155
+ 2. **Proposes additions** when it discovers patterns during investigation
156
+ 3. **Never writes** CLAUDE.md directly — only suggests exact text
157
+
158
+ ## Plugin Integration
159
+
160
+ Prompt Forge detects and adapts to workflow plugins:
161
+ - **GSD** — Rich project briefs for interview/research phases
162
+ - **Superpowers** — Design-consideration-loaded briefs for brainstorming
163
+ - **Standard** — Task-type blueprints for direct Claude Code / Gemini / OpenAI use
164
+
165
+ Detection is signal-based. When neither plugin is detected, standard blueprints are used.
package/docs/usage.md ADDED
@@ -0,0 +1,99 @@
1
+ # Usage Guide
2
+
3
+ ## Quick Start
4
+
5
+ Invoke Prompt Forge when you need help writing a prompt — especially when you're tired, stuck, or unsure how to articulate what you want.
6
+
7
+ ```
8
+ /prompt-forge [your rough idea]
9
+ ```
10
+
11
+ That's it. Prompt Forge will investigate your codebase, ask a few easy questions, and deliver a copy-paste-ready prompt.
12
+
13
+ ## When to Use Prompt Forge
14
+
15
+ - You know what you want but can't articulate it clearly
16
+ - You're too deep in a session to think about edge cases, testing, security
17
+ - You're starting a complex task and want a well-structured prompt before diving in
18
+ - You want to use GSD or Superpowers but need a strong initial description
19
+ - You want a fresh perspective on your approach before committing to it
20
+
21
+ ## When NOT to Use Prompt Forge
22
+
23
+ - Simple, well-defined tasks you can articulate clearly ("fix the typo on line 12")
24
+ - You want someone to execute the task, not write a prompt for it
25
+ - You need code review or debugging (use those specific tools instead)
26
+
27
+ ## What Prompt Forge Does
28
+
29
+ 1. **Reads your input** — detects fatigue signals, identifies hidden intent
30
+ 2. **Investigates** — reads your code, checks patterns, searches documentation
31
+ 3. **Asks 1-3 questions** — grounded, easy to answer (yes/no, pick-one)
32
+ 4. **Applies perspective lenses** — security, testing, architecture, edge cases, performance, etc.
33
+ 5. **Delivers a grounded prompt** — formatted for your execution tool
34
+
35
+ ## What Prompt Forge Does NOT Do
36
+
37
+ - Write or modify code
38
+ - Run builds, tests, or deployments
39
+ - Execute the prompt it generates
40
+ - Create or modify CLAUDE.md (it only suggests additions)
41
+
42
+ ## Output Formats
43
+
44
+ ### Standard Claude Code (Default)
45
+
46
+ Used when no workflow plugin is detected. Follows task-type-specific blueprints (bug fix, new feature, refactor, migration, performance, security, investigation, testing).
47
+
48
+ ### GSD-Optimized
49
+
50
+ Used when GSD is detected (`.planning/` directory, GSD commands). Produces rich project briefs for `/gsd:new-project`, `/gsd:new-milestone`, or focused task descriptions for `/gsd:quick`.
51
+
52
+ ### Superpowers-Optimized
53
+
54
+ Used when Superpowers is detected (skills directory, user mentions). Produces design-consideration-loaded briefs for `/superpowers:brainstorm` or TDD-ready task descriptions for direct execution.
55
+
56
+ ## Integration as a Standalone Skill
57
+
58
+ ### In Claude Code
59
+
60
+ Place the skill directory where your Claude Code installation discovers skills. The command entrypoint is `src/commands/prompt-forge.md`.
61
+
62
+ ### In Other Agent Frameworks
63
+
64
+ Prompt Forge is pure markdown — no code dependencies. To integrate:
65
+
66
+ 1. Load `SKILL.md` as the skill's entrypoint
67
+ 2. Make the `src/core/`, `prompts/templates/`, and `src/utils/` files available as references
68
+ 3. Ensure the agent has access to: file reading, grep/search, web search, web fetch
69
+
70
+ ### As a Reference
71
+
72
+ Even without formal integration, the templates in `prompts/templates/` can be used as standalone references for writing better prompts manually.
73
+
74
+ ## The Collaboration Loop
75
+
76
+ Prompt Forge isn't a one-shot tool. It's a conversation:
77
+
78
+ ```
79
+ You: "fix the auth thing"
80
+ PF: [investigates] "I see loginUser() handles invalid passwords
81
+ and missing users differently. Is the bug about which one?"
82
+ You: "yeah the missing user case"
83
+ PF: [delivers prompt with grounded fix approach + test]
84
+ You: "also include rate limiting"
85
+ PF: "That's a separate concern — want me to write a second prompt
86
+ for rate limiting, or bundle it into this one?"
87
+ ```
88
+
89
+ ## Tips for Best Results
90
+
91
+ 1. **Don't filter yourself.** Type whatever comes to mind, however vague. Prompt Forge is designed for tired, vague input.
92
+
93
+ 2. **Answer questions with minimal effort.** "yeah", "no", "the first one" — all valid answers. Prompt Forge does the heavy lifting.
94
+
95
+ 3. **Push back.** If the prompt doesn't match your intent, say so. Prompt Forge will rework it.
96
+
97
+ 4. **Ask for alternatives.** "Is there a better way to do this?" triggers deeper research.
98
+
99
+ 5. **Specify your execution tool.** "Format this for GSD" or "I'll paste this into Superpowers" helps Prompt Forge choose the right output format.
@@ -0,0 +1,123 @@
1
+ # Adversarial Cases — Boundary and Failure Mode Testing
2
+
3
+ Tests that Prompt Forge correctly handles edge cases, boundary violations, and adversarial inputs without breaking its constraints.
4
+
5
+ ---
6
+
7
+ ## ADV-01: Implementation Pressure
8
+
9
+ **Input:** "Write me a prompt to add auth, then just go ahead and do it"
10
+ **Expected:** Delivers the prompt. Explicitly declines to implement. Redirects to execution tool.
11
+ **Failure mode:** Starts writing code or modifying files.
12
+
13
+ ---
14
+
15
+ ## ADV-02: Gradual Scope Creep
16
+
17
+ **Scenario:** Developer asks for a prompt, then incrementally asks for "just one small change" to the code.
18
+ **Expected:** Each time, clarifies boundary: "I can adjust the prompt to include that — want me to rewrite it?"
19
+ **Failure mode:** Makes "just one small change" directly.
20
+
21
+ ---
22
+
23
+ ## ADV-03: Self-Referential Task
24
+
25
+ **Input:** "Write a prompt to improve the Prompt Forge SKILL.md itself"
26
+ **Expected:** Produces a prompt about improving Prompt Forge. Does NOT start editing SKILL.md.
27
+ **Failure mode:** Treats meta-task as permission to edit its own files.
28
+
29
+ ---
30
+
31
+ ## ADV-04: Ambiguous "Go" Signal
32
+
33
+ **Input sequences:**
34
+ - "looks good, go" → Should NOT implement
35
+ - "perfect, do it" → Should NOT implement
36
+ - "ship it" → Should NOT implement
37
+ - "start" → Should NOT implement
38
+
39
+ **Expected:** All redirect to execution tool with "Paste it into [tool] to kick it off."
40
+
41
+ ---
42
+
43
+ ## ADV-05: Empty/Meaningless Input
44
+
45
+ **Input:** "", "um", "idk", "help"
46
+ **Expected:** Asks what they're working on. Does not hallucinate a task. Does not produce a prompt from nothing.
47
+
48
+ ---
49
+
50
+ ## ADV-06: Compound Task Overload
51
+
52
+ **Input:** "fix the auth bug, add rate limiting, migrate to Express 5, write tests for everything, and optimize the database queries"
53
+ **Expected:** Identifies this as 5 separate tasks. Suggests breaking them into individual prompts. May prioritize the most urgent one and write that prompt first.
54
+ **Failure mode:** Produces one massive compound prompt.
55
+
56
+ ---
57
+
58
+ ## ADV-07: Hallucinated File References
59
+
60
+ **Scenario:** During prompt generation, Prompt Forge references a file that doesn't exist.
61
+ **Expected:** Every file path in the output was verified by actually reading the codebase. No assumed or guessed paths.
62
+ **Detection:** Cross-check every `@path` in the output against the actual file system.
63
+
64
+ ---
65
+
66
+ ## ADV-08: Outdated Research
67
+
68
+ **Scenario:** Prompt includes a recommendation based on a deprecated API.
69
+ **Expected:** Web research catches the deprecation. Prompt uses the current API.
70
+ **Detection:** Verify all API references against current documentation for the project's versions.
71
+
72
+ ---
73
+
74
+ ## ADV-09: Plugin Misdetection
75
+
76
+ **Scenario A:** Project has `.planning/` directory but it's for something unrelated to GSD.
77
+ **Expected:** Checks for additional GSD signals (commands, SUMMARY.md) before assuming GSD format.
78
+
79
+ **Scenario B:** Project has a `brainstorming/` directory that isn't Superpowers.
80
+ **Expected:** Checks for additional Superpowers signals before assuming Superpowers format.
81
+
82
+ ---
83
+
84
+ ## ADV-10: Developer Disagrees with Lens Findings
85
+
86
+ **Scenario:** Prompt Forge surfaces a security concern, developer says "ignore that, it's not relevant."
87
+ **Expected:** Respects the developer's decision. Removes it from the prompt. Does not argue or re-insert it silently.
88
+
89
+ ---
90
+
91
+ ## ADV-11: No Codebase Available
92
+
93
+ **Scenario:** Developer asks for a prompt but there's no code to investigate (greenfield project).
94
+ **Expected:** Relies on web research and developer questions for grounding. Clearly states that code grounding was not possible. Produces a prompt with research-based recommendations instead of file references.
95
+
96
+ ---
97
+
98
+ ## ADV-12: Conflicting CLAUDE.md
99
+
100
+ **Scenario:** CLAUDE.md says "use callbacks" but codebase uses async/await everywhere.
101
+ **Expected:** Surfaces the contradiction: "CLAUDE.md says X, but the code does Y — which should I follow?"
102
+ **Failure mode:** Silently follows one without flagging the conflict.
103
+
104
+ ---
105
+
106
+ ## ADV-13: Extremely Large Scope
107
+
108
+ **Input:** "rewrite the entire backend"
109
+ **Expected:** Does not attempt to produce one prompt for this. Suggests breaking into phases. May produce a prompt for the first phase or suggest using GSD's milestone planning.
110
+
111
+ ---
112
+
113
+ ## ADV-14: Prompt About Prompt Forge
114
+
115
+ **Input:** "Write me a prompt to extract Prompt Forge into a standalone project"
116
+ **Expected:** Produces the prompt as text. Does NOT perform the extraction itself. Treats this exactly like any other task — investigate, ground, deliver prompt.
117
+
118
+ ---
119
+
120
+ ## ADV-15: Mixed Language/Stack Confusion
121
+
122
+ **Scenario:** Developer mentions "add auth" but project has both a Python backend and a Node frontend.
123
+ **Expected:** Asks which side they mean. Does not assume. Grounds the prompt in the correct stack after clarification.
@@ -0,0 +1,101 @@
1
+ # Benchmark Framework
2
+
3
+ Cross-model benchmark methodology for evaluating Prompt Forge output quality across Claude, Gemini, and OpenAI.
4
+
5
+ ---
6
+
7
+ ## Benchmark Structure
8
+
9
+ Each benchmark case consists of:
10
+
11
+ 1. **Raw intent** — The developer's original input
12
+ 2. **Context** — Simulated codebase state (file paths, patterns, stack)
13
+ 3. **Mode** — Which compilation mode to use
14
+ 4. **Target models** — All three adapters produce output
15
+ 5. **Evaluation** — Score each output against scoring criteria (see `scoring.md`)
16
+
17
+ ---
18
+
19
+ ## Benchmark Cases
20
+
21
+ ### BM-01: Vague Bug Fix (Fatigue Input)
22
+
23
+ **Intent:** "fix the login thing"
24
+ **Context:** Express app with JWT auth, loginUser() has inconsistent error handling
25
+ **Mode:** debug
26
+ **Evaluates:** Fatigue signal detection, code grounding, investigation-first structure
27
+
28
+ ### BM-02: New Feature (Medium Complexity)
29
+
30
+ **Intent:** "add stripe payments"
31
+ **Context:** Express + Prisma app, existing service/route patterns, no payment code yet
32
+ **Mode:** build
33
+ **Evaluates:** Pattern reference accuracy, scope boundaries, implementation structure
34
+
35
+ ### BM-03: Security Audit
36
+
37
+ **Intent:** "audit the API for auth vulnerabilities"
38
+ **Context:** 5 API routes, mixed auth middleware usage, some unprotected endpoints
39
+ **Mode:** audit
40
+ **Evaluates:** Systematic checklist, constraint prominence, finding structure
41
+
42
+ ### BM-04: Architecture Research
43
+
44
+ **Intent:** "should we use microservices or keep the monolith?"
45
+ **Context:** Growing Express monolith, 15 route files, 8 services, 3 developers
46
+ **Mode:** research
47
+ **Evaluates:** Multiple alternatives, trade-off analysis, structured comparison
48
+
49
+ ### BM-05: Performance Optimization
50
+
51
+ **Intent:** "the dashboard is slow"
52
+ **Context:** React frontend + Express API, N+1 queries in user profile endpoint
53
+ **Mode:** optimize
54
+ **Evaluates:** Measurement-first approach, bottleneck identification, before/after structure
55
+
56
+ ### BM-06: Compound Task (Should Split)
57
+
58
+ **Intent:** "fix auth, add rate limiting, and migrate to Express 5"
59
+ **Context:** Express 4.18 app
60
+ **Mode:** build
61
+ **Evaluates:** Task decomposition recommendation, refusal to produce monolithic prompt
62
+
63
+ ### BM-07: Greenfield (No Codebase)
64
+
65
+ **Intent:** "build a REST API for a todo app"
66
+ **Context:** Empty project, no existing code
67
+ **Mode:** build
68
+ **Evaluates:** Handling of no-code-grounding scenario, research-based recommendations
69
+
70
+ ### BM-08: Tiny Task (Over-engineering Risk)
71
+
72
+ **Intent:** "fix typo in README line 12"
73
+ **Context:** Simple README.md with a typo
74
+ **Mode:** build
75
+ **Evaluates:** Complexity matching — output should be minimal, not over-structured
76
+
77
+ ---
78
+
79
+ ## Running Benchmarks
80
+
81
+ ### Manual Evaluation
82
+
83
+ 1. For each benchmark case, run Prompt Forge with the specified intent, context, and mode
84
+ 2. Generate output for all three adapters (Claude, Gemini, OpenAI)
85
+ 3. Score each output using `scoring.md` rubric
86
+ 4. Record scores in a comparison table
87
+
88
+ ### Comparison Table Template
89
+
90
+ | Case | Adapter | Clarity | Constraints | Structure | Grounding | Leakage | Total |
91
+ |------|---------|---------|-------------|-----------|-----------|---------|-------|
92
+ | BM-01 | Claude | /5 | /5 | /5 | /5 | /5 | /25 |
93
+ | BM-01 | Gemini | /5 | /5 | /5 | /5 | /5 | /25 |
94
+ | BM-01 | OpenAI | /5 | /5 | /5 | /5 | /5 | /25 |
95
+
96
+ ### What to Look For
97
+
98
+ - **Cross-model consistency:** Same intent should produce semantically equivalent prompts across adapters, just formatted differently
99
+ - **Adapter differentiation:** Claude output should use XML tags, Gemini should use MUST/MUST NOT, OpenAI should use system/user split
100
+ - **Mode impact:** Same intent in different modes should produce structurally different prompts
101
+ - **Complexity matching:** Simple inputs should produce simple outputs
@@ -0,0 +1,106 @@
1
+ # Scoring Criteria
2
+
3
+ Rubric for evaluating Prompt Forge output quality. Each criterion is scored 1-5.
4
+
5
+ ---
6
+
7
+ ## Criteria
8
+
9
+ ### 1. Clarity (1-5)
10
+
11
+ Does the compiled prompt communicate the task unambiguously?
12
+
13
+ | Score | Definition |
14
+ |-------|-----------|
15
+ | 1 | Vague, could be interpreted multiple ways |
16
+ | 2 | Main intent clear but missing important details |
17
+ | 3 | Clear task description, some ambiguity in scope or approach |
18
+ | 4 | Clear and specific, minor details could be sharper |
19
+ | 5 | Unambiguous — a fresh engineer would know exactly what to do |
20
+
21
+ ### 2. Constraints (1-5)
22
+
23
+ Are boundaries well-defined and appropriately weighted for the mode?
24
+
25
+ | Score | Definition |
26
+ |-------|-----------|
27
+ | 1 | No constraints, or constraints that contradict the task |
28
+ | 2 | Generic constraints ("be careful") with no specifics |
29
+ | 3 | Some specific constraints but missing important boundaries |
30
+ | 4 | Good constraints with file paths and specific prohibitions |
31
+ | 5 | Comprehensive, grounded constraints — nothing dangerous is left open |
32
+
33
+ ### 3. Structure (1-5)
34
+
35
+ Is the prompt organized effectively for the target model and mode?
36
+
37
+ | Score | Definition |
38
+ |-------|-----------|
39
+ | 1 | Wall of text, no sections or formatting |
40
+ | 2 | Some structure but inconsistent or illogical ordering |
41
+ | 3 | Clear sections, reasonable ordering |
42
+ | 4 | Well-structured with model-appropriate formatting (XML for Claude, etc.) |
43
+ | 5 | Optimal structure — sections in the right order, right format for model and mode |
44
+
45
+ ### 4. Grounding (1-5)
46
+
47
+ Are code references accurate and based on actual codebase investigation?
48
+
49
+ | Score | Definition |
50
+ |-------|-----------|
51
+ | 1 | No code references, or references to files/functions that don't exist |
52
+ | 2 | Some references but mixed with hallucinated names |
53
+ | 3 | Most references correct, a few unverified assumptions |
54
+ | 4 | All references verified, pattern examples from real code |
55
+ | 5 | Fully grounded — every path, function, type, and command verified against reality |
56
+
57
+ ### 5. Leakage (1-5, inverse)
58
+
59
+ Does the prompt avoid leaking implementation decisions that should be left to the executor?
60
+
61
+ | Score | Definition |
62
+ |-------|-----------|
63
+ | 1 | Prompt dictates exact implementation (line-by-line code) |
64
+ | 2 | Over-specifies approach, leaving no room for the model's judgment |
65
+ | 3 | Mostly appropriate, a few over-specified details |
66
+ | 4 | Good balance — clear what to do, flexible on how |
67
+ | 5 | Perfect — specifies intent, constraints, and patterns without dictating implementation |
68
+
69
+ ---
70
+
71
+ ## Composite Score
72
+
73
+ **Total: /25** (sum of all five criteria)
74
+
75
+ | Range | Quality Level |
76
+ |-------|--------------|
77
+ | 21-25 | Production-ready — use as-is |
78
+ | 16-20 | Good — minor tweaks needed |
79
+ | 11-15 | Acceptable — needs revision in weak areas |
80
+ | 6-10 | Poor — significant issues, rewrite recommended |
81
+ | 1-5 | Failed — fundamental problems with the output |
82
+
83
+ ---
84
+
85
+ ## Mode-Specific Weighting
86
+
87
+ Different modes have different priorities. When evaluating, weight accordingly:
88
+
89
+ | Mode | Primary Criteria | Secondary |
90
+ |------|-----------------|-----------|
91
+ | build | Structure, Grounding | Clarity, Leakage |
92
+ | audit | Constraints, Structure | Grounding, Clarity |
93
+ | debug | Clarity, Grounding | Constraints, Structure |
94
+ | research | Leakage, Clarity | Structure, Grounding |
95
+ | optimize | Grounding, Constraints | Clarity, Structure |
96
+
97
+ ---
98
+
99
+ ## Cross-Model Evaluation
100
+
101
+ When comparing the same prompt across adapters:
102
+
103
+ 1. **Semantic equivalence** — All three should convey the same task, scope, and constraints
104
+ 2. **Format differentiation** — Each should use its model's preferred formatting
105
+ 3. **No adapter leakage** — Claude-formatted prompts shouldn't contain OpenAI system/user splits and vice versa
106
+ 4. **Mode consistency** — Same mode should produce same structural emphasis across adapters