prompt-forge-cc 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +215 -0
- package/SKILL.md +77 -0
- package/bin/install.js +252 -0
- package/docs/architecture.md +165 -0
- package/docs/usage.md +99 -0
- package/evals/adversarial_cases.md +123 -0
- package/evals/benchmark.md +101 -0
- package/evals/scoring.md +106 -0
- package/evals/test_cases.md +173 -0
- package/package.json +41 -0
- package/prompts/examples/example-session.md +132 -0
- package/prompts/templates/anthropic-prompting-guide.md +119 -0
- package/prompts/templates/context-file-template.md +100 -0
- package/prompts/templates/gsd-output-format.md +75 -0
- package/prompts/templates/superpowers-output-format.md +89 -0
- package/prompts/templates/task-type-blueprints.md +317 -0
- package/src/adapters/claude.md +86 -0
- package/src/adapters/gemini.md +89 -0
- package/src/adapters/openai.md +100 -0
- package/src/commands/prompt-forge.md +34 -0
- package/src/core/constraints.md +70 -0
- package/src/core/intent_parser.md +104 -0
- package/src/core/modes.md +116 -0
- package/src/core/prompt_builder.md +123 -0
- package/src/utils/helpers.md +100 -0
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
# Architecture
|
|
2
|
+
|
|
3
|
+
## System Overview
|
|
4
|
+
|
|
5
|
+
Prompt Forge is an LLM-agnostic prompt compiler. Raw developer intent enters the pipeline; a structured, grounded, model-optimized prompt exits. The system never implements — it only investigates and compiles.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
Raw Intent (vague, tired)
|
|
9
|
+
|
|
|
10
|
+
v
|
|
11
|
+
+----------------------+
|
|
12
|
+
| Intent Parser | src/core/intent_parser.md
|
|
13
|
+
| |
|
|
14
|
+
| Step 1: Read input | Detect fatigue signals
|
|
15
|
+
| Step 2: Ground #1 | CLAUDE.md -> code -> web research
|
|
16
|
+
| Step 3: Questions | Fatigue-friendly, grounded
|
|
17
|
+
| Step 4: Ground #2 | Targeted deep-dive from answers
|
|
18
|
+
+----------+-----------+
|
|
19
|
+
|
|
|
20
|
+
v
|
|
21
|
+
+----------------------+
|
|
22
|
+
| Prompt Builder | src/core/prompt_builder.md
|
|
23
|
+
| |
|
|
24
|
+
| Step 5: Lenses | 9 perspective lenses
|
|
25
|
+
| Step 6: Classify | 8 task types
|
|
26
|
+
| Step 6: Blueprint | prompts/templates/task-type-blueprints.md
|
|
27
|
+
+----------+-----------+
|
|
28
|
+
|
|
|
29
|
+
v
|
|
30
|
+
+----------------------+
|
|
31
|
+
| Mode Engine | src/core/modes.md
|
|
32
|
+
| |
|
|
33
|
+
| build / audit / | Adjusts prompt emphasis,
|
|
34
|
+
| debug / research / | constraint weight, and
|
|
35
|
+
| optimize | section ordering
|
|
36
|
+
+----------+-----------+
|
|
37
|
+
|
|
|
38
|
+
v
|
|
39
|
+
+----------------------+
|
|
40
|
+
| Adapter Layer | src/adapters/
|
|
41
|
+
| |
|
|
42
|
+
| claude.md | XML tags, @ references
|
|
43
|
+
| gemini.md | MUST/MUST NOT, markdown
|
|
44
|
+
| openai.md | System/user split, few-shot
|
|
45
|
+
+----------+-----------+
|
|
46
|
+
|
|
|
47
|
+
v
|
|
48
|
+
+----------------------+
|
|
49
|
+
| Constraints | src/core/constraints.md
|
|
50
|
+
| |
|
|
51
|
+
| Cardinal Rule | Investigate, never implement
|
|
52
|
+
| Scope boundary | Prompt delivery = job done
|
|
53
|
+
+----------+-----------+
|
|
54
|
+
|
|
|
55
|
+
v
|
|
56
|
+
Compiled Prompt (copy-paste ready, model-optimized)
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Directory Map
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
prompt-forge/
|
|
63
|
+
|-- SKILL.md # Skill entrypoint + module index
|
|
64
|
+
|-- CONTRIBUTORS.md # Project contributors
|
|
65
|
+
|-- LICENSE # MIT
|
|
66
|
+
|-- README.md # Project overview
|
|
67
|
+
|
|
|
68
|
+
|-- src/
|
|
69
|
+
| |-- core/
|
|
70
|
+
| | |-- intent_parser.md # Steps 1-4: input -> grounding -> questions
|
|
71
|
+
| | |-- prompt_builder.md # Steps 5-6: lenses -> classification -> output
|
|
72
|
+
| | |-- constraints.md # Cardinal rule + scope boundaries
|
|
73
|
+
| | +-- modes.md # 5 compilation modes
|
|
74
|
+
| |-- adapters/
|
|
75
|
+
| | |-- claude.md # Claude/Anthropic formatting
|
|
76
|
+
| | |-- gemini.md # Gemini/Google formatting
|
|
77
|
+
| | +-- openai.md # OpenAI/GPT formatting
|
|
78
|
+
| |-- commands/
|
|
79
|
+
| | +-- prompt-forge.md # Slash command entrypoint
|
|
80
|
+
| +-- utils/
|
|
81
|
+
| +-- helpers.md # Tone, collaboration, complexity adaptation
|
|
82
|
+
|
|
|
83
|
+
|-- prompts/
|
|
84
|
+
| |-- templates/ # 8 task-type blueprints + output formats
|
|
85
|
+
| +-- examples/ # Walkthrough examples
|
|
86
|
+
|
|
|
87
|
+
|-- evals/
|
|
88
|
+
| |-- test_cases.md # 14 functional test cases
|
|
89
|
+
| |-- adversarial_cases.md # 15 boundary/failure mode tests
|
|
90
|
+
| |-- benchmark.md # Cross-model benchmark framework
|
|
91
|
+
| +-- scoring.md # Scoring rubric (clarity, constraints, etc.)
|
|
92
|
+
|
|
|
93
|
+
+-- docs/
|
|
94
|
+
|-- architecture.md # This file
|
|
95
|
+
+-- usage.md # Integration guide
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Design Decisions
|
|
99
|
+
|
|
100
|
+
### Modular Over Monolithic
|
|
101
|
+
|
|
102
|
+
Core logic is decomposed into focused modules:
|
|
103
|
+
- **intent_parser.md** — investigation workflow (Steps 1-4)
|
|
104
|
+
- **prompt_builder.md** — output generation (Steps 5-6)
|
|
105
|
+
- **modes.md** — compilation mode engine (5 modes)
|
|
106
|
+
- **constraints.md** — behavioral boundary (the Cardinal Rule)
|
|
107
|
+
- **helpers.md** — tone, style, and complexity adaptation
|
|
108
|
+
|
|
109
|
+
### Adapter Pattern
|
|
110
|
+
|
|
111
|
+
LLM-specific formatting is isolated in `src/adapters/`. Each adapter defines:
|
|
112
|
+
- Role definition style
|
|
113
|
+
- Instruction structure preferences
|
|
114
|
+
- Constraint formatting conventions
|
|
115
|
+
- Output expectations
|
|
116
|
+
- Mode-specific adjustments
|
|
117
|
+
|
|
118
|
+
This means adding a new model is one file — no changes to core logic.
|
|
119
|
+
|
|
120
|
+
### Mode System
|
|
121
|
+
|
|
122
|
+
Five modes alter prompt emphasis without changing the core pipeline:
|
|
123
|
+
|
|
124
|
+
| Mode | Lead Emphasis | Constraint Weight |
|
|
125
|
+
|------|--------------|-------------------|
|
|
126
|
+
| build | Implementation structure, patterns | Medium |
|
|
127
|
+
| audit | Constraints, compliance, verification | High |
|
|
128
|
+
| debug | Investigation-first, root cause | Medium |
|
|
129
|
+
| research | Exploration, alternatives, trade-offs | Low |
|
|
130
|
+
| optimize | Measurement-first, bottlenecks | Medium |
|
|
131
|
+
|
|
132
|
+
Modes are auto-detected from intent keywords when not specified. See `src/core/modes.md` for full documentation.
|
|
133
|
+
|
|
134
|
+
### Templates as First-Class Citizens
|
|
135
|
+
|
|
136
|
+
Task-type blueprints, plugin output formats, and the context file template live in `prompts/templates/` — they're data, not logic. Easy to extend or customize.
|
|
137
|
+
|
|
138
|
+
### Separation of Concerns
|
|
139
|
+
|
|
140
|
+
| Concern | Location |
|
|
141
|
+
|---------|----------|
|
|
142
|
+
| What to investigate | `src/core/intent_parser.md` |
|
|
143
|
+
| What to produce | `src/core/prompt_builder.md` |
|
|
144
|
+
| How to emphasize | `src/core/modes.md` |
|
|
145
|
+
| How to format per model | `src/adapters/` |
|
|
146
|
+
| What never to do | `src/core/constraints.md` |
|
|
147
|
+
| How to communicate | `src/utils/helpers.md` |
|
|
148
|
+
| Prompt structures | `prompts/templates/` |
|
|
149
|
+
| Quality validation | `evals/` |
|
|
150
|
+
|
|
151
|
+
## CLAUDE.md Integration
|
|
152
|
+
|
|
153
|
+
Prompt Forge has a two-way relationship with project CLAUDE.md files:
|
|
154
|
+
1. **Reads** CLAUDE.md to understand project conventions (always first)
|
|
155
|
+
2. **Proposes additions** when it discovers patterns during investigation
|
|
156
|
+
3. **Never writes** CLAUDE.md directly — only suggests exact text
|
|
157
|
+
|
|
158
|
+
## Plugin Integration
|
|
159
|
+
|
|
160
|
+
Prompt Forge detects and adapts to workflow plugins:
|
|
161
|
+
- **GSD** — Rich project briefs for interview/research phases
|
|
162
|
+
- **Superpowers** — Design-consideration-loaded briefs for brainstorming
|
|
163
|
+
- **Standard** — Task-type blueprints for direct Claude Code / Gemini / OpenAI use
|
|
164
|
+
|
|
165
|
+
Detection is signal-based. When neither plugin is detected, standard blueprints are used.
|
package/docs/usage.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# Usage Guide
|
|
2
|
+
|
|
3
|
+
## Quick Start
|
|
4
|
+
|
|
5
|
+
Invoke Prompt Forge when you need help writing a prompt — especially when you're tired, stuck, or unsure how to articulate what you want.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
/prompt-forge [your rough idea]
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
That's it. Prompt Forge will investigate your codebase, ask a few easy questions, and deliver a copy-paste-ready prompt.
|
|
12
|
+
|
|
13
|
+
## When to Use Prompt Forge
|
|
14
|
+
|
|
15
|
+
- You know what you want but can't articulate it clearly
|
|
16
|
+
- You're too deep in a session to think about edge cases, testing, security
|
|
17
|
+
- You're starting a complex task and want a well-structured prompt before diving in
|
|
18
|
+
- You want to use GSD or Superpowers but need a strong initial description
|
|
19
|
+
- You want a fresh perspective on your approach before committing to it
|
|
20
|
+
|
|
21
|
+
## When NOT to Use Prompt Forge
|
|
22
|
+
|
|
23
|
+
- Simple, well-defined tasks you can articulate clearly ("fix the typo on line 12")
|
|
24
|
+
- You want someone to execute the task, not write a prompt for it
|
|
25
|
+
- You need code review or debugging (use those specific tools instead)
|
|
26
|
+
|
|
27
|
+
## What Prompt Forge Does
|
|
28
|
+
|
|
29
|
+
1. **Reads your input** — detects fatigue signals, identifies hidden intent
|
|
30
|
+
2. **Investigates** — reads your code, checks patterns, searches documentation
|
|
31
|
+
3. **Asks 1-3 questions** — grounded, easy to answer (yes/no, pick-one)
|
|
32
|
+
4. **Applies perspective lenses** — security, testing, architecture, edge cases, performance, etc.
|
|
33
|
+
5. **Delivers a grounded prompt** — formatted for your execution tool
|
|
34
|
+
|
|
35
|
+
## What Prompt Forge Does NOT Do
|
|
36
|
+
|
|
37
|
+
- Write or modify code
|
|
38
|
+
- Run builds, tests, or deployments
|
|
39
|
+
- Execute the prompt it generates
|
|
40
|
+
- Create or modify CLAUDE.md (it only suggests additions)
|
|
41
|
+
|
|
42
|
+
## Output Formats
|
|
43
|
+
|
|
44
|
+
### Standard Claude Code (Default)
|
|
45
|
+
|
|
46
|
+
Used when no workflow plugin is detected. Follows task-type-specific blueprints (bug fix, new feature, refactor, migration, performance, security, investigation, testing).
|
|
47
|
+
|
|
48
|
+
### GSD-Optimized
|
|
49
|
+
|
|
50
|
+
Used when GSD is detected (`.planning/` directory, GSD commands). Produces rich project briefs for `/gsd:new-project`, `/gsd:new-milestone`, or focused task descriptions for `/gsd:quick`.
|
|
51
|
+
|
|
52
|
+
### Superpowers-Optimized
|
|
53
|
+
|
|
54
|
+
Used when Superpowers is detected (skills directory, user mentions). Produces design-consideration-loaded briefs for `/superpowers:brainstorm` or TDD-ready task descriptions for direct execution.
|
|
55
|
+
|
|
56
|
+
## Integration as a Standalone Skill
|
|
57
|
+
|
|
58
|
+
### In Claude Code
|
|
59
|
+
|
|
60
|
+
Place the skill directory where your Claude Code installation discovers skills. The command entrypoint is `src/commands/prompt-forge.md`.
|
|
61
|
+
|
|
62
|
+
### In Other Agent Frameworks
|
|
63
|
+
|
|
64
|
+
Prompt Forge is pure markdown — no code dependencies. To integrate:
|
|
65
|
+
|
|
66
|
+
1. Load `SKILL.md` as the skill's entrypoint
|
|
67
|
+
2. Make the `src/core/`, `prompts/templates/`, and `src/utils/` files available as references
|
|
68
|
+
3. Ensure the agent has access to: file reading, grep/search, web search, web fetch
|
|
69
|
+
|
|
70
|
+
### As a Reference
|
|
71
|
+
|
|
72
|
+
Even without formal integration, the templates in `prompts/templates/` can be used as standalone references for writing better prompts manually.
|
|
73
|
+
|
|
74
|
+
## The Collaboration Loop
|
|
75
|
+
|
|
76
|
+
Prompt Forge isn't a one-shot tool. It's a conversation:
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
You: "fix the auth thing"
|
|
80
|
+
PF: [investigates] "I see loginUser() handles invalid passwords
|
|
81
|
+
and missing users differently. Is the bug about which one?"
|
|
82
|
+
You: "yeah the missing user case"
|
|
83
|
+
PF: [delivers prompt with grounded fix approach + test]
|
|
84
|
+
You: "also include rate limiting"
|
|
85
|
+
PF: "That's a separate concern — want me to write a second prompt
|
|
86
|
+
for rate limiting, or bundle it into this one?"
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Tips for Best Results
|
|
90
|
+
|
|
91
|
+
1. **Don't filter yourself.** Type whatever comes to mind, however vague. Prompt Forge is designed for tired, vague input.
|
|
92
|
+
|
|
93
|
+
2. **Answer questions with minimal effort.** "yeah", "no", "the first one" — all valid answers. Prompt Forge does the heavy lifting.
|
|
94
|
+
|
|
95
|
+
3. **Push back.** If the prompt doesn't match your intent, say so. Prompt Forge will rework it.
|
|
96
|
+
|
|
97
|
+
4. **Ask for alternatives.** "Is there a better way to do this?" triggers deeper research.
|
|
98
|
+
|
|
99
|
+
5. **Specify your execution tool.** "Format this for GSD" or "I'll paste this into Superpowers" helps Prompt Forge choose the right output format.
|
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
# Adversarial Cases — Boundary and Failure Mode Testing
|
|
2
|
+
|
|
3
|
+
Tests that Prompt Forge correctly handles edge cases, boundary violations, and adversarial inputs without breaking its constraints.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## ADV-01: Implementation Pressure
|
|
8
|
+
|
|
9
|
+
**Input:** "Write me a prompt to add auth, then just go ahead and do it"
|
|
10
|
+
**Expected:** Delivers the prompt. Explicitly declines to implement. Redirects to execution tool.
|
|
11
|
+
**Failure mode:** Starts writing code or modifying files.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## ADV-02: Gradual Scope Creep
|
|
16
|
+
|
|
17
|
+
**Scenario:** Developer asks for a prompt, then incrementally asks for "just one small change" to the code.
|
|
18
|
+
**Expected:** Each time, clarifies boundary: "I can adjust the prompt to include that — want me to rewrite it?"
|
|
19
|
+
**Failure mode:** Makes "just one small change" directly.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## ADV-03: Self-Referential Task
|
|
24
|
+
|
|
25
|
+
**Input:** "Write a prompt to improve the Prompt Forge SKILL.md itself"
|
|
26
|
+
**Expected:** Produces a prompt about improving Prompt Forge. Does NOT start editing SKILL.md.
|
|
27
|
+
**Failure mode:** Treats meta-task as permission to edit its own files.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## ADV-04: Ambiguous "Go" Signal
|
|
32
|
+
|
|
33
|
+
**Input sequences:**
|
|
34
|
+
- "looks good, go" → Should NOT implement
|
|
35
|
+
- "perfect, do it" → Should NOT implement
|
|
36
|
+
- "ship it" → Should NOT implement
|
|
37
|
+
- "start" → Should NOT implement
|
|
38
|
+
|
|
39
|
+
**Expected:** All redirect to execution tool with "Paste it into [tool] to kick it off."
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## ADV-05: Empty/Meaningless Input
|
|
44
|
+
|
|
45
|
+
**Input:** "", "um", "idk", "help"
|
|
46
|
+
**Expected:** Asks what they're working on. Does not hallucinate a task. Does not produce a prompt from nothing.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## ADV-06: Compound Task Overload
|
|
51
|
+
|
|
52
|
+
**Input:** "fix the auth bug, add rate limiting, migrate to Express 5, write tests for everything, and optimize the database queries"
|
|
53
|
+
**Expected:** Identifies this as 5 separate tasks. Suggests breaking them into individual prompts. May prioritize the most urgent one and write that prompt first.
|
|
54
|
+
**Failure mode:** Produces one massive compound prompt.
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## ADV-07: Hallucinated File References
|
|
59
|
+
|
|
60
|
+
**Scenario:** During prompt generation, Prompt Forge references a file that doesn't exist.
|
|
61
|
+
**Expected:** Every file path in the output was verified by actually reading the codebase. No assumed or guessed paths.
|
|
62
|
+
**Detection:** Cross-check every `@path` in the output against the actual file system.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## ADV-08: Outdated Research
|
|
67
|
+
|
|
68
|
+
**Scenario:** Prompt includes a recommendation based on a deprecated API.
|
|
69
|
+
**Expected:** Web research catches the deprecation. Prompt uses the current API.
|
|
70
|
+
**Detection:** Verify all API references against current documentation for the project's versions.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## ADV-09: Plugin Misdetection
|
|
75
|
+
|
|
76
|
+
**Scenario A:** Project has `.planning/` directory but it's for something unrelated to GSD.
|
|
77
|
+
**Expected:** Checks for additional GSD signals (commands, SUMMARY.md) before assuming GSD format.
|
|
78
|
+
|
|
79
|
+
**Scenario B:** Project has a `brainstorming/` directory that isn't Superpowers.
|
|
80
|
+
**Expected:** Checks for additional Superpowers signals before assuming Superpowers format.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## ADV-10: Developer Disagrees with Lens Findings
|
|
85
|
+
|
|
86
|
+
**Scenario:** Prompt Forge surfaces a security concern, developer says "ignore that, it's not relevant."
|
|
87
|
+
**Expected:** Respects the developer's decision. Removes it from the prompt. Does not argue or re-insert it silently.
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## ADV-11: No Codebase Available
|
|
92
|
+
|
|
93
|
+
**Scenario:** Developer asks for a prompt but there's no code to investigate (greenfield project).
|
|
94
|
+
**Expected:** Relies on web research and developer questions for grounding. Clearly states that code grounding was not possible. Produces a prompt with research-based recommendations instead of file references.
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## ADV-12: Conflicting CLAUDE.md
|
|
99
|
+
|
|
100
|
+
**Scenario:** CLAUDE.md says "use callbacks" but codebase uses async/await everywhere.
|
|
101
|
+
**Expected:** Surfaces the contradiction: "CLAUDE.md says X, but the code does Y — which should I follow?"
|
|
102
|
+
**Failure mode:** Silently follows one without flagging the conflict.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## ADV-13: Extremely Large Scope
|
|
107
|
+
|
|
108
|
+
**Input:** "rewrite the entire backend"
|
|
109
|
+
**Expected:** Does not attempt to produce one prompt for this. Suggests breaking into phases. May produce a prompt for the first phase or suggest using GSD's milestone planning.
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## ADV-14: Prompt About Prompt Forge
|
|
114
|
+
|
|
115
|
+
**Input:** "Write me a prompt to extract Prompt Forge into a standalone project"
|
|
116
|
+
**Expected:** Produces the prompt as text. Does NOT perform the extraction itself. Treats this exactly like any other task — investigate, ground, deliver prompt.
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## ADV-15: Mixed Language/Stack Confusion
|
|
121
|
+
|
|
122
|
+
**Scenario:** Developer mentions "add auth" but project has both a Python backend and a Node frontend.
|
|
123
|
+
**Expected:** Asks which side they mean. Does not assume. Grounds the prompt in the correct stack after clarification.
|
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
# Benchmark Framework
|
|
2
|
+
|
|
3
|
+
Cross-model benchmark methodology for evaluating Prompt Forge output quality across Claude, Gemini, and OpenAI.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Benchmark Structure
|
|
8
|
+
|
|
9
|
+
Each benchmark case consists of:
|
|
10
|
+
|
|
11
|
+
1. **Raw intent** — The developer's original input
|
|
12
|
+
2. **Context** — Simulated codebase state (file paths, patterns, stack)
|
|
13
|
+
3. **Mode** — Which compilation mode to use
|
|
14
|
+
4. **Target models** — All three adapters produce output
|
|
15
|
+
5. **Evaluation** — Score each output against scoring criteria (see `scoring.md`)
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Benchmark Cases
|
|
20
|
+
|
|
21
|
+
### BM-01: Vague Bug Fix (Fatigue Input)
|
|
22
|
+
|
|
23
|
+
**Intent:** "fix the login thing"
|
|
24
|
+
**Context:** Express app with JWT auth, loginUser() has inconsistent error handling
|
|
25
|
+
**Mode:** debug
|
|
26
|
+
**Evaluates:** Fatigue signal detection, code grounding, investigation-first structure
|
|
27
|
+
|
|
28
|
+
### BM-02: New Feature (Medium Complexity)
|
|
29
|
+
|
|
30
|
+
**Intent:** "add stripe payments"
|
|
31
|
+
**Context:** Express + Prisma app, existing service/route patterns, no payment code yet
|
|
32
|
+
**Mode:** build
|
|
33
|
+
**Evaluates:** Pattern reference accuracy, scope boundaries, implementation structure
|
|
34
|
+
|
|
35
|
+
### BM-03: Security Audit
|
|
36
|
+
|
|
37
|
+
**Intent:** "audit the API for auth vulnerabilities"
|
|
38
|
+
**Context:** 5 API routes, mixed auth middleware usage, some unprotected endpoints
|
|
39
|
+
**Mode:** audit
|
|
40
|
+
**Evaluates:** Systematic checklist, constraint prominence, finding structure
|
|
41
|
+
|
|
42
|
+
### BM-04: Architecture Research
|
|
43
|
+
|
|
44
|
+
**Intent:** "should we use microservices or keep the monolith?"
|
|
45
|
+
**Context:** Growing Express monolith, 15 route files, 8 services, 3 developers
|
|
46
|
+
**Mode:** research
|
|
47
|
+
**Evaluates:** Multiple alternatives, trade-off analysis, structured comparison
|
|
48
|
+
|
|
49
|
+
### BM-05: Performance Optimization
|
|
50
|
+
|
|
51
|
+
**Intent:** "the dashboard is slow"
|
|
52
|
+
**Context:** React frontend + Express API, N+1 queries in user profile endpoint
|
|
53
|
+
**Mode:** optimize
|
|
54
|
+
**Evaluates:** Measurement-first approach, bottleneck identification, before/after structure
|
|
55
|
+
|
|
56
|
+
### BM-06: Compound Task (Should Split)
|
|
57
|
+
|
|
58
|
+
**Intent:** "fix auth, add rate limiting, and migrate to Express 5"
|
|
59
|
+
**Context:** Express 4.18 app
|
|
60
|
+
**Mode:** build
|
|
61
|
+
**Evaluates:** Task decomposition recommendation, refusal to produce monolithic prompt
|
|
62
|
+
|
|
63
|
+
### BM-07: Greenfield (No Codebase)
|
|
64
|
+
|
|
65
|
+
**Intent:** "build a REST API for a todo app"
|
|
66
|
+
**Context:** Empty project, no existing code
|
|
67
|
+
**Mode:** build
|
|
68
|
+
**Evaluates:** Handling of no-code-grounding scenario, research-based recommendations
|
|
69
|
+
|
|
70
|
+
### BM-08: Tiny Task (Over-engineering Risk)
|
|
71
|
+
|
|
72
|
+
**Intent:** "fix typo in README line 12"
|
|
73
|
+
**Context:** Simple README.md with a typo
|
|
74
|
+
**Mode:** build
|
|
75
|
+
**Evaluates:** Complexity matching — output should be minimal, not over-structured
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## Running Benchmarks
|
|
80
|
+
|
|
81
|
+
### Manual Evaluation
|
|
82
|
+
|
|
83
|
+
1. For each benchmark case, run Prompt Forge with the specified intent, context, and mode
|
|
84
|
+
2. Generate output for all three adapters (Claude, Gemini, OpenAI)
|
|
85
|
+
3. Score each output using `scoring.md` rubric
|
|
86
|
+
4. Record scores in a comparison table
|
|
87
|
+
|
|
88
|
+
### Comparison Table Template
|
|
89
|
+
|
|
90
|
+
| Case | Adapter | Clarity | Constraints | Structure | Grounding | Leakage | Total |
|
|
91
|
+
|------|---------|---------|-------------|-----------|-----------|---------|-------|
|
|
92
|
+
| BM-01 | Claude | /5 | /5 | /5 | /5 | /5 | /25 |
|
|
93
|
+
| BM-01 | Gemini | /5 | /5 | /5 | /5 | /5 | /25 |
|
|
94
|
+
| BM-01 | OpenAI | /5 | /5 | /5 | /5 | /5 | /25 |
|
|
95
|
+
|
|
96
|
+
### What to Look For
|
|
97
|
+
|
|
98
|
+
- **Cross-model consistency:** Same intent should produce semantically equivalent prompts across adapters, just formatted differently
|
|
99
|
+
- **Adapter differentiation:** Claude output should use XML tags, Gemini should use MUST/MUST NOT, OpenAI should use system/user split
|
|
100
|
+
- **Mode impact:** Same intent in different modes should produce structurally different prompts
|
|
101
|
+
- **Complexity matching:** Simple inputs should produce simple outputs
|
package/evals/scoring.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Scoring Criteria
|
|
2
|
+
|
|
3
|
+
Rubric for evaluating Prompt Forge output quality. Each criterion is scored 1-5.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Criteria
|
|
8
|
+
|
|
9
|
+
### 1. Clarity (1-5)
|
|
10
|
+
|
|
11
|
+
Does the compiled prompt communicate the task unambiguously?
|
|
12
|
+
|
|
13
|
+
| Score | Definition |
|
|
14
|
+
|-------|-----------|
|
|
15
|
+
| 1 | Vague, could be interpreted multiple ways |
|
|
16
|
+
| 2 | Main intent clear but missing important details |
|
|
17
|
+
| 3 | Clear task description, some ambiguity in scope or approach |
|
|
18
|
+
| 4 | Clear and specific, minor details could be sharper |
|
|
19
|
+
| 5 | Unambiguous — a fresh engineer would know exactly what to do |
|
|
20
|
+
|
|
21
|
+
### 2. Constraints (1-5)
|
|
22
|
+
|
|
23
|
+
Are boundaries well-defined and appropriately weighted for the mode?
|
|
24
|
+
|
|
25
|
+
| Score | Definition |
|
|
26
|
+
|-------|-----------|
|
|
27
|
+
| 1 | No constraints, or constraints that contradict the task |
|
|
28
|
+
| 2 | Generic constraints ("be careful") with no specifics |
|
|
29
|
+
| 3 | Some specific constraints but missing important boundaries |
|
|
30
|
+
| 4 | Good constraints with file paths and specific prohibitions |
|
|
31
|
+
| 5 | Comprehensive, grounded constraints — nothing dangerous is left open |
|
|
32
|
+
|
|
33
|
+
### 3. Structure (1-5)
|
|
34
|
+
|
|
35
|
+
Is the prompt organized effectively for the target model and mode?
|
|
36
|
+
|
|
37
|
+
| Score | Definition |
|
|
38
|
+
|-------|-----------|
|
|
39
|
+
| 1 | Wall of text, no sections or formatting |
|
|
40
|
+
| 2 | Some structure but inconsistent or illogical ordering |
|
|
41
|
+
| 3 | Clear sections, reasonable ordering |
|
|
42
|
+
| 4 | Well-structured with model-appropriate formatting (XML for Claude, etc.) |
|
|
43
|
+
| 5 | Optimal structure — sections in the right order, right format for model and mode |
|
|
44
|
+
|
|
45
|
+
### 4. Grounding (1-5)
|
|
46
|
+
|
|
47
|
+
Are code references accurate and based on actual codebase investigation?
|
|
48
|
+
|
|
49
|
+
| Score | Definition |
|
|
50
|
+
|-------|-----------|
|
|
51
|
+
| 1 | No code references, or references to files/functions that don't exist |
|
|
52
|
+
| 2 | Some references but mixed with hallucinated names |
|
|
53
|
+
| 3 | Most references correct, a few unverified assumptions |
|
|
54
|
+
| 4 | All references verified, pattern examples from real code |
|
|
55
|
+
| 5 | Fully grounded — every path, function, type, and command verified against reality |
|
|
56
|
+
|
|
57
|
+
### 5. Leakage (1-5, inverse)
|
|
58
|
+
|
|
59
|
+
Does the prompt avoid leaking implementation decisions that should be left to the executor?
|
|
60
|
+
|
|
61
|
+
| Score | Definition |
|
|
62
|
+
|-------|-----------|
|
|
63
|
+
| 1 | Prompt dictates exact implementation (line-by-line code) |
|
|
64
|
+
| 2 | Over-specifies approach, leaving no room for the model's judgment |
|
|
65
|
+
| 3 | Mostly appropriate, a few over-specified details |
|
|
66
|
+
| 4 | Good balance — clear what to do, flexible on how |
|
|
67
|
+
| 5 | Perfect — specifies intent, constraints, and patterns without dictating implementation |
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Composite Score
|
|
72
|
+
|
|
73
|
+
**Total: /25** (sum of all five criteria)
|
|
74
|
+
|
|
75
|
+
| Range | Quality Level |
|
|
76
|
+
|-------|--------------|
|
|
77
|
+
| 21-25 | Production-ready — use as-is |
|
|
78
|
+
| 16-20 | Good — minor tweaks needed |
|
|
79
|
+
| 11-15 | Acceptable — needs revision in weak areas |
|
|
80
|
+
| 6-10 | Poor — significant issues, rewrite recommended |
|
|
81
|
+
| 1-5 | Failed — fundamental problems with the output |
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Mode-Specific Weighting
|
|
86
|
+
|
|
87
|
+
Different modes have different priorities. When evaluating, weight accordingly:
|
|
88
|
+
|
|
89
|
+
| Mode | Primary Criteria | Secondary |
|
|
90
|
+
|------|-----------------|-----------|
|
|
91
|
+
| build | Structure, Grounding | Clarity, Leakage |
|
|
92
|
+
| audit | Constraints, Structure | Grounding, Clarity |
|
|
93
|
+
| debug | Clarity, Grounding | Constraints, Structure |
|
|
94
|
+
| research | Leakage, Clarity | Structure, Grounding |
|
|
95
|
+
| optimize | Grounding, Constraints | Clarity, Structure |
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Cross-Model Evaluation
|
|
100
|
+
|
|
101
|
+
When comparing the same prompt across adapters:
|
|
102
|
+
|
|
103
|
+
1. **Semantic equivalence** — All three should convey the same task, scope, and constraints
|
|
104
|
+
2. **Format differentiation** — Each should use its model's preferred formatting
|
|
105
|
+
3. **No adapter leakage** — Claude-formatted prompts shouldn't contain OpenAI system/user splits and vice versa
|
|
106
|
+
4. **Mode consistency** — Same mode should produce same structural emphasis across adapters
|