mdan-cli 2.2.0 → 2.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.mcp.json +46 -0
- package/AGENTS.md +246 -0
- package/README.md +32 -7
- package/agents/test.md +60 -2
- package/cli/mdan.js +149 -26
- package/cli/mdan.py +111 -54
- package/cli/mdan.sh +43 -43
- package/install.sh +30 -167
- package/integrations/all-integrations.md +2 -2
- package/integrations/cursor.md +11 -11
- package/integrations/mcp.md +153 -0
- package/integrations/windsurf.md +4 -4
- package/package.json +4 -2
- package/phases/04-verify.md +9 -3
- package/templates/prompts/README.md +108 -0
- package/templates/prompts/dev-agent.yaml +85 -0
- package/templates/prompts/orchestrator.yaml +97 -0
- package/templates/prompts.json +81 -0
- package/templates/tests/evaluations/README.md +80 -0
- package/templates/tests/evaluations/classification_eval.md +136 -0
- package/templates/tests/evaluations/rag_eval.md +116 -0
- package/templates/tests/scenarios/README.md +62 -0
- package/templates/tests/scenarios/basic_authentication.test.md +82 -0
- package/templates/tests/scenarios/user_registration.test.md +107 -0
package/integrations/windsurf.md
CHANGED
|
@@ -23,14 +23,14 @@ You are operating inside Windsurf IDE with Cascade AI.
|
|
|
23
23
|
- Use Cascade's flow awareness to maintain MDAN phase context across sessions
|
|
24
24
|
|
|
25
25
|
### File Organization
|
|
26
|
-
All MDAN artifacts should be saved to
|
|
26
|
+
All MDAN artifacts should be saved to `mdan/artifacts/` for reference.
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
### Step 2: Copy agents
|
|
30
30
|
|
|
31
31
|
```bash
|
|
32
|
-
mkdir -p
|
|
33
|
-
cp agents/*.md
|
|
32
|
+
mkdir -p mdan/agents mdan/artifacts
|
|
33
|
+
cp agents/*.md mdan/agents/
|
|
34
34
|
```
|
|
35
35
|
|
|
36
36
|
### Step 3: Using MDAN with Cascade
|
|
@@ -45,4 +45,4 @@ Cascade's multi-step reasoning pairs well with MDAN's structured phases. When st
|
|
|
45
45
|
|
|
46
46
|
- Windsurf's Cascade is excellent for the BUILD phase — it can implement entire features autonomously
|
|
47
47
|
- Use MDAN's Feature Briefs as Cascade tasks for predictable, structured implementation
|
|
48
|
-
- Save architecture documents to
|
|
48
|
+
- Save architecture documents to `mdan/artifacts/` so Cascade can reference them in context
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "mdan-cli",
|
|
3
|
-
"version": "2.
|
|
3
|
+
"version": "2.4.0",
|
|
4
4
|
"description": "Multi-Agent Development Agentic Network - A modern, adaptive, LLM-agnostic methodology for building software",
|
|
5
5
|
"main": "cli/mdan.js",
|
|
6
6
|
"bin": {
|
|
@@ -39,7 +39,9 @@
|
|
|
39
39
|
"integrations/",
|
|
40
40
|
"memory/",
|
|
41
41
|
"skills/",
|
|
42
|
-
"install.sh"
|
|
42
|
+
"install.sh",
|
|
43
|
+
".mcp.json",
|
|
44
|
+
"AGENTS.md"
|
|
43
45
|
],
|
|
44
46
|
"dependencies": {
|
|
45
47
|
"@clack/prompts": "^1.0.1",
|
package/phases/04-verify.md
CHANGED
|
@@ -38,11 +38,13 @@ Implemented features: [List of completed features]
|
|
|
38
38
|
|
|
39
39
|
Expected output:
|
|
40
40
|
1. Complete test plan
|
|
41
|
-
2. Unit tests for all features
|
|
41
|
+
2. Unit tests for all features (80%+ coverage)
|
|
42
42
|
3. Integration tests for critical flows
|
|
43
43
|
4. E2E test scenarios
|
|
44
|
-
5.
|
|
45
|
-
6.
|
|
44
|
+
5. Scenario tests (Better Agents format)
|
|
45
|
+
6. Evaluation datasets (if RAG/ML features)
|
|
46
|
+
7. Performance test criteria
|
|
47
|
+
8. Test results summary
|
|
46
48
|
```
|
|
47
49
|
|
|
48
50
|
---
|
|
@@ -77,6 +79,8 @@ Testing:
|
|
|
77
79
|
[ ] Test coverage meets target (e.g., 80%)
|
|
78
80
|
[ ] Integration tests pass
|
|
79
81
|
[ ] At least 3 E2E scenarios pass
|
|
82
|
+
[ ] Scenario tests pass (Better Agents format)
|
|
83
|
+
[ ] Evaluation benchmarks pass (if RAG/ML features)
|
|
80
84
|
[ ] Performance criteria are met
|
|
81
85
|
|
|
82
86
|
Security:
|
|
@@ -99,3 +103,5 @@ Quality:
|
|
|
99
103
|
|---|---|---|---|
|
|
100
104
|
| Test Plan | `templates/TEST-PLAN.md` | Test Agent | Complete |
|
|
101
105
|
| Security Report | `templates/SECURITY-REVIEW.md` | Security Agent | Signed off |
|
|
106
|
+
| Scenarios | `templates/tests/scenarios/*.test.md` | Test Agent | Pass |
|
|
107
|
+
| Evaluations | `templates/tests/evaluations/*.md` | Test Agent | Pass thresholds |
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
# Prompts Versioning Index
|
|
2
|
+
|
|
3
|
+
> Version-controlled prompts for MDAN agents
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Prompts are versioned using YAML files for better collaboration, tracking, and rollback capabilities.
|
|
8
|
+
|
|
9
|
+
## Available Prompts
|
|
10
|
+
|
|
11
|
+
| Prompt | Version | Agent |
|
|
12
|
+
|--------|---------|-------|
|
|
13
|
+
| [orchestrator.yaml](prompts/orchestrator.yaml) | 2.2.0 | MDAN Core |
|
|
14
|
+
| [dev-agent.yaml](prompts/dev-agent.yaml) | 2.0.0 | Haytame (Dev) |
|
|
15
|
+
| [product-agent.yaml](prompts/product-agent.yaml) | 2.0.0 | Khalil (Product) |
|
|
16
|
+
| [architect-agent.yaml](prompts/architect-agent.yaml) | 2.0.0 | Reda (Architect) |
|
|
17
|
+
| [test-agent.yaml](prompts/test-agent.yaml) | 2.0.0 | Youssef (Test) |
|
|
18
|
+
|
|
19
|
+
## Prompt Format
|
|
20
|
+
|
|
21
|
+
Each prompt file follows this structure:
|
|
22
|
+
|
|
23
|
+
```yaml
|
|
24
|
+
handle: agent-handle
|
|
25
|
+
scope: PROJECT # or GLOBAL
|
|
26
|
+
model: openai/gpt-4o
|
|
27
|
+
version: 1.0.0
|
|
28
|
+
last_updated: "2026-02-24"
|
|
29
|
+
maintainer: username
|
|
30
|
+
|
|
31
|
+
description: Brief description
|
|
32
|
+
|
|
33
|
+
system_prompt: |
|
|
34
|
+
Full prompt content with:
|
|
35
|
+
- Role definition
|
|
36
|
+
- Capabilities
|
|
37
|
+
- Constraints
|
|
38
|
+
- Output format
|
|
39
|
+
|
|
40
|
+
capabilities:
|
|
41
|
+
- Capability 1
|
|
42
|
+
- Capability 2
|
|
43
|
+
|
|
44
|
+
constraints:
|
|
45
|
+
- Constraint 1
|
|
46
|
+
- Constraint 2
|
|
47
|
+
|
|
48
|
+
changelog:
|
|
49
|
+
- version: 1.0.0
|
|
50
|
+
date: "2026-02-24"
|
|
51
|
+
changes:
|
|
52
|
+
- Change 1
|
|
53
|
+
- Change 2
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Using Prompts
|
|
57
|
+
|
|
58
|
+
### CLI Commands
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
# List all prompts
|
|
62
|
+
mdan prompt list
|
|
63
|
+
|
|
64
|
+
# Show specific prompt
|
|
65
|
+
mdan prompt show orchestrator
|
|
66
|
+
|
|
67
|
+
# Compare versions
|
|
68
|
+
mdan prompt diff orchestrator 2.1.0 2.2.0
|
|
69
|
+
|
|
70
|
+
# Rollback prompt
|
|
71
|
+
mdan prompt rollback orchestrator 2.1.0
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Integration with IDE
|
|
75
|
+
|
|
76
|
+
Prompts are synced to:
|
|
77
|
+
- `.claude/skills/` - For Claude/Cursor
|
|
78
|
+
- `.windsurfrules` - For Windsurf
|
|
79
|
+
- `.github/copilot-instructions.md` - For Copilot
|
|
80
|
+
|
|
81
|
+
## Registry
|
|
82
|
+
|
|
83
|
+
The `prompts.json` file tracks all prompts:
|
|
84
|
+
|
|
85
|
+
```json
|
|
86
|
+
{
|
|
87
|
+
"prompts": {
|
|
88
|
+
"orchestrator": {
|
|
89
|
+
"version": "2.2.0",
|
|
90
|
+
"active": true
|
|
91
|
+
}
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
## Best Practices
|
|
97
|
+
|
|
98
|
+
1. **Version bump on any change** - Even small tweaks warrant a version bump
|
|
99
|
+
2. **Document changes** - Always add changelog entries
|
|
100
|
+
3. **Test prompts** - Validate prompts before releasing
|
|
101
|
+
4. **Use semantic versioning** - MAJOR for breaking, MINOR for features, PATCH for fixes
|
|
102
|
+
|
|
103
|
+
## Adding New Prompts
|
|
104
|
+
|
|
105
|
+
1. Create YAML file in `templates/prompts/`
|
|
106
|
+
2. Add entry to `prompts.json`
|
|
107
|
+
3. Update registry version
|
|
108
|
+
4. Commit with descriptive message
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
handle: dev-agent
|
|
2
|
+
scope: PROJECT
|
|
3
|
+
model: openai/gpt-4o
|
|
4
|
+
version: 2.0.0
|
|
5
|
+
last_updated: "2026-02-24"
|
|
6
|
+
maintainer: khalilbenaz
|
|
7
|
+
|
|
8
|
+
description: MDAN Dev Agent (Haytame) - Senior full-stack developer responsible for implementation, code quality, and technical decisions.
|
|
9
|
+
|
|
10
|
+
system_prompt: |
|
|
11
|
+
[MDAN-AGENT]
|
|
12
|
+
NAME: Dev Agent (Haytame)
|
|
13
|
+
VERSION: 2.0.0
|
|
14
|
+
ROLE: Senior Full-Stack Developer responsible for implementation
|
|
15
|
+
PHASE: BUILD
|
|
16
|
+
REPORTS_TO: MDAN Core
|
|
17
|
+
|
|
18
|
+
You are Haytame, a senior full-stack developer with 10+ years of experience.
|
|
19
|
+
You write clean, maintainable, production-ready code. You care about:
|
|
20
|
+
- Code readability over cleverness
|
|
21
|
+
- Testing as a safety net
|
|
22
|
+
- Security by default
|
|
23
|
+
- Performance optimization when needed
|
|
24
|
+
|
|
25
|
+
Your philosophy:
|
|
26
|
+
- "Code is read more than written"
|
|
27
|
+
- "The best code is no code"
|
|
28
|
+
- "Always consider the next developer"
|
|
29
|
+
- "Security is not optional"
|
|
30
|
+
|
|
31
|
+
capabilities:
|
|
32
|
+
- Implement features from user stories
|
|
33
|
+
- Write unit, integration, and e2e tests
|
|
34
|
+
- Create API endpoints
|
|
35
|
+
- Design database schemas
|
|
36
|
+
- Set up CI/CD pipelines
|
|
37
|
+
- Perform code reviews
|
|
38
|
+
- Refactor existing code
|
|
39
|
+
- Optimize performance
|
|
40
|
+
- Fix bugs
|
|
41
|
+
|
|
42
|
+
constraints:
|
|
43
|
+
- NEVER skip tests
|
|
44
|
+
- NEVER commit secrets/keys to repository
|
|
45
|
+
- NEVER bypass security checks
|
|
46
|
+
- ALWAYS use type hints (TypeScript/Python)
|
|
47
|
+
- ALWAYS handle errors explicitly
|
|
48
|
+
- NEVER expose stack traces to users
|
|
49
|
+
- ALWAYS use environment variables for config
|
|
50
|
+
|
|
51
|
+
input_format: |
|
|
52
|
+
MDAN Core provides:
|
|
53
|
+
- User stories with acceptance criteria
|
|
54
|
+
- Architecture document
|
|
55
|
+
- UX designs/wireframes
|
|
56
|
+
- Previous implementation context (if any)
|
|
57
|
+
|
|
58
|
+
output_format: |
|
|
59
|
+
Produce implementation artifacts:
|
|
60
|
+
- Implementation Plan (file structure, dependencies, sequence)
|
|
61
|
+
- Code Files (source, types, configs)
|
|
62
|
+
- Tests (unit 80%+, integration, e2e)
|
|
63
|
+
- Setup Instructions (env vars, run locally, migrations)
|
|
64
|
+
|
|
65
|
+
quality_checklist:
|
|
66
|
+
- All acceptance criteria implemented
|
|
67
|
+
- Code follows style guide
|
|
68
|
+
- No TODO/FIXME in production
|
|
69
|
+
- All tests pass
|
|
70
|
+
- No security vulnerabilities
|
|
71
|
+
- Error handling comprehensive
|
|
72
|
+
- Logging appropriate
|
|
73
|
+
- Performance addressed
|
|
74
|
+
|
|
75
|
+
changelog:
|
|
76
|
+
- version: 2.0.0
|
|
77
|
+
date: "2026-02-24"
|
|
78
|
+
changes:
|
|
79
|
+
- Added implementation plan format
|
|
80
|
+
- Added setup instructions section
|
|
81
|
+
- Added coding standards reference
|
|
82
|
+
- version: 1.0.0
|
|
83
|
+
date: "2025-10-01"
|
|
84
|
+
changes:
|
|
85
|
+
- Initial release
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
handle: orchestrator
|
|
2
|
+
scope: PROJECT
|
|
3
|
+
model: openai/gpt-4o
|
|
4
|
+
version: 2.2.0
|
|
5
|
+
last_updated: "2026-02-24"
|
|
6
|
+
maintainer: khalilbenaz
|
|
7
|
+
|
|
8
|
+
description: |
|
|
9
|
+
MDAN Core v2 - Adaptive central orchestrator that manages the entire
|
|
10
|
+
development lifecycle. Handles project profiling, session memory,
|
|
11
|
+
and agent coordination.
|
|
12
|
+
|
|
13
|
+
system_prompt: |
|
|
14
|
+
You are MDAN Core v2, the adaptive central orchestrator of the MDAN method.
|
|
15
|
+
|
|
16
|
+
You do everything MDAN Core v1 does, plus two critical upgrades:
|
|
17
|
+
|
|
18
|
+
1. PROJECT PROFILE DETECTION — You detect the type of project early in DISCOVER
|
|
19
|
+
and automatically adapt the agent workflow, phase depth, and artifact requirements.
|
|
20
|
+
|
|
21
|
+
2. SESSION MEMORY — You can resume any project from a MDAN-STATE.json file,
|
|
22
|
+
reconstructing full context and picking up exactly where work left off.
|
|
23
|
+
|
|
24
|
+
Your responsibilities:
|
|
25
|
+
- Guide user through 5 phases: DISCOVER, DESIGN, BUILD, VERIFY, SHIP
|
|
26
|
+
- Activate appropriate agents based on project profile
|
|
27
|
+
- Maintain session state in MDAN-STATE.json
|
|
28
|
+
- Ensure quality gates are met before phase transitions
|
|
29
|
+
- Support slash commands: /mdan-help, /party
|
|
30
|
+
|
|
31
|
+
Project profiles you can detect:
|
|
32
|
+
- MICRO: Solo dev, <1 week, single feature
|
|
33
|
+
- STANDARD: Small team, 2-8 weeks (default)
|
|
34
|
+
- PRODUCT: Team 5+, multi-month, stakeholders
|
|
35
|
+
- ENTERPRISE: Large team, compliance, SLA
|
|
36
|
+
- API-ONLY: No frontend, developer-focused
|
|
37
|
+
- DATA/ML: Machine learning, data pipelines
|
|
38
|
+
|
|
39
|
+
constraints: |
|
|
40
|
+
- Never skip quality gates
|
|
41
|
+
- Always ask for user name in first interaction
|
|
42
|
+
- Use agent names: Khalil (Product), Reda (Architect), Jihane (UX),
|
|
43
|
+
Haytame (Dev), Youssef (Test), Said (Security), Anas (DevOps), Amina (Doc)
|
|
44
|
+
- Log all agent activations with version numbers
|
|
45
|
+
|
|
46
|
+
output_format: |
|
|
47
|
+
Always include these in your responses:
|
|
48
|
+
- Current phase indicator
|
|
49
|
+
- Next recommended action
|
|
50
|
+
- Quality gate status
|
|
51
|
+
|
|
52
|
+
examples:
|
|
53
|
+
- scenario: Start new project
|
|
54
|
+
user: "I want to build a SaaS app"
|
|
55
|
+
response: |
|
|
56
|
+
[MDAN CORE — PROJECT PROFILE DETECTED]
|
|
57
|
+
|
|
58
|
+
Based on your description, I'm classifying this as a STANDARD project.
|
|
59
|
+
|
|
60
|
+
This means:
|
|
61
|
+
- Full 5-phase workflow
|
|
62
|
+
- All agents will be activated
|
|
63
|
+
- Standard artifact depth
|
|
64
|
+
|
|
65
|
+
Does this feel right? Also, before we proceed, what is your name?
|
|
66
|
+
|
|
67
|
+
- scenario: Resume session
|
|
68
|
+
user: "Here's my MDAN-STATE.json"
|
|
69
|
+
response: |
|
|
70
|
+
[MDAN CORE — SESSION RESUMED]
|
|
71
|
+
|
|
72
|
+
📁 Project: my-saas (STANDARD)
|
|
73
|
+
📍 Current Phase: DESIGN
|
|
74
|
+
...
|
|
75
|
+
|
|
76
|
+
→ Recommended next action: Complete Architecture document
|
|
77
|
+
|
|
78
|
+
Shall I continue?
|
|
79
|
+
|
|
80
|
+
changelog:
|
|
81
|
+
- version: 2.2.0
|
|
82
|
+
date: "2026-02-24"
|
|
83
|
+
changes:
|
|
84
|
+
- Added project profile detection
|
|
85
|
+
- Added session resume protocol
|
|
86
|
+
- Added agent version tracking
|
|
87
|
+
- Added quality gates for each profile
|
|
88
|
+
- version: 2.1.0
|
|
89
|
+
date: "2025-12-01"
|
|
90
|
+
changes:
|
|
91
|
+
- Added /party command
|
|
92
|
+
- Improved /mdan-help
|
|
93
|
+
- version: 2.0.0
|
|
94
|
+
date: "2025-10-01"
|
|
95
|
+
changes:
|
|
96
|
+
- Initial v2 release
|
|
97
|
+
- 5-phase workflow
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
{
|
|
2
|
+
"registry_version": "1.0",
|
|
3
|
+
"last_updated": "2026-02-24",
|
|
4
|
+
"prompts": {
|
|
5
|
+
"orchestrator": {
|
|
6
|
+
"handle": "orchestrator",
|
|
7
|
+
"version": "2.2.0",
|
|
8
|
+
"file": "templates/prompts/orchestrator.yaml",
|
|
9
|
+
"active": true,
|
|
10
|
+
"model": "openai/gpt-4o",
|
|
11
|
+
"changelog": [
|
|
12
|
+
{
|
|
13
|
+
"version": "2.2.0",
|
|
14
|
+
"date": "2026-02-24",
|
|
15
|
+
"changes": ["Added project profile detection", "Added session resume protocol"]
|
|
16
|
+
}
|
|
17
|
+
]
|
|
18
|
+
},
|
|
19
|
+
"dev-agent": {
|
|
20
|
+
"handle": "dev-agent",
|
|
21
|
+
"version": "2.0.0",
|
|
22
|
+
"file": "templates/prompts/dev-agent.yaml",
|
|
23
|
+
"active": true,
|
|
24
|
+
"model": "openai/gpt-4o",
|
|
25
|
+
"changelog": [
|
|
26
|
+
{
|
|
27
|
+
"version": "2.0.0",
|
|
28
|
+
"date": "2026-02-24",
|
|
29
|
+
"changes": ["Added implementation plan format", "Added setup instructions"]
|
|
30
|
+
}
|
|
31
|
+
]
|
|
32
|
+
},
|
|
33
|
+
"product-agent": {
|
|
34
|
+
"handle": "product-agent",
|
|
35
|
+
"version": "2.0.0",
|
|
36
|
+
"file": "templates/prompts/product-agent.yaml",
|
|
37
|
+
"active": true,
|
|
38
|
+
"model": "openai/gpt-4o",
|
|
39
|
+
"changelog": [
|
|
40
|
+
{
|
|
41
|
+
"version": "2.0.0",
|
|
42
|
+
"date": "2026-02-24",
|
|
43
|
+
"changes": ["Initial release with MoSCoW, risk matrix"]
|
|
44
|
+
}
|
|
45
|
+
]
|
|
46
|
+
},
|
|
47
|
+
"architect-agent": {
|
|
48
|
+
"handle": "architect-agent",
|
|
49
|
+
"version": "2.0.0",
|
|
50
|
+
"file": "templates/prompts/architect-agent.yaml",
|
|
51
|
+
"active": true,
|
|
52
|
+
"model": "openai/gpt-4o",
|
|
53
|
+
"changelog": [
|
|
54
|
+
{
|
|
55
|
+
"version": "2.0.0",
|
|
56
|
+
"date": "2026-02-24",
|
|
57
|
+
"changes": ["Added Mermaid diagrams", "Added ADR format"]
|
|
58
|
+
}
|
|
59
|
+
]
|
|
60
|
+
},
|
|
61
|
+
"test-agent": {
|
|
62
|
+
"handle": "test-agent",
|
|
63
|
+
"version": "2.0.0",
|
|
64
|
+
"file": "templates/prompts/test-agent.yaml",
|
|
65
|
+
"active": true,
|
|
66
|
+
"model": "openai/gpt-4o",
|
|
67
|
+
"changelog": [
|
|
68
|
+
{
|
|
69
|
+
"version": "2.0.0",
|
|
70
|
+
"date": "2026-02-24",
|
|
71
|
+
"changes": ["Added scenarios support", "Added evaluations"]
|
|
72
|
+
}
|
|
73
|
+
]
|
|
74
|
+
}
|
|
75
|
+
},
|
|
76
|
+
"sync_settings": {
|
|
77
|
+
"auto_sync": false,
|
|
78
|
+
"langwatch_integration": false,
|
|
79
|
+
"commit_on_change": true
|
|
80
|
+
}
|
|
81
|
+
}
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# Test Evaluations Index
|
|
2
|
+
|
|
3
|
+
> Structured benchmarking for agent components
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Evaluations provide quantitative testing for specific components of your agent pipeline. Unlike scenarios (end-to-end), evaluations test isolated components.
|
|
8
|
+
|
|
9
|
+
## Available Evaluations
|
|
10
|
+
|
|
11
|
+
| Evaluation | Description | Metrics |
|
|
12
|
+
|------------|-------------|---------|
|
|
13
|
+
| [rag_eval.md](rag_eval.md) | RAG correctness | F1, Precision, Recall, Context Relevance |
|
|
14
|
+
| [classification_eval.md](classification_eval.md) | Routing/categorization | Accuracy, F1, Precision, Recall |
|
|
15
|
+
|
|
16
|
+
## Adding New Evaluations
|
|
17
|
+
|
|
18
|
+
1. Copy an existing template:
|
|
19
|
+
```bash
|
|
20
|
+
cp rag_eval.md my_eval.md
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
2. Customize:
|
|
24
|
+
- Update dataset format
|
|
25
|
+
- Set target thresholds
|
|
26
|
+
- Add domain-specific checks
|
|
27
|
+
|
|
28
|
+
3. Run:
|
|
29
|
+
```python
|
|
30
|
+
import langwatch
|
|
31
|
+
results = langwatch.evaluate(
|
|
32
|
+
dataset="my-dataset",
|
|
33
|
+
evaluator="my_eval",
|
|
34
|
+
metrics=["accuracy", "f1"]
|
|
35
|
+
)
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Built-in Evaluators
|
|
39
|
+
|
|
40
|
+
LangWatch provides extensive evaluators:
|
|
41
|
+
|
|
42
|
+
| Evaluator | Description |
|
|
43
|
+
|-----------|-------------|
|
|
44
|
+
| `rag_correctness` | Retrieval and generation quality |
|
|
45
|
+
| `classification_accuracy` | Routing and categorization |
|
|
46
|
+
| `answer_correctness` | Factual accuracy |
|
|
47
|
+
| `safety_check` | Jailbreak, PII, toxicity |
|
|
48
|
+
| `format_validation` | JSON, XML, markdown structure |
|
|
49
|
+
| `tool_calling` | Correct tool selection and args |
|
|
50
|
+
| `latency` | Response time benchmarking |
|
|
51
|
+
|
|
52
|
+
## Running Evaluations
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
# Python
|
|
56
|
+
pytest tests/evaluations/ -v
|
|
57
|
+
|
|
58
|
+
# JavaScript
|
|
59
|
+
npm test -- tests/evaluations/
|
|
60
|
+
|
|
61
|
+
# With LangWatch
|
|
62
|
+
langwatch evaluate --dataset my-project
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Integration with MDAN
|
|
66
|
+
|
|
67
|
+
During VERIFY phase:
|
|
68
|
+
1. Test Agent identifies components needing evaluation
|
|
69
|
+
2. Creates evaluation datasets from PRD/user stories
|
|
70
|
+
3. Runs relevant evaluations
|
|
71
|
+
4. Reports metrics in quality gate
|
|
72
|
+
5. Fails if thresholds not met
|
|
73
|
+
|
|
74
|
+
## Best Practices
|
|
75
|
+
|
|
76
|
+
- Create evaluation datasets from real user queries
|
|
77
|
+
- Include edge cases and negative examples
|
|
78
|
+
- Set realistic thresholds (not 100%)
|
|
79
|
+
- Track metrics over time (regression detection)
|
|
80
|
+
- Run evaluations in CI/CD pipeline
|
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# Classification Evaluation Template
|
|
2
|
+
|
|
3
|
+
> Benchmark routing accuracy and categorization correctness
|
|
4
|
+
|
|
5
|
+
## Metadata
|
|
6
|
+
|
|
7
|
+
| Field | Value |
|
|
8
|
+
|-------|-------|
|
|
9
|
+
| eval_name | classification_accuracy |
|
|
10
|
+
| version | 1.0.0 |
|
|
11
|
+
| metrics | Accuracy, Precision, Recall, F1 Score |
|
|
12
|
+
|
|
13
|
+
## Purpose
|
|
14
|
+
|
|
15
|
+
Evaluate how well the agent correctly routes requests, categorizes inputs, or makes routing decisions.
|
|
16
|
+
|
|
17
|
+
## Use Cases
|
|
18
|
+
|
|
19
|
+
- Intent classification (chatbot routing)
|
|
20
|
+
- Content moderation
|
|
21
|
+
- Priority/ticket categorization
|
|
22
|
+
- Language detection
|
|
23
|
+
- Sentiment analysis
|
|
24
|
+
|
|
25
|
+
## Dataset Format
|
|
26
|
+
|
|
27
|
+
```json
|
|
28
|
+
[
|
|
29
|
+
{
|
|
30
|
+
"input": "I want to get a refund",
|
|
31
|
+
"expected_category": "refund_request",
|
|
32
|
+
"expected_confidence": 0.90
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"input": "How do I change my password?",
|
|
36
|
+
"expected_category": "account_settings",
|
|
37
|
+
"expected_confidence": 0.85
|
|
38
|
+
}
|
|
39
|
+
]
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Evaluation Metrics
|
|
43
|
+
|
|
44
|
+
| Metric | Target | Description |
|
|
45
|
+
|--------|--------|-------------|
|
|
46
|
+
| Accuracy | ≥0.95 | % of correct classifications |
|
|
47
|
+
| Macro F1 | ≥0.90 | F1 averaged across all categories |
|
|
48
|
+
| Precision (per class) | ≥0.85 | True positives / predicted positives |
|
|
49
|
+
| Recall (per class) | ≥0.85 | True positives / actual positives |
|
|
50
|
+
| Confidence Calibration | ≤0.10 | Mean absolute error between confidence and accuracy |
|
|
51
|
+
|
|
52
|
+
## Evaluation Code
|
|
53
|
+
|
|
54
|
+
### Python (LangWatch)
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
import langwatch
|
|
58
|
+
|
|
59
|
+
results = langwatch.evaluate(
|
|
60
|
+
dataset="intent-classification",
|
|
61
|
+
evaluator="classification_accuracy",
|
|
62
|
+
metrics=["accuracy", "precision", "recall", "f1_macro", "calibration"]
|
|
63
|
+
)
|
|
64
|
+
|
|
65
|
+
print(f"Accuracy: {results.accuracy}")
|
|
66
|
+
print(f"Macro F1: {results.f1_macro}")
|
|
67
|
+
print(f"Precision: {results.precision}")
|
|
68
|
+
print(f"Recall: {results.recall}")
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### JavaScript/TypeScript
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
import { evaluate } from "@langwatch/evaluators";
|
|
75
|
+
|
|
76
|
+
const results = await evaluate({
|
|
77
|
+
dataset: "intent-classification",
|
|
78
|
+
evaluator: "classification_accuracy",
|
|
79
|
+
metrics: ["accuracy", "precision", "recall", "f1_macro"],
|
|
80
|
+
});
|
|
81
|
+
|
|
82
|
+
console.log(`Accuracy: ${results.accuracy}`);
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## Pass/Fail Criteria
|
|
86
|
+
|
|
87
|
+
| Metric | Threshold | Status |
|
|
88
|
+
|--------|-----------|--------|
|
|
89
|
+
| Accuracy | ≥0.95 | ✅ Pass |
|
|
90
|
+
| Accuracy | 0.90-0.94 | ⚠️ Warning |
|
|
91
|
+
| Accuracy | <0.90 | ❌ Fail |
|
|
92
|
+
| Macro F1 | ≥0.90 | ✅ Pass |
|
|
93
|
+
| Macro F1 | <0.85 | ❌ Fail |
|
|
94
|
+
| Confidence Error | ≤0.10 | ✅ Pass |
|
|
95
|
+
| Confidence Error | >0.15 | ⚠️ Warning |
|
|
96
|
+
|
|
97
|
+
## Per-Class Analysis
|
|
98
|
+
|
|
99
|
+
Generate confusion matrix:
|
|
100
|
+
|
|
101
|
+
| | Predicted A | Predicted B | Predicted C |
|
|
102
|
+
|--|-------------|-------------|-------------|
|
|
103
|
+
| Actual A | 45 | 3 | 2 |
|
|
104
|
+
| Actual B | 5 | 38 | 7 |
|
|
105
|
+
| Actual C | 1 | 4 | 40 |
|
|
106
|
+
|
|
107
|
+
Identify:
|
|
108
|
+
- **High-confusion pairs**: A↔B need better differentiation
|
|
109
|
+
- **Low-recall classes**: More training data needed
|
|
110
|
+
- **Low-precision classes**: Overlapping with other categories
|
|
111
|
+
|
|
112
|
+
## Common Issues
|
|
113
|
+
|
|
114
|
+
### Low Precision (many false positives)
|
|
115
|
+
- Add negative examples
|
|
116
|
+
- Make categories more distinct
|
|
117
|
+
- Add disambiguation prompts
|
|
118
|
+
|
|
119
|
+
### Low Recall (many false negatives)
|
|
120
|
+
- Add more training data
|
|
121
|
+
- Expand category definitions
|
|
122
|
+
- Check for data quality issues
|
|
123
|
+
|
|
124
|
+
### Poor Calibration
|
|
125
|
+
- Retrain with temperature scaling
|
|
126
|
+
- Add more diverse examples
|
|
127
|
+
- Use calibration-aware loss
|
|
128
|
+
|
|
129
|
+
## Integration with MDAN
|
|
130
|
+
|
|
131
|
+
During VERIFY phase, Test Agent should:
|
|
132
|
+
1. Identify all classification/routing points in the system
|
|
133
|
+
2. Create evaluation datasets from real user queries
|
|
134
|
+
3. Run classification evaluation
|
|
135
|
+
4. Report per-class performance
|
|
136
|
+
5. Fail if overall accuracy < 90%
|