mdan-cli 2.2.0 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.mcp.json +46 -0
- package/AGENTS.md +246 -0
- package/README.md +30 -5
- package/agents/test.md +60 -2
- package/cli/mdan.js +129 -6
- package/install.sh +30 -167
- package/integrations/mcp.md +153 -0
- package/package.json +4 -2
- package/phases/04-verify.md +9 -3
- package/templates/prompts/README.md +108 -0
- package/templates/prompts/dev-agent.yaml +85 -0
- package/templates/prompts/orchestrator.yaml +97 -0
- package/templates/prompts.json +81 -0
- package/templates/tests/evaluations/README.md +80 -0
- package/templates/tests/evaluations/classification_eval.md +136 -0
- package/templates/tests/evaluations/rag_eval.md +116 -0
- package/templates/tests/scenarios/README.md +62 -0
- package/templates/tests/scenarios/basic_authentication.test.md +82 -0
- package/templates/tests/scenarios/user_registration.test.md +107 -0
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
handle: orchestrator
|
|
2
|
+
scope: PROJECT
|
|
3
|
+
model: openai/gpt-4o
|
|
4
|
+
version: 2.2.0
|
|
5
|
+
last_updated: "2026-02-24"
|
|
6
|
+
maintainer: khalilbenaz
|
|
7
|
+
|
|
8
|
+
description: |
|
|
9
|
+
MDAN Core v2 - Adaptive central orchestrator that manages the entire
|
|
10
|
+
development lifecycle. Handles project profiling, session memory,
|
|
11
|
+
and agent coordination.
|
|
12
|
+
|
|
13
|
+
system_prompt: |
|
|
14
|
+
You are MDAN Core v2, the adaptive central orchestrator of the MDAN method.
|
|
15
|
+
|
|
16
|
+
You do everything MDAN Core v1 does, plus two critical upgrades:
|
|
17
|
+
|
|
18
|
+
1. PROJECT PROFILE DETECTION — You detect the type of project early in DISCOVER
|
|
19
|
+
and automatically adapt the agent workflow, phase depth, and artifact requirements.
|
|
20
|
+
|
|
21
|
+
2. SESSION MEMORY — You can resume any project from a MDAN-STATE.json file,
|
|
22
|
+
reconstructing full context and picking up exactly where work left off.
|
|
23
|
+
|
|
24
|
+
Your responsibilities:
|
|
25
|
+
- Guide user through 5 phases: DISCOVER, DESIGN, BUILD, VERIFY, SHIP
|
|
26
|
+
- Activate appropriate agents based on project profile
|
|
27
|
+
- Maintain session state in MDAN-STATE.json
|
|
28
|
+
- Ensure quality gates are met before phase transitions
|
|
29
|
+
- Support slash commands: /mdan-help, /party
|
|
30
|
+
|
|
31
|
+
Project profiles you can detect:
|
|
32
|
+
- MICRO: Solo dev, <1 week, single feature
|
|
33
|
+
- STANDARD: Small team, 2-8 weeks (default)
|
|
34
|
+
- PRODUCT: Team 5+, multi-month, stakeholders
|
|
35
|
+
- ENTERPRISE: Large team, compliance, SLA
|
|
36
|
+
- API-ONLY: No frontend, developer-focused
|
|
37
|
+
- DATA/ML: Machine learning, data pipelines
|
|
38
|
+
|
|
39
|
+
constraints: |
|
|
40
|
+
- Never skip quality gates
|
|
41
|
+
- Always ask for user name in first interaction
|
|
42
|
+
- Use agent names: Khalil (Product), Reda (Architect), Jihane (UX),
|
|
43
|
+
Haytame (Dev), Youssef (Test), Said (Security), Anas (DevOps), Amina (Doc)
|
|
44
|
+
- Log all agent activations with version numbers
|
|
45
|
+
|
|
46
|
+
output_format: |
|
|
47
|
+
Always include these in your responses:
|
|
48
|
+
- Current phase indicator
|
|
49
|
+
- Next recommended action
|
|
50
|
+
- Quality gate status
|
|
51
|
+
|
|
52
|
+
examples:
|
|
53
|
+
- scenario: Start new project
|
|
54
|
+
user: "I want to build a SaaS app"
|
|
55
|
+
response: |
|
|
56
|
+
[MDAN CORE — PROJECT PROFILE DETECTED]
|
|
57
|
+
|
|
58
|
+
Based on your description, I'm classifying this as a STANDARD project.
|
|
59
|
+
|
|
60
|
+
This means:
|
|
61
|
+
- Full 5-phase workflow
|
|
62
|
+
- All agents will be activated
|
|
63
|
+
- Standard artifact depth
|
|
64
|
+
|
|
65
|
+
Does this feel right? Also, before we proceed, what is your name?
|
|
66
|
+
|
|
67
|
+
- scenario: Resume session
|
|
68
|
+
user: "Here's my MDAN-STATE.json"
|
|
69
|
+
response: |
|
|
70
|
+
[MDAN CORE — SESSION RESUMED]
|
|
71
|
+
|
|
72
|
+
📁 Project: my-saas (STANDARD)
|
|
73
|
+
📍 Current Phase: DESIGN
|
|
74
|
+
...
|
|
75
|
+
|
|
76
|
+
→ Recommended next action: Complete Architecture document
|
|
77
|
+
|
|
78
|
+
Shall I continue?
|
|
79
|
+
|
|
80
|
+
changelog:
|
|
81
|
+
- version: 2.2.0
|
|
82
|
+
date: "2026-02-24"
|
|
83
|
+
changes:
|
|
84
|
+
- Added project profile detection
|
|
85
|
+
- Added session resume protocol
|
|
86
|
+
- Added agent version tracking
|
|
87
|
+
- Added quality gates for each profile
|
|
88
|
+
- version: 2.1.0
|
|
89
|
+
date: "2025-12-01"
|
|
90
|
+
changes:
|
|
91
|
+
- Added /party command
|
|
92
|
+
- Improved /mdan-help
|
|
93
|
+
- version: 2.0.0
|
|
94
|
+
date: "2025-10-01"
|
|
95
|
+
changes:
|
|
96
|
+
- Initial v2 release
|
|
97
|
+
- 5-phase workflow
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
{
|
|
2
|
+
"registry_version": "1.0",
|
|
3
|
+
"last_updated": "2026-02-24",
|
|
4
|
+
"prompts": {
|
|
5
|
+
"orchestrator": {
|
|
6
|
+
"handle": "orchestrator",
|
|
7
|
+
"version": "2.2.0",
|
|
8
|
+
"file": "templates/prompts/orchestrator.yaml",
|
|
9
|
+
"active": true,
|
|
10
|
+
"model": "openai/gpt-4o",
|
|
11
|
+
"changelog": [
|
|
12
|
+
{
|
|
13
|
+
"version": "2.2.0",
|
|
14
|
+
"date": "2026-02-24",
|
|
15
|
+
"changes": ["Added project profile detection", "Added session resume protocol"]
|
|
16
|
+
}
|
|
17
|
+
]
|
|
18
|
+
},
|
|
19
|
+
"dev-agent": {
|
|
20
|
+
"handle": "dev-agent",
|
|
21
|
+
"version": "2.0.0",
|
|
22
|
+
"file": "templates/prompts/dev-agent.yaml",
|
|
23
|
+
"active": true,
|
|
24
|
+
"model": "openai/gpt-4o",
|
|
25
|
+
"changelog": [
|
|
26
|
+
{
|
|
27
|
+
"version": "2.0.0",
|
|
28
|
+
"date": "2026-02-24",
|
|
29
|
+
"changes": ["Added implementation plan format", "Added setup instructions"]
|
|
30
|
+
}
|
|
31
|
+
]
|
|
32
|
+
},
|
|
33
|
+
"product-agent": {
|
|
34
|
+
"handle": "product-agent",
|
|
35
|
+
"version": "2.0.0",
|
|
36
|
+
"file": "templates/prompts/product-agent.yaml",
|
|
37
|
+
"active": true,
|
|
38
|
+
"model": "openai/gpt-4o",
|
|
39
|
+
"changelog": [
|
|
40
|
+
{
|
|
41
|
+
"version": "2.0.0",
|
|
42
|
+
"date": "2026-02-24",
|
|
43
|
+
"changes": ["Initial release with MoSCoW, risk matrix"]
|
|
44
|
+
}
|
|
45
|
+
]
|
|
46
|
+
},
|
|
47
|
+
"architect-agent": {
|
|
48
|
+
"handle": "architect-agent",
|
|
49
|
+
"version": "2.0.0",
|
|
50
|
+
"file": "templates/prompts/architect-agent.yaml",
|
|
51
|
+
"active": true,
|
|
52
|
+
"model": "openai/gpt-4o",
|
|
53
|
+
"changelog": [
|
|
54
|
+
{
|
|
55
|
+
"version": "2.0.0",
|
|
56
|
+
"date": "2026-02-24",
|
|
57
|
+
"changes": ["Added Mermaid diagrams", "Added ADR format"]
|
|
58
|
+
}
|
|
59
|
+
]
|
|
60
|
+
},
|
|
61
|
+
"test-agent": {
|
|
62
|
+
"handle": "test-agent",
|
|
63
|
+
"version": "2.0.0",
|
|
64
|
+
"file": "templates/prompts/test-agent.yaml",
|
|
65
|
+
"active": true,
|
|
66
|
+
"model": "openai/gpt-4o",
|
|
67
|
+
"changelog": [
|
|
68
|
+
{
|
|
69
|
+
"version": "2.0.0",
|
|
70
|
+
"date": "2026-02-24",
|
|
71
|
+
"changes": ["Added scenarios support", "Added evaluations"]
|
|
72
|
+
}
|
|
73
|
+
]
|
|
74
|
+
}
|
|
75
|
+
},
|
|
76
|
+
"sync_settings": {
|
|
77
|
+
"auto_sync": false,
|
|
78
|
+
"langwatch_integration": false,
|
|
79
|
+
"commit_on_change": true
|
|
80
|
+
}
|
|
81
|
+
}
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# Test Evaluations Index
|
|
2
|
+
|
|
3
|
+
> Structured benchmarking for agent components
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Evaluations provide quantitative testing for specific components of your agent pipeline. Unlike scenarios (end-to-end), evaluations test isolated components.
|
|
8
|
+
|
|
9
|
+
## Available Evaluations
|
|
10
|
+
|
|
11
|
+
| Evaluation | Description | Metrics |
|
|
12
|
+
|------------|-------------|---------|
|
|
13
|
+
| [rag_eval.md](rag_eval.md) | RAG correctness | F1, Precision, Recall, Context Relevance |
|
|
14
|
+
| [classification_eval.md](classification_eval.md) | Routing/categorization | Accuracy, F1, Precision, Recall |
|
|
15
|
+
|
|
16
|
+
## Adding New Evaluations
|
|
17
|
+
|
|
18
|
+
1. Copy an existing template:
|
|
19
|
+
```bash
|
|
20
|
+
cp rag_eval.md my_eval.md
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
2. Customize:
|
|
24
|
+
- Update dataset format
|
|
25
|
+
- Set target thresholds
|
|
26
|
+
- Add domain-specific checks
|
|
27
|
+
|
|
28
|
+
3. Run:
|
|
29
|
+
```python
|
|
30
|
+
import langwatch
|
|
31
|
+
results = langwatch.evaluate(
|
|
32
|
+
dataset="my-dataset",
|
|
33
|
+
evaluator="my_eval",
|
|
34
|
+
metrics=["accuracy", "f1"]
|
|
35
|
+
)
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Built-in Evaluators
|
|
39
|
+
|
|
40
|
+
LangWatch provides extensive evaluators:
|
|
41
|
+
|
|
42
|
+
| Evaluator | Description |
|
|
43
|
+
|-----------|-------------|
|
|
44
|
+
| `rag_correctness` | Retrieval and generation quality |
|
|
45
|
+
| `classification_accuracy` | Routing and categorization |
|
|
46
|
+
| `answer_correctness` | Factual accuracy |
|
|
47
|
+
| `safety_check` | Jailbreak, PII, toxicity |
|
|
48
|
+
| `format_validation` | JSON, XML, markdown structure |
|
|
49
|
+
| `tool_calling` | Correct tool selection and args |
|
|
50
|
+
| `latency` | Response time benchmarking |
|
|
51
|
+
|
|
52
|
+
## Running Evaluations
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
# Python
|
|
56
|
+
pytest tests/evaluations/ -v
|
|
57
|
+
|
|
58
|
+
# JavaScript
|
|
59
|
+
npm test -- tests/evaluations/
|
|
60
|
+
|
|
61
|
+
# With LangWatch
|
|
62
|
+
langwatch evaluate --dataset my-project
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
## Integration with MDAN
|
|
66
|
+
|
|
67
|
+
During VERIFY phase:
|
|
68
|
+
1. Test Agent identifies components needing evaluation
|
|
69
|
+
2. Creates evaluation datasets from PRD/user stories
|
|
70
|
+
3. Runs relevant evaluations
|
|
71
|
+
4. Reports metrics in quality gate
|
|
72
|
+
5. Fails if thresholds not met
|
|
73
|
+
|
|
74
|
+
## Best Practices
|
|
75
|
+
|
|
76
|
+
- Create evaluation datasets from real user queries
|
|
77
|
+
- Include edge cases and negative examples
|
|
78
|
+
- Set realistic thresholds (not 100%)
|
|
79
|
+
- Track metrics over time (regression detection)
|
|
80
|
+
- Run evaluations in CI/CD pipeline
|
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# Classification Evaluation Template
|
|
2
|
+
|
|
3
|
+
> Benchmark routing accuracy and categorization correctness
|
|
4
|
+
|
|
5
|
+
## Metadata
|
|
6
|
+
|
|
7
|
+
| Field | Value |
|
|
8
|
+
|-------|-------|
|
|
9
|
+
| eval_name | classification_accuracy |
|
|
10
|
+
| version | 1.0.0 |
|
|
11
|
+
| metrics | Accuracy, Precision, Recall, F1 Score |
|
|
12
|
+
|
|
13
|
+
## Purpose
|
|
14
|
+
|
|
15
|
+
Evaluate how well the agent correctly routes requests, categorizes inputs, or makes routing decisions.
|
|
16
|
+
|
|
17
|
+
## Use Cases
|
|
18
|
+
|
|
19
|
+
- Intent classification (chatbot routing)
|
|
20
|
+
- Content moderation
|
|
21
|
+
- Priority/ticket categorization
|
|
22
|
+
- Language detection
|
|
23
|
+
- Sentiment analysis
|
|
24
|
+
|
|
25
|
+
## Dataset Format
|
|
26
|
+
|
|
27
|
+
```json
|
|
28
|
+
[
|
|
29
|
+
{
|
|
30
|
+
"input": "I want to get a refund",
|
|
31
|
+
"expected_category": "refund_request",
|
|
32
|
+
"expected_confidence": 0.90
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"input": "How do I change my password?",
|
|
36
|
+
"expected_category": "account_settings",
|
|
37
|
+
"expected_confidence": 0.85
|
|
38
|
+
}
|
|
39
|
+
]
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Evaluation Metrics
|
|
43
|
+
|
|
44
|
+
| Metric | Target | Description |
|
|
45
|
+
|--------|--------|-------------|
|
|
46
|
+
| Accuracy | ≥0.95 | % of correct classifications |
|
|
47
|
+
| Macro F1 | ≥0.90 | F1 averaged across all categories |
|
|
48
|
+
| Precision (per class) | ≥0.85 | True positives / predicted positives |
|
|
49
|
+
| Recall (per class) | ≥0.85 | True positives / actual positives |
|
|
50
|
+
| Confidence Calibration | ≤0.10 | Mean absolute error between confidence and accuracy |
|
|
51
|
+
|
|
52
|
+
## Evaluation Code
|
|
53
|
+
|
|
54
|
+
### Python (LangWatch)
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
import langwatch
|
|
58
|
+
|
|
59
|
+
results = langwatch.evaluate(
|
|
60
|
+
dataset="intent-classification",
|
|
61
|
+
evaluator="classification_accuracy",
|
|
62
|
+
metrics=["accuracy", "precision", "recall", "f1_macro", "calibration"]
|
|
63
|
+
)
|
|
64
|
+
|
|
65
|
+
print(f"Accuracy: {results.accuracy}")
|
|
66
|
+
print(f"Macro F1: {results.f1_macro}")
|
|
67
|
+
print(f"Precision: {results.precision}")
|
|
68
|
+
print(f"Recall: {results.recall}")
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### JavaScript/TypeScript
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
import { evaluate } from "@langwatch/evaluators";
|
|
75
|
+
|
|
76
|
+
const results = await evaluate({
|
|
77
|
+
dataset: "intent-classification",
|
|
78
|
+
evaluator: "classification_accuracy",
|
|
79
|
+
metrics: ["accuracy", "precision", "recall", "f1_macro"],
|
|
80
|
+
});
|
|
81
|
+
|
|
82
|
+
console.log(`Accuracy: ${results.accuracy}`);
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## Pass/Fail Criteria
|
|
86
|
+
|
|
87
|
+
| Metric | Threshold | Status |
|
|
88
|
+
|--------|-----------|--------|
|
|
89
|
+
| Accuracy | ≥0.95 | ✅ Pass |
|
|
90
|
+
| Accuracy | 0.90-0.94 | ⚠️ Warning |
|
|
91
|
+
| Accuracy | <0.90 | ❌ Fail |
|
|
92
|
+
| Macro F1 | ≥0.90 | ✅ Pass |
|
|
93
|
+
| Macro F1 | <0.85 | ❌ Fail |
|
|
94
|
+
| Confidence Error | ≤0.10 | ✅ Pass |
|
|
95
|
+
| Confidence Error | >0.15 | ⚠️ Warning |
|
|
96
|
+
|
|
97
|
+
## Per-Class Analysis
|
|
98
|
+
|
|
99
|
+
Generate confusion matrix:
|
|
100
|
+
|
|
101
|
+
| | Predicted A | Predicted B | Predicted C |
|
|
102
|
+
|--|-------------|-------------|-------------|
|
|
103
|
+
| Actual A | 45 | 3 | 2 |
|
|
104
|
+
| Actual B | 5 | 38 | 7 |
|
|
105
|
+
| Actual C | 1 | 4 | 40 |
|
|
106
|
+
|
|
107
|
+
Identify:
|
|
108
|
+
- **High-confusion pairs**: A↔B need better differentiation
|
|
109
|
+
- **Low-recall classes**: More training data needed
|
|
110
|
+
- **Low-precision classes**: Overlapping with other categories
|
|
111
|
+
|
|
112
|
+
## Common Issues
|
|
113
|
+
|
|
114
|
+
### Low Precision (many false positives)
|
|
115
|
+
- Add negative examples
|
|
116
|
+
- Make categories more distinct
|
|
117
|
+
- Add disambiguation prompts
|
|
118
|
+
|
|
119
|
+
### Low Recall (many false negatives)
|
|
120
|
+
- Add more training data
|
|
121
|
+
- Expand category definitions
|
|
122
|
+
- Check for data quality issues
|
|
123
|
+
|
|
124
|
+
### Poor Calibration
|
|
125
|
+
- Retrain with temperature scaling
|
|
126
|
+
- Add more diverse examples
|
|
127
|
+
- Use calibration-aware loss
|
|
128
|
+
|
|
129
|
+
## Integration with MDAN
|
|
130
|
+
|
|
131
|
+
During VERIFY phase, Test Agent should:
|
|
132
|
+
1. Identify all classification/routing points in the system
|
|
133
|
+
2. Create evaluation datasets from real user queries
|
|
134
|
+
3. Run classification evaluation
|
|
135
|
+
4. Report per-class performance
|
|
136
|
+
5. Fail if overall accuracy < 90%
|
|
@@ -0,0 +1,116 @@
|
|
|
1
|
+
# RAG Evaluation Template
|
|
2
|
+
|
|
3
|
+
> Benchmark RAG (Retrieval-Augmented Generation) correctness and quality
|
|
4
|
+
|
|
5
|
+
## Metadata
|
|
6
|
+
|
|
7
|
+
| Field | Value |
|
|
8
|
+
|-------|-------|
|
|
9
|
+
| eval_name | rag_correctness |
|
|
10
|
+
| version | 1.0.0 |
|
|
11
|
+
| metrics | F1 Score, Precision, Recall, Context Relevance |
|
|
12
|
+
|
|
13
|
+
## Purpose
|
|
14
|
+
|
|
15
|
+
Evaluate how well the RAG pipeline retrieves relevant context and generates accurate answers.
|
|
16
|
+
|
|
17
|
+
## Dataset Format
|
|
18
|
+
|
|
19
|
+
```json
|
|
20
|
+
[
|
|
21
|
+
{
|
|
22
|
+
"query": "What is the refund policy?",
|
|
23
|
+
"expected_chunks": [
|
|
24
|
+
"refund_policy.md: paragraphs 1-3",
|
|
25
|
+
"faq.md: refund section"
|
|
26
|
+
],
|
|
27
|
+
"expected_answer_contains": ["30 days", "original payment", "processing time"]
|
|
28
|
+
}
|
|
29
|
+
]
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Evaluation Metrics
|
|
33
|
+
|
|
34
|
+
### Retrieval Metrics
|
|
35
|
+
|
|
36
|
+
| Metric | Target | Description |
|
|
37
|
+
|--------|--------|-------------|
|
|
38
|
+
| Recall | ≥0.85 | % of relevant chunks retrieved |
|
|
39
|
+
| Precision | ≥0.90 | % of retrieved chunks that are relevant |
|
|
40
|
+
| F1 Score | ≥0.87 | Harmonic mean of precision/recall |
|
|
41
|
+
|
|
42
|
+
### Generation Metrics
|
|
43
|
+
|
|
44
|
+
| Metric | Target | Description |
|
|
45
|
+
|--------|--------|-------------|
|
|
46
|
+
| Context Relevance | ≥0.80 | LLM judge scores context usefulness |
|
|
47
|
+
| Answer Accuracy | ≥0.85 | Answer contains expected information |
|
|
48
|
+
| Hallucination Rate | ≤0.05 | Facts not in context |
|
|
49
|
+
|
|
50
|
+
## Evaluation Code
|
|
51
|
+
|
|
52
|
+
### Python (LangWatch)
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
import langwatch
|
|
56
|
+
|
|
57
|
+
results = langwatch.evaluate(
|
|
58
|
+
dataset="customer-support-rag",
|
|
59
|
+
evaluator="rag_correctness",
|
|
60
|
+
metrics=["f1_score", "precision", "recall", "context_relevance"]
|
|
61
|
+
)
|
|
62
|
+
|
|
63
|
+
print(f"F1 Score: {results.f1_score}")
|
|
64
|
+
print(f"Precision: {results.precision}")
|
|
65
|
+
print(f"Recall: {results.recall}")
|
|
66
|
+
print(f"Context Relevance: {results.context_relevance}")
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### JavaScript/TypeScript
|
|
70
|
+
|
|
71
|
+
```typescript
|
|
72
|
+
import { evaluate } from "@langwatch/evaluators";
|
|
73
|
+
|
|
74
|
+
const results = await evaluate({
|
|
75
|
+
dataset: "customer-support-rag",
|
|
76
|
+
evaluator: "rag_correctness",
|
|
77
|
+
metrics: ["f1_score", "precision", "recall", "context_relevance"],
|
|
78
|
+
});
|
|
79
|
+
|
|
80
|
+
console.log(`F1 Score: ${results.f1Score}`);
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Pass/Fail Criteria
|
|
84
|
+
|
|
85
|
+
| Metric | Threshold | Status |
|
|
86
|
+
|--------|-----------|--------|
|
|
87
|
+
| F1 Score | ≥0.87 | ✅ Pass |
|
|
88
|
+
| F1 Score | 0.70-0.86 | ⚠️ Warning |
|
|
89
|
+
| F1 Score | <0.70 | ❌ Fail |
|
|
90
|
+
| Hallucination | ≤0.05 | ✅ Pass |
|
|
91
|
+
| Hallucination | >0.15 | ❌ Fail |
|
|
92
|
+
|
|
93
|
+
## Troubleshooting
|
|
94
|
+
|
|
95
|
+
### Low Recall
|
|
96
|
+
- Check chunk size (try 512-1024)
|
|
97
|
+
- Add more overlapping chunks
|
|
98
|
+
- Improve embedding model
|
|
99
|
+
|
|
100
|
+
### Low Precision
|
|
101
|
+
- Reduce chunk size
|
|
102
|
+
- Add more specific metadata filters
|
|
103
|
+
- Filter out irrelevant sources
|
|
104
|
+
|
|
105
|
+
### High Hallucination
|
|
106
|
+
- Add source citations to prompt
|
|
107
|
+
- Reduce max_tokens
|
|
108
|
+
- Use better context ranking
|
|
109
|
+
|
|
110
|
+
## Integration with MDAN
|
|
111
|
+
|
|
112
|
+
During VERIFY phase, Test Agent should:
|
|
113
|
+
1. Create RAG evaluation dataset from PRD
|
|
114
|
+
2. Run retrieval + generation tests
|
|
115
|
+
3. Report metrics in quality gate
|
|
116
|
+
4. Fail if thresholds not met
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# Test Scenarios Index
|
|
2
|
+
|
|
3
|
+
> End-to-end conversational tests for MDAN projects
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Scenarios are conversation-based tests that validate agent behavior in realistic, multi-turn interactions. Unlike unit tests, they simulate how real users interact with your agent.
|
|
8
|
+
|
|
9
|
+
## Available Scenarios
|
|
10
|
+
|
|
11
|
+
| Scenario | Description | Agent |
|
|
12
|
+
|----------|-------------|-------|
|
|
13
|
+
| [basic_authentication.test.md](basic_authentication.test.md) | Login, logout, session management | Dev Agent |
|
|
14
|
+
| [user_registration.test.md](user_registration.test.md) | Signup, validation, confirmation | Dev Agent |
|
|
15
|
+
|
|
16
|
+
## Adding New Scenarios
|
|
17
|
+
|
|
18
|
+
1. Copy this template:
|
|
19
|
+
```bash
|
|
20
|
+
cp basic_authentication.test.md my_new_scenario.test.md
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
2. Edit the scenario:
|
|
24
|
+
- Update metadata (name, version, framework)
|
|
25
|
+
- Write the conversation script
|
|
26
|
+
- Define success criteria
|
|
27
|
+
- Add security checks if applicable
|
|
28
|
+
|
|
29
|
+
3. Run the scenario:
|
|
30
|
+
```bash
|
|
31
|
+
# Python/pytest
|
|
32
|
+
pytest tests/scenarios/my_new_scenario.test.md -v
|
|
33
|
+
|
|
34
|
+
# Node/TypeScript
|
|
35
|
+
npm test -- tests/scenarios/my_new_scenario.test.ts
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Scenario Format
|
|
39
|
+
|
|
40
|
+
Each scenario includes:
|
|
41
|
+
- **Metadata**: version, framework, duration estimate
|
|
42
|
+
- **Preconditions**: what must be true before testing
|
|
43
|
+
- **Script**: conversation steps with verification points
|
|
44
|
+
- **Success Criteria**: checklist of must-pass conditions
|
|
45
|
+
- **Security Checks**: validation for security requirements
|
|
46
|
+
|
|
47
|
+
## Integration with MDAN
|
|
48
|
+
|
|
49
|
+
Scenarios are automatically generated during the VERIFY phase:
|
|
50
|
+
1. Test Agent reviews implemented features
|
|
51
|
+
2. Creates relevant scenarios for critical flows
|
|
52
|
+
3. Runs scenarios to validate behavior
|
|
53
|
+
4. Reports results in quality gate
|
|
54
|
+
|
|
55
|
+
## Framework Support
|
|
56
|
+
|
|
57
|
+
| Framework | Command |
|
|
58
|
+
|-----------|---------|
|
|
59
|
+
| Jest | `jest tests/scenarios/` |
|
|
60
|
+
| Pytest | `pytest tests/scenarios/` |
|
|
61
|
+
| Playwright | `playwright test tests/scenarios/` |
|
|
62
|
+
| Vitest | `vitest run tests/scenarios/` |
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# Scenario: User Authentication
|
|
2
|
+
|
|
3
|
+
> Test conversation flow for basic authentication functionality
|
|
4
|
+
|
|
5
|
+
## Metadata
|
|
6
|
+
|
|
7
|
+
| Field | Value |
|
|
8
|
+
|-------|-------|
|
|
9
|
+
| scenario_name | basic_authentication |
|
|
10
|
+
| version | 1.0.0 |
|
|
11
|
+
| agent | Dev Agent |
|
|
12
|
+
| framework | [Jest/Pytest/Playwright] |
|
|
13
|
+
| estimated_duration | 30s |
|
|
14
|
+
|
|
15
|
+
## Description
|
|
16
|
+
|
|
17
|
+
Test that the authentication system handles login, logout, and session management correctly.
|
|
18
|
+
|
|
19
|
+
## Preconditions
|
|
20
|
+
|
|
21
|
+
- User is not logged in
|
|
22
|
+
- Database contains test users
|
|
23
|
+
- Auth service is running
|
|
24
|
+
|
|
25
|
+
## Script
|
|
26
|
+
|
|
27
|
+
### Test Case 1: Successful Login
|
|
28
|
+
|
|
29
|
+
```
|
|
30
|
+
USER: I want to log in with my account
|
|
31
|
+
AGENT: [Should prompt for credentials or display login form]
|
|
32
|
+
USER: My email is test@example.com and password is Test123!
|
|
33
|
+
AGENT: [Should validate credentials]
|
|
34
|
+
-> VERIFY: auth_token received
|
|
35
|
+
-> VERIFY: user object returned with correct email
|
|
36
|
+
USER: What's my username?
|
|
37
|
+
AGENT: [Should return 'test@example.com' from session]
|
|
38
|
+
-> VERIFY: response contains correct username
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Test Case 2: Invalid Credentials
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
USER: I want to log in
|
|
45
|
+
AGENT: [Should prompt for credentials]
|
|
46
|
+
USER: email: wrong@example.com, password: wrongpass
|
|
47
|
+
AGENT: [Should reject with error message]
|
|
48
|
+
-> VERIFY: error message does NOT reveal if email exists
|
|
49
|
+
-> VERIFY: no auth_token in response
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
### Test Case 3: Logout
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
USER: I'm logged in and want to log out
|
|
56
|
+
AGENT: [Should clear session]
|
|
57
|
+
-> VERIFY: session cleared
|
|
58
|
+
-> VERIFY: confirmation message shown
|
|
59
|
+
USER: Can I see my profile?
|
|
60
|
+
AGENT: [Should deny access]
|
|
61
|
+
-> VERIFY: 401 or redirect to login
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Success Criteria
|
|
65
|
+
|
|
66
|
+
- [ ] Login with valid credentials succeeds
|
|
67
|
+
- [ ] Login with invalid credentials fails with secure error
|
|
68
|
+
- [ ] Logout clears session completely
|
|
69
|
+
- [ ] Protected routes redirect unauthenticated users
|
|
70
|
+
- [ ] Session expires after configured timeout
|
|
71
|
+
|
|
72
|
+
## Failure Handling
|
|
73
|
+
|
|
74
|
+
If any step fails, the scenario should:
|
|
75
|
+
1. Capture the actual response
|
|
76
|
+
2. Compare with expected behavior
|
|
77
|
+
3. Log the difference for debugging
|
|
78
|
+
4. Fail the test with descriptive error
|
|
79
|
+
|
|
80
|
+
## Notes
|
|
81
|
+
|
|
82
|
+
This scenario tests the authentication flow end-to-end. Use with Playwright for browser-based testing or Pytest for API-based testing.
|