mdan-cli 2.2.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,97 @@
1
+ handle: orchestrator
2
+ scope: PROJECT
3
+ model: openai/gpt-4o
4
+ version: 2.2.0
5
+ last_updated: "2026-02-24"
6
+ maintainer: khalilbenaz
7
+
8
+ description: |
9
+ MDAN Core v2 - Adaptive central orchestrator that manages the entire
10
+ development lifecycle. Handles project profiling, session memory,
11
+ and agent coordination.
12
+
13
+ system_prompt: |
14
+ You are MDAN Core v2, the adaptive central orchestrator of the MDAN method.
15
+
16
+ You do everything MDAN Core v1 does, plus two critical upgrades:
17
+
18
+ 1. PROJECT PROFILE DETECTION — You detect the type of project early in DISCOVER
19
+ and automatically adapt the agent workflow, phase depth, and artifact requirements.
20
+
21
+ 2. SESSION MEMORY — You can resume any project from a MDAN-STATE.json file,
22
+ reconstructing full context and picking up exactly where work left off.
23
+
24
+ Your responsibilities:
25
+ - Guide user through 5 phases: DISCOVER, DESIGN, BUILD, VERIFY, SHIP
26
+ - Activate appropriate agents based on project profile
27
+ - Maintain session state in MDAN-STATE.json
28
+ - Ensure quality gates are met before phase transitions
29
+ - Support slash commands: /mdan-help, /party
30
+
31
+ Project profiles you can detect:
32
+ - MICRO: Solo dev, <1 week, single feature
33
+ - STANDARD: Small team, 2-8 weeks (default)
34
+ - PRODUCT: Team 5+, multi-month, stakeholders
35
+ - ENTERPRISE: Large team, compliance, SLA
36
+ - API-ONLY: No frontend, developer-focused
37
+ - DATA/ML: Machine learning, data pipelines
38
+
39
+ constraints: |
40
+ - Never skip quality gates
41
+ - Always ask for user name in first interaction
42
+ - Use agent names: Khalil (Product), Reda (Architect), Jihane (UX),
43
+ Haytame (Dev), Youssef (Test), Said (Security), Anas (DevOps), Amina (Doc)
44
+ - Log all agent activations with version numbers
45
+
46
+ output_format: |
47
+ Always include these in your responses:
48
+ - Current phase indicator
49
+ - Next recommended action
50
+ - Quality gate status
51
+
52
+ examples:
53
+ - scenario: Start new project
54
+ user: "I want to build a SaaS app"
55
+ response: |
56
+ [MDAN CORE — PROJECT PROFILE DETECTED]
57
+
58
+ Based on your description, I'm classifying this as a STANDARD project.
59
+
60
+ This means:
61
+ - Full 5-phase workflow
62
+ - All agents will be activated
63
+ - Standard artifact depth
64
+
65
+ Does this feel right? Also, before we proceed, what is your name?
66
+
67
+ - scenario: Resume session
68
+ user: "Here's my MDAN-STATE.json"
69
+ response: |
70
+ [MDAN CORE — SESSION RESUMED]
71
+
72
+ 📁 Project: my-saas (STANDARD)
73
+ 📍 Current Phase: DESIGN
74
+ ...
75
+
76
+ → Recommended next action: Complete Architecture document
77
+
78
+ Shall I continue?
79
+
80
+ changelog:
81
+ - version: 2.2.0
82
+ date: "2026-02-24"
83
+ changes:
84
+ - Added project profile detection
85
+ - Added session resume protocol
86
+ - Added agent version tracking
87
+ - Added quality gates for each profile
88
+ - version: 2.1.0
89
+ date: "2025-12-01"
90
+ changes:
91
+ - Added /party command
92
+ - Improved /mdan-help
93
+ - version: 2.0.0
94
+ date: "2025-10-01"
95
+ changes:
96
+ - Initial v2 release
97
+ - 5-phase workflow
@@ -0,0 +1,81 @@
1
+ {
2
+ "registry_version": "1.0",
3
+ "last_updated": "2026-02-24",
4
+ "prompts": {
5
+ "orchestrator": {
6
+ "handle": "orchestrator",
7
+ "version": "2.2.0",
8
+ "file": "templates/prompts/orchestrator.yaml",
9
+ "active": true,
10
+ "model": "openai/gpt-4o",
11
+ "changelog": [
12
+ {
13
+ "version": "2.2.0",
14
+ "date": "2026-02-24",
15
+ "changes": ["Added project profile detection", "Added session resume protocol"]
16
+ }
17
+ ]
18
+ },
19
+ "dev-agent": {
20
+ "handle": "dev-agent",
21
+ "version": "2.0.0",
22
+ "file": "templates/prompts/dev-agent.yaml",
23
+ "active": true,
24
+ "model": "openai/gpt-4o",
25
+ "changelog": [
26
+ {
27
+ "version": "2.0.0",
28
+ "date": "2026-02-24",
29
+ "changes": ["Added implementation plan format", "Added setup instructions"]
30
+ }
31
+ ]
32
+ },
33
+ "product-agent": {
34
+ "handle": "product-agent",
35
+ "version": "2.0.0",
36
+ "file": "templates/prompts/product-agent.yaml",
37
+ "active": true,
38
+ "model": "openai/gpt-4o",
39
+ "changelog": [
40
+ {
41
+ "version": "2.0.0",
42
+ "date": "2026-02-24",
43
+ "changes": ["Initial release with MoSCoW, risk matrix"]
44
+ }
45
+ ]
46
+ },
47
+ "architect-agent": {
48
+ "handle": "architect-agent",
49
+ "version": "2.0.0",
50
+ "file": "templates/prompts/architect-agent.yaml",
51
+ "active": true,
52
+ "model": "openai/gpt-4o",
53
+ "changelog": [
54
+ {
55
+ "version": "2.0.0",
56
+ "date": "2026-02-24",
57
+ "changes": ["Added Mermaid diagrams", "Added ADR format"]
58
+ }
59
+ ]
60
+ },
61
+ "test-agent": {
62
+ "handle": "test-agent",
63
+ "version": "2.0.0",
64
+ "file": "templates/prompts/test-agent.yaml",
65
+ "active": true,
66
+ "model": "openai/gpt-4o",
67
+ "changelog": [
68
+ {
69
+ "version": "2.0.0",
70
+ "date": "2026-02-24",
71
+ "changes": ["Added scenarios support", "Added evaluations"]
72
+ }
73
+ ]
74
+ }
75
+ },
76
+ "sync_settings": {
77
+ "auto_sync": false,
78
+ "langwatch_integration": false,
79
+ "commit_on_change": true
80
+ }
81
+ }
@@ -0,0 +1,80 @@
1
+ # Test Evaluations Index
2
+
3
+ > Structured benchmarking for agent components
4
+
5
+ ## Overview
6
+
7
+ Evaluations provide quantitative testing for specific components of your agent pipeline. Unlike scenarios (end-to-end), evaluations test isolated components.
8
+
9
+ ## Available Evaluations
10
+
11
+ | Evaluation | Description | Metrics |
12
+ |------------|-------------|---------|
13
+ | [rag_eval.md](rag_eval.md) | RAG correctness | F1, Precision, Recall, Context Relevance |
14
+ | [classification_eval.md](classification_eval.md) | Routing/categorization | Accuracy, F1, Precision, Recall |
15
+
16
+ ## Adding New Evaluations
17
+
18
+ 1. Copy an existing template:
19
+ ```bash
20
+ cp rag_eval.md my_eval.md
21
+ ```
22
+
23
+ 2. Customize:
24
+ - Update dataset format
25
+ - Set target thresholds
26
+ - Add domain-specific checks
27
+
28
+ 3. Run:
29
+ ```python
30
+ import langwatch
31
+ results = langwatch.evaluate(
32
+ dataset="my-dataset",
33
+ evaluator="my_eval",
34
+ metrics=["accuracy", "f1"]
35
+ )
36
+ ```
37
+
38
+ ## Built-in Evaluators
39
+
40
+ LangWatch provides extensive evaluators:
41
+
42
+ | Evaluator | Description |
43
+ |-----------|-------------|
44
+ | `rag_correctness` | Retrieval and generation quality |
45
+ | `classification_accuracy` | Routing and categorization |
46
+ | `answer_correctness` | Factual accuracy |
47
+ | `safety_check` | Jailbreak, PII, toxicity |
48
+ | `format_validation` | JSON, XML, markdown structure |
49
+ | `tool_calling` | Correct tool selection and args |
50
+ | `latency` | Response time benchmarking |
51
+
52
+ ## Running Evaluations
53
+
54
+ ```bash
55
+ # Python
56
+ pytest tests/evaluations/ -v
57
+
58
+ # JavaScript
59
+ npm test -- tests/evaluations/
60
+
61
+ # With LangWatch
62
+ langwatch evaluate --dataset my-project
63
+ ```
64
+
65
+ ## Integration with MDAN
66
+
67
+ During VERIFY phase:
68
+ 1. Test Agent identifies components needing evaluation
69
+ 2. Creates evaluation datasets from PRD/user stories
70
+ 3. Runs relevant evaluations
71
+ 4. Reports metrics in quality gate
72
+ 5. Fails if thresholds not met
73
+
74
+ ## Best Practices
75
+
76
+ - Create evaluation datasets from real user queries
77
+ - Include edge cases and negative examples
78
+ - Set realistic thresholds (not 100%)
79
+ - Track metrics over time (regression detection)
80
+ - Run evaluations in CI/CD pipeline
@@ -0,0 +1,136 @@
1
+ # Classification Evaluation Template
2
+
3
+ > Benchmark routing accuracy and categorization correctness
4
+
5
+ ## Metadata
6
+
7
+ | Field | Value |
8
+ |-------|-------|
9
+ | eval_name | classification_accuracy |
10
+ | version | 1.0.0 |
11
+ | metrics | Accuracy, Precision, Recall, F1 Score |
12
+
13
+ ## Purpose
14
+
15
+ Evaluate how well the agent correctly routes requests, categorizes inputs, or makes routing decisions.
16
+
17
+ ## Use Cases
18
+
19
+ - Intent classification (chatbot routing)
20
+ - Content moderation
21
+ - Priority/ticket categorization
22
+ - Language detection
23
+ - Sentiment analysis
24
+
25
+ ## Dataset Format
26
+
27
+ ```json
28
+ [
29
+ {
30
+ "input": "I want to get a refund",
31
+ "expected_category": "refund_request",
32
+ "expected_confidence": 0.90
33
+ },
34
+ {
35
+ "input": "How do I change my password?",
36
+ "expected_category": "account_settings",
37
+ "expected_confidence": 0.85
38
+ }
39
+ ]
40
+ ```
41
+
42
+ ## Evaluation Metrics
43
+
44
+ | Metric | Target | Description |
45
+ |--------|--------|-------------|
46
+ | Accuracy | ≥0.95 | % of correct classifications |
47
+ | Macro F1 | ≥0.90 | F1 averaged across all categories |
48
+ | Precision (per class) | ≥0.85 | True positives / predicted positives |
49
+ | Recall (per class) | ≥0.85 | True positives / actual positives |
50
+ | Confidence Calibration | ≤0.10 | Mean absolute error between confidence and accuracy |
51
+
52
+ ## Evaluation Code
53
+
54
+ ### Python (LangWatch)
55
+
56
+ ```python
57
+ import langwatch
58
+
59
+ results = langwatch.evaluate(
60
+ dataset="intent-classification",
61
+ evaluator="classification_accuracy",
62
+ metrics=["accuracy", "precision", "recall", "f1_macro", "calibration"]
63
+ )
64
+
65
+ print(f"Accuracy: {results.accuracy}")
66
+ print(f"Macro F1: {results.f1_macro}")
67
+ print(f"Precision: {results.precision}")
68
+ print(f"Recall: {results.recall}")
69
+ ```
70
+
71
+ ### JavaScript/TypeScript
72
+
73
+ ```typescript
74
+ import { evaluate } from "@langwatch/evaluators";
75
+
76
+ const results = await evaluate({
77
+ dataset: "intent-classification",
78
+ evaluator: "classification_accuracy",
79
+ metrics: ["accuracy", "precision", "recall", "f1_macro"],
80
+ });
81
+
82
+ console.log(`Accuracy: ${results.accuracy}`);
83
+ ```
84
+
85
+ ## Pass/Fail Criteria
86
+
87
+ | Metric | Threshold | Status |
88
+ |--------|-----------|--------|
89
+ | Accuracy | ≥0.95 | ✅ Pass |
90
+ | Accuracy | 0.90-0.94 | ⚠️ Warning |
91
+ | Accuracy | <0.90 | ❌ Fail |
92
+ | Macro F1 | ≥0.90 | ✅ Pass |
93
+ | Macro F1 | <0.85 | ❌ Fail |
94
+ | Confidence Error | ≤0.10 | ✅ Pass |
95
+ | Confidence Error | >0.15 | ⚠️ Warning |
96
+
97
+ ## Per-Class Analysis
98
+
99
+ Generate confusion matrix:
100
+
101
+ | | Predicted A | Predicted B | Predicted C |
102
+ |--|-------------|-------------|-------------|
103
+ | Actual A | 45 | 3 | 2 |
104
+ | Actual B | 5 | 38 | 7 |
105
+ | Actual C | 1 | 4 | 40 |
106
+
107
+ Identify:
108
+ - **High-confusion pairs**: A↔B need better differentiation
109
+ - **Low-recall classes**: More training data needed
110
+ - **Low-precision classes**: Overlapping with other categories
111
+
112
+ ## Common Issues
113
+
114
+ ### Low Precision (many false positives)
115
+ - Add negative examples
116
+ - Make categories more distinct
117
+ - Add disambiguation prompts
118
+
119
+ ### Low Recall (many false negatives)
120
+ - Add more training data
121
+ - Expand category definitions
122
+ - Check for data quality issues
123
+
124
+ ### Poor Calibration
125
+ - Retrain with temperature scaling
126
+ - Add more diverse examples
127
+ - Use calibration-aware loss
128
+
129
+ ## Integration with MDAN
130
+
131
+ During VERIFY phase, Test Agent should:
132
+ 1. Identify all classification/routing points in the system
133
+ 2. Create evaluation datasets from real user queries
134
+ 3. Run classification evaluation
135
+ 4. Report per-class performance
136
+ 5. Fail if overall accuracy < 90%
@@ -0,0 +1,116 @@
1
+ # RAG Evaluation Template
2
+
3
+ > Benchmark RAG (Retrieval-Augmented Generation) correctness and quality
4
+
5
+ ## Metadata
6
+
7
+ | Field | Value |
8
+ |-------|-------|
9
+ | eval_name | rag_correctness |
10
+ | version | 1.0.0 |
11
+ | metrics | F1 Score, Precision, Recall, Context Relevance |
12
+
13
+ ## Purpose
14
+
15
+ Evaluate how well the RAG pipeline retrieves relevant context and generates accurate answers.
16
+
17
+ ## Dataset Format
18
+
19
+ ```json
20
+ [
21
+ {
22
+ "query": "What is the refund policy?",
23
+ "expected_chunks": [
24
+ "refund_policy.md: paragraphs 1-3",
25
+ "faq.md: refund section"
26
+ ],
27
+ "expected_answer_contains": ["30 days", "original payment", "processing time"]
28
+ }
29
+ ]
30
+ ```
31
+
32
+ ## Evaluation Metrics
33
+
34
+ ### Retrieval Metrics
35
+
36
+ | Metric | Target | Description |
37
+ |--------|--------|-------------|
38
+ | Recall | ≥0.85 | % of relevant chunks retrieved |
39
+ | Precision | ≥0.90 | % of retrieved chunks that are relevant |
40
+ | F1 Score | ≥0.87 | Harmonic mean of precision/recall |
41
+
42
+ ### Generation Metrics
43
+
44
+ | Metric | Target | Description |
45
+ |--------|--------|-------------|
46
+ | Context Relevance | ≥0.80 | LLM judge scores context usefulness |
47
+ | Answer Accuracy | ≥0.85 | Answer contains expected information |
48
+ | Hallucination Rate | ≤0.05 | Facts not in context |
49
+
50
+ ## Evaluation Code
51
+
52
+ ### Python (LangWatch)
53
+
54
+ ```python
55
+ import langwatch
56
+
57
+ results = langwatch.evaluate(
58
+ dataset="customer-support-rag",
59
+ evaluator="rag_correctness",
60
+ metrics=["f1_score", "precision", "recall", "context_relevance"]
61
+ )
62
+
63
+ print(f"F1 Score: {results.f1_score}")
64
+ print(f"Precision: {results.precision}")
65
+ print(f"Recall: {results.recall}")
66
+ print(f"Context Relevance: {results.context_relevance}")
67
+ ```
68
+
69
+ ### JavaScript/TypeScript
70
+
71
+ ```typescript
72
+ import { evaluate } from "@langwatch/evaluators";
73
+
74
+ const results = await evaluate({
75
+ dataset: "customer-support-rag",
76
+ evaluator: "rag_correctness",
77
+ metrics: ["f1_score", "precision", "recall", "context_relevance"],
78
+ });
79
+
80
+ console.log(`F1 Score: ${results.f1Score}`);
81
+ ```
82
+
83
+ ## Pass/Fail Criteria
84
+
85
+ | Metric | Threshold | Status |
86
+ |--------|-----------|--------|
87
+ | F1 Score | ≥0.87 | ✅ Pass |
88
+ | F1 Score | 0.70-0.86 | ⚠️ Warning |
89
+ | F1 Score | <0.70 | ❌ Fail |
90
+ | Hallucination | ≤0.05 | ✅ Pass |
91
+ | Hallucination | >0.15 | ❌ Fail |
92
+
93
+ ## Troubleshooting
94
+
95
+ ### Low Recall
96
+ - Check chunk size (try 512-1024)
97
+ - Add more overlapping chunks
98
+ - Improve embedding model
99
+
100
+ ### Low Precision
101
+ - Reduce chunk size
102
+ - Add more specific metadata filters
103
+ - Filter out irrelevant sources
104
+
105
+ ### High Hallucination
106
+ - Add source citations to prompt
107
+ - Reduce max_tokens
108
+ - Use better context ranking
109
+
110
+ ## Integration with MDAN
111
+
112
+ During VERIFY phase, Test Agent should:
113
+ 1. Create RAG evaluation dataset from PRD
114
+ 2. Run retrieval + generation tests
115
+ 3. Report metrics in quality gate
116
+ 4. Fail if thresholds not met
@@ -0,0 +1,62 @@
1
+ # Test Scenarios Index
2
+
3
+ > End-to-end conversational tests for MDAN projects
4
+
5
+ ## Overview
6
+
7
+ Scenarios are conversation-based tests that validate agent behavior in realistic, multi-turn interactions. Unlike unit tests, they simulate how real users interact with your agent.
8
+
9
+ ## Available Scenarios
10
+
11
+ | Scenario | Description | Agent |
12
+ |----------|-------------|-------|
13
+ | [basic_authentication.test.md](basic_authentication.test.md) | Login, logout, session management | Dev Agent |
14
+ | [user_registration.test.md](user_registration.test.md) | Signup, validation, confirmation | Dev Agent |
15
+
16
+ ## Adding New Scenarios
17
+
18
+ 1. Copy this template:
19
+ ```bash
20
+ cp basic_authentication.test.md my_new_scenario.test.md
21
+ ```
22
+
23
+ 2. Edit the scenario:
24
+ - Update metadata (name, version, framework)
25
+ - Write the conversation script
26
+ - Define success criteria
27
+ - Add security checks if applicable
28
+
29
+ 3. Run the scenario:
30
+ ```bash
31
+ # Python/pytest
32
+ pytest tests/scenarios/my_new_scenario.test.md -v
33
+
34
+ # Node/TypeScript
35
+ npm test -- tests/scenarios/my_new_scenario.test.ts
36
+ ```
37
+
38
+ ## Scenario Format
39
+
40
+ Each scenario includes:
41
+ - **Metadata**: version, framework, duration estimate
42
+ - **Preconditions**: what must be true before testing
43
+ - **Script**: conversation steps with verification points
44
+ - **Success Criteria**: checklist of must-pass conditions
45
+ - **Security Checks**: validation for security requirements
46
+
47
+ ## Integration with MDAN
48
+
49
+ Scenarios are automatically generated during the VERIFY phase:
50
+ 1. Test Agent reviews implemented features
51
+ 2. Creates relevant scenarios for critical flows
52
+ 3. Runs scenarios to validate behavior
53
+ 4. Reports results in quality gate
54
+
55
+ ## Framework Support
56
+
57
+ | Framework | Command |
58
+ |-----------|---------|
59
+ | Jest | `jest tests/scenarios/` |
60
+ | Pytest | `pytest tests/scenarios/` |
61
+ | Playwright | `playwright test tests/scenarios/` |
62
+ | Vitest | `vitest run tests/scenarios/` |
@@ -0,0 +1,82 @@
1
+ # Scenario: User Authentication
2
+
3
+ > Test conversation flow for basic authentication functionality
4
+
5
+ ## Metadata
6
+
7
+ | Field | Value |
8
+ |-------|-------|
9
+ | scenario_name | basic_authentication |
10
+ | version | 1.0.0 |
11
+ | agent | Dev Agent |
12
+ | framework | [Jest/Pytest/Playwright] |
13
+ | estimated_duration | 30s |
14
+
15
+ ## Description
16
+
17
+ Test that the authentication system handles login, logout, and session management correctly.
18
+
19
+ ## Preconditions
20
+
21
+ - User is not logged in
22
+ - Database contains test users
23
+ - Auth service is running
24
+
25
+ ## Script
26
+
27
+ ### Test Case 1: Successful Login
28
+
29
+ ```
30
+ USER: I want to log in with my account
31
+ AGENT: [Should prompt for credentials or display login form]
32
+ USER: My email is test@example.com and password is Test123!
33
+ AGENT: [Should validate credentials]
34
+ -> VERIFY: auth_token received
35
+ -> VERIFY: user object returned with correct email
36
+ USER: What's my username?
37
+ AGENT: [Should return 'test@example.com' from session]
38
+ -> VERIFY: response contains correct username
39
+ ```
40
+
41
+ ### Test Case 2: Invalid Credentials
42
+
43
+ ```
44
+ USER: I want to log in
45
+ AGENT: [Should prompt for credentials]
46
+ USER: email: wrong@example.com, password: wrongpass
47
+ AGENT: [Should reject with error message]
48
+ -> VERIFY: error message does NOT reveal if email exists
49
+ -> VERIFY: no auth_token in response
50
+ ```
51
+
52
+ ### Test Case 3: Logout
53
+
54
+ ```
55
+ USER: I'm logged in and want to log out
56
+ AGENT: [Should clear session]
57
+ -> VERIFY: session cleared
58
+ -> VERIFY: confirmation message shown
59
+ USER: Can I see my profile?
60
+ AGENT: [Should deny access]
61
+ -> VERIFY: 401 or redirect to login
62
+ ```
63
+
64
+ ## Success Criteria
65
+
66
+ - [ ] Login with valid credentials succeeds
67
+ - [ ] Login with invalid credentials fails with secure error
68
+ - [ ] Logout clears session completely
69
+ - [ ] Protected routes redirect unauthenticated users
70
+ - [ ] Session expires after configured timeout
71
+
72
+ ## Failure Handling
73
+
74
+ If any step fails, the scenario should:
75
+ 1. Capture the actual response
76
+ 2. Compare with expected behavior
77
+ 3. Log the difference for debugging
78
+ 4. Fail the test with descriptive error
79
+
80
+ ## Notes
81
+
82
+ This scenario tests the authentication flow end-to-end. Use with Playwright for browser-based testing or Pytest for API-based testing.