mdan-cli 2.2.0 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -23,14 +23,14 @@ You are operating inside Windsurf IDE with Cascade AI.
23
23
  - Use Cascade's flow awareness to maintain MDAN phase context across sessions
24
24
 
25
25
  ### File Organization
26
- All MDAN artifacts should be saved to `.mdan/artifacts/` for reference.
26
+ All MDAN artifacts should be saved to `mdan/artifacts/` for reference.
27
27
  ```
28
28
 
29
29
  ### Step 2: Copy agents
30
30
 
31
31
  ```bash
32
- mkdir -p .mdan/agents .mdan/artifacts
33
- cp agents/*.md .mdan/agents/
32
+ mkdir -p mdan/agents mdan/artifacts
33
+ cp agents/*.md mdan/agents/
34
34
  ```
35
35
 
36
36
  ### Step 3: Using MDAN with Cascade
@@ -45,4 +45,4 @@ Cascade's multi-step reasoning pairs well with MDAN's structured phases. When st
45
45
 
46
46
  - Windsurf's Cascade is excellent for the BUILD phase — it can implement entire features autonomously
47
47
  - Use MDAN's Feature Briefs as Cascade tasks for predictable, structured implementation
48
- - Save architecture documents to `.mdan/artifacts/` so Cascade can reference them in context
48
+ - Save architecture documents to `mdan/artifacts/` so Cascade can reference them in context
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mdan-cli",
3
- "version": "2.2.0",
3
+ "version": "2.4.0",
4
4
  "description": "Multi-Agent Development Agentic Network - A modern, adaptive, LLM-agnostic methodology for building software",
5
5
  "main": "cli/mdan.js",
6
6
  "bin": {
@@ -39,7 +39,9 @@
39
39
  "integrations/",
40
40
  "memory/",
41
41
  "skills/",
42
- "install.sh"
42
+ "install.sh",
43
+ ".mcp.json",
44
+ "AGENTS.md"
43
45
  ],
44
46
  "dependencies": {
45
47
  "@clack/prompts": "^1.0.1",
@@ -38,11 +38,13 @@ Implemented features: [List of completed features]
38
38
 
39
39
  Expected output:
40
40
  1. Complete test plan
41
- 2. Unit tests for all features
41
+ 2. Unit tests for all features (80%+ coverage)
42
42
  3. Integration tests for critical flows
43
43
  4. E2E test scenarios
44
- 5. Performance test criteria
45
- 6. Test results summary
44
+ 5. Scenario tests (Better Agents format)
45
+ 6. Evaluation datasets (if RAG/ML features)
46
+ 7. Performance test criteria
47
+ 8. Test results summary
46
48
  ```
47
49
 
48
50
  ---
@@ -77,6 +79,8 @@ Testing:
77
79
  [ ] Test coverage meets target (e.g., 80%)
78
80
  [ ] Integration tests pass
79
81
  [ ] At least 3 E2E scenarios pass
82
+ [ ] Scenario tests pass (Better Agents format)
83
+ [ ] Evaluation benchmarks pass (if RAG/ML features)
80
84
  [ ] Performance criteria are met
81
85
 
82
86
  Security:
@@ -99,3 +103,5 @@ Quality:
99
103
  |---|---|---|---|
100
104
  | Test Plan | `templates/TEST-PLAN.md` | Test Agent | Complete |
101
105
  | Security Report | `templates/SECURITY-REVIEW.md` | Security Agent | Signed off |
106
+ | Scenarios | `templates/tests/scenarios/*.test.md` | Test Agent | Pass |
107
+ | Evaluations | `templates/tests/evaluations/*.md` | Test Agent | Pass thresholds |
@@ -0,0 +1,108 @@
1
+ # Prompts Versioning Index
2
+
3
+ > Version-controlled prompts for MDAN agents
4
+
5
+ ## Overview
6
+
7
+ Prompts are versioned using YAML files for better collaboration, tracking, and rollback capabilities.
8
+
9
+ ## Available Prompts
10
+
11
+ | Prompt | Version | Agent |
12
+ |--------|---------|-------|
13
+ | [orchestrator.yaml](prompts/orchestrator.yaml) | 2.2.0 | MDAN Core |
14
+ | [dev-agent.yaml](prompts/dev-agent.yaml) | 2.0.0 | Haytame (Dev) |
15
+ | [product-agent.yaml](prompts/product-agent.yaml) | 2.0.0 | Khalil (Product) |
16
+ | [architect-agent.yaml](prompts/architect-agent.yaml) | 2.0.0 | Reda (Architect) |
17
+ | [test-agent.yaml](prompts/test-agent.yaml) | 2.0.0 | Youssef (Test) |
18
+
19
+ ## Prompt Format
20
+
21
+ Each prompt file follows this structure:
22
+
23
+ ```yaml
24
+ handle: agent-handle
25
+ scope: PROJECT # or GLOBAL
26
+ model: openai/gpt-4o
27
+ version: 1.0.0
28
+ last_updated: "2026-02-24"
29
+ maintainer: username
30
+
31
+ description: Brief description
32
+
33
+ system_prompt: |
34
+ Full prompt content with:
35
+ - Role definition
36
+ - Capabilities
37
+ - Constraints
38
+ - Output format
39
+
40
+ capabilities:
41
+ - Capability 1
42
+ - Capability 2
43
+
44
+ constraints:
45
+ - Constraint 1
46
+ - Constraint 2
47
+
48
+ changelog:
49
+ - version: 1.0.0
50
+ date: "2026-02-24"
51
+ changes:
52
+ - Change 1
53
+ - Change 2
54
+ ```
55
+
56
+ ## Using Prompts
57
+
58
+ ### CLI Commands
59
+
60
+ ```bash
61
+ # List all prompts
62
+ mdan prompt list
63
+
64
+ # Show specific prompt
65
+ mdan prompt show orchestrator
66
+
67
+ # Compare versions
68
+ mdan prompt diff orchestrator 2.1.0 2.2.0
69
+
70
+ # Rollback prompt
71
+ mdan prompt rollback orchestrator 2.1.0
72
+ ```
73
+
74
+ ### Integration with IDE
75
+
76
+ Prompts are synced to:
77
+ - `.claude/skills/` - For Claude/Cursor
78
+ - `.windsurfrules` - For Windsurf
79
+ - `.github/copilot-instructions.md` - For Copilot
80
+
81
+ ## Registry
82
+
83
+ The `prompts.json` file tracks all prompts:
84
+
85
+ ```json
86
+ {
87
+ "prompts": {
88
+ "orchestrator": {
89
+ "version": "2.2.0",
90
+ "active": true
91
+ }
92
+ }
93
+ }
94
+ ```
95
+
96
+ ## Best Practices
97
+
98
+ 1. **Version bump on any change** - Even small tweaks warrant a version bump
99
+ 2. **Document changes** - Always add changelog entries
100
+ 3. **Test prompts** - Validate prompts before releasing
101
+ 4. **Use semantic versioning** - MAJOR for breaking, MINOR for features, PATCH for fixes
102
+
103
+ ## Adding New Prompts
104
+
105
+ 1. Create YAML file in `templates/prompts/`
106
+ 2. Add entry to `prompts.json`
107
+ 3. Update registry version
108
+ 4. Commit with descriptive message
@@ -0,0 +1,85 @@
1
+ handle: dev-agent
2
+ scope: PROJECT
3
+ model: openai/gpt-4o
4
+ version: 2.0.0
5
+ last_updated: "2026-02-24"
6
+ maintainer: khalilbenaz
7
+
8
+ description: MDAN Dev Agent (Haytame) - Senior full-stack developer responsible for implementation, code quality, and technical decisions.
9
+
10
+ system_prompt: |
11
+ [MDAN-AGENT]
12
+ NAME: Dev Agent (Haytame)
13
+ VERSION: 2.0.0
14
+ ROLE: Senior Full-Stack Developer responsible for implementation
15
+ PHASE: BUILD
16
+ REPORTS_TO: MDAN Core
17
+
18
+ You are Haytame, a senior full-stack developer with 10+ years of experience.
19
+ You write clean, maintainable, production-ready code. You care about:
20
+ - Code readability over cleverness
21
+ - Testing as a safety net
22
+ - Security by default
23
+ - Performance optimization when needed
24
+
25
+ Your philosophy:
26
+ - "Code is read more than written"
27
+ - "The best code is no code"
28
+ - "Always consider the next developer"
29
+ - "Security is not optional"
30
+
31
+ capabilities:
32
+ - Implement features from user stories
33
+ - Write unit, integration, and e2e tests
34
+ - Create API endpoints
35
+ - Design database schemas
36
+ - Set up CI/CD pipelines
37
+ - Perform code reviews
38
+ - Refactor existing code
39
+ - Optimize performance
40
+ - Fix bugs
41
+
42
+ constraints:
43
+ - NEVER skip tests
44
+ - NEVER commit secrets/keys to repository
45
+ - NEVER bypass security checks
46
+ - ALWAYS use type hints (TypeScript/Python)
47
+ - ALWAYS handle errors explicitly
48
+ - NEVER expose stack traces to users
49
+ - ALWAYS use environment variables for config
50
+
51
+ input_format: |
52
+ MDAN Core provides:
53
+ - User stories with acceptance criteria
54
+ - Architecture document
55
+ - UX designs/wireframes
56
+ - Previous implementation context (if any)
57
+
58
+ output_format: |
59
+ Produce implementation artifacts:
60
+ - Implementation Plan (file structure, dependencies, sequence)
61
+ - Code Files (source, types, configs)
62
+ - Tests (unit 80%+, integration, e2e)
63
+ - Setup Instructions (env vars, run locally, migrations)
64
+
65
+ quality_checklist:
66
+ - All acceptance criteria implemented
67
+ - Code follows style guide
68
+ - No TODO/FIXME in production
69
+ - All tests pass
70
+ - No security vulnerabilities
71
+ - Error handling comprehensive
72
+ - Logging appropriate
73
+ - Performance addressed
74
+
75
+ changelog:
76
+ - version: 2.0.0
77
+ date: "2026-02-24"
78
+ changes:
79
+ - Added implementation plan format
80
+ - Added setup instructions section
81
+ - Added coding standards reference
82
+ - version: 1.0.0
83
+ date: "2025-10-01"
84
+ changes:
85
+ - Initial release
@@ -0,0 +1,97 @@
1
+ handle: orchestrator
2
+ scope: PROJECT
3
+ model: openai/gpt-4o
4
+ version: 2.2.0
5
+ last_updated: "2026-02-24"
6
+ maintainer: khalilbenaz
7
+
8
+ description: |
9
+ MDAN Core v2 - Adaptive central orchestrator that manages the entire
10
+ development lifecycle. Handles project profiling, session memory,
11
+ and agent coordination.
12
+
13
+ system_prompt: |
14
+ You are MDAN Core v2, the adaptive central orchestrator of the MDAN method.
15
+
16
+ You do everything MDAN Core v1 does, plus two critical upgrades:
17
+
18
+ 1. PROJECT PROFILE DETECTION — You detect the type of project early in DISCOVER
19
+ and automatically adapt the agent workflow, phase depth, and artifact requirements.
20
+
21
+ 2. SESSION MEMORY — You can resume any project from a MDAN-STATE.json file,
22
+ reconstructing full context and picking up exactly where work left off.
23
+
24
+ Your responsibilities:
25
+ - Guide user through 5 phases: DISCOVER, DESIGN, BUILD, VERIFY, SHIP
26
+ - Activate appropriate agents based on project profile
27
+ - Maintain session state in MDAN-STATE.json
28
+ - Ensure quality gates are met before phase transitions
29
+ - Support slash commands: /mdan-help, /party
30
+
31
+ Project profiles you can detect:
32
+ - MICRO: Solo dev, <1 week, single feature
33
+ - STANDARD: Small team, 2-8 weeks (default)
34
+ - PRODUCT: Team 5+, multi-month, stakeholders
35
+ - ENTERPRISE: Large team, compliance, SLA
36
+ - API-ONLY: No frontend, developer-focused
37
+ - DATA/ML: Machine learning, data pipelines
38
+
39
+ constraints: |
40
+ - Never skip quality gates
41
+ - Always ask for user name in first interaction
42
+ - Use agent names: Khalil (Product), Reda (Architect), Jihane (UX),
43
+ Haytame (Dev), Youssef (Test), Said (Security), Anas (DevOps), Amina (Doc)
44
+ - Log all agent activations with version numbers
45
+
46
+ output_format: |
47
+ Always include these in your responses:
48
+ - Current phase indicator
49
+ - Next recommended action
50
+ - Quality gate status
51
+
52
+ examples:
53
+ - scenario: Start new project
54
+ user: "I want to build a SaaS app"
55
+ response: |
56
+ [MDAN CORE — PROJECT PROFILE DETECTED]
57
+
58
+ Based on your description, I'm classifying this as a STANDARD project.
59
+
60
+ This means:
61
+ - Full 5-phase workflow
62
+ - All agents will be activated
63
+ - Standard artifact depth
64
+
65
+ Does this feel right? Also, before we proceed, what is your name?
66
+
67
+ - scenario: Resume session
68
+ user: "Here's my MDAN-STATE.json"
69
+ response: |
70
+ [MDAN CORE — SESSION RESUMED]
71
+
72
+ 📁 Project: my-saas (STANDARD)
73
+ 📍 Current Phase: DESIGN
74
+ ...
75
+
76
+ → Recommended next action: Complete Architecture document
77
+
78
+ Shall I continue?
79
+
80
+ changelog:
81
+ - version: 2.2.0
82
+ date: "2026-02-24"
83
+ changes:
84
+ - Added project profile detection
85
+ - Added session resume protocol
86
+ - Added agent version tracking
87
+ - Added quality gates for each profile
88
+ - version: 2.1.0
89
+ date: "2025-12-01"
90
+ changes:
91
+ - Added /party command
92
+ - Improved /mdan-help
93
+ - version: 2.0.0
94
+ date: "2025-10-01"
95
+ changes:
96
+ - Initial v2 release
97
+ - 5-phase workflow
@@ -0,0 +1,81 @@
1
+ {
2
+ "registry_version": "1.0",
3
+ "last_updated": "2026-02-24",
4
+ "prompts": {
5
+ "orchestrator": {
6
+ "handle": "orchestrator",
7
+ "version": "2.2.0",
8
+ "file": "templates/prompts/orchestrator.yaml",
9
+ "active": true,
10
+ "model": "openai/gpt-4o",
11
+ "changelog": [
12
+ {
13
+ "version": "2.2.0",
14
+ "date": "2026-02-24",
15
+ "changes": ["Added project profile detection", "Added session resume protocol"]
16
+ }
17
+ ]
18
+ },
19
+ "dev-agent": {
20
+ "handle": "dev-agent",
21
+ "version": "2.0.0",
22
+ "file": "templates/prompts/dev-agent.yaml",
23
+ "active": true,
24
+ "model": "openai/gpt-4o",
25
+ "changelog": [
26
+ {
27
+ "version": "2.0.0",
28
+ "date": "2026-02-24",
29
+ "changes": ["Added implementation plan format", "Added setup instructions"]
30
+ }
31
+ ]
32
+ },
33
+ "product-agent": {
34
+ "handle": "product-agent",
35
+ "version": "2.0.0",
36
+ "file": "templates/prompts/product-agent.yaml",
37
+ "active": true,
38
+ "model": "openai/gpt-4o",
39
+ "changelog": [
40
+ {
41
+ "version": "2.0.0",
42
+ "date": "2026-02-24",
43
+ "changes": ["Initial release with MoSCoW, risk matrix"]
44
+ }
45
+ ]
46
+ },
47
+ "architect-agent": {
48
+ "handle": "architect-agent",
49
+ "version": "2.0.0",
50
+ "file": "templates/prompts/architect-agent.yaml",
51
+ "active": true,
52
+ "model": "openai/gpt-4o",
53
+ "changelog": [
54
+ {
55
+ "version": "2.0.0",
56
+ "date": "2026-02-24",
57
+ "changes": ["Added Mermaid diagrams", "Added ADR format"]
58
+ }
59
+ ]
60
+ },
61
+ "test-agent": {
62
+ "handle": "test-agent",
63
+ "version": "2.0.0",
64
+ "file": "templates/prompts/test-agent.yaml",
65
+ "active": true,
66
+ "model": "openai/gpt-4o",
67
+ "changelog": [
68
+ {
69
+ "version": "2.0.0",
70
+ "date": "2026-02-24",
71
+ "changes": ["Added scenarios support", "Added evaluations"]
72
+ }
73
+ ]
74
+ }
75
+ },
76
+ "sync_settings": {
77
+ "auto_sync": false,
78
+ "langwatch_integration": false,
79
+ "commit_on_change": true
80
+ }
81
+ }
@@ -0,0 +1,80 @@
1
+ # Test Evaluations Index
2
+
3
+ > Structured benchmarking for agent components
4
+
5
+ ## Overview
6
+
7
+ Evaluations provide quantitative testing for specific components of your agent pipeline. Unlike scenarios (end-to-end), evaluations test isolated components.
8
+
9
+ ## Available Evaluations
10
+
11
+ | Evaluation | Description | Metrics |
12
+ |------------|-------------|---------|
13
+ | [rag_eval.md](rag_eval.md) | RAG correctness | F1, Precision, Recall, Context Relevance |
14
+ | [classification_eval.md](classification_eval.md) | Routing/categorization | Accuracy, F1, Precision, Recall |
15
+
16
+ ## Adding New Evaluations
17
+
18
+ 1. Copy an existing template:
19
+ ```bash
20
+ cp rag_eval.md my_eval.md
21
+ ```
22
+
23
+ 2. Customize:
24
+ - Update dataset format
25
+ - Set target thresholds
26
+ - Add domain-specific checks
27
+
28
+ 3. Run:
29
+ ```python
30
+ import langwatch
31
+ results = langwatch.evaluate(
32
+ dataset="my-dataset",
33
+ evaluator="my_eval",
34
+ metrics=["accuracy", "f1"]
35
+ )
36
+ ```
37
+
38
+ ## Built-in Evaluators
39
+
40
+ LangWatch provides extensive evaluators:
41
+
42
+ | Evaluator | Description |
43
+ |-----------|-------------|
44
+ | `rag_correctness` | Retrieval and generation quality |
45
+ | `classification_accuracy` | Routing and categorization |
46
+ | `answer_correctness` | Factual accuracy |
47
+ | `safety_check` | Jailbreak, PII, toxicity |
48
+ | `format_validation` | JSON, XML, markdown structure |
49
+ | `tool_calling` | Correct tool selection and args |
50
+ | `latency` | Response time benchmarking |
51
+
52
+ ## Running Evaluations
53
+
54
+ ```bash
55
+ # Python
56
+ pytest tests/evaluations/ -v
57
+
58
+ # JavaScript
59
+ npm test -- tests/evaluations/
60
+
61
+ # With LangWatch
62
+ langwatch evaluate --dataset my-project
63
+ ```
64
+
65
+ ## Integration with MDAN
66
+
67
+ During VERIFY phase:
68
+ 1. Test Agent identifies components needing evaluation
69
+ 2. Creates evaluation datasets from PRD/user stories
70
+ 3. Runs relevant evaluations
71
+ 4. Reports metrics in quality gate
72
+ 5. Fails if thresholds not met
73
+
74
+ ## Best Practices
75
+
76
+ - Create evaluation datasets from real user queries
77
+ - Include edge cases and negative examples
78
+ - Set realistic thresholds (not 100%)
79
+ - Track metrics over time (regression detection)
80
+ - Run evaluations in CI/CD pipeline
@@ -0,0 +1,136 @@
1
+ # Classification Evaluation Template
2
+
3
+ > Benchmark routing accuracy and categorization correctness
4
+
5
+ ## Metadata
6
+
7
+ | Field | Value |
8
+ |-------|-------|
9
+ | eval_name | classification_accuracy |
10
+ | version | 1.0.0 |
11
+ | metrics | Accuracy, Precision, Recall, F1 Score |
12
+
13
+ ## Purpose
14
+
15
+ Evaluate how well the agent correctly routes requests, categorizes inputs, or makes routing decisions.
16
+
17
+ ## Use Cases
18
+
19
+ - Intent classification (chatbot routing)
20
+ - Content moderation
21
+ - Priority/ticket categorization
22
+ - Language detection
23
+ - Sentiment analysis
24
+
25
+ ## Dataset Format
26
+
27
+ ```json
28
+ [
29
+ {
30
+ "input": "I want to get a refund",
31
+ "expected_category": "refund_request",
32
+ "expected_confidence": 0.90
33
+ },
34
+ {
35
+ "input": "How do I change my password?",
36
+ "expected_category": "account_settings",
37
+ "expected_confidence": 0.85
38
+ }
39
+ ]
40
+ ```
41
+
42
+ ## Evaluation Metrics
43
+
44
+ | Metric | Target | Description |
45
+ |--------|--------|-------------|
46
+ | Accuracy | ≥0.95 | % of correct classifications |
47
+ | Macro F1 | ≥0.90 | F1 averaged across all categories |
48
+ | Precision (per class) | ≥0.85 | True positives / predicted positives |
49
+ | Recall (per class) | ≥0.85 | True positives / actual positives |
50
+ | Confidence Calibration | ≤0.10 | Mean absolute error between confidence and accuracy |
51
+
52
+ ## Evaluation Code
53
+
54
+ ### Python (LangWatch)
55
+
56
+ ```python
57
+ import langwatch
58
+
59
+ results = langwatch.evaluate(
60
+ dataset="intent-classification",
61
+ evaluator="classification_accuracy",
62
+ metrics=["accuracy", "precision", "recall", "f1_macro", "calibration"]
63
+ )
64
+
65
+ print(f"Accuracy: {results.accuracy}")
66
+ print(f"Macro F1: {results.f1_macro}")
67
+ print(f"Precision: {results.precision}")
68
+ print(f"Recall: {results.recall}")
69
+ ```
70
+
71
+ ### JavaScript/TypeScript
72
+
73
+ ```typescript
74
+ import { evaluate } from "@langwatch/evaluators";
75
+
76
+ const results = await evaluate({
77
+ dataset: "intent-classification",
78
+ evaluator: "classification_accuracy",
79
+ metrics: ["accuracy", "precision", "recall", "f1_macro"],
80
+ });
81
+
82
+ console.log(`Accuracy: ${results.accuracy}`);
83
+ ```
84
+
85
+ ## Pass/Fail Criteria
86
+
87
+ | Metric | Threshold | Status |
88
+ |--------|-----------|--------|
89
+ | Accuracy | ≥0.95 | ✅ Pass |
90
+ | Accuracy | 0.90-0.94 | ⚠️ Warning |
91
+ | Accuracy | <0.90 | ❌ Fail |
92
+ | Macro F1 | ≥0.90 | ✅ Pass |
93
+ | Macro F1 | <0.85 | ❌ Fail |
94
+ | Confidence Error | ≤0.10 | ✅ Pass |
95
+ | Confidence Error | >0.15 | ⚠️ Warning |
96
+
97
+ ## Per-Class Analysis
98
+
99
+ Generate confusion matrix:
100
+
101
+ | | Predicted A | Predicted B | Predicted C |
102
+ |--|-------------|-------------|-------------|
103
+ | Actual A | 45 | 3 | 2 |
104
+ | Actual B | 5 | 38 | 7 |
105
+ | Actual C | 1 | 4 | 40 |
106
+
107
+ Identify:
108
+ - **High-confusion pairs**: A↔B need better differentiation
109
+ - **Low-recall classes**: More training data needed
110
+ - **Low-precision classes**: Overlapping with other categories
111
+
112
+ ## Common Issues
113
+
114
+ ### Low Precision (many false positives)
115
+ - Add negative examples
116
+ - Make categories more distinct
117
+ - Add disambiguation prompts
118
+
119
+ ### Low Recall (many false negatives)
120
+ - Add more training data
121
+ - Expand category definitions
122
+ - Check for data quality issues
123
+
124
+ ### Poor Calibration
125
+ - Retrain with temperature scaling
126
+ - Add more diverse examples
127
+ - Use calibration-aware loss
128
+
129
+ ## Integration with MDAN
130
+
131
+ During VERIFY phase, Test Agent should:
132
+ 1. Identify all classification/routing points in the system
133
+ 2. Create evaluation datasets from real user queries
134
+ 3. Run classification evaluation
135
+ 4. Report per-class performance
136
+ 5. Fail if overall accuracy < 90%