npm - opencode-multiagent - Versions diffs - 0.2.1 → 0.3.0-next.1 - Mend

opencode-multiagent 0.2.1 → 0.3.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (153) hide show

package/AGENTS.md +62 -0
package/CHANGELOG.md +18 -0
package/CONTRIBUTING.md +36 -0
package/README.md +41 -165
package/README.tr.md +84 -0
package/RELEASE.md +68 -0
package/agents/advisor.md +9 -6
package/agents/auditor.md +8 -6
package/agents/critic.md +19 -10
package/agents/deep-worker.md +11 -7
package/agents/devil.md +3 -1
package/agents/executor.md +20 -19
package/agents/heavy-worker.md +11 -7
package/agents/lead.md +22 -30
package/agents/librarian.md +6 -2
package/agents/planner.md +18 -10
package/agents/qa.md +9 -6
package/agents/quick.md +12 -7
package/agents/reviewer.md +9 -6
package/agents/scout.md +9 -5
package/agents/scribe.md +33 -28
package/agents/strategist.md +10 -7
package/agents/ui-heavy-worker.md +11 -7
package/agents/ui-worker.md +12 -7
package/agents/validator.md +8 -5
package/agents/worker.md +12 -7
package/commands/execute.md +1 -0
package/commands/init-deep.md +1 -0
package/commands/init.md +1 -0
package/commands/inspect.md +1 -0
package/commands/plan.md +1 -0
package/commands/quality.md +1 -0
package/commands/review.md +1 -0
package/commands/status.md +1 -0
package/defaults/opencode-multiagent.json +223 -0
package/defaults/opencode-multiagent.schema.json +249 -0
package/dist/control-plane.d.ts +4 -0
package/dist/control-plane.d.ts.map +1 -0
package/dist/index.d.ts +5 -0
package/dist/index.d.ts.map +1 -0
package/dist/index.js +1583 -0
package/dist/opencode-multiagent/compiler.d.ts +19 -0
package/dist/opencode-multiagent/compiler.d.ts.map +1 -0
package/dist/opencode-multiagent/constants.d.ts +116 -0
package/dist/opencode-multiagent/constants.d.ts.map +1 -0
package/dist/opencode-multiagent/defaults.d.ts +10 -0
package/dist/opencode-multiagent/defaults.d.ts.map +1 -0
package/dist/opencode-multiagent/file-lock.d.ts +15 -0
package/dist/opencode-multiagent/file-lock.d.ts.map +1 -0
package/dist/opencode-multiagent/hooks.d.ts +62 -0
package/dist/opencode-multiagent/hooks.d.ts.map +1 -0
package/dist/opencode-multiagent/log.d.ts +2 -0
package/dist/opencode-multiagent/log.d.ts.map +1 -0
package/dist/opencode-multiagent/markdown.d.ts +8 -0
package/dist/opencode-multiagent/markdown.d.ts.map +1 -0
package/dist/opencode-multiagent/mcp.d.ts +3 -0
package/dist/opencode-multiagent/mcp.d.ts.map +1 -0
package/dist/opencode-multiagent/policy.d.ts +5 -0
package/dist/opencode-multiagent/policy.d.ts.map +1 -0
package/dist/opencode-multiagent/quality.d.ts +14 -0
package/dist/opencode-multiagent/quality.d.ts.map +1 -0
package/dist/opencode-multiagent/runtime.d.ts +7 -0
package/dist/opencode-multiagent/runtime.d.ts.map +1 -0
package/dist/opencode-multiagent/session-tracker.d.ts +32 -0
package/dist/opencode-multiagent/session-tracker.d.ts.map +1 -0
package/dist/opencode-multiagent/skills.d.ts +17 -0
package/dist/opencode-multiagent/skills.d.ts.map +1 -0
package/dist/opencode-multiagent/supervision.d.ts +12 -0
package/dist/opencode-multiagent/supervision.d.ts.map +1 -0
package/dist/opencode-multiagent/task-manager.d.ts +48 -0
package/dist/opencode-multiagent/task-manager.d.ts.map +1 -0
package/dist/opencode-multiagent/telemetry.d.ts +26 -0
package/dist/opencode-multiagent/telemetry.d.ts.map +1 -0
package/dist/opencode-multiagent/tools.d.ts +56 -0
package/dist/opencode-multiagent/tools.d.ts.map +1 -0
package/dist/opencode-multiagent/types.d.ts +36 -0
package/dist/opencode-multiagent/types.d.ts.map +1 -0
package/dist/opencode-multiagent/utils.d.ts +9 -0
package/dist/opencode-multiagent/utils.d.ts.map +1 -0
package/docs/agents.md +260 -0
package/docs/agents.tr.md +260 -0
package/docs/configuration.md +255 -0
package/docs/configuration.tr.md +255 -0
package/docs/usage-guide.md +226 -0
package/docs/usage-guide.tr.md +227 -0
package/examples/opencode.with-overrides.json +1 -5
package/package.json +23 -13
package/skills/advanced-evaluation/SKILL.md +37 -21
package/skills/advanced-evaluation/manifest.json +2 -13
package/skills/cek-context-engineering/SKILL.md +159 -87
package/skills/cek-context-engineering/manifest.json +1 -3
package/skills/cek-prompt-engineering/SKILL.md +13 -10
package/skills/cek-prompt-engineering/manifest.json +1 -3
package/skills/cek-test-prompt/SKILL.md +38 -28
package/skills/cek-test-prompt/manifest.json +1 -3
package/skills/cek-thought-based-reasoning/SKILL.md +75 -21
package/skills/cek-thought-based-reasoning/manifest.json +1 -3
package/skills/context-degradation/SKILL.md +14 -13
package/skills/context-degradation/manifest.json +1 -3
package/skills/debate/SKILL.md +23 -78
package/skills/debate/manifest.json +2 -12
package/skills/design-first/manifest.json +2 -13
package/skills/dispatching-parallel-agents/SKILL.md +14 -3
package/skills/dispatching-parallel-agents/manifest.json +1 -4
package/skills/drift-analysis/SKILL.md +50 -29
package/skills/drift-analysis/manifest.json +2 -12
package/skills/evaluation/manifest.json +2 -12
package/skills/executing-plans/SKILL.md +15 -8
package/skills/executing-plans/manifest.json +1 -3
package/skills/handoff-protocols/manifest.json +2 -12
package/skills/parallel-investigation/SKILL.md +25 -12
package/skills/parallel-investigation/manifest.json +1 -4
package/skills/reflexion-critique/SKILL.md +21 -10
package/skills/reflexion-critique/manifest.json +1 -3
package/skills/reflexion-reflect/SKILL.md +36 -34
package/skills/reflexion-reflect/manifest.json +2 -10
package/skills/root-cause-analysis/manifest.json +2 -13
package/skills/sadd-judge-with-debate/SKILL.md +50 -26
package/skills/sadd-judge-with-debate/manifest.json +1 -3
package/skills/structured-code-review/manifest.json +2 -11
package/skills/task-decomposition/manifest.json +2 -13
package/skills/verification-before-completion/manifest.json +2 -15
package/skills/verification-gates/SKILL.md +27 -19
package/skills/verification-gates/manifest.json +2 -12
package/defaults/agent-settings.json +0 -102
package/defaults/agent-settings.schema.json +0 -25
package/defaults/flags.json +0 -35
package/defaults/flags.schema.json +0 -119
package/defaults/mcp-defaults.json +0 -47
package/defaults/mcp-defaults.schema.json +0 -38
package/defaults/profiles.json +0 -53
package/defaults/profiles.schema.json +0 -60
package/defaults/team-profiles.json +0 -83
package/src/control-plane.ts +0 -21
package/src/index.ts +0 -8
package/src/opencode-multiagent/compiler.ts +0 -168
package/src/opencode-multiagent/constants.ts +0 -178
package/src/opencode-multiagent/file-lock.ts +0 -90
package/src/opencode-multiagent/hooks.ts +0 -599
package/src/opencode-multiagent/log.ts +0 -12
package/src/opencode-multiagent/mailbox.ts +0 -287
package/src/opencode-multiagent/markdown.ts +0 -99
package/src/opencode-multiagent/mcp.ts +0 -35
package/src/opencode-multiagent/policy.ts +0 -67
package/src/opencode-multiagent/quality.ts +0 -140
package/src/opencode-multiagent/runtime.ts +0 -55
package/src/opencode-multiagent/skills.ts +0 -144
package/src/opencode-multiagent/supervision.ts +0 -156
package/src/opencode-multiagent/task-manager.ts +0 -148
package/src/opencode-multiagent/team-manager.ts +0 -219
package/src/opencode-multiagent/team-tools.ts +0 -359
package/src/opencode-multiagent/telemetry.ts +0 -124
package/src/opencode-multiagent/utils.ts +0 -54

package/skills/advanced-evaluation/SKILL.md CHANGED Viewed

@@ -28,11 +28,13 @@ Activate this skill when:
 Evaluation approaches fall into two primary categories with distinct reliability profiles:
 **Direct Scoring**: A single LLM rates one response on a defined scale.
 - Best for: Objective criteria (factual accuracy, instruction following, toxicity)
 - Reliability: Moderate to high for well-defined criteria
 - Failure mode: Score calibration drift, inconsistent scale interpretation
 **Pairwise Comparison**: An LLM compares two responses and selects the better one.
 - Best for: Subjective preferences (tone, style, persuasiveness)
 - Reliability: Higher than direct scoring for preferences
 - Failure mode: Position bias, length bias
@@ -57,12 +59,12 @@ LLM judges exhibit systematic biases that must be actively mitigated:
 Choose metrics based on the evaluation task structure:
-| Task Type | Primary Metrics | Secondary Metrics |
-|-----------|-----------------|-------------------|
-| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ |
-| Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
-| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
-| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
+| Task Type                         | Primary Metrics                      | Secondary Metrics          |
+| --------------------------------- | ------------------------------------ | -------------------------- |
+| Binary classification (pass/fail) | Recall, Precision, F1                | Cohen's κ                  |
+| Ordinal scale (1-5 rating)        | Spearman's ρ, Kendall's τ            | Cohen's κ (weighted)       |
+| Pairwise preference               | Agreement rate, Position consistency | Confidence calibration     |
+| Multi-label                       | Macro-F1, Micro-F1                   | Per-label precision/recall |
 The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
@@ -73,6 +75,7 @@ The critical insight: High absolute agreement matters less than systematic disag
 Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
 **Criteria Definition Pattern**:
 ```
 Criterion: [Name]
 Description: [What this criterion measures]
@@ -80,11 +83,13 @@ Weight: [Relative importance, 0-1]
 ```
 **Scale Calibration**:
 - 1-3 scales: Binary with neutral option, lowest cognitive load
 - 1-5 scales: Standard Likert, good balance of granularity and reliability
 - 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
 **Prompt Structure for Direct Scoring**:
 ```
 You are an expert evaluator assessing response quality.
@@ -118,12 +123,14 @@ Respond with structured JSON containing scores, justifications, and summary.
 Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
 **Position Bias Mitigation Protocol**:
 1. First pass: Response A in first position, Response B in second
 2. Second pass: Response B in first position, Response A in second
 3. Consistency check: If passes disagree, return TIE with reduced confidence
 4. Final verdict: Consistent winner with averaged confidence
 **Prompt Structure for Pairwise Comparison**:
 ```
 You are an expert evaluator comparing two AI responses.
@@ -155,6 +162,7 @@ JSON with per-criterion comparison, overall winner, confidence (0-1), and reason
 ```
 **Confidence Calibration**: Confidence scores should reflect position consistency:
 - Both passes agree: confidence = average of individual confidences
 - Passes disagree: confidence = 0.5, verdict = TIE
@@ -163,6 +171,7 @@ JSON with per-criterion comparison, overall winner, confidence (0-1), and reason
 Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
 **Rubric Components**:
 1. **Level descriptions**: Clear boundaries for each score level
 2. **Characteristics**: Observable features that define each level
 3. **Examples**: Representative text for each level (optional but valuable)
@@ -170,6 +179,7 @@ Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended
 5. **Scoring guidelines**: General principles for consistent application
 **Strictness Calibration**:
 - **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration
 - **Balanced**: Fair, typical expectations for production use
 - **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation
@@ -218,22 +228,27 @@ Production evaluation systems require multiple layers:
 ### Common Anti-Patterns
 **Anti-pattern: Scoring without justification**
 - Problem: Scores lack grounding, difficult to debug or improve
 - Solution: Always require evidence-based justification before score
 **Anti-pattern: Single-pass pairwise comparison**
 - Problem: Position bias corrupts results
 - Solution: Always swap positions and check consistency
 **Anti-pattern: Overloaded criteria**
 - Problem: Criteria measuring multiple things are unreliable
 - Solution: One criterion = one measurable aspect
 **Anti-pattern: Missing edge case guidance**
 - Problem: Evaluators handle ambiguous cases inconsistently
 - Solution: Include edge cases in rubrics with explicit guidance
 **Anti-pattern: Ignoring confidence calibration**
 - Problem: High-confidence wrong judgments are worse than low-confidence
 - Solution: Calibrate confidence to position consistency and evidence strength
@@ -275,15 +290,17 @@ For high-volume evaluation:
 ### Example 1: Direct Scoring for Accuracy
 **Input**:
 ```
 Prompt: "What causes seasons on Earth?"
-Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
+Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
 different hemispheres receive more direct sunlight at different times of year."
 Criterion: Factual Accuracy (weight: 1.0)
 Scale: 1-5
 ```
 **Output**:
 ```json
 {
   "criterion": "Factual Accuracy",
@@ -293,8 +310,8 @@ Scale: 1-5
     "Correctly explains differential sunlight by hemisphere",
     "No factual errors present"
   ],
-  "justification": "Response accurately explains the cause of seasons with correct
-scientific reasoning. Both the axial tilt and its effect on sunlight distribution
+  "justification": "Response accurately explains the cause of seasons with correct
+scientific reasoning. Both the axial tilt and its effect on sunlight distribution
 are correctly described.",
   "improvement": "Could add the specific tilt angle (23.5°) for completeness."
 }
@@ -303,6 +320,7 @@ are correctly described.",
 ### Example 2: Pairwise Comparison with Position Swap
 **Input**:
 ```
 Prompt: "Explain machine learning to a beginner"
 Response A: [Technical explanation with jargon]
@@ -311,22 +329,27 @@ Criteria: ["clarity", "accessibility"]
 ```
 **First Pass (A first)**:
 ```json
 { "winner": "B", "confidence": 0.8 }
 ```
 **Second Pass (B first)**:
 ```json
 { "winner": "A", "confidence": 0.6 }
 ```
 (Note: Winner is A because B was in first position)
 **Mapped Second Pass**:
 ```json
 { "winner": "B", "confidence": 0.6 }
 ```
 **Final Result**:
 ```json
 {
   "winner": "B",
@@ -342,6 +365,7 @@ Criteria: ["clarity", "accessibility"]
 ### Example 3: Rubric Generation
 **Input**:
 ```
 criterionName: "Code Readability"
 criterionDescription: "How easy the code is to understand and maintain"
@@ -351,6 +375,7 @@ strictness: "balanced"
 ```
 **Output** (abbreviated):
 ```json
 {
   "levels": [
@@ -420,28 +445,20 @@ strictness: "balanced"
 This skill integrates with:
-- **context-fundamentals** - Evaluation prompts require effective context structure
-- **tool-design** - Evaluation tools need proper schemas and error handling
-- **context-optimization** - Evaluation prompts can be optimized for token efficiency
 - **evaluation** (foundational) - This skill extends the foundational evaluation concepts
 ## References
-Internal reference:
-- [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md)
-- [Bias Mitigation Techniques](./references/bias-mitigation.md)
-- [Metric Selection Guide](./references/metrics-guide.md)
 External research:
 - [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/)
 - [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685)
 - [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634)
 - [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926)
 Related skills in this collection:
-- evaluation - Foundational evaluation concepts
-- context-fundamentals - Context structure for evaluation prompts
-- tool-design - Building evaluation tools
+- **evaluation** - Foundational evaluation concepts
 ---
@@ -451,4 +468,3 @@ Related skills in this collection:
 **Last Updated**: 2024-12-24
 **Author**: Muratcan Koylan
 **Version**: 1.0.0

package/skills/advanced-evaluation/manifest.json CHANGED Viewed

@@ -2,19 +2,8 @@
   "name": "advanced-evaluation",
   "version": "1.0.0",
   "description": "Advanced evaluation workflows for comparative and bias-aware judgment tasks",
-  "triggers": [
-    "advanced evaluation",
-    "compare outputs",
-    "pairwise",
-    "position bias",
-    "judge"
-  ],
-  "applicable_agents": [
-    "critic",
-    "strategist",
-    "librarian",
-    "reviewer"
-  ],
+  "triggers": ["advanced evaluation", "compare outputs", "pairwise", "position bias", "judge"],
+  "applicable_agents": ["librarian"],
   "max_context_tokens": 2200,
   "entry_file": "SKILL.md"
 }