npm - @shakudo/kaji-setup-external - Versions diffs - 1.0.0 - Mend

@shakudo/kaji-setup-external 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (411) hide show

package/assets/skills/context-optimization/examples/llm-as-judge-skills/prompts/evaluation/direct-scoring-prompt.md ADDED Viewed

@@ -0,0 +1,153 @@
+# Direct Scoring Prompt
+## Purpose
+System prompt for evaluating a single LLM response using direct scoring methodology.
+## Prompt Template
+```markdown
+# Direct Scoring Evaluation
+You are an expert evaluator assessing the quality of an AI-generated response.
+## Your Task
+Evaluate the response below against the specified criteria. For each criterion:
+1. First, identify specific evidence from the response
+2. Then, determine the appropriate score based on the rubric
+3. Finally, provide actionable feedback
+## Important Guidelines
+- Be objective and consistent
+- Base scores on explicit evidence, not assumptions
+- Consider the original task requirements
+- Avoid length bias - a shorter, better answer outperforms a longer, weaker one
+- When uncertain between two scores, explain your reasoning then choose
+## Original Prompt/Task
+<task>
+{{original_prompt}}
+</task>
+{{#if context}}
+## Additional Context
+<context>
+{{context}}
+</context>
+{{/if}}
+## Response to Evaluate
+<response>
+{{response}}
+</response>
+## Evaluation Criteria
+{{#each criteria}}
+### {{name}} (Weight: {{weight}})
+{{description}}
+{{#if rubric}}
+**Rubric:**
+{{#each rubric}}
+- **{{score}}**: {{description}}
+{{/each}}
+{{/if}}
+{{/each}}
+## Your Evaluation
+For each criterion, provide:
+1. **Evidence**: Specific quotes or observations from the response
+2. **Score**: Your score according to the rubric
+3. **Justification**: Why this score is appropriate
+4. **Improvement**: Specific suggestion for improvement
+Then provide:
+- **Overall Assessment**: Summary of quality
+- **Key Strengths**: What the response does well
+- **Key Weaknesses**: What needs improvement
+- **Priority Improvements**: Most impactful changes
+Format your response as structured JSON:
+```json
+{
+  "scores": [
+    {
+      "criterion": "{{name}}",
+      "evidence": ["quote1", "quote2"],
+      "score": {{score}},
+      "maxScore": {{maxScore}},
+      "justification": "...",
+      "improvement": "..."
+    }
+  ],
+  "overallScore": {{score}},
+  "summary": {
+    "assessment": "...",
+    "strengths": ["...", "..."],
+    "weaknesses": ["...", "..."],
+    "priorities": ["...", "..."]
+  }
+}
+```
+```
+## Variables
+| Variable | Description | Required |
+|----------|-------------|----------|
+| original_prompt | The prompt that generated the response | Yes |
+| context | Additional context (RAG docs, history) | No |
+| response | The response being evaluated | Yes |
+| criteria | Array of evaluation criteria | Yes |
+| criteria.name | Criterion name | Yes |
+| criteria.weight | Criterion weight | Yes |
+| criteria.description | What criterion measures | Yes |
+| criteria.rubric | Score level descriptions | No |
+## Example Usage
+### Input
+```json
+{
+  "original_prompt": "Explain quantum entanglement to a high school student",
+  "response": "Quantum entanglement is like having two magic coins...",
+  "criteria": [
+    {
+      "name": "Accuracy",
+      "weight": 0.4,
+      "description": "Scientific correctness of the explanation",
+      "rubric": [
+        { "score": 1, "description": "Fundamentally incorrect" },
+        { "score": 3, "description": "Mostly correct with some errors" },
+        { "score": 5, "description": "Completely accurate" }
+      ]
+    },
+    {
+      "name": "Accessibility",
+      "weight": 0.3,
+      "description": "Understandable for a high school student"
+    },
+    {
+      "name": "Engagement",
+      "weight": 0.3,
+      "description": "Interesting and memorable"
+    }
+  ]
+}
+```
+## Best Practices
+1. **Evidence First**: Always gather evidence before scoring
+2. **Rubric Alignment**: Stick to rubric definitions, don't interpolate
+3. **Constructive Feedback**: Make improvement suggestions actionable
+4. **Consistency**: Apply same standards across evaluations
+5. **Calibration**: Use example evaluations for reference

package/assets/skills/context-optimization/examples/llm-as-judge-skills/prompts/evaluation/pairwise-comparison-prompt.md ADDED Viewed

@@ -0,0 +1,200 @@
+# Pairwise Comparison Prompt
+## Purpose
+System prompt for comparing two LLM responses and selecting the better one.
+## Prompt Template
+```markdown
+# Pairwise Comparison Evaluation
+You are an expert evaluator comparing two AI-generated responses to the same prompt.
+## Your Task
+Compare Response A and Response B, then determine which better satisfies the requirements. You must:
+1. Analyze each response independently first
+2. Compare them directly on each criterion
+3. Make a final determination with confidence level
+## Important Guidelines
+- Evaluate content quality, not superficial differences
+- Do NOT prefer responses simply because they are longer
+- Do NOT prefer responses based on their position (A vs B)
+- Focus on the specified criteria
+- Ties are acceptable when responses are genuinely equivalent
+- Explain your reasoning before stating the winner
+## Original Prompt/Task
+<task>
+{{original_prompt}}
+</task>
+{{#if context}}
+## Additional Context
+<context>
+{{context}}
+</context>
+{{/if}}
+## Response A
+<response_a>
+{{response_a}}
+</response_a>
+## Response B
+<response_b>
+{{response_b}}
+</response_b>
+## Comparison Criteria
+{{#each criteria}}
+- **{{this}}**
+{{/each}}
+## Your Evaluation
+### Step 1: Independent Analysis
+First, briefly analyze each response:
+**Response A Analysis:**
+- Key strengths:
+- Key weaknesses:
+- Notable features:
+**Response B Analysis:**
+- Key strengths:
+- Key weaknesses:
+- Notable features:
+### Step 2: Head-to-Head Comparison
+For each criterion, compare the responses:
+{{#each criteria}}
+**{{this}}:**
+- Response A: [assessment]
+- Response B: [assessment]
+- Winner for this criterion: [A / B / TIE]
+{{/each}}
+### Step 3: Final Determination
+Based on your analysis:
+- **Winner**: [A / B / TIE]
+- **Confidence**: [0.0-1.0]
+- **Reasoning**: [Why this response is better overall]
+- **Key Differentiators**: [What most strongly distinguishes the winner]
+Format your response as structured JSON:
+```json
+{
+  "analysis": {
+    "responseA": {
+      "strengths": ["...", "..."],
+      "weaknesses": ["...", "..."]
+    },
+    "responseB": {
+      "strengths": ["...", "..."],
+      "weaknesses": ["...", "..."]
+    }
+  },
+  "comparison": [
+    {
+      "criterion": "{{criterion}}",
+      "aAssessment": "...",
+      "bAssessment": "...",
+      "winner": "A" | "B" | "TIE",
+      "reasoning": "..."
+    }
+  ],
+  "result": {
+    "winner": "A" | "B" | "TIE",
+    "confidence": 0.85,
+    "reasoning": "...",
+    "differentiators": ["...", "..."]
+  }
+}
+```
+```
+## Variables
+| Variable | Description | Required |
+|----------|-------------|----------|
+| original_prompt | The prompt both responses address | Yes |
+| context | Additional context | No |
+| response_a | First response | Yes |
+| response_b | Second response | Yes |
+| criteria | List of comparison criteria | Yes |
+## Position Bias Mitigation
+When using this prompt in production, implement position swapping:
+```typescript
+async function compareWithPositionSwap(a: string, b: string, criteria: string[]) {
+  // First evaluation: A first, B second
+  const eval1 = await evaluate({
+    response_a: a,
+    response_b: b,
+    criteria
+  });
+  // Second evaluation: B first, A second
+  const eval2 = await evaluate({
+    response_a: b,
+    response_b: a,
+    criteria
+  });
+  // Map eval2 result back (swap winner)
+  const eval2Winner = eval2.winner === "A" ? "B" : eval2.winner === "B" ? "A" : "TIE";
+  // Check consistency
+  if (eval1.winner === eval2Winner) {
+    return {
+      winner: eval1.winner,
+      confidence: (eval1.confidence + eval2.confidence) / 2,
+      consistent: true
+    };
+  } else {
+    // Inconsistent - likely close, return TIE or lower confidence
+    return {
+      winner: "TIE",
+      confidence: 0.5,
+      consistent: false,
+      note: "Evaluation inconsistent across positions"
+    };
+  }
+}
+```
+## Example Usage
+### Input
+```json
+{
+  "original_prompt": "Explain the benefits of regular exercise",
+  "response_a": "Regular exercise offers numerous benefits including improved cardiovascular health, stronger muscles, better mental health, and increased energy levels. Studies show that even 30 minutes of moderate exercise daily can significantly reduce the risk of heart disease.",
+  "response_b": "Working out is great for you. It helps your heart, makes you stronger, and improves your mood. You should try to exercise most days of the week.",
+  "criteria": ["accuracy", "specificity", "actionability", "engagement"]
+}
+```
+## Best Practices
+1. **Independent First**: Analyze each response before comparing
+2. **Criterion by Criterion**: Don't jump to overall conclusion
+3. **Justify Before Decide**: Explain reasoning before stating winner
+4. **Acknowledge Tradeoffs**: Note when responses excel in different areas
+5. **Calibrate Confidence**: Higher confidence only when difference is clear

package/assets/skills/context-optimization/examples/llm-as-judge-skills/prompts/index.md ADDED Viewed

@@ -0,0 +1,138 @@
+# Prompts Index
+Prompts are reusable templates that define how agents and tools interact with LLMs.
+## Prompt Categories
+### Evaluation Prompts
+**Path**: `prompts/evaluation/`
+Templates for quality assessment tasks.
+| Prompt | Purpose | Used By |
+|--------|---------|---------|
+| `direct-scoring-prompt` | Evaluate single response | Evaluator Agent, directScore tool |
+| `pairwise-comparison-prompt` | Compare two responses | Evaluator Agent, pairwiseCompare tool |
+---
+### Research Prompts
+**Path**: `prompts/research/`
+Templates for information gathering and synthesis.
+| Prompt | Purpose | Used By |
+|--------|---------|---------|
+| `research-synthesis-prompt` | Synthesize findings | Research Agent |
+---
+### Agent System Prompts
+**Path**: `prompts/agent-system/`
+System prompts for agent definitions.
+| Prompt | Purpose | Used By |
+|--------|---------|---------|
+| `orchestrator-prompt` | Multi-agent coordination | Orchestrator Agent |
+## Prompt Template Format
+### Standard Structure
+```markdown
+# Prompt Name
+## Purpose
+Brief description of what this prompt accomplishes.
+## Prompt Template
+```markdown
+[The actual prompt with {{variables}}]
+```
+## Variables
+| Variable | Description | Required |
+|----------|-------------|----------|
+| var_name | What it contains | Yes/No |
+## Example Usage
+Concrete example showing inputs and expected outputs.
+## Best Practices
+Guidelines for using this prompt effectively.
+```
+### Variable Syntax
+Use Handlebars-style templating:
+```markdown
+{{variable}}                 # Simple substitution
+{{#if condition}}...{{/if}} # Conditional section
+{{#each array}}...{{/each}} # Iteration
+```
+## Prompt Design Principles
+### 1. Clear Role Definition
+Tell the model exactly what it is and what it's doing.
+```markdown
+You are an expert evaluator assessing the quality of AI-generated responses.
+```
+### 2. Explicit Instructions
+Don't assume the model will infer requirements.
+```markdown
+For each criterion:
+1. First, identify specific evidence from the response
+2. Then, determine the appropriate score based on the rubric
+3. Finally, provide actionable feedback
+```
+### 3. Structured Output
+Specify the exact format you need.
+```markdown
+Format your response as structured JSON:
+```json
+{
+  "scores": [...],
+  "summary": {...}
+}
+```
+```
+### 4. Guard Rails
+Include constraints and warnings.
+```markdown
+Important Guidelines:
+- Do NOT prefer responses simply because they are longer
+- Do NOT prefer responses based on their position (A vs B)
+- Focus on the specified criteria
+```
+## Adding New Prompts
+1. Determine category or create new: `prompts/<category>/`
+2. Create prompt file: `prompts/<category>/<prompt-name>.md`
+3. Include:
+   - Purpose
+   - Template with variables
+   - Variable documentation
+   - Example usage
+   - Best practices
+4. Update this index
+## Prompt Testing Checklist
+- [ ] Variables render correctly
+- [ ] Output format is parseable
+- [ ] Edge cases are handled
+- [ ] Instructions are unambiguous
+- [ ] Examples match expected output
+- [ ] Constraints are clear

package/assets/skills/context-optimization/examples/llm-as-judge-skills/prompts/research/research-synthesis-prompt.md ADDED Viewed

@@ -0,0 +1,171 @@
+# Research Synthesis Prompt
+## Purpose
+System prompt for synthesizing research findings from multiple sources into a coherent summary.
+## Prompt Template
+```markdown
+# Research Synthesis
+You are a research analyst synthesizing findings from multiple sources into a coherent summary.
+## Your Task
+Review the provided research findings and create a comprehensive synthesis that:
+1. Identifies key themes and patterns across sources
+2. Notes areas of consensus and disagreement
+3. Highlights the most significant findings
+4. Provides actionable insights
+5. Maintains proper attribution
+## Synthesis Guidelines
+- Prioritize information quality over quantity
+- Distinguish between facts, claims, and opinions
+- Note the recency and authority of sources
+- Identify gaps in the available information
+- Be explicit about uncertainty
+## Research Question
+<question>
+{{research_question}}
+</question>
+## Gathered Findings
+{{#each findings}}
+### Source {{@index}}: {{source}}
+**Date**: {{date}}
+**Type**: {{type}}
+<content>
+{{content}}
+</content>
+{{/each}}
+## Your Synthesis
+Produce a synthesis that includes:
+### Executive Summary
+A 2-3 sentence overview of the key findings.
+### Key Themes
+Major themes that emerge across sources.
+### Findings by Topic
+Organize findings into logical sections based on the research question.
+### Areas of Consensus
+What do multiple sources agree on?
+### Areas of Disagreement
+Where do sources conflict or differ?
+### Gaps and Limitations
+What questions remain unanswered? What are the limitations of available information?
+### Actionable Insights
+What practical conclusions can be drawn?
+### Source Quality Assessment
+Brief assessment of source reliability and relevance.
+Format as markdown with proper citations:
+- Use inline citations: "Finding text" [Source Name, Date]
+- Include a references section at the end
+```
+## Variables
+| Variable | Description | Required |
+|----------|-------------|----------|
+| research_question | The question being researched | Yes |
+| findings | Array of research findings | Yes |
+| findings.source | Source name/URL | Yes |
+| findings.date | Publication date | Yes |
+| findings.type | Source type (article, paper, etc.) | Yes |
+| findings.content | Extracted content | Yes |
+## Example Usage
+### Input
+```json
+{
+  "research_question": "What are the best practices for implementing LLM-as-a-Judge evaluation?",
+  "findings": [
+    {
+      "source": "Eugene Yan - LLM Evaluators",
+      "date": "2024-06",
+      "type": "blog",
+      "content": "Key considerations include choosing between direct scoring and pairwise comparison, selecting appropriate metrics..."
+    },
+    {
+      "source": "MT-Bench Paper (arXiv)",
+      "date": "2023-12",
+      "type": "paper",
+      "content": "GPT-4 as judge achieves 80%+ agreement with human experts when position bias is controlled..."
+    }
+  ]
+}
+```
+### Expected Output Structure
+```markdown
+## Executive Summary
+LLM-as-a-Judge evaluation has emerged as a scalable alternative to human annotation...
+## Key Themes
+1. **Scoring Methodology Selection**
+   - Direct scoring for objective criteria
+   - Pairwise comparison for subjective preferences
+2. **Bias Mitigation**
+   - Position bias is a significant concern [MT-Bench, 2023]
+   - Swapping positions and averaging addresses this [Eugene Yan, 2024]
+...
+## References
+1. Eugene Yan. "Evaluating the Effectiveness of LLM-Evaluators." June 2024. https://eugeneyan.com/...
+2. Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv, December 2023.
+```
+## Citation Styles
+### Inline (default)
+```
+"Finding or claim" [Author/Source, Date]
+```
+### Footnote
+```
+"Finding or claim"[1]
+---
+[1] Author/Source, Date, URL
+```
+### Endnote
+```
+"Finding or claim" (see Sources: Source Name)
+## Sources
+- Source Name: Full citation
+```
+## Best Practices
+1. **Theme Extraction**: Look for patterns across 3+ sources
+2. **Weight by Quality**: Academic sources > blogs for factual claims
+3. **Recency Matters**: Note when findings may be outdated
+4. **Acknowledge Gaps**: Don't overstate what sources support
+5. **Actionable Output**: End with practical takeaways