npm - @mastra/mcp-docs-server - Versions diffs - 0.13.39 → 1.0.0-beta.1 - Mend

@mastra/mcp-docs-server 0.13.39 → 1.0.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (494) hide show

package/.docs/raw/reference/evals/prompt-alignment.mdx CHANGED Viewed

@@ -1,246 +1,668 @@
 ---
-title: "Reference: PromptAlignmentMetric | Evals | Mastra Docs"
-description: Documentation for the Prompt Alignment Metric in Mastra, which evaluates how well LLM outputs adhere to given prompt instructions.
+title: "Reference: Prompt Alignment Scorer | Evals | Mastra Docs"
+description: Documentation for the Prompt Alignment Scorer in Mastra. Evaluates how well agent responses align with user prompt intent, requirements, completeness, and appropriateness using multi-dimensional analysis.
 ---
-# PromptAlignmentMetric
+import PropertiesTable from "@site/src/components/PropertiesTable";
-:::info Scorers
-This documentation refers to the legacy evals API. For the latest scorer features, see [Scorers](/docs/scorers/overview).
-:::
+# Prompt Alignment Scorer
-The `PromptAlignmentMetric` class evaluates how strictly an LLM's output follows a set of given prompt instructions. It uses a judge-based system to verify each instruction is followed exactly and provides detailed reasoning for any deviations.
+The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.
-## Basic Usage
-```typescript
-import { openai } from "@ai-sdk/openai";
-import { PromptAlignmentMetric } from "@mastra/evals/llm";
-// Configure the model for evaluation
-const model = openai("gpt-4o-mini");
-const instructions = [
-  "Start sentences with capital letters",
-  "End each sentence with a period",
-  "Use present tense",
-];
-const metric = new PromptAlignmentMetric(model, {
-  instructions,
-  scale: 1,
-});
-const result = await metric.measure(
-  "describe the weather",
-  "The sun is shining. Clouds float in the sky. A gentle breeze blows.",
-);
-console.log(result.score); // Alignment score from 0-1
-console.log(result.info.reason); // Explanation of the score
-```
-## Constructor Parameters
+## Parameters
 <PropertiesTable
   content={[
     {
       name: "model",
-      type: "LanguageModel",
+      type: "MastraModelConfig",
       description:
-        "Configuration for the model used to evaluate instruction alignment",
-      isOptional: false,
+        "The language model to use for evaluating prompt-response alignment",
+      required: true,
     },
     {
       name: "options",
       type: "PromptAlignmentOptions",
-      description: "Configuration options for the metric",
-      isOptional: false,
+      description: "Configuration options for the scorer",
+      required: false,
+      children: [
+        {
+          name: "scale",
+          type: "number",
+          description: "Scale factor to multiply the final score (default: 1)",
+          required: false,
+        },
+        {
+          name: "evaluationMode",
+          type: "'user' | 'system' | 'both'",
+          description:
+            "Evaluation mode - 'user' evaluates user prompt alignment only, 'system' evaluates system compliance only, 'both' evaluates both with weighted scoring (default: 'both')",
+          required: false,
+        },
+      ],
     },
   ]}
 />
-### PromptAlignmentOptions
+## .run() Returns
 <PropertiesTable
   content={[
     {
-      name: "instructions",
-      type: "string[]",
-      description: "Array of instructions that the output should follow",
-      isOptional: false,
+      name: "score",
+      type: "number",
+      description:
+        "Multi-dimensional alignment score between 0 and scale (default 0-1)",
     },
     {
-      name: "scale",
-      type: "number",
-      description: "Maximum score value",
-      isOptional: true,
-      defaultValue: "1",
+      name: "reason",
+      type: "string",
+      description:
+        "Human-readable explanation of the prompt alignment evaluation with detailed breakdown",
     },
   ]}
 />
-## measure() Parameters
+`.run()` returns a result in the following shape:
-<PropertiesTable
-  content={[
-    {
-      name: "input",
-      type: "string",
-      description: "The original prompt or query",
-      isOptional: false,
+```typescript
+{
+  runId: string,
+  score: number,
+  reason: string,
+  analyzeStepResult: {
+    intentAlignment: {
+      score: number,
+      primaryIntent: string,
+      isAddressed: boolean,
+      reasoning: string
     },
-    {
-      name: "output",
-      type: "string",
-      description: "The LLM's response to evaluate",
-      isOptional: false,
+    requirementsFulfillment: {
+      requirements: Array<{
+        requirement: string,
+        isFulfilled: boolean,
+        reasoning: string
+      }>,
+      overallScore: number
     },
-  ]}
-/>
+    completeness: {
+      score: number,
+      missingElements: string[],
+      reasoning: string
+    },
+    responseAppropriateness: {
+      score: number,
+      formatAlignment: boolean,
+      toneAlignment: boolean,
+      reasoning: string
+    },
+    overallAssessment: string
+  }
+}
+```
-## Returns
+## Scoring Details
-<PropertiesTable
-  content={[
+### Scorer configuration
+You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.
+```typescript showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: {
+    scale: 10, // Score from 0-10 instead of 0-1
+    evaluationMode: "both", // 'user', 'system', or 'both' (default)
+  },
+});
+```
+### Multi-Dimensional Analysis
+Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:
+#### User Mode ('user')
+Evaluates alignment with user prompts only:
+1. **Intent Alignment** (40% weight) - Whether the response addresses the user's core request
+2. **Requirements Fulfillment** (30% weight) - If all user requirements are met
+3. **Completeness** (20% weight) - Whether the response is comprehensive for user needs
+4. **Response Appropriateness** (10% weight) - If format and tone match user expectations
+#### System Mode ('system')
+Evaluates compliance with system guidelines only:
+1. **Intent Alignment** (35% weight) - Whether the response follows system behavioral guidelines
+2. **Requirements Fulfillment** (35% weight) - If all system constraints are respected
+3. **Completeness** (15% weight) - Whether the response adheres to all system rules
+4. **Response Appropriateness** (15% weight) - If format and tone match system specifications
+#### Both Mode ('both' - default)
+Combines evaluation of both user and system alignment:
+- **User alignment**: 70% of final score (using user mode weights)
+- **System compliance**: 30% of final score (using system mode weights)
+- Provides balanced assessment of user satisfaction and system adherence
+### Scoring Formula
+**User Mode:**
+```
+Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
+                 (completeness_score × 0.2) + (appropriateness_score × 0.1)
+Final Score = Weighted Score × scale
+```
+**System Mode:**
+```
+Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
+                 (completeness_score × 0.15) + (appropriateness_score × 0.15)
+Final Score = Weighted Score × scale
+```
+**Both Mode (default):**
+```
+User Score = (user dimensions with user weights)
+System Score = (system dimensions with system weights)
+Weighted Score = (User Score × 0.7) + (System Score × 0.3)
+Final Score = Weighted Score × scale
+```
+**Weight Distribution Rationale**:
+- **User Mode**: Prioritizes intent (40%) and requirements (30%) for user satisfaction
+- **System Mode**: Balances behavioral compliance (35%) and constraints (35%) equally
+- **Both Mode**: 70/30 split ensures user needs are primary while maintaining system compliance
+### Score Interpretation
+- **0.9-1.0** = Excellent alignment across all dimensions
+- **0.8-0.9** = Very good alignment with minor gaps
+- **0.7-0.8** = Good alignment but missing some requirements or completeness
+- **0.6-0.7** = Moderate alignment with noticeable gaps
+- **0.4-0.6** = Poor alignment with significant issues
+- **0.0-0.4** = Very poor alignment, response doesn't address the prompt effectively
+### When to Use Each Mode
+**User Mode (`'user'`)** - Use when:
+- Evaluating customer service responses for user satisfaction
+- Testing content generation quality from user perspective
+- Measuring how well responses address user questions
+- Focusing purely on request fulfillment without system constraints
+**System Mode (`'system'`)** - Use when:
+- Auditing AI safety and compliance with behavioral guidelines
+- Ensuring agents follow brand voice and tone requirements
+- Validating adherence to content policies and constraints
+- Testing system-level behavioral consistency
+**Both Mode (`'both'`)** - Use when (default, recommended):
+- Comprehensive evaluation of overall AI agent performance
+- Balancing user satisfaction with system compliance
+- Production monitoring where both user and system requirements matter
+- Holistic assessment of prompt-response alignment
+## Common Use Cases
+### Code Generation Evaluation
+Ideal for evaluating:
+- Programming task completion
+- Code quality and completeness
+- Adherence to coding requirements
+- Format specifications (functions, classes, etc.)
+```typescript
+// Example: API endpoint creation
+const codePrompt =
+  "Create a REST API endpoint with authentication and rate limiting";
+// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
+// completeness (full implementation), format (code structure)
+```
+### Instruction Following Assessment
+Perfect for:
+- Task completion verification
+- Multi-step instruction adherence
+- Requirement compliance checking
+- Educational content evaluation
+```typescript
+// Example: Multi-requirement task
+const taskPrompt =
+  "Write a Python class with initialization, validation, error handling, and documentation";
+// Scorer tracks each requirement individually and provides detailed breakdown
+```
+### Content Format Validation
+Useful for:
+- Format specification compliance
+- Style guide adherence
+- Output structure verification
+- Response appropriateness checking
+```typescript
+// Example: Structured output
+const formatPrompt =
+  "Explain the differences between let and const in JavaScript using bullet points";
+// Scorer evaluates content accuracy AND format compliance
+```
+### Agent Response Quality
+Measure how well your AI agents follow user instructions:
+```typescript
+const agent = new Agent({
+  name: "CodingAssistant",
+  instructions:
+    "You are a helpful coding assistant. Always provide working code examples.",
+  model: "openai/gpt-4o",
+});
+// Evaluate comprehensive alignment (default)
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
+});
+// Evaluate just user satisfaction
+const userScorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "user" }, // Focus only on user request fulfillment
+});
+// Evaluate system compliance
+const systemScorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "system" }, // Check adherence to system instructions
+});
+const result = await scorer.run(agentRun);
+```
+### Prompt Engineering Optimization
+Test different prompts to improve alignment:
+```typescript
+const prompts = [
+  "Write a function to calculate factorial",
+  "Create a Python function that calculates factorial with error handling for negative inputs",
+  "Implement a factorial calculator in Python with: input validation, error handling, and docstring",
+];
+// Compare alignment scores to find the best prompt
+for (const prompt of prompts) {
+  const result = await scorer.run(createTestRun(prompt, response));
+  console.log(`Prompt alignment: ${result.score}`);
+}
+```
+### Multi-Agent System Evaluation
+Compare different agents or models:
+```typescript
+const agents = [agent1, agent2, agent3];
+const testPrompts = [...]; // Array of test prompts
+for (const agent of agents) {
+  let totalScore = 0;
+  for (const prompt of testPrompts) {
+    const response = await agent.run(prompt);
+    const evaluation = await scorer.run({ input: prompt, output: response });
+    totalScore += evaluation.score;
+  }
+  console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
+}
+```
+## Examples
+### Basic Configuration
+```typescript
+import { createPromptAlignmentScorerLLM } from "@mastra/evals";
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o",
+});
+// Evaluate a code generation task
+const result = await scorer.run({
+  input: [
     {
-      name: "score",
-      type: "number",
-      description: "Alignment score (0 to scale, default 0-1)",
+      role: "user",
+      content:
+        "Write a Python function to calculate factorial with error handling",
     },
+  ],
+  output: {
+    role: "assistant",
+    text: `def factorial(n):
+    if n < 0:
+        raise ValueError("Factorial not defined for negative numbers")
+    if n == 0:
+        return 1
+    return n * factorial(n-1)`,
+  },
+});
+// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
+```
+### Custom Configuration Examples
+```typescript
+// Configure scale and evaluation mode
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o",
+  options: {
+    scale: 10, // Score from 0-10 instead of 0-1
+    evaluationMode: "both", // 'user', 'system', or 'both' (default)
+  },
+});
+// User-only evaluation - focus on user satisfaction
+const userScorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o",
+  options: { evaluationMode: "user" },
+});
+// System-only evaluation - focus on compliance
+const systemScorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o",
+  options: { evaluationMode: "system" },
+});
+const result = await scorer.run(testRun);
+// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }
+```
+### Format-Specific Evaluation
+```typescript
+// Evaluate bullet point formatting
+const result = await scorer.run({
+  input: [
     {
-      name: "info",
-      type: "object",
-      description:
-        "Object containing detailed metrics about instruction compliance",
-      properties: [
-        {
-          type: "string",
-          parameters: [
-            {
-              name: "reason",
-              type: "string",
-              description:
-                "Detailed explanation of the score and instruction compliance",
-            },
-          ],
-        },
-      ],
+      role: "user",
+      content: "List the benefits of TypeScript in bullet points",
     },
-  ]}
-/>
+  ],
+  output: {
+    role: "assistant",
+    text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
+  },
+});
+// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
+```
-## Scoring Details
+### Excellent alignment example
+In this example, the response fully addresses the user's prompt with all requirements met.
-The metric evaluates instruction alignment through:
+```typescript title="src/example-excellent-prompt-alignment.ts" showLineNumbers copy
+import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
-- Applicability assessment for each instruction
-- Strict compliance evaluation for applicable instructions
-- Detailed reasoning for all verdicts
-- Proportional scoring based on applicable instructions
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+});
-### Instruction Verdicts
+const inputMessages = [
+  {
+    role: "user",
+    content:
+      "Write a Python function to calculate factorial with error handling for negative numbers",
+  },
+];
-Each instruction receives one of three verdicts:
+const outputMessage = {
+  text: `def factorial(n):
+    """Calculate factorial of a number."""
+    if n < 0:
+        raise ValueError("Factorial not defined for negative numbers")
+    if n == 0 or n == 1:
+        return 1
+    return n * factorial(n - 1)`,
+};
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
+});
-- "yes": Instruction is applicable and completely followed
-- "no": Instruction is applicable but not followed or only partially followed
-- "n/a": Instruction is not applicable to the given context
+console.log(result);
+```
-### Scoring Process
+### Excellent alignment output
-1. Evaluates instruction applicability:
-   - Determines if each instruction applies to the context
-   - Marks irrelevant instructions as "n/a"
-   - Considers domain-specific requirements
+The output receives a high score because it perfectly addresses the intent, fulfills all requirements, and uses appropriate format.
-2. Assesses compliance for applicable instructions:
-   - Evaluates each applicable instruction independently
-   - Requires complete compliance for "yes" verdict
-   - Documents specific reasons for all verdicts
+```typescript
+{
+  score: 0.95,
+  reason: 'The score is 0.95 because the response perfectly addresses the primary intent of creating a factorial function and fulfills all requirements including Python implementation, error handling for negative numbers, and proper documentation. The code format is appropriate and the implementation is complete.'
+}
+```
-3. Calculates alignment score:
-   - Counts followed instructions ("yes" verdicts)
-   - Divides by total applicable instructions (excluding "n/a")
-   - Scales to configured range
+### Partial alignment example
-Final score: `(followed_instructions / applicable_instructions) * scale`
+In this example, the response addresses the core intent but misses some requirements or has format issues.
-### Important Considerations
+```typescript title="src/example-partial-prompt-alignment.ts" showLineNumbers copy
+import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
-- Empty outputs:
-  - All formatting instructions are considered applicable
-  - Marked as "no" since they cannot satisfy requirements
-- Domain-specific instructions:
-  - Always applicable if about the queried domain
-  - Marked as "no" if not followed, not "n/a"
-- "n/a" verdicts:
-  - Only used for completely different domains
-  - Do not affect the final score calculation
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+});
-### Score interpretation
+const inputMessages = [
+  {
+    role: "user",
+    content: "List the benefits of TypeScript in bullet points",
+  },
+];
+const outputMessage = {
+  text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
+};
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
+});
-(0 to scale, default 0-1)
+console.log(result);
+```
-- 1.0: All applicable instructions followed perfectly
-- 0.7-0.9: Most applicable instructions followed
-- 0.4-0.6: Mixed compliance with applicable instructions
-- 0.1-0.3: Limited compliance with applicable instructions
-- 0.0: No applicable instructions followed
+#### Partial alignment output
-## Example with Analysis
+The output receives a lower score because while the content is accurate, it doesn't follow the requested format (bullet points).
 ```typescript
-import { openai } from "@ai-sdk/openai";
-import { PromptAlignmentMetric } from "@mastra/evals/llm";
+{
+  score: 0.75,
+  reason: 'The score is 0.75 because the response addresses the intent of explaining TypeScript benefits and provides accurate information, but fails to use the requested bullet point format, resulting in lower appropriateness scoring.'
+}
+```
-// Configure the model for evaluation
-const model = openai("gpt-4o-mini");
+### Poor alignment example
-const metric = new PromptAlignmentMetric(model, {
-  instructions: [
-    "Use bullet points for each item",
-    "Include exactly three examples",
-    "End each point with a semicolon"
-  ],
-  scale: 1
-});
-const result = await metric.measure(
-  "List three fruits",
-  "• Apple is red and sweet;
-• Banana is yellow and curved;
-• Orange is citrus and round."
-);
-// Example output:
-// {
-//   score: 1.0,
-//   info: {
-//     reason: "The score is 1.0 because all instructions were followed exactly:
-//           bullet points were used, exactly three examples were provided, and
-//           each point ends with a semicolon."
-//   }
-// }
-const result2 = await metric.measure(
-  "List three fruits",
-  "1. Apple
-2. Banana
-3. Orange and Grape"
-);
-// Example output:
-// {
-//   score: 0.33,
-//   info: {
-//     reason: "The score is 0.33 because: numbered lists were used instead of bullet points,
-//           no semicolons were used, and four fruits were listed instead of exactly three."
-//   }
-// }
+In this example, the response fails to address the user's specific requirements.
+```typescript title="src/example-poor-prompt-alignment.ts" showLineNumbers copy
+import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+});
+const inputMessages = [
+  {
+    role: "user",
+    content:
+      "Write a Python class with initialization, validation, error handling, and documentation",
+  },
+];
+const outputMessage = {
+  text: `class Example:
+    def __init__(self, value):
+        self.value = value`,
+};
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
+});
+console.log(result);
+```
+### Poor alignment output
+The output receives a low score because it only partially fulfills the requirements, missing validation, error handling, and documentation.
+```typescript
+{
+  score: 0.35,
+  reason: 'The score is 0.35 because while the response addresses the basic intent of creating a Python class with initialization, it fails to include validation, error handling, and documentation as specifically requested, resulting in incomplete requirement fulfillment.'
+}
+```
+### Evaluation Mode Examples
+#### User Mode - Focus on User Prompt Only
+Evaluates how well the response addresses the user's request, ignoring system instructions:
+```typescript title="src/example-user-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "user" },
+});
+const result = await scorer.run({
+  input: {
+    inputMessages: [
+      {
+        role: "user",
+        content: "Explain recursion with an example",
+      },
+    ],
+    systemMessages: [
+      {
+        role: "system",
+        content: "Always provide code examples in Python",
+      },
+    ],
+  },
+  output: {
+    text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
+  },
+});
+// Scores high for addressing user request, even without Python code
+```
+#### System Mode - Focus on System Guidelines Only
+Evaluates compliance with system behavioral guidelines and constraints:
+```typescript title="src/example-system-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "system" },
+});
+const result = await scorer.run({
+  input: {
+    systemMessages: [
+      {
+        role: "system",
+        content:
+          "You are a helpful assistant. Always be polite, concise, and provide examples.",
+      },
+    ],
+    inputMessages: [
+      {
+        role: "user",
+        content: "What is machine learning?",
+      },
+    ],
+  },
+  output: {
+    text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
+  },
+});
+// Evaluates politeness, conciseness, and example provision
 ```
+#### Both Mode - Combined Evaluation (Default)
+Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):
+```typescript title="src/example-both-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "both" }, // This is the default
+});
+const result = await scorer.run({
+  input: {
+    systemMessages: [
+      {
+        role: "system",
+        content:
+          "Always provide code examples when explaining programming concepts",
+      },
+    ],
+    inputMessages: [
+      {
+        role: "user",
+        content: "Explain how to reverse a string",
+      },
+    ],
+  },
+  output: {
+    text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:
+    def reverse_string(s):
+        return s[::-1]
+    # Usage: reverse_string("hello") returns "olleh"`,
+  },
+});
+// High score for both addressing the user's request AND following system guidelines
+```
+## Comparison with Other Scorers
+| Aspect         | Prompt Alignment                           | Answer Relevancy             | Faithfulness                     |
+| -------------- | ------------------------------------------ | ---------------------------- | -------------------------------- |
+| **Focus**      | Multi-dimensional prompt adherence         | Query-response relevance     | Context groundedness             |
+| **Evaluation** | Intent, requirements, completeness, format | Semantic similarity to query | Factual consistency with context |
+| **Use Case**   | General prompt following                   | Information retrieval        | RAG/context-based systems        |
+| **Dimensions** | 4 weighted dimensions                      | Single relevance dimension   | Single faithfulness dimension    |
 ## Related
-- [Answer Relevancy Metric](./answer-relevancy)
-- [Keyword Coverage Metric](./keyword-coverage)
+- [Answer Relevancy Scorer](/reference/v1/evals/answer-relevancy) - Evaluates query-response relevance
+- [Faithfulness Scorer](/reference/v1/evals/faithfulness) - Measures context groundedness
+- [Tool Call Accuracy Scorer](/reference/v1/evals/tool-call-accuracy) - Evaluates tool selection
+- [Custom Scorers](/docs/v1/evals/custom-scorers) - Creating your own evaluation metrics