npm - @microsoft/m365-copilot-eval - Versions diffs - 1.1.0-preview.1 → 1.2.0-preview.1 - Mend

@microsoft/m365-copilot-eval 1.1.0-preview.1 → 1.2.0-preview.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/README.md +64 -18
package/package.json +4 -2
package/schema/CHANGELOG.md +21 -0
package/schema/v1/eval-document.schema.json +236 -0
package/schema/v1/examples/invalid/empty-items.json +4 -0
package/schema/v1/examples/invalid/invalid-semver.json +8 -0
package/schema/v1/examples/invalid/missing-schema-version.json +7 -0
package/schema/v1/examples/invalid/wrong-type.json +6 -0
package/schema/v1/examples/valid/comprehensive.json +92 -0
package/schema/v1/examples/valid/minimal.json +8 -0
package/schema/version.json +6 -0
package/src/clients/cli/custom_evaluators/CitationsEvaluator.py +77 -33
package/src/clients/cli/main.py +197 -30
package/src/clients/cli/readme.md +5 -5
package/src/clients/cli/requirements.txt +2 -0
package/src/clients/cli/samples/starter.json +13 -10
package/src/clients/cli/schema_handler.py +349 -0
package/src/clients/cli/version_check.py +139 -0
package/src/clients/node-js/bin/runevals.js +34 -103
package/src/clients/node-js/config/default.js +1 -1
package/src/clients/node-js/lib/env-loader.js +126 -0
package/src/clients/node-js/lib/progress.js +36 -36
package/src/clients/node-js/lib/python-runtime.js +4 -6
package/src/clients/node-js/lib/venv-manager.js +65 -32

package/README.md CHANGED Viewed

@@ -9,6 +9,8 @@ A **zero-configuration** CLI for evaluating M365 Copilot agents. Send prompts to
 - - Relevance (1–5)
 - - Coherence (1–5)
 - - Groundedness (1–5)
+- - Tool Call Accuracy (1–5)
+- - Citations (0–1)
 - Multiple input modes: command‑line list, JSON file, interactive.
 - Multiple output formats: console (colorized), JSON, CSV, HTML (auto‑opens report).
@@ -26,37 +28,48 @@ A **zero-configuration** CLI for evaluating M365 Copilot agents. Send prompts to
 ### Install the Tool
 1. Make sure you have Node.js
-2. Run `npm install @microsoft/m365-copilot-eval`
+2. Run `npm install -g @microsoft/m365-copilot-eval`
 ### Setup Steps
 Now, set up where you'll store your environment variables:
 **Are you using M365 Agents Toolkit (ATK)?**
-- ✅ **Yes** → You already have `.env.local` in your project with `M365_TITLE_ID`. You'll add Azure OpenAI variables to this file.
-- ✅ **No** → Create a new `env/.env.dev` file in your project directory. You'll add all variables there.
+- - **Yes** → You already have `.env.local` in your project with `M365_TITLE_ID` (automatically used as your agent ID). Keep non-secret config there and put secrets like `AZURE_AI_API_KEY` in `.env.local.user` (never committed).
+- - **No** → Create a new `env/.env.dev` file in your project directory. You'll add all variables there.
 The CLI loads environment variables from multiple sources (in order of precedence):
 1. **`.env.local`** in current directory (auto-detected, ideal for ATK projects)
-2. **`env/.env.{environment}`** via `--env` flag (e.g., `--env dev` loads `env/.env.dev`)
-3. **System environment variables**
+2. **`.env.local.user`** in current directory — or **`env/.env.local.user`** — auto-loaded as a user-specific override (never commit this file; put secrets here)
+3. **`env/.env.{environment}`** via `--env` flag (e.g., `--env dev` loads `env/.env.dev`)
+4. **System environment variables**
 #### Option 1: For M365 Agents Toolkit (ATK) Projects
-If you're working in an ATK project, you already have `.env.local` with `M365_TITLE_ID`. Just add your Azure credentials and tenant ID:
+ATK projects already check in `.env.local` with agent configuration. **Do not put secrets in `.env.local`** — use `.env.local.user` instead, which is loaded automatically and should be added to your `.gitignore`.
 ```bash
-# .env.local (existing ATK project file)
+# .env.local (checked in — no secrets!)
 # Already present from ATK:
 M365_TITLE_ID="T_your-title-id-here"  # Auto-generated by ATK
+```
-# You'll add these (see Getting Variables section below):
+```bash
+# .env.local.user (NOT checked in — secrets go here)
 AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
 AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
 TENANT_ID="<your-tenant-id>"
 ```
+Add `.env.local.user` to your `.gitignore`:
+```gitignore
+# User-specific secrets — never commit
+.env.local.user
+env/.env.local.user
+```
 #### Option 2: For Non-ATK Projects
 Create `env/.env.dev` in your project directory:
@@ -69,16 +82,12 @@ M365_AGENT_ID="your-agent-id"  # e.g., U_0dc4a8a2-b95f-edac-91c8-d802023ec2d4
 # You'll add these (see Getting Variables section below):
 AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
 AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
-TENANT_ID="<your-tenant-id>"
-```
-#### Optional Overrides
-```bash
 AZURE_AI_API_VERSION="2024-12-01-preview"  # default
 AZURE_AI_MODEL_NAME="gpt-4o-mini"           # default
+TENANT_ID="<your-tenant-id>"
 ```
-You can also override the agent ID at runtime: `runevals --agent-id "custom-id"`
+You can also override the agent ID at runtime: `runevals --m365-agent-id "custom-id"`
 ---
@@ -103,13 +112,13 @@ az account show --query tenantId
 ```
 ### 2. Agent ID
-- If you have created your agent using Agents Toolkit, then the agent-id is the M365_TITLE_ID in .env.local file
+- If you have created your agent using Agents Toolkit, the tool automatically reads `M365_TITLE_ID` from `.env.local` and constructs the agent ID.
 - If you don't know your agent-id, the tool offers agent selection when you try to submit a job. The agent selection has both the name, description, agent-id so that you can select the right agent.
 ### 3. Azure OpenAI Endpoint and API Key
-You need both the endpoint URL and API key from your Azure OpenAI resource for "LLM as a Judge" evaluations.
+You need both the endpoint URL and API key from your Azure OpenAI resource for "LLM as a Judge" evaluations. This Azure OpenAI endpoint can be in any tenant or account, and you will just configure the Evals tool using `AZURE_AI_OPENAI_ENDPOINT` and `AZURE_AI_API_KEY`.
 **How to obtain:**
@@ -226,7 +235,7 @@ When you run an evaluation from your agent project directory, you'll see:
 🚀 M365 Copilot Agent Evaluations CLI
 📂 Loading environment: dev
-🤖 Agent ID (from M365_TITLE_ID): T_my-agent.declarativeAgent
+🤖 Agent ID: T_my-agent.declarativeAgent
 📄 Using prompts file: ./evals/evals.json
 📊 Running evaluations...
@@ -318,7 +327,7 @@ Options:
   --prompts-file <file>         JSON file with prompts
   -o, --output <file>           output file (JSON, CSV, or HTML)
   -i, --interactive             interactive prompt entry mode
-  --agent-id <id>               override agent ID
+  --m365-agent-id <id>          override agent ID
   --env <environment>           environment name (default: dev)
   --init-only                   just setup, don't run evals
   -h, --help                    display help
@@ -393,6 +402,43 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+## Schema Versioning
+The eval document schema is versioned independently from the CLI, following [Semantic Versioning](https://semver.org/). This allows external consumers to depend on a stable contract without coupling to CLI release cycles.
+- **Schema location**: [`schema/v1/eval-document.schema.json`](schema/v1/eval-document.schema.json)
+- **Schema changelog**: [`schema/CHANGELOG.md`](schema/CHANGELOG.md)
+- **Consumer quickstart**: [`specs/wi-6081652-dataset-schema-versioning/quickstart.md`](specs/wi-6081652-dataset-schema-versioning/quickstart.md)
+### Eval Document Format
+Eval documents should include a `schemaVersion` field:
+```json
+{
+  "schemaVersion": "1.0.0",
+  "items": [
+    {
+      "prompt": "What is Microsoft 365?",
+      "expected_response": "Microsoft 365 is a cloud-based productivity suite."
+    }
+  ]
+}
+```
+### Auto-Upgrade Behavior
+When the CLI loads an eval document:
+- **Legacy documents** (missing `schemaVersion`): Automatically upgraded with a timestamped backup (e.g., `file.json.bak.20260205143052`)
+- **Older versions** (same major version): `schemaVersion` field updated without backup
+- **Invalid documents**: CLI exits with an error message and guidance to review the schema changelog
+- **Future versions**: CLI rejects with a message suggesting a CLI update
+### Version Compatibility
+Within a major version (e.g., 1.x.x), backward compatibility is guaranteed. Documents valid against 1.0.0 will remain valid against 1.1.0, 1.2.0, etc.
 ## Trademarks
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

package/package.json CHANGED Viewed

@@ -1,8 +1,9 @@
 {
   "name": "@microsoft/m365-copilot-eval",
-  "version": "1.1.0-preview.1",
+  "version": "1.2.0-preview.1",
+  "minCliVersion": "1.0.1-preview.1",
   "description": "Zero-config Node.js wrapper for M365 Copilot Agent Evaluations CLI (Python-based Azure AI Evaluation SDK)",
-  "publishDate": "2026-02-03",
+  "publishDate": "2026-03-11",
   "main": "src/clients/node-js/lib/index.js",
   "type": "module",
   "bin": {
@@ -71,6 +72,7 @@
     "src/clients/cli/**/*.py",
     "src/clients/cli/requirements.txt",
     "src/clients/cli/samples/",
+    "schema/",
     "README.md",
     "LICENSE"
   ],

package/schema/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,21 @@
+# Changelog - M365 Copilot Eval Document Schema
+All notable changes to the eval document schema will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.0.0] - 2026-02-19
+### Added
+- Initial schema version for eval document contract
+- Root document structure: `schemaVersion` (required), `metadata` (optional), `items` (required)
+- `DocumentMetadata` with `name`, `description`, `createdAt`, `createdBy`, `evaluatedAt`, `tags`, `agentId`, and `extensions`
+- `EvalItem` with `prompt` (required), `expected_response`, `response`, `context`, `citations`, `scores`, and `extensions`
+- `ScoreCollection` with `relevance`, `coherence`, `groundedness`, `toolCallAccuracy`, and `citations` scores
+- `EvalScore` standard score structure (1-5 scale) with `score`, `result`, `threshold`, `reason`, `evaluator`
+- `CitationScore` for citation-specific evaluation with `count`, `result`, `threshold`, `format`, `citations`
+- `Citation` reference object with `index`, `text`, `source`
+- Extension points at `metadata.extensions` and `items[].extensions` for forward compatibility
+- `additionalProperties: true` on all objects for forward compatibility within major version

package/schema/v1/eval-document.schema.json ADDED Viewed

@@ -0,0 +1,236 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://raw.githubusercontent.com/microsoft/M365-Copilot-Agent-Evals/refs/heads/main/schema/v1/eval-document.schema.json",
+  "title": "M365 Copilot Eval Document",
+  "description": "Schema for evaluation documents used by M365 Copilot Agent Evals CLI. Version 1.0.0.",
+  "type": "object",
+  "required": ["schemaVersion", "items"],
+  "additionalProperties": true,
+  "properties": {
+    "$schema": {
+      "type": "string",
+      "format": "uri",
+      "description": "JSON Schema URI for editor validation support"
+    },
+    "schemaVersion": {
+      "type": "string",
+      "pattern": "^1\\.\\d+\\.\\d+$",
+      "description": "SemVer string identifying the schema version this document conforms to (e.g., '1.0.0')",
+      "examples": ["1.0.0", "1.1.0", "1.2.0"]
+    },
+    "metadata": {
+      "$ref": "#/$defs/DocumentMetadata"
+    },
+    "items": {
+      "type": "array",
+      "minItems": 1,
+      "description": "Array of evaluation items (prompts and optionally responses with scores)",
+      "items": {
+        "$ref": "#/$defs/EvalItem"
+      }
+    }
+  },
+  "$defs": {
+    "DocumentMetadata": {
+      "type": "object",
+      "description": "Optional metadata about the evaluation document",
+      "additionalProperties": true,
+      "properties": {
+        "name": {
+          "type": "string",
+          "description": "Human-readable name for the evaluation set"
+        },
+        "description": {
+          "type": "string",
+          "description": "Description of what this evaluation set tests"
+        },
+        "createdAt": {
+          "type": "string",
+          "format": "date-time",
+          "description": "ISO 8601 timestamp when the document was created"
+        },
+        "createdBy": {
+          "type": "string",
+          "description": "Author or system that created the document"
+        },
+        "evaluatedAt": {
+          "type": "string",
+          "format": "date-time",
+          "description": "ISO 8601 timestamp when evaluation was performed (output documents)"
+        },
+        "tags": {
+          "type": "array",
+          "items": {
+            "type": "string"
+          },
+          "description": "Tags for categorization and filtering"
+        },
+        "agentId": {
+          "type": "string",
+          "description": "M365 Agent ID this evaluation targets"
+        },
+        "extensions": {
+          "type": "object",
+          "additionalProperties": true,
+          "description": "Extension point for custom metadata. Use reverse-domain notation for field names."
+        }
+      }
+    },
+    "EvalItem": {
+      "type": "object",
+      "description": "A single evaluation item containing a prompt and optionally a response with scores",
+      "required": ["prompt"],
+      "additionalProperties": true,
+      "properties": {
+        "prompt": {
+          "type": "string",
+          "minLength": 1,
+          "description": "The input prompt to evaluate"
+        },
+        "expected_response": {
+          "type": "string",
+          "description": "Expected or ideal response for comparison during evaluation"
+        },
+        "response": {
+          "type": "string",
+          "description": "Actual response from the agent (present in output documents)"
+        },
+        "context": {
+          "type": "string",
+          "description": "Additional context for grounding evaluation"
+        },
+        "citations": {
+          "type": "array",
+          "items": {
+            "$ref": "#/$defs/Citation"
+          },
+          "description": "Citations included in the response"
+        },
+        "scores": {
+          "$ref": "#/$defs/ScoreCollection"
+        },
+        "extensions": {
+          "type": "object",
+          "additionalProperties": true,
+          "description": "Extension point for custom item-level fields"
+        }
+      }
+    },
+    "ScoreCollection": {
+      "type": "object",
+      "description": "Collection of evaluation scores for an item",
+      "additionalProperties": true,
+      "properties": {
+        "relevance": {
+          "$ref": "#/$defs/EvalScore",
+          "description": "Relevance score (1-5)"
+        },
+        "coherence": {
+          "$ref": "#/$defs/EvalScore",
+          "description": "Coherence score (1-5)"
+        },
+        "groundedness": {
+          "$ref": "#/$defs/EvalScore",
+          "description": "Groundedness score (1-5)"
+        },
+        "toolCallAccuracy": {
+          "$ref": "#/$defs/EvalScore",
+          "description": "Tool call accuracy score (1-5)"
+        },
+        "citations": {
+          "$ref": "#/$defs/CitationScore",
+          "description": "Citation evaluation results"
+        }
+      }
+    },
+    "EvalScore": {
+      "type": "object",
+      "description": "Standard evaluation score (1-5 scale)",
+      "required": ["score", "result", "threshold"],
+      "additionalProperties": true,
+      "properties": {
+        "score": {
+          "type": "number",
+          "minimum": 1,
+          "maximum": 5,
+          "description": "Numeric score from 1.0 (worst) to 5.0 (best)"
+        },
+        "result": {
+          "type": "string",
+          "enum": ["pass", "fail"],
+          "description": "Pass/fail result based on threshold comparison"
+        },
+        "threshold": {
+          "type": "number",
+          "minimum": 1,
+          "maximum": 5,
+          "description": "Threshold used for pass/fail determination"
+        },
+        "reason": {
+          "type": "string",
+          "description": "Explanation of why this score was assigned"
+        },
+        "evaluator": {
+          "type": "string",
+          "description": "Name or identifier of the evaluator that produced this score"
+        }
+      }
+    },
+    "CitationScore": {
+      "type": "object",
+      "description": "Citation-specific evaluation score",
+      "required": ["count", "result", "threshold"],
+      "additionalProperties": true,
+      "properties": {
+        "count": {
+          "type": "integer",
+          "minimum": 0,
+          "description": "Number of citations found in the response"
+        },
+        "result": {
+          "type": "string",
+          "enum": ["pass", "fail"],
+          "description": "Pass/fail result based on citation count vs threshold"
+        },
+        "threshold": {
+          "type": "integer",
+          "minimum": 0,
+          "description": "Minimum required number of citations for pass"
+        },
+        "format": {
+          "type": "string",
+          "description": "Citation format detected. Known values: 'oai_unicode', 'bracket', 'mixed'. Additional formats may be added.",
+          "examples": ["oai_unicode", "bracket", "mixed"]
+        },
+        "citations": {
+          "type": "array",
+          "items": {
+            "$ref": "#/$defs/Citation"
+          },
+          "description": "Parsed citation objects"
+        }
+      }
+    },
+    "Citation": {
+      "type": "object",
+      "description": "A single citation reference",
+      "required": ["index"],
+      "additionalProperties": true,
+      "properties": {
+        "index": {
+          "type": "integer",
+          "minimum": 1,
+          "description": "Citation index (1-based)"
+        },
+        "text": {
+          "type": "string",
+          "description": "The cited text"
+        },
+        "source": {
+          "type": "string",
+          "description": "Source reference (URL, document name, etc.)"
+        }
+      }
+    }
+  }
+}

package/schema/v1/examples/invalid/empty-items.json ADDED Viewed

@@ -0,0 +1,4 @@
+{
+  "schemaVersion": "1.0.0",
+  "items": []
+}

package/schema/v1/examples/invalid/invalid-semver.json ADDED Viewed

@@ -0,0 +1,8 @@
+{
+  "schemaVersion": "version-one",
+  "items": [
+    {
+      "prompt": "What is Microsoft 365?"
+    }
+  ]
+}

package/schema/v1/examples/invalid/missing-schema-version.json ADDED Viewed

@@ -0,0 +1,7 @@
+{
+  "items": [
+    {
+      "prompt": "What is Microsoft 365?"
+    }
+  ]
+}

package/schema/v1/examples/invalid/wrong-type.json ADDED Viewed

@@ -0,0 +1,6 @@
+{
+  "schemaVersion": "1.0.0",
+  "items": {
+    "prompt": "This should be an array, not an object"
+  }
+}

package/schema/v1/examples/valid/comprehensive.json ADDED Viewed

@@ -0,0 +1,92 @@
+{
+  "$schema": "https://raw.githubusercontent.com/microsoft/M365-Copilot-Agent-Evals/refs/heads/main/schema/v1/eval-document.schema.json",
+  "schemaVersion": "1.0.0",
+  "metadata": {
+    "name": "Graph API Evaluation Set",
+    "description": "Test prompts for Microsoft Graph API knowledge",
+    "createdAt": "2026-01-20T10:00:00Z",
+    "createdBy": "eval-team",
+    "evaluatedAt": "2026-01-20T10:30:00Z",
+    "tags": ["graph", "api", "authentication"],
+    "agentId": "12345678-1234-1234-1234-123456789abc",
+    "extensions": {
+      "com.contoso.department": "engineering",
+      "com.contoso.priority": "high"
+    }
+  },
+  "items": [
+    {
+      "prompt": "What is Microsoft Graph API?",
+      "expected_response": "Microsoft Graph API is a unified endpoint for accessing Microsoft services.",
+      "context": "User is a developer new to Microsoft ecosystem.",
+      "response": "Microsoft Graph API is a gateway to data and intelligence in Microsoft 365.",
+      "scores": {
+        "relevance": {
+          "score": 5.0,
+          "result": "pass",
+          "threshold": 3,
+          "reason": "Response directly addresses the query with accurate information.",
+          "evaluator": "azure-ai-relevance"
+        },
+        "coherence": {
+          "score": 5.0,
+          "result": "pass",
+          "threshold": 3,
+          "reason": "Response is well-structured and easy to follow.",
+          "evaluator": "azure-ai-coherence"
+        },
+        "groundedness": {
+          "score": 4.0,
+          "result": "pass",
+          "threshold": 3,
+          "reason": "Response is grounded in the provided context.",
+          "evaluator": "azure-ai-groundedness"
+        },
+        "toolCallAccuracy": {
+          "score": 5.0,
+          "result": "pass",
+          "threshold": 3,
+          "reason": "Tool calls are accurate and well-formed.",
+          "evaluator": "azure-ai-tool-accuracy"
+        },
+        "citations": {
+          "count": 2,
+          "result": "pass",
+          "threshold": 1,
+          "format": "bracket",
+          "citations": [
+            {
+              "index": 1,
+              "text": "Microsoft Graph is a unified API endpoint",
+              "source": "https://learn.microsoft.com/graph/overview"
+            },
+            {
+              "index": 2,
+              "text": "Access data and intelligence in Microsoft 365",
+              "source": "https://learn.microsoft.com/graph/use-the-api"
+            }
+          ]
+        }
+      },
+      "citations": [
+        {
+          "index": 1,
+          "text": "Microsoft Graph is a unified API endpoint",
+          "source": "https://learn.microsoft.com/graph/overview"
+        },
+        {
+          "index": 2,
+          "text": "Access data and intelligence in Microsoft 365",
+          "source": "https://learn.microsoft.com/graph/use-the-api"
+        }
+      ],
+      "extensions": {
+        "com.contoso.difficulty": "easy"
+      }
+    },
+    {
+      "prompt": "How do I authenticate with Microsoft Graph?",
+      "expected_response": "You can authenticate using OAuth 2.0 or client credentials flow."
+    }
+  ]
+}

package/schema/v1/examples/valid/minimal.json ADDED Viewed

@@ -0,0 +1,8 @@
+{
+  "schemaVersion": "1.0.0",
+  "items": [
+    {
+      "prompt": "What is Microsoft 365?"
+    }
+  ]
+}

package/schema/version.json ADDED Viewed

@@ -0,0 +1,6 @@
+{
+  "version": "1.0.0",
+  "releaseDate": "2026-02-19",
+  "schemaId": "https://raw.githubusercontent.com/microsoft/M365-Copilot-Agent-Evals/refs/heads/main/schema/v1/eval-document.schema.json",
+  "description": "M365 Copilot Eval Document Schema"
+}