@microsoft/m365-copilot-eval 1.1.0-preview.1 → 1.2.0-preview.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +64 -18
- package/package.json +4 -2
- package/schema/CHANGELOG.md +21 -0
- package/schema/v1/eval-document.schema.json +236 -0
- package/schema/v1/examples/invalid/empty-items.json +4 -0
- package/schema/v1/examples/invalid/invalid-semver.json +8 -0
- package/schema/v1/examples/invalid/missing-schema-version.json +7 -0
- package/schema/v1/examples/invalid/wrong-type.json +6 -0
- package/schema/v1/examples/valid/comprehensive.json +92 -0
- package/schema/v1/examples/valid/minimal.json +8 -0
- package/schema/version.json +6 -0
- package/src/clients/cli/custom_evaluators/CitationsEvaluator.py +77 -33
- package/src/clients/cli/main.py +197 -30
- package/src/clients/cli/readme.md +5 -5
- package/src/clients/cli/requirements.txt +2 -0
- package/src/clients/cli/samples/starter.json +13 -10
- package/src/clients/cli/schema_handler.py +349 -0
- package/src/clients/cli/version_check.py +139 -0
- package/src/clients/node-js/bin/runevals.js +34 -103
- package/src/clients/node-js/config/default.js +1 -1
- package/src/clients/node-js/lib/env-loader.js +126 -0
- package/src/clients/node-js/lib/progress.js +36 -36
- package/src/clients/node-js/lib/python-runtime.js +4 -6
- package/src/clients/node-js/lib/venv-manager.js +65 -32
package/README.md
CHANGED
|
@@ -9,6 +9,8 @@ A **zero-configuration** CLI for evaluating M365 Copilot agents. Send prompts to
|
|
|
9
9
|
- - Relevance (1–5)
|
|
10
10
|
- - Coherence (1–5)
|
|
11
11
|
- - Groundedness (1–5)
|
|
12
|
+
- - Tool Call Accuracy (1–5)
|
|
13
|
+
- - Citations (0–1)
|
|
12
14
|
- Multiple input modes: command‑line list, JSON file, interactive.
|
|
13
15
|
- Multiple output formats: console (colorized), JSON, CSV, HTML (auto‑opens report).
|
|
14
16
|
|
|
@@ -26,37 +28,48 @@ A **zero-configuration** CLI for evaluating M365 Copilot agents. Send prompts to
|
|
|
26
28
|
### Install the Tool
|
|
27
29
|
|
|
28
30
|
1. Make sure you have Node.js
|
|
29
|
-
2. Run `npm install @microsoft/m365-copilot-eval`
|
|
31
|
+
2. Run `npm install -g @microsoft/m365-copilot-eval`
|
|
30
32
|
|
|
31
33
|
### Setup Steps
|
|
32
34
|
|
|
33
35
|
Now, set up where you'll store your environment variables:
|
|
34
36
|
|
|
35
37
|
**Are you using M365 Agents Toolkit (ATK)?**
|
|
36
|
-
-
|
|
37
|
-
-
|
|
38
|
+
- - **Yes** → You already have `.env.local` in your project with `M365_TITLE_ID` (automatically used as your agent ID). Keep non-secret config there and put secrets like `AZURE_AI_API_KEY` in `.env.local.user` (never committed).
|
|
39
|
+
- - **No** → Create a new `env/.env.dev` file in your project directory. You'll add all variables there.
|
|
38
40
|
|
|
39
41
|
The CLI loads environment variables from multiple sources (in order of precedence):
|
|
40
42
|
|
|
41
43
|
1. **`.env.local`** in current directory (auto-detected, ideal for ATK projects)
|
|
42
|
-
2. **`env/.env.
|
|
43
|
-
3.
|
|
44
|
+
2. **`.env.local.user`** in current directory — or **`env/.env.local.user`** — auto-loaded as a user-specific override (never commit this file; put secrets here)
|
|
45
|
+
3. **`env/.env.{environment}`** via `--env` flag (e.g., `--env dev` loads `env/.env.dev`)
|
|
46
|
+
4. **System environment variables**
|
|
44
47
|
|
|
45
48
|
#### Option 1: For M365 Agents Toolkit (ATK) Projects
|
|
46
49
|
|
|
47
|
-
|
|
50
|
+
ATK projects already check in `.env.local` with agent configuration. **Do not put secrets in `.env.local`** — use `.env.local.user` instead, which is loaded automatically and should be added to your `.gitignore`.
|
|
48
51
|
|
|
49
52
|
```bash
|
|
50
|
-
# .env.local (
|
|
53
|
+
# .env.local (checked in — no secrets!)
|
|
51
54
|
# Already present from ATK:
|
|
52
55
|
M365_TITLE_ID="T_your-title-id-here" # Auto-generated by ATK
|
|
56
|
+
```
|
|
53
57
|
|
|
54
|
-
|
|
58
|
+
```bash
|
|
59
|
+
# .env.local.user (NOT checked in — secrets go here)
|
|
55
60
|
AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
|
|
56
61
|
AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
|
|
57
62
|
TENANT_ID="<your-tenant-id>"
|
|
58
63
|
```
|
|
59
64
|
|
|
65
|
+
Add `.env.local.user` to your `.gitignore`:
|
|
66
|
+
|
|
67
|
+
```gitignore
|
|
68
|
+
# User-specific secrets — never commit
|
|
69
|
+
.env.local.user
|
|
70
|
+
env/.env.local.user
|
|
71
|
+
```
|
|
72
|
+
|
|
60
73
|
#### Option 2: For Non-ATK Projects
|
|
61
74
|
|
|
62
75
|
Create `env/.env.dev` in your project directory:
|
|
@@ -69,16 +82,12 @@ M365_AGENT_ID="your-agent-id" # e.g., U_0dc4a8a2-b95f-edac-91c8-d802023ec2d4
|
|
|
69
82
|
# You'll add these (see Getting Variables section below):
|
|
70
83
|
AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
|
|
71
84
|
AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
|
|
72
|
-
TENANT_ID="<your-tenant-id>"
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
#### Optional Overrides
|
|
76
|
-
```bash
|
|
77
85
|
AZURE_AI_API_VERSION="2024-12-01-preview" # default
|
|
78
86
|
AZURE_AI_MODEL_NAME="gpt-4o-mini" # default
|
|
87
|
+
TENANT_ID="<your-tenant-id>"
|
|
79
88
|
```
|
|
80
89
|
|
|
81
|
-
You can also override the agent ID at runtime: `runevals --agent-id "custom-id"`
|
|
90
|
+
You can also override the agent ID at runtime: `runevals --m365-agent-id "custom-id"`
|
|
82
91
|
|
|
83
92
|
---
|
|
84
93
|
|
|
@@ -103,13 +112,13 @@ az account show --query tenantId
|
|
|
103
112
|
```
|
|
104
113
|
|
|
105
114
|
### 2. Agent ID
|
|
106
|
-
- If you have created your agent using Agents Toolkit,
|
|
115
|
+
- If you have created your agent using Agents Toolkit, the tool automatically reads `M365_TITLE_ID` from `.env.local` and constructs the agent ID.
|
|
107
116
|
- If you don't know your agent-id, the tool offers agent selection when you try to submit a job. The agent selection has both the name, description, agent-id so that you can select the right agent.
|
|
108
117
|
|
|
109
118
|
|
|
110
119
|
### 3. Azure OpenAI Endpoint and API Key
|
|
111
120
|
|
|
112
|
-
You need both the endpoint URL and API key from your Azure OpenAI resource for "LLM as a Judge" evaluations.
|
|
121
|
+
You need both the endpoint URL and API key from your Azure OpenAI resource for "LLM as a Judge" evaluations. This Azure OpenAI endpoint can be in any tenant or account, and you will just configure the Evals tool using `AZURE_AI_OPENAI_ENDPOINT` and `AZURE_AI_API_KEY`.
|
|
113
122
|
|
|
114
123
|
**How to obtain:**
|
|
115
124
|
|
|
@@ -226,7 +235,7 @@ When you run an evaluation from your agent project directory, you'll see:
|
|
|
226
235
|
🚀 M365 Copilot Agent Evaluations CLI
|
|
227
236
|
|
|
228
237
|
📂 Loading environment: dev
|
|
229
|
-
🤖 Agent ID
|
|
238
|
+
🤖 Agent ID: T_my-agent.declarativeAgent
|
|
230
239
|
📄 Using prompts file: ./evals/evals.json
|
|
231
240
|
|
|
232
241
|
📊 Running evaluations...
|
|
@@ -318,7 +327,7 @@ Options:
|
|
|
318
327
|
--prompts-file <file> JSON file with prompts
|
|
319
328
|
-o, --output <file> output file (JSON, CSV, or HTML)
|
|
320
329
|
-i, --interactive interactive prompt entry mode
|
|
321
|
-
--agent-id <id>
|
|
330
|
+
--m365-agent-id <id> override agent ID
|
|
322
331
|
--env <environment> environment name (default: dev)
|
|
323
332
|
--init-only just setup, don't run evals
|
|
324
333
|
-h, --help display help
|
|
@@ -393,6 +402,43 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
|
|
|
393
402
|
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
|
394
403
|
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
|
395
404
|
|
|
405
|
+
## Schema Versioning
|
|
406
|
+
|
|
407
|
+
The eval document schema is versioned independently from the CLI, following [Semantic Versioning](https://semver.org/). This allows external consumers to depend on a stable contract without coupling to CLI release cycles.
|
|
408
|
+
|
|
409
|
+
- **Schema location**: [`schema/v1/eval-document.schema.json`](schema/v1/eval-document.schema.json)
|
|
410
|
+
- **Schema changelog**: [`schema/CHANGELOG.md`](schema/CHANGELOG.md)
|
|
411
|
+
- **Consumer quickstart**: [`specs/wi-6081652-dataset-schema-versioning/quickstart.md`](specs/wi-6081652-dataset-schema-versioning/quickstart.md)
|
|
412
|
+
|
|
413
|
+
### Eval Document Format
|
|
414
|
+
|
|
415
|
+
Eval documents should include a `schemaVersion` field:
|
|
416
|
+
|
|
417
|
+
```json
|
|
418
|
+
{
|
|
419
|
+
"schemaVersion": "1.0.0",
|
|
420
|
+
"items": [
|
|
421
|
+
{
|
|
422
|
+
"prompt": "What is Microsoft 365?",
|
|
423
|
+
"expected_response": "Microsoft 365 is a cloud-based productivity suite."
|
|
424
|
+
}
|
|
425
|
+
]
|
|
426
|
+
}
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
### Auto-Upgrade Behavior
|
|
430
|
+
|
|
431
|
+
When the CLI loads an eval document:
|
|
432
|
+
|
|
433
|
+
- **Legacy documents** (missing `schemaVersion`): Automatically upgraded with a timestamped backup (e.g., `file.json.bak.20260205143052`)
|
|
434
|
+
- **Older versions** (same major version): `schemaVersion` field updated without backup
|
|
435
|
+
- **Invalid documents**: CLI exits with an error message and guidance to review the schema changelog
|
|
436
|
+
- **Future versions**: CLI rejects with a message suggesting a CLI update
|
|
437
|
+
|
|
438
|
+
### Version Compatibility
|
|
439
|
+
|
|
440
|
+
Within a major version (e.g., 1.x.x), backward compatibility is guaranteed. Documents valid against 1.0.0 will remain valid against 1.1.0, 1.2.0, etc.
|
|
441
|
+
|
|
396
442
|
## Trademarks
|
|
397
443
|
|
|
398
444
|
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
package/package.json
CHANGED
|
@@ -1,8 +1,9 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@microsoft/m365-copilot-eval",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.2.0-preview.1",
|
|
4
|
+
"minCliVersion": "1.0.1-preview.1",
|
|
4
5
|
"description": "Zero-config Node.js wrapper for M365 Copilot Agent Evaluations CLI (Python-based Azure AI Evaluation SDK)",
|
|
5
|
-
"publishDate": "2026-
|
|
6
|
+
"publishDate": "2026-03-11",
|
|
6
7
|
"main": "src/clients/node-js/lib/index.js",
|
|
7
8
|
"type": "module",
|
|
8
9
|
"bin": {
|
|
@@ -71,6 +72,7 @@
|
|
|
71
72
|
"src/clients/cli/**/*.py",
|
|
72
73
|
"src/clients/cli/requirements.txt",
|
|
73
74
|
"src/clients/cli/samples/",
|
|
75
|
+
"schema/",
|
|
74
76
|
"README.md",
|
|
75
77
|
"LICENSE"
|
|
76
78
|
],
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# Changelog - M365 Copilot Eval Document Schema
|
|
2
|
+
|
|
3
|
+
All notable changes to the eval document schema will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [1.0.0] - 2026-02-19
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- Initial schema version for eval document contract
|
|
13
|
+
- Root document structure: `schemaVersion` (required), `metadata` (optional), `items` (required)
|
|
14
|
+
- `DocumentMetadata` with `name`, `description`, `createdAt`, `createdBy`, `evaluatedAt`, `tags`, `agentId`, and `extensions`
|
|
15
|
+
- `EvalItem` with `prompt` (required), `expected_response`, `response`, `context`, `citations`, `scores`, and `extensions`
|
|
16
|
+
- `ScoreCollection` with `relevance`, `coherence`, `groundedness`, `toolCallAccuracy`, and `citations` scores
|
|
17
|
+
- `EvalScore` standard score structure (1-5 scale) with `score`, `result`, `threshold`, `reason`, `evaluator`
|
|
18
|
+
- `CitationScore` for citation-specific evaluation with `count`, `result`, `threshold`, `format`, `citations`
|
|
19
|
+
- `Citation` reference object with `index`, `text`, `source`
|
|
20
|
+
- Extension points at `metadata.extensions` and `items[].extensions` for forward compatibility
|
|
21
|
+
- `additionalProperties: true` on all objects for forward compatibility within major version
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
|
+
"$id": "https://raw.githubusercontent.com/microsoft/M365-Copilot-Agent-Evals/refs/heads/main/schema/v1/eval-document.schema.json",
|
|
4
|
+
"title": "M365 Copilot Eval Document",
|
|
5
|
+
"description": "Schema for evaluation documents used by M365 Copilot Agent Evals CLI. Version 1.0.0.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"required": ["schemaVersion", "items"],
|
|
8
|
+
"additionalProperties": true,
|
|
9
|
+
"properties": {
|
|
10
|
+
"$schema": {
|
|
11
|
+
"type": "string",
|
|
12
|
+
"format": "uri",
|
|
13
|
+
"description": "JSON Schema URI for editor validation support"
|
|
14
|
+
},
|
|
15
|
+
"schemaVersion": {
|
|
16
|
+
"type": "string",
|
|
17
|
+
"pattern": "^1\\.\\d+\\.\\d+$",
|
|
18
|
+
"description": "SemVer string identifying the schema version this document conforms to (e.g., '1.0.0')",
|
|
19
|
+
"examples": ["1.0.0", "1.1.0", "1.2.0"]
|
|
20
|
+
},
|
|
21
|
+
"metadata": {
|
|
22
|
+
"$ref": "#/$defs/DocumentMetadata"
|
|
23
|
+
},
|
|
24
|
+
"items": {
|
|
25
|
+
"type": "array",
|
|
26
|
+
"minItems": 1,
|
|
27
|
+
"description": "Array of evaluation items (prompts and optionally responses with scores)",
|
|
28
|
+
"items": {
|
|
29
|
+
"$ref": "#/$defs/EvalItem"
|
|
30
|
+
}
|
|
31
|
+
}
|
|
32
|
+
},
|
|
33
|
+
"$defs": {
|
|
34
|
+
"DocumentMetadata": {
|
|
35
|
+
"type": "object",
|
|
36
|
+
"description": "Optional metadata about the evaluation document",
|
|
37
|
+
"additionalProperties": true,
|
|
38
|
+
"properties": {
|
|
39
|
+
"name": {
|
|
40
|
+
"type": "string",
|
|
41
|
+
"description": "Human-readable name for the evaluation set"
|
|
42
|
+
},
|
|
43
|
+
"description": {
|
|
44
|
+
"type": "string",
|
|
45
|
+
"description": "Description of what this evaluation set tests"
|
|
46
|
+
},
|
|
47
|
+
"createdAt": {
|
|
48
|
+
"type": "string",
|
|
49
|
+
"format": "date-time",
|
|
50
|
+
"description": "ISO 8601 timestamp when the document was created"
|
|
51
|
+
},
|
|
52
|
+
"createdBy": {
|
|
53
|
+
"type": "string",
|
|
54
|
+
"description": "Author or system that created the document"
|
|
55
|
+
},
|
|
56
|
+
"evaluatedAt": {
|
|
57
|
+
"type": "string",
|
|
58
|
+
"format": "date-time",
|
|
59
|
+
"description": "ISO 8601 timestamp when evaluation was performed (output documents)"
|
|
60
|
+
},
|
|
61
|
+
"tags": {
|
|
62
|
+
"type": "array",
|
|
63
|
+
"items": {
|
|
64
|
+
"type": "string"
|
|
65
|
+
},
|
|
66
|
+
"description": "Tags for categorization and filtering"
|
|
67
|
+
},
|
|
68
|
+
"agentId": {
|
|
69
|
+
"type": "string",
|
|
70
|
+
"description": "M365 Agent ID this evaluation targets"
|
|
71
|
+
},
|
|
72
|
+
"extensions": {
|
|
73
|
+
"type": "object",
|
|
74
|
+
"additionalProperties": true,
|
|
75
|
+
"description": "Extension point for custom metadata. Use reverse-domain notation for field names."
|
|
76
|
+
}
|
|
77
|
+
}
|
|
78
|
+
},
|
|
79
|
+
"EvalItem": {
|
|
80
|
+
"type": "object",
|
|
81
|
+
"description": "A single evaluation item containing a prompt and optionally a response with scores",
|
|
82
|
+
"required": ["prompt"],
|
|
83
|
+
"additionalProperties": true,
|
|
84
|
+
"properties": {
|
|
85
|
+
"prompt": {
|
|
86
|
+
"type": "string",
|
|
87
|
+
"minLength": 1,
|
|
88
|
+
"description": "The input prompt to evaluate"
|
|
89
|
+
},
|
|
90
|
+
"expected_response": {
|
|
91
|
+
"type": "string",
|
|
92
|
+
"description": "Expected or ideal response for comparison during evaluation"
|
|
93
|
+
},
|
|
94
|
+
"response": {
|
|
95
|
+
"type": "string",
|
|
96
|
+
"description": "Actual response from the agent (present in output documents)"
|
|
97
|
+
},
|
|
98
|
+
"context": {
|
|
99
|
+
"type": "string",
|
|
100
|
+
"description": "Additional context for grounding evaluation"
|
|
101
|
+
},
|
|
102
|
+
"citations": {
|
|
103
|
+
"type": "array",
|
|
104
|
+
"items": {
|
|
105
|
+
"$ref": "#/$defs/Citation"
|
|
106
|
+
},
|
|
107
|
+
"description": "Citations included in the response"
|
|
108
|
+
},
|
|
109
|
+
"scores": {
|
|
110
|
+
"$ref": "#/$defs/ScoreCollection"
|
|
111
|
+
},
|
|
112
|
+
"extensions": {
|
|
113
|
+
"type": "object",
|
|
114
|
+
"additionalProperties": true,
|
|
115
|
+
"description": "Extension point for custom item-level fields"
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
},
|
|
119
|
+
"ScoreCollection": {
|
|
120
|
+
"type": "object",
|
|
121
|
+
"description": "Collection of evaluation scores for an item",
|
|
122
|
+
"additionalProperties": true,
|
|
123
|
+
"properties": {
|
|
124
|
+
"relevance": {
|
|
125
|
+
"$ref": "#/$defs/EvalScore",
|
|
126
|
+
"description": "Relevance score (1-5)"
|
|
127
|
+
},
|
|
128
|
+
"coherence": {
|
|
129
|
+
"$ref": "#/$defs/EvalScore",
|
|
130
|
+
"description": "Coherence score (1-5)"
|
|
131
|
+
},
|
|
132
|
+
"groundedness": {
|
|
133
|
+
"$ref": "#/$defs/EvalScore",
|
|
134
|
+
"description": "Groundedness score (1-5)"
|
|
135
|
+
},
|
|
136
|
+
"toolCallAccuracy": {
|
|
137
|
+
"$ref": "#/$defs/EvalScore",
|
|
138
|
+
"description": "Tool call accuracy score (1-5)"
|
|
139
|
+
},
|
|
140
|
+
"citations": {
|
|
141
|
+
"$ref": "#/$defs/CitationScore",
|
|
142
|
+
"description": "Citation evaluation results"
|
|
143
|
+
}
|
|
144
|
+
}
|
|
145
|
+
},
|
|
146
|
+
"EvalScore": {
|
|
147
|
+
"type": "object",
|
|
148
|
+
"description": "Standard evaluation score (1-5 scale)",
|
|
149
|
+
"required": ["score", "result", "threshold"],
|
|
150
|
+
"additionalProperties": true,
|
|
151
|
+
"properties": {
|
|
152
|
+
"score": {
|
|
153
|
+
"type": "number",
|
|
154
|
+
"minimum": 1,
|
|
155
|
+
"maximum": 5,
|
|
156
|
+
"description": "Numeric score from 1.0 (worst) to 5.0 (best)"
|
|
157
|
+
},
|
|
158
|
+
"result": {
|
|
159
|
+
"type": "string",
|
|
160
|
+
"enum": ["pass", "fail"],
|
|
161
|
+
"description": "Pass/fail result based on threshold comparison"
|
|
162
|
+
},
|
|
163
|
+
"threshold": {
|
|
164
|
+
"type": "number",
|
|
165
|
+
"minimum": 1,
|
|
166
|
+
"maximum": 5,
|
|
167
|
+
"description": "Threshold used for pass/fail determination"
|
|
168
|
+
},
|
|
169
|
+
"reason": {
|
|
170
|
+
"type": "string",
|
|
171
|
+
"description": "Explanation of why this score was assigned"
|
|
172
|
+
},
|
|
173
|
+
"evaluator": {
|
|
174
|
+
"type": "string",
|
|
175
|
+
"description": "Name or identifier of the evaluator that produced this score"
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
},
|
|
179
|
+
"CitationScore": {
|
|
180
|
+
"type": "object",
|
|
181
|
+
"description": "Citation-specific evaluation score",
|
|
182
|
+
"required": ["count", "result", "threshold"],
|
|
183
|
+
"additionalProperties": true,
|
|
184
|
+
"properties": {
|
|
185
|
+
"count": {
|
|
186
|
+
"type": "integer",
|
|
187
|
+
"minimum": 0,
|
|
188
|
+
"description": "Number of citations found in the response"
|
|
189
|
+
},
|
|
190
|
+
"result": {
|
|
191
|
+
"type": "string",
|
|
192
|
+
"enum": ["pass", "fail"],
|
|
193
|
+
"description": "Pass/fail result based on citation count vs threshold"
|
|
194
|
+
},
|
|
195
|
+
"threshold": {
|
|
196
|
+
"type": "integer",
|
|
197
|
+
"minimum": 0,
|
|
198
|
+
"description": "Minimum required number of citations for pass"
|
|
199
|
+
},
|
|
200
|
+
"format": {
|
|
201
|
+
"type": "string",
|
|
202
|
+
"description": "Citation format detected. Known values: 'oai_unicode', 'bracket', 'mixed'. Additional formats may be added.",
|
|
203
|
+
"examples": ["oai_unicode", "bracket", "mixed"]
|
|
204
|
+
},
|
|
205
|
+
"citations": {
|
|
206
|
+
"type": "array",
|
|
207
|
+
"items": {
|
|
208
|
+
"$ref": "#/$defs/Citation"
|
|
209
|
+
},
|
|
210
|
+
"description": "Parsed citation objects"
|
|
211
|
+
}
|
|
212
|
+
}
|
|
213
|
+
},
|
|
214
|
+
"Citation": {
|
|
215
|
+
"type": "object",
|
|
216
|
+
"description": "A single citation reference",
|
|
217
|
+
"required": ["index"],
|
|
218
|
+
"additionalProperties": true,
|
|
219
|
+
"properties": {
|
|
220
|
+
"index": {
|
|
221
|
+
"type": "integer",
|
|
222
|
+
"minimum": 1,
|
|
223
|
+
"description": "Citation index (1-based)"
|
|
224
|
+
},
|
|
225
|
+
"text": {
|
|
226
|
+
"type": "string",
|
|
227
|
+
"description": "The cited text"
|
|
228
|
+
},
|
|
229
|
+
"source": {
|
|
230
|
+
"type": "string",
|
|
231
|
+
"description": "Source reference (URL, document name, etc.)"
|
|
232
|
+
}
|
|
233
|
+
}
|
|
234
|
+
}
|
|
235
|
+
}
|
|
236
|
+
}
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "https://raw.githubusercontent.com/microsoft/M365-Copilot-Agent-Evals/refs/heads/main/schema/v1/eval-document.schema.json",
|
|
3
|
+
"schemaVersion": "1.0.0",
|
|
4
|
+
"metadata": {
|
|
5
|
+
"name": "Graph API Evaluation Set",
|
|
6
|
+
"description": "Test prompts for Microsoft Graph API knowledge",
|
|
7
|
+
"createdAt": "2026-01-20T10:00:00Z",
|
|
8
|
+
"createdBy": "eval-team",
|
|
9
|
+
"evaluatedAt": "2026-01-20T10:30:00Z",
|
|
10
|
+
"tags": ["graph", "api", "authentication"],
|
|
11
|
+
"agentId": "12345678-1234-1234-1234-123456789abc",
|
|
12
|
+
"extensions": {
|
|
13
|
+
"com.contoso.department": "engineering",
|
|
14
|
+
"com.contoso.priority": "high"
|
|
15
|
+
}
|
|
16
|
+
},
|
|
17
|
+
"items": [
|
|
18
|
+
{
|
|
19
|
+
"prompt": "What is Microsoft Graph API?",
|
|
20
|
+
"expected_response": "Microsoft Graph API is a unified endpoint for accessing Microsoft services.",
|
|
21
|
+
"context": "User is a developer new to Microsoft ecosystem.",
|
|
22
|
+
"response": "Microsoft Graph API is a gateway to data and intelligence in Microsoft 365.",
|
|
23
|
+
"scores": {
|
|
24
|
+
"relevance": {
|
|
25
|
+
"score": 5.0,
|
|
26
|
+
"result": "pass",
|
|
27
|
+
"threshold": 3,
|
|
28
|
+
"reason": "Response directly addresses the query with accurate information.",
|
|
29
|
+
"evaluator": "azure-ai-relevance"
|
|
30
|
+
},
|
|
31
|
+
"coherence": {
|
|
32
|
+
"score": 5.0,
|
|
33
|
+
"result": "pass",
|
|
34
|
+
"threshold": 3,
|
|
35
|
+
"reason": "Response is well-structured and easy to follow.",
|
|
36
|
+
"evaluator": "azure-ai-coherence"
|
|
37
|
+
},
|
|
38
|
+
"groundedness": {
|
|
39
|
+
"score": 4.0,
|
|
40
|
+
"result": "pass",
|
|
41
|
+
"threshold": 3,
|
|
42
|
+
"reason": "Response is grounded in the provided context.",
|
|
43
|
+
"evaluator": "azure-ai-groundedness"
|
|
44
|
+
},
|
|
45
|
+
"toolCallAccuracy": {
|
|
46
|
+
"score": 5.0,
|
|
47
|
+
"result": "pass",
|
|
48
|
+
"threshold": 3,
|
|
49
|
+
"reason": "Tool calls are accurate and well-formed.",
|
|
50
|
+
"evaluator": "azure-ai-tool-accuracy"
|
|
51
|
+
},
|
|
52
|
+
"citations": {
|
|
53
|
+
"count": 2,
|
|
54
|
+
"result": "pass",
|
|
55
|
+
"threshold": 1,
|
|
56
|
+
"format": "bracket",
|
|
57
|
+
"citations": [
|
|
58
|
+
{
|
|
59
|
+
"index": 1,
|
|
60
|
+
"text": "Microsoft Graph is a unified API endpoint",
|
|
61
|
+
"source": "https://learn.microsoft.com/graph/overview"
|
|
62
|
+
},
|
|
63
|
+
{
|
|
64
|
+
"index": 2,
|
|
65
|
+
"text": "Access data and intelligence in Microsoft 365",
|
|
66
|
+
"source": "https://learn.microsoft.com/graph/use-the-api"
|
|
67
|
+
}
|
|
68
|
+
]
|
|
69
|
+
}
|
|
70
|
+
},
|
|
71
|
+
"citations": [
|
|
72
|
+
{
|
|
73
|
+
"index": 1,
|
|
74
|
+
"text": "Microsoft Graph is a unified API endpoint",
|
|
75
|
+
"source": "https://learn.microsoft.com/graph/overview"
|
|
76
|
+
},
|
|
77
|
+
{
|
|
78
|
+
"index": 2,
|
|
79
|
+
"text": "Access data and intelligence in Microsoft 365",
|
|
80
|
+
"source": "https://learn.microsoft.com/graph/use-the-api"
|
|
81
|
+
}
|
|
82
|
+
],
|
|
83
|
+
"extensions": {
|
|
84
|
+
"com.contoso.difficulty": "easy"
|
|
85
|
+
}
|
|
86
|
+
},
|
|
87
|
+
{
|
|
88
|
+
"prompt": "How do I authenticate with Microsoft Graph?",
|
|
89
|
+
"expected_response": "You can authenticate using OAuth 2.0 or client credentials flow."
|
|
90
|
+
}
|
|
91
|
+
]
|
|
92
|
+
}
|