@microsoft/m365-copilot-eval 1.2.1-preview.1 → 1.4.0-preview.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/README.md +140 -101
  2. package/package.json +7 -4
  3. package/schema/CHANGELOG.md +8 -0
  4. package/schema/v1/eval-document.schema.json +256 -8
  5. package/schema/v1/examples/invalid/multi-turn-empty-turns.json +8 -0
  6. package/schema/v1/examples/invalid/multi-turn-has-both-prompt-and-turns.json +13 -0
  7. package/schema/v1/examples/invalid/multi-turn-missing-prompt.json +12 -0
  8. package/schema/v1/examples/invalid/multi-turn-typo-in-turn.json +13 -0
  9. package/schema/v1/examples/invalid/multi-turn-unknown-evaluator.json +15 -0
  10. package/schema/v1/examples/valid/comprehensive.json +27 -2
  11. package/schema/v1/examples/valid/mixed-single-and-multi-turn.json +30 -0
  12. package/schema/v1/examples/valid/multi-turn-output.json +59 -0
  13. package/schema/v1/examples/valid/multi-turn-simple.json +21 -0
  14. package/schema/v1/examples/valid/multi-turn-with-evaluators.json +34 -0
  15. package/schema/version.json +2 -2
  16. package/src/clients/cli/api_clients/A2A/__init__.py +3 -0
  17. package/src/clients/cli/api_clients/A2A/a2a_client.py +456 -0
  18. package/src/clients/cli/api_clients/REST/__init__.py +3 -0
  19. package/src/clients/cli/api_clients/REST/sydney_client.py +204 -0
  20. package/src/clients/cli/api_clients/__init__.py +3 -0
  21. package/src/clients/cli/api_clients/base_agent_client.py +78 -0
  22. package/src/clients/cli/cli_logging/__init__.py +0 -0
  23. package/src/clients/cli/cli_logging/console_diagnostics.py +107 -0
  24. package/src/clients/cli/cli_logging/logging_utils.py +144 -0
  25. package/src/clients/cli/common.py +62 -0
  26. package/src/clients/cli/custom_evaluators/CitationsEvaluator.py +3 -3
  27. package/src/clients/cli/custom_evaluators/ExactMatchEvaluator.py +11 -11
  28. package/src/clients/cli/custom_evaluators/PartialMatchEvaluator.py +1 -11
  29. package/src/clients/cli/evaluator_resolver.py +150 -0
  30. package/src/clients/cli/generate_report.py +347 -184
  31. package/src/clients/cli/main.py +1288 -481
  32. package/src/clients/cli/parallel_executor.py +57 -0
  33. package/src/clients/cli/readme.md +14 -7
  34. package/src/clients/cli/requirements.txt +1 -1
  35. package/src/clients/cli/response_extractor.py +30 -14
  36. package/src/clients/cli/retry_policy.py +52 -0
  37. package/src/clients/cli/samples/multiturn_example.json +35 -0
  38. package/src/clients/cli/throttle_gate.py +82 -0
  39. package/src/clients/node-js/bin/runevals.js +134 -41
  40. package/src/clients/node-js/config/default.js +5 -1
  41. package/src/clients/node-js/lib/agent-id.js +12 -0
  42. package/src/clients/node-js/lib/env-loader.js +11 -16
  43. package/src/clients/node-js/lib/eula-manager.js +78 -0
  44. package/src/clients/node-js/lib/progress.js +13 -11
package/README.md CHANGED
@@ -5,21 +5,28 @@
5
5
  A **zero-configuration** CLI for evaluating M365 Copilot agents. Send prompts to your agent, get responses, and automatically score them with Azure AI Evaluation metrics (relevance, coherence, groundedness).
6
6
  - Send a batch (or interactive set) of prompts to a configured chat API endpoint.
7
7
  - Collect agent responses and evaluate them locally using Azure AI Evaluation SDK.
8
- - Metrics produced per prompt:
9
- - - Relevance (1–5)
10
- - - Coherence (1–5)
11
- - - Groundedness (1–5)
12
- - - Tool Call Accuracy (1–5)
13
- - - Citations (0–1)
8
+ - The CLI supports 7 evaluator types. Evaluators marked with ⭐ are **enabled by default**.
9
+
10
+ | Evaluator | Type | Scale | Default Threshold | Default |
11
+ |-----------|------|-------|-------------------|---------|
12
+ | **Relevance** ⭐ | LLM-based | 1-5 | 3 | Yes |
13
+ | **Coherence** ⭐ | LLM-based | 1-5 | 3 | Yes |
14
+ | **Groundedness** | LLM-based | 1-5 | 3 | No |
15
+ | **ToolCallAccuracy** | LLM-based | 1-5 | 3 | No |
16
+ | **Citations** | Count-based | >= 0 | 1 | No |
17
+ | **ExactMatch** | String match | boolean | N/A | No |
18
+ | **PartialMatch** | String match | 0.0-1.0 | 0.5 | No |
14
19
  - Multiple input modes: command‑line list, JSON file, interactive.
15
20
  - Multiple output formats: console (colorized), JSON, CSV, HTML (auto‑opens report).
16
21
 
17
22
  ## 📋 Prerequisites
18
23
 
24
+ - **M365 Copilot License** for your tenant
19
25
  - **M365 Copilot Agent** deployed to your tenant (can be created with [M365 Agents Toolkit](https://learn.microsoft.com/en-us/microsoft-365/developer/overview-m365-agents-toolkit) or any other method)
20
26
  - **Node.js 24.12.0+** (check: `node --version`)
21
27
  - **Environment file** with your credentials and agent ID (see [Environment Setup](#-environment-setup) below)
22
- - **Your Tenant ID, Azure OpenAI endpoint, and API key** (see [Getting Variables](#-getting-variables) below)
28
+ - **Your Tenant ID** - get your tenant id using the instructions [here](https://learn.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id)
29
+ - **Azure OpenAI endpoint, and API key** (see [Getting Variables](#-getting-variables) below)
23
30
 
24
31
  > Note: Authentication is currently supported on Windows only. Support for other operating systems is coming soon.
25
32
 
@@ -123,16 +130,29 @@ You need both the endpoint URL and API key from your Azure OpenAI resource for "
123
130
  **How to obtain:**
124
131
 
125
132
  1. Go to [Azure Portal](https://portal.azure.com)
126
- 2. Navigate to your Azure OpenAI service
127
- - **Path:** Portal All Services Search "OpenAI" → Select your resource
128
- - **Or create new:** Portal Create a resource Search "OpenAI"
129
- 3. In the **Overview** section, copy the **Endpoint** value
130
- - Format: `https://YOUR-RESOURCE-NAME.openai.azure.com/`
131
- - This is your `AZURE_AI_OPENAI_ENDPOINT`
132
- 4. In the left sidebar, click **Keys and Endpoint**
133
- 5. Copy **KEY 1** or **KEY 2**
134
- - This is your `AZURE_AI_API_KEY`
135
- 6. Add both values to your `.env.dev` file as shown in the [Setup Steps](#setup-steps) above
133
+ 2. Open Azure Portal. Search OpenAI in the search bar and select Azure OpenAi.
134
+ ![Azure Portal search bar showing Azure OpenAI service](docs/images/image.png)
135
+ 3. once you select Azure OpenAi, then Create an AI Foundry Resource.
136
+ ![Azure OpenAI service page with Create AI Foundry Resource button](docs/images/image-1.png)
137
+ 4. On the Create Foundry Resource, fill in the details and click 'Review + Create'.
138
+ ![Create AI Foundry Resource form with Review + Create button](docs/images/image-2.png)
139
+ 5. Once the resource deployed, go to foundry portal
140
+ ![Resource deployment complete with link to AI Foundry portal](docs/images/image-3.png)
141
+ 6. At this point, you should be able to deploy an LLM model.
142
+ 7. Select Models + Endpoints on the left rail
143
+ ![AI Foundry portal left navigation with Models + Endpoints selected](docs/images/image-4.png)
144
+ 8. Select Deploy Model -> Deploy base model (we recommend gpt-4o-mini model)
145
+ ![Deploy Model dropdown showing Deploy base model option](docs/images/image-5.png)
146
+ 9. Select Confirm, then select Customize
147
+ ![Model deployment confirmation dialog with Customize button](docs/images/image-6.png)
148
+ 10. Click on Customize and change the capacity to 50K tokens per minute
149
+ ![Model deployment customization showing token capacity setting](docs/images/image-7.png)
150
+ ![Token capacity set to 50K tokens per minute](docs/images/image-8.png)
151
+ 11. Hit deploy and wait for a few minutes for the model to deploy.
152
+ 12. Once the deployment finishes, you are redirected to the API endpoint and API_Key page.
153
+ 13. Copy the following values from that page.
154
+ ![API endpoint and API key values on the model deployment page](docs/images/image-10.png)
155
+ 14. Add all of these values to your `.env.dev` file as shown in the [Setup Steps](#setup-steps) above
136
156
 
137
157
  **Required model:** Ensure you have `gpt-4o-mini` (or similar) deployed in your Azure OpenAI resource.
138
158
 
@@ -165,64 +185,110 @@ runevals --env dev
165
185
 
166
186
  ---
167
187
 
168
- ## 📝 Creating Prompts Files
188
+ ## 📝 Eval Document Format
169
189
 
170
- The CLI auto-discovers prompts files in your project:
190
+ The eval document schema is versioned independently from the CLI, following [Semantic Versioning](https://semver.org/).
171
191
 
172
- ### Auto-Discovery
192
+ - **Schema location**: [`schema/v1/eval-document.schema.json`](schema/v1/eval-document.schema.json)
193
+ - **Schema changelog**: [`schema/CHANGELOG.md`](schema/CHANGELOG.md)
194
+
195
+ > **New in Schema v1.2.0**: Multi-turn conversation threads — test context persistence across multiple turns within a shared conversation session. Each thread supports 1-20 turns.
196
+
197
+ > **New in Schema v1.1.0**: Per-prompt evaluator overrides with `evaluators_mode` (`extend`/`replace`), file-level `default_evaluators`, and `ExactMatch`/`PartialMatch` evaluators.
198
+
199
+ ### Getting Started
173
200
 
174
- When you run `runevals`, it searches:
201
+ The CLI auto-discovers prompts files in your project. When you run `runevals`, it searches:
175
202
  1. Current directory: `prompts.json`, `evals.json`, `tests.json`
176
203
  2. `./evals/` subdirectory: `prompts.json`, `evals.json`, `tests.json`
177
204
 
178
- **Example project structure:**
179
- ```
180
- my-agent/
181
- ├── .env.local # Your credentials
182
- ├── evals/
183
- │ └── evals.json # Your test prompts (auto-discovered!)
184
- └── .evals/
185
- └── 2025-12-03_14-30-45.html # Generated reports
186
- ```
205
+ **No prompts file?** The CLI will offer to create a starter file with example prompts for you.
187
206
 
188
- ### Starter File Creation
207
+ A minimal eval document:
189
208
 
190
- If no file is found:
209
+ ```json
210
+ {
211
+ "schemaVersion": "1.2.0",
212
+ "items": [
213
+ {
214
+ "prompt": "What is Microsoft 365?",
215
+ "expected_response": "Microsoft 365 is a cloud-based productivity suite..."
216
+ }
217
+ ]
218
+ }
191
219
  ```
192
- ⚠️ No prompts file found in current directory or ./evals/
193
220
 
194
- Create a starter evals file with sample prompts? (Y/n):
195
- ```
221
+ ### Evaluator Configuration
196
222
 
197
- Answering "Y" creates `./evals/evals.json` with 2 starter prompts:
223
+ Use `default_evaluators` to set file-level defaults, and per-item `evaluators` with `evaluators_mode` to customize:
198
224
 
199
225
  ```json
200
- [
201
- {
202
- "prompt": "What is Microsoft 365?",
203
- "expected_response": "Microsoft 365 is a cloud-based productivity suite..."
226
+ {
227
+ "schemaVersion": "1.2.0",
228
+ "default_evaluators": {
229
+ "Relevance": {},
230
+ "Coherence": {}
204
231
  },
205
- {
206
- "prompt": "How can I share a file in Teams?",
207
- "expected_response": "You can share a file in Teams by uploading it..."
208
- }
209
- ]
232
+ "items": [
233
+ {
234
+ "prompt": "What is Microsoft Graph?",
235
+ "expected_response": "A unified API endpoint for Microsoft services.",
236
+ "evaluators": {
237
+ "Citations": { "citation_format": "mixed" }
238
+ },
239
+ "evaluators_mode": "extend"
240
+ },
241
+ {
242
+ "name": "Expense policy flow",
243
+ "turns": [
244
+ {
245
+ "prompt": "I spent $250 on dinner. Is that okay?",
246
+ "expected_response": "The per-diem meal allowance is $200."
247
+ },
248
+ {
249
+ "prompt": "What should I do about the overage?",
250
+ "expected_response": "Request manager approval.",
251
+ "evaluators": {
252
+ "ExactMatch": { "case_sensitive": false }
253
+ },
254
+ "evaluators_mode": "replace"
255
+ }
256
+ ]
257
+ }
258
+ ]
259
+ }
210
260
  ```
211
261
 
212
- Edit this file with your own prompts and run again!
262
+ **How evaluator modes work in this example:**
213
263
 
214
- ### Manual Creation
264
+ | Item | `evaluators_mode` | Active Evaluators | Why |
265
+ |------|-------------------|-------------------|-----|
266
+ | Single-turn (Graph) | `extend` | Relevance, Coherence, Citations | Per-prompt Citations **merged** with defaults |
267
+ | Multi-turn turn 1 (dinner) | _(none)_ | Relevance, Coherence | **Inherits** file-level defaults |
268
+ | Multi-turn turn 2 (overage) | `replace` | ExactMatch | Per-turn ExactMatch **replaces** defaults entirely |
215
269
 
216
- Create `./evals/prompts.json`:
270
+ ### Evaluator Modes
217
271
 
218
- ```json
219
- [
220
- {
221
- "prompt": "Your test prompt here",
222
- "expected_response": "Expected agent response"
223
- }
224
- ]
225
- ```
272
+ | Mode | Behavior |
273
+ |------|----------|
274
+ | `"extend"` (default) | Per-item evaluators **merge** with defaults. Both run. |
275
+ | `"replace"` | Per-item evaluators **replace** defaults entirely. Only per-item evaluators run. |
276
+ | _(none)_ | Inherits file-level `default_evaluators`, or system defaults (Relevance, Coherence) if not set. |
277
+
278
+ See `schema/v1/examples/` in the package for more examples including per-turn evaluator overrides, mixed single/multi-turn files, and output format.
279
+
280
+ ### Auto-Upgrade Behavior
281
+
282
+ When the CLI loads an eval document:
283
+
284
+ - **Legacy documents** (missing `schemaVersion`): Automatically upgraded with a timestamped backup (e.g., `file.json.bak.20260205143052`)
285
+ - **Older versions** (same major version): `schemaVersion` field updated without backup
286
+ - **Invalid documents**: CLI exits with an error message and guidance to review the schema changelog
287
+ - **Future versions**: CLI rejects with a message suggesting a CLI update
288
+
289
+ ### Version Compatibility
290
+
291
+ Within a major version (e.g., 1.x.x), we aim to maintain backward compatibility for documents that conform to the published schema for their version. Compatibility does not extend to undeclared or ad-hoc fields outside the schema definition; review the [schema changelog](schema/CHANGELOG.md) when upgrading between minor versions.
226
292
 
227
293
  ## 🎯 Usage Examples
228
294
 
@@ -265,10 +331,22 @@ runevals --prompts "What is Microsoft Graph?" --expected "Gateway to M365 data"
265
331
  # Interactive mode (enter prompts interactively)
266
332
  runevals --interactive
267
333
 
334
+ # Canonical logging verbosity
335
+ runevals --log-level debug
336
+ runevals --log-level info
337
+ runevals --log-level warning
338
+ runevals --log-level error
339
+
340
+ # Parallel prompt execution control
341
+ runevals --concurrency 5 --prompts-file ./evals/evals.json
342
+ runevals --concurrency 1000 --prompts-file ./evals/evals.json # Python CLI clamps to 5
343
+
268
344
  # Custom output location in your project
269
345
  runevals --output ./reports/results.html
270
346
  ```
271
347
 
348
+ > **⚠️ Debug log safety notice:** The `--log-level debug` option is opt-in and may include raw API payloads and response data in console output. Redaction is pattern-based (API keys, tokens, passwords, long mixed-case strings) and **will not catch arbitrary PII or custom credentials** embedded in prompts or responses. Do not share debug-level output publicly without manual review.
349
+
272
350
  ### Optional: Add Shortcuts to package.json
273
351
 
274
352
  You can add shortcuts (npm scripts) to your agent project's `package.json`:
@@ -301,10 +379,9 @@ npm run eval:dev
301
379
  ## 📊 Output Formats
302
380
 
303
381
  Results are automatically saved to `./evals/YYYY-MM-DD_HH-MM-SS.html` with:
304
- - **Relevance** score (1-5)
305
- - **Coherence** score (1-5)
306
- - **Groundedness** score (1-5)
307
- - Per-prompt details and aggregate metrics
382
+ - Per-prompt and per-turn evaluation scores from configured evaluators
383
+ - Aggregate statistics across all evaluated items
384
+ - Multi-turn thread summaries (turns passed/failed, overall status)
308
385
 
309
386
  Other formats:
310
387
  ```bash
@@ -320,8 +397,7 @@ runevals --output results.csv
320
397
  ```bash
321
398
  Options:
322
399
  -V, --version output version number
323
- -v, --verbose show detailed processing steps
324
- -q, --quiet minimal output
400
+ --log-level [level] log level: debug|info|warning|error (bare flag -> info)
325
401
  --prompts <prompts...> inline prompts to evaluate
326
402
  --expected <responses...> expected responses (with --prompts)
327
403
  --prompts-file <file> JSON file with prompts
@@ -360,7 +436,7 @@ runevals cache-info
360
436
 
361
437
  # Clear and rebuild
362
438
  runevals cache-clear
363
- runevals --init-only --verbose
439
+ runevals --init-only --log-level debug
364
440
  ```
365
441
 
366
442
  ### Network/Proxy Issues
@@ -369,7 +445,7 @@ runevals --init-only --verbose
369
445
  export HTTPS_PROXY=http://proxy:8080
370
446
 
371
447
  # Retry with verbose output
372
- runevals --init-only --verbose
448
+ runevals --init-only --log-level debug
373
449
  ```
374
450
 
375
451
  ### Permission Issues
@@ -403,43 +479,6 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
403
479
  For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
404
480
  contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
405
481
 
406
- ## Schema Versioning
407
-
408
- The eval document schema is versioned independently from the CLI, following [Semantic Versioning](https://semver.org/). This allows external consumers to depend on a stable contract without coupling to CLI release cycles.
409
-
410
- - **Schema location**: [`schema/v1/eval-document.schema.json`](schema/v1/eval-document.schema.json)
411
- - **Schema changelog**: [`schema/CHANGELOG.md`](schema/CHANGELOG.md)
412
- - **Consumer quickstart**: [`specs/wi-6081652-dataset-schema-versioning/quickstart.md`](specs/wi-6081652-dataset-schema-versioning/quickstart.md)
413
-
414
- ### Eval Document Format
415
-
416
- Eval documents should include a `schemaVersion` field:
417
-
418
- ```json
419
- {
420
- "schemaVersion": "1.0.0",
421
- "items": [
422
- {
423
- "prompt": "What is Microsoft 365?",
424
- "expected_response": "Microsoft 365 is a cloud-based productivity suite."
425
- }
426
- ]
427
- }
428
- ```
429
-
430
- ### Auto-Upgrade Behavior
431
-
432
- When the CLI loads an eval document:
433
-
434
- - **Legacy documents** (missing `schemaVersion`): Automatically upgraded with a timestamped backup (e.g., `file.json.bak.20260205143052`)
435
- - **Older versions** (same major version): `schemaVersion` field updated without backup
436
- - **Invalid documents**: CLI exits with an error message and guidance to review the schema changelog
437
- - **Future versions**: CLI rejects with a message suggesting a CLI update
438
-
439
- ### Version Compatibility
440
-
441
- Within a major version (e.g., 1.x.x), backward compatibility is guaranteed. Documents valid against 1.0.0 will remain valid against 1.1.0, 1.2.0, etc.
442
-
443
482
  ## Trademarks
444
483
 
445
484
  This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
package/package.json CHANGED
@@ -1,9 +1,9 @@
1
1
  {
2
2
  "name": "@microsoft/m365-copilot-eval",
3
- "version": "1.2.1-preview.1",
3
+ "version": "1.4.0-preview.1",
4
4
  "minCliVersion": "1.0.1-preview.1",
5
5
  "description": "Zero-config Node.js wrapper for M365 Copilot Agent Evaluations CLI (Python-based Azure AI Evaluation SDK)",
6
- "publishDate": "2026-03-23",
6
+ "publishDate": "2026-04-22",
7
7
  "main": "src/clients/node-js/lib/index.js",
8
8
  "type": "module",
9
9
  "bin": {
@@ -14,6 +14,7 @@
14
14
  "build": "npm run prettier:check && npm run clean && npm run lint",
15
15
  "clean": "rimraf node_modules/.cache dist coverage",
16
16
  "test": "node --test tests/clients/node-js/**/*.test.js",
17
+ "test:coverage": "c8 --reporter=cobertura --reporter=text node --test tests/clients/node-js/**/*.test.js",
17
18
  "install-credprovider": "artifacts-npm-credprovider && npm ci",
18
19
  "set-publish-date": "node scripts/set-publish-date.js",
19
20
  "prepublishOnly": "node scripts/set-publish-date.js",
@@ -41,9 +42,10 @@
41
42
  },
42
43
  "dependencies": {
43
44
  "commander": "^12.1.0",
45
+ "dotenv": "^16.0.0",
46
+ "https-proxy-agent": "^7.0.5",
44
47
  "node-fetch": "^3.3.2",
45
- "tar": "^7.5.4",
46
- "https-proxy-agent": "^7.0.5"
48
+ "tar": "^7.5.4"
47
49
  },
48
50
  "devDependencies": {
49
51
  "@microsoft/eslint-config-msgraph": "^5.0.0",
@@ -52,6 +54,7 @@
52
54
  "@vitest/coverage-istanbul": "^3.0.0",
53
55
  "@vitest/coverage-v8": "^3.0.0",
54
56
  "@vitest/ui": "^3.0.0",
57
+ "c8": "^11.0.0",
55
58
  "eslint": "^9.7.0",
56
59
  "eslint-config-prettier": "^10.0.0",
57
60
  "eslint-plugin-jsdoc": "^50.1.0",
@@ -5,6 +5,14 @@ All notable changes to the eval document schema will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [1.1.0](https://github.com/microsoft/M365-Copilot-Agent-Evals/compare/schema-v1.0.0...schema-v1.1.0) (2026-03-30)
9
+
10
+
11
+ ### Features
12
+
13
+ * **WI-6855059:** add agentName/cliVersion to schema, fix duplicate prompt loss, include default_evaluators in output ([#181](https://github.com/microsoft/M365-Copilot-Agent-Evals/issues/181)) ([9321474](https://github.com/microsoft/M365-Copilot-Agent-Evals/commit/93214746144e9d11f507433eff185aefac4a858a))
14
+ * **WI-6855059:** implement per-prompt evaluator configuration ([#168](https://github.com/microsoft/M365-Copilot-Agent-Evals/issues/168)) ([eface7e](https://github.com/microsoft/M365-Copilot-Agent-Evals/commit/eface7e7041b118681cd4c68582fe903640bf6c0))
15
+
8
16
  ## [1.0.0] - 2026-02-19
9
17
 
10
18
  ### Added