@microsoft/m365-copilot-eval 1.0.1-preview.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +415 -0
  3. package/TERMS.txt +65 -0
  4. package/package.json +82 -0
  5. package/src/clients/cli/auth/__init__.py +1 -0
  6. package/src/clients/cli/auth/auth_handler.py +262 -0
  7. package/src/clients/cli/custom_evaluators/CitationsEvaluator.py +136 -0
  8. package/src/clients/cli/custom_evaluators/ConcisenessNonLLMEvaluator.py +18 -0
  9. package/src/clients/cli/custom_evaluators/ExactMatchEvaluator.py +25 -0
  10. package/src/clients/cli/custom_evaluators/PII/PII.py +45 -0
  11. package/src/clients/cli/custom_evaluators/PartialMatchEvaluator.py +39 -0
  12. package/src/clients/cli/custom_evaluators/__init__.py +1 -0
  13. package/src/clients/cli/demo_usage.py +83 -0
  14. package/src/clients/cli/generate_report.py +251 -0
  15. package/src/clients/cli/main.py +766 -0
  16. package/src/clients/cli/readme.md +301 -0
  17. package/src/clients/cli/requirements.txt +10 -0
  18. package/src/clients/cli/response_extractor.py +589 -0
  19. package/src/clients/cli/samples/PartnerSuccess.json +122 -0
  20. package/src/clients/cli/samples/example_prompts.json +14 -0
  21. package/src/clients/cli/samples/example_prompts_alt.json +12 -0
  22. package/src/clients/cli/samples/prompts_ambiguity.json +22 -0
  23. package/src/clients/cli/samples/prompts_rag_grounding.json +22 -0
  24. package/src/clients/cli/samples/prompts_security_injection.json +22 -0
  25. package/src/clients/cli/samples/prompts_tool_use_negatives.json +22 -0
  26. package/src/clients/cli/samples/psaSample.json +18 -0
  27. package/src/clients/cli/samples/starter.json +10 -0
  28. package/src/clients/node-js/bin/runevals.js +505 -0
  29. package/src/clients/node-js/config/default.js +25 -0
  30. package/src/clients/node-js/lib/cache-utils.js +119 -0
  31. package/src/clients/node-js/lib/expiry-check.js +164 -0
  32. package/src/clients/node-js/lib/index.js +25 -0
  33. package/src/clients/node-js/lib/python-runtime.js +253 -0
  34. package/src/clients/node-js/lib/venv-manager.js +242 -0
@@ -0,0 +1,301 @@
1
+ # M365 Copilot Agent Evaluations (CLI)
2
+
3
+ This CLI evaluates DA responses from a configured M365 Copilot endpoint (copilotApi) and scores them locally using Azure AI Evaluation services.
4
+
5
+ Current evaluation metrics:
6
+ - Relevance (1–5)
7
+ - Coherence (1–5)
8
+ - Groundedness (1–5)
9
+ - Citations (count with pass/fail based on presence)
10
+ - Tool Call Accuracy (1-5, evaluates correct tool usage when tools are invoked)
11
+
12
+ ## 📋 Prerequisites
13
+
14
+ - Python 3.13.10+ - https://www.python.org/downloads/
15
+ - Azure subscription with Azure OpenAI access (for evaluation metrics only)
16
+
17
+ ## 🚀 Quick Start (Run CLI)
18
+
19
+ ### 1. Clone and Install Dependencies
20
+
21
+ ```bash
22
+ git clone https://github.com/microsoft/M365-Copilot-Agent-Evals.git
23
+ cd M365-Copilot-Agent-Evals/src/clients/cli
24
+ python -m venv .venv # Create virtual Python environment.
25
+ .\.venv\Scripts\Activate.ps1 # Activate the virtual environment.
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ ### 2. Set Up Environment Variables
30
+
31
+ Create a `.env` file in the `src/clients/cli` directory (or export them). Choose **one** auth path for the Copilot API: either provide a pre-issued token or use interactive WAM auth (Windows).
32
+
33
+ ```bash
34
+ # Azure OpenAI (evaluation models)
35
+ AZURE_AI_OPENAI_ENDPOINT="<Your_Azure_AI_OpenAI_Endpoint>"
36
+ AZURE_AI_API_KEY="<azure-openai-key>"
37
+ AZURE_AI_API_VERSION="2024-12-01-preview"
38
+ AZURE_AI_MODEL_NAME="gpt-4o-mini"
39
+
40
+ # Copilot Chat API (response generation)
41
+ COPILOT_API_ENDPOINT="https://substrate.office.com/m365Copilot" # CLI appends /chat
42
+ X_SCENARIO_HEADER="<scenario-header>" # e.g, officeweb
43
+
44
+ # Auth option A: static access token (no prompt)
45
+ COPILOT_API_ACCESS_TOKEN="<access-token>"
46
+
47
+ # Auth option B: interactive WAM auth (used if COPILOT_API_ACCESS_TOKEN is empty)
48
+ M365_EVAL_CLIENT_ID="<app-registration-client-id>"
49
+ TENANT_ID="<aad-tenant-id>"
50
+ COPILOT_SCOPES="https://substrate.office.com/sydney/.default"
51
+
52
+ # Optional: default agent id (overridable via --agent-id)
53
+ AGENT_ID="00000000-0000-0000-0000-000000000000"
54
+ ```
55
+
56
+ ### 3. Run the Agent Evaluation
57
+
58
+ The CLI supports multiple ways to run evaluations:
59
+
60
+ #### Basic Usage (with default prompt set)
61
+ ```bash
62
+ python main.py
63
+ ```
64
+
65
+ #### Command Line Prompts
66
+ ```bash
67
+ # Single prompt and expected response
68
+ python main.py --prompts "What is Microsoft Graph?" --expected "Microsoft Graph is a gateway to data and intelligence in Microsoft 365."
69
+
70
+ # Multiple prompts and expected responses
71
+ python main.py --prompts "What is Microsoft Graph?" "How does authentication work?" --expected "Microsoft Graph is a gateway..." "Authentication works by..."
72
+
73
+ # Override the agent configured in environment variables
74
+ python main.py --agent-id "00000000-0000-0000-0000-000000000000" --prompts "What is Microsoft Graph?"
75
+ ```
76
+
77
+ #### Using Prompts from File
78
+ ```bash
79
+ # JSON file with prompts and expected responses
80
+ python main.py --prompts-file samples/example_prompts.json
81
+
82
+ # Output to JSON file
83
+ python main.py --prompts-file samples/example_prompts.json --output results.json
84
+
85
+ # Output to CSV file
86
+ python main.py --prompts-file samples/example_prompts.json --output results.csv
87
+
88
+ # Output to HTML file (opens in browser)
89
+ python main.py --prompts-file samples/example_prompts.json --output report.html
90
+ ```
91
+
92
+ #### Interactive Mode
93
+ ```bash
94
+ # Enter prompts interactively
95
+ python main.py --interactive
96
+ ```
97
+
98
+ #### Additional Options
99
+ ```bash
100
+ # Verbose output (shows detailed processing steps)
101
+ python main.py --verbose
102
+
103
+ # Quiet mode (minimal output)
104
+ python main.py --quiet
105
+
106
+ # Get help and see all options
107
+ python main.py --help
108
+
109
+ # Specify / override the Agent ID (takes precedence over AZURE_AI_AGENT_ID env var)
110
+ python main.py --agent-id "00000000-0000-0000-0000-000000000000"
111
+
112
+ # Citation format options
113
+ python main.py --citation-format oai_unicode # Default: New OAI format
114
+ python main.py --citation-format legacy_bracket # Old [^i^] format
115
+ ```
116
+
117
+ #### File Format Examples
118
+
119
+ **JSON Format 1 (Array of Objects):**
120
+ ```json
121
+ [
122
+ {
123
+ "prompt": "What is Microsoft Graph?",
124
+ "expected_response": "Microsoft Graph is a gateway to data and intelligence in Microsoft 365."
125
+ },
126
+ {
127
+ "prompt": "How do I authenticate with Microsoft Graph?",
128
+ "expected_response": "You can authenticate using OAuth 2.0..."
129
+ }
130
+ ]
131
+ ```
132
+
133
+ **JSON Format 2 (Separate Arrays):**
134
+ ```json
135
+ {
136
+ "prompts": [
137
+ "What is Microsoft Graph?",
138
+ "How do I authenticate with Microsoft Graph?"
139
+ ],
140
+ "expected_responses": [
141
+ "Microsoft Graph is a gateway to data and intelligence in Microsoft 365.",
142
+ "You can authenticate using OAuth 2.0..."
143
+ ]
144
+ }
145
+ ```
146
+
147
+ ## 🔧 Tool Call Accuracy Evaluation
148
+
149
+ The CLI now includes advanced tool call accuracy evaluation that analyzes how effectively the agent uses available tools:
150
+
151
+ ### What It Evaluates
152
+ - **Tool Selection**: Whether the agent chooses appropriate tools for the given task
153
+ - **Parameter Accuracy**: Correctness of arguments passed to tool functions
154
+ - **Tool Usage Patterns**: Overall effectiveness of tool invocation strategies
155
+
156
+ ### How It Works
157
+ 1. **Response Analysis**: Extracts tool calls and results from conversation telemetry
158
+ 2. **Tool Definitions**: Captures available tools from conversation metadata
159
+ 3. **Accuracy Assessment**: Uses Azure AI Evaluation SDK's ToolCallAccuracyEvaluator
160
+ 4. **Score Calculation**: Returns 1-5 score with pass/fail threshold (default: 3)
161
+
162
+ ### Enhanced Data Extraction
163
+ The tool now extracts detailed information from agent responses:
164
+ - **Message Flow**: Complete chronological sequence of tool calls and results
165
+ - **Tool Definitions**: Available tools and their schemas
166
+ - **Internal Filtering**: Removes framework-internal tools for cleaner analysis
167
+
168
+ ### Example Output
169
+ ```bash
170
+ 📊 Aggregate Statistics (3 prompts):
171
+ ════════════════════════════════════════════════════════════
172
+ Tool Call Accuracy:
173
+ Pass Rate: 66.7% (2/3 passed)
174
+ Avg Score: 0.75
175
+ Threshold: 0.5
176
+ ```
177
+
178
+ **Note**: Tool Call Accuracy evaluation only applies when the agent response includes tool invocations. For text-only responses, this metric will not be computed.
179
+
180
+ ## 🔧 Configuration
181
+
182
+ ### Getting Azure AI Foundry Configuration Values
183
+
184
+ #### 1. Azure OpenAI Endpoint (`AZURE_AI_OPENAI_ENDPOINT`)
185
+
186
+ 1. In [Azure AI Foundry](https://ai.azure.com/)
187
+ 2. Go to **Models + endpoints** under My assets
188
+ 3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
189
+ 4. Copy the endpoint URL (format: `https://your-resource.openai.azure.com/`)
190
+
191
+ #### 2. Azure OpenAI API Key (`AZURE_AI_API_KEY`)
192
+
193
+ 1. In [Azure AI Foundry](https://ai.azure.com/)
194
+ 2. Go to **Models + endpoints** under My assets
195
+ 3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
196
+ 4. Copy the key
197
+
198
+ #### 3. Azure OpenAI Model Name (`AZURE_AI_MODEL_NAME`)
199
+
200
+ 1. In [Azure AI Foundry](https://ai.azure.com/)
201
+ 2. Go to **Models + endpoints** under My assets
202
+ 3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
203
+ 4. Copy the model/deployment name (e.g., "gpt-4o-mini", "gpt-4", "gpt-35-turbo")
204
+
205
+ #### 4. API Version (`AZURE_AI_API_VERSION`)
206
+
207
+ Use the API version: `2024-12-01-preview`
208
+
209
+ For the most current version, check the [Azure OpenAI API reference](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/reference).
210
+
211
+
212
+ ## 📁 Project Structure
213
+
214
+ ```
215
+ M365-Copilot-Agent-Evals/
216
+ ├── src/
217
+ │ └── clients/
218
+ │ └── cli/
219
+ │ ├── main.py # Main evaluation script
220
+ │ ├── response_extractor.py # Enhanced response parsing and tool extraction
221
+ │ ├── generate_report.py # HTML report generation with aggregates
222
+ │ ├── requirements.txt # Python dependencies
223
+ │ ├── readme.md # This file
224
+ │ ├── CHANGELOG.md # Version history and changes
225
+ │ ├── .env.template # Environment variables template
226
+ │ ├── .env # Environment variables (create this)
227
+ │ ├── custom_evaluators/ # Custom evaluation modules
228
+ │ │ ├── __init__.py
229
+ │ │ └── CitationsEvaluator.py # Citation detection evaluator
230
+ │ └── samples/ # Example prompt files
231
+ │ ├── example_prompts.json
232
+ │ └── prompts_*.json # Various test scenarios
233
+ ├── tests/ # Test files
234
+ ├── CODE_OF_CONDUCT.md # Code of conduct
235
+ ├── LICENSE # License file
236
+ ├── README.md # Root project README
237
+ ├── SECURITY.md # Security guidelines
238
+ └── SUPPORT.md # Support information
239
+ ```
240
+
241
+ ## 📊 Features
242
+
243
+ - **Chat Invocation**: Sends prompts to the Sydney chat API
244
+ - **Evaluation Metrics**: Relevance, Coherence, Groundedness, Citations, Tool Call Accuracy
245
+ - **Multi-format Citation Detection**: Supports both new OAI Unicode and legacy bracket formats
246
+ - **Tool Call Analysis**: Extracts and evaluates tool invocations from conversation telemetry
247
+ - **Enhanced Response Extraction**: Detailed parsing of tool calls, results, and message flow
248
+ - **Aggregate Statistics**: Summary metrics across multiple prompts with pass/fail rates
249
+ - **Colorized Console Output**
250
+ - **Multiple Output Formats**: JSON, CSV, HTML with aggregate dashboards
251
+ - **Flexible Prompt Input**: Command line, file, interactive
252
+
253
+ ## 🔒 Security Best Practices
254
+
255
+ - ✅ All sensitive information is stored in environment variables
256
+ - ✅ No hardcoded credentials in source code
257
+ - ✅ `.env` files are excluded from version control
258
+
259
+ ## 🔑 Authentication Requirements
260
+
261
+ - Copilot API: Either a static `COPILOT_API_ACCESS_TOKEN` **or** WAM interactive auth via `M365_EVAL_CLIENT_ID` and `TENANT_ID` (optional `COPILOT_SCOPES`)
262
+ - Evaluators: Azure OpenAI key (`AZURE_AI_API_KEY`)
263
+
264
+ ## 📚 Useful Resources
265
+
266
+ - [Azure AI Foundry Documentation](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/)
267
+ - [Azure AI Projects Python SDK](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/README.md)
268
+ - [Azure AI Evaluation Documentation](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/how-to/evaluate-models)
269
+ - [Azure OpenAI Service Documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/)
270
+ - [Azure Identity Library](https://docs.microsoft.com/en-us/python/api/overview/azure/identity-readme)
271
+
272
+ ## 🐛 Troubleshooting
273
+
274
+ ### Common Issues
275
+
276
+ 1. **Authentication Errors**: Ensure your Azure credentials are properly configured and you have access to the resources.
277
+
278
+ 2. **HTTP 401 / 403**: If using a static token, verify `COPILOT_API_ACCESS_TOKEN` and required headers. If using WAM, confirm `M365_EVAL_CLIENT_ID`, `X_SCENARIO_HEADER`, `TENANT_ID`, and `COPILOT_SCOPES` are correct.
279
+
280
+ 3. **Endpoint Issues**: Confirm `COPILOT_API_ENDPOINT` (base URL without `/chat`) is reachable (try `curl` / `Invoke-WebRequest`).
281
+
282
+ ### Getting Help
283
+
284
+ - Check the [Azure AI Foundry troubleshooting guide](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/troubleshooting)
285
+ - Review the Azure AI Projects SDK [troubleshooting section](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/README.md#troubleshooting)
286
+
287
+ ## 📋 Version History
288
+
289
+ For detailed information about changes, new features, and breaking changes, see [CHANGELOG.md](CHANGELOG.md).
290
+
291
+ ## 🤝 Contributing
292
+
293
+ 1. Fork the repository
294
+ 2. Create a feature branch
295
+ 3. Make your changes
296
+ 4. Ensure all sensitive information uses environment variables
297
+ 5. Submit a pull request
298
+
299
+ ## 📄 License
300
+
301
+ This project is licensed under the MIT License - see the LICENSE file for details.
@@ -0,0 +1,10 @@
1
+ ansible-core==2.19.0
2
+ azure-ai-evaluation==1.10.0
3
+ azure-ai-projects==1.0.0
4
+ msal[broker]>=1.34,<2
5
+ msal-extensions>=1.3.1
6
+ pip==25.3
7
+ python-dotenv==1.1.1
8
+ markdown==3.8.2
9
+ promptflow>=1.18.1
10
+ questionary>=2.1.1