@microsoft/m365-copilot-eval 1.0.1-preview.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +415 -0
- package/TERMS.txt +65 -0
- package/package.json +82 -0
- package/src/clients/cli/auth/__init__.py +1 -0
- package/src/clients/cli/auth/auth_handler.py +262 -0
- package/src/clients/cli/custom_evaluators/CitationsEvaluator.py +136 -0
- package/src/clients/cli/custom_evaluators/ConcisenessNonLLMEvaluator.py +18 -0
- package/src/clients/cli/custom_evaluators/ExactMatchEvaluator.py +25 -0
- package/src/clients/cli/custom_evaluators/PII/PII.py +45 -0
- package/src/clients/cli/custom_evaluators/PartialMatchEvaluator.py +39 -0
- package/src/clients/cli/custom_evaluators/__init__.py +1 -0
- package/src/clients/cli/demo_usage.py +83 -0
- package/src/clients/cli/generate_report.py +251 -0
- package/src/clients/cli/main.py +766 -0
- package/src/clients/cli/readme.md +301 -0
- package/src/clients/cli/requirements.txt +10 -0
- package/src/clients/cli/response_extractor.py +589 -0
- package/src/clients/cli/samples/PartnerSuccess.json +122 -0
- package/src/clients/cli/samples/example_prompts.json +14 -0
- package/src/clients/cli/samples/example_prompts_alt.json +12 -0
- package/src/clients/cli/samples/prompts_ambiguity.json +22 -0
- package/src/clients/cli/samples/prompts_rag_grounding.json +22 -0
- package/src/clients/cli/samples/prompts_security_injection.json +22 -0
- package/src/clients/cli/samples/prompts_tool_use_negatives.json +22 -0
- package/src/clients/cli/samples/psaSample.json +18 -0
- package/src/clients/cli/samples/starter.json +10 -0
- package/src/clients/node-js/bin/runevals.js +505 -0
- package/src/clients/node-js/config/default.js +25 -0
- package/src/clients/node-js/lib/cache-utils.js +119 -0
- package/src/clients/node-js/lib/expiry-check.js +164 -0
- package/src/clients/node-js/lib/index.js +25 -0
- package/src/clients/node-js/lib/python-runtime.js +253 -0
- package/src/clients/node-js/lib/venv-manager.js +242 -0
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# M365 Copilot Agent Evaluations (CLI)
|
|
2
|
+
|
|
3
|
+
This CLI evaluates DA responses from a configured M365 Copilot endpoint (copilotApi) and scores them locally using Azure AI Evaluation services.
|
|
4
|
+
|
|
5
|
+
Current evaluation metrics:
|
|
6
|
+
- Relevance (1–5)
|
|
7
|
+
- Coherence (1–5)
|
|
8
|
+
- Groundedness (1–5)
|
|
9
|
+
- Citations (count with pass/fail based on presence)
|
|
10
|
+
- Tool Call Accuracy (1-5, evaluates correct tool usage when tools are invoked)
|
|
11
|
+
|
|
12
|
+
## 📋 Prerequisites
|
|
13
|
+
|
|
14
|
+
- Python 3.13.10+ - https://www.python.org/downloads/
|
|
15
|
+
- Azure subscription with Azure OpenAI access (for evaluation metrics only)
|
|
16
|
+
|
|
17
|
+
## 🚀 Quick Start (Run CLI)
|
|
18
|
+
|
|
19
|
+
### 1. Clone and Install Dependencies
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
git clone https://github.com/microsoft/M365-Copilot-Agent-Evals.git
|
|
23
|
+
cd M365-Copilot-Agent-Evals/src/clients/cli
|
|
24
|
+
python -m venv .venv # Create virtual Python environment.
|
|
25
|
+
.\.venv\Scripts\Activate.ps1 # Activate the virtual environment.
|
|
26
|
+
pip install -r requirements.txt
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### 2. Set Up Environment Variables
|
|
30
|
+
|
|
31
|
+
Create a `.env` file in the `src/clients/cli` directory (or export them). Choose **one** auth path for the Copilot API: either provide a pre-issued token or use interactive WAM auth (Windows).
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
# Azure OpenAI (evaluation models)
|
|
35
|
+
AZURE_AI_OPENAI_ENDPOINT="<Your_Azure_AI_OpenAI_Endpoint>"
|
|
36
|
+
AZURE_AI_API_KEY="<azure-openai-key>"
|
|
37
|
+
AZURE_AI_API_VERSION="2024-12-01-preview"
|
|
38
|
+
AZURE_AI_MODEL_NAME="gpt-4o-mini"
|
|
39
|
+
|
|
40
|
+
# Copilot Chat API (response generation)
|
|
41
|
+
COPILOT_API_ENDPOINT="https://substrate.office.com/m365Copilot" # CLI appends /chat
|
|
42
|
+
X_SCENARIO_HEADER="<scenario-header>" # e.g, officeweb
|
|
43
|
+
|
|
44
|
+
# Auth option A: static access token (no prompt)
|
|
45
|
+
COPILOT_API_ACCESS_TOKEN="<access-token>"
|
|
46
|
+
|
|
47
|
+
# Auth option B: interactive WAM auth (used if COPILOT_API_ACCESS_TOKEN is empty)
|
|
48
|
+
M365_EVAL_CLIENT_ID="<app-registration-client-id>"
|
|
49
|
+
TENANT_ID="<aad-tenant-id>"
|
|
50
|
+
COPILOT_SCOPES="https://substrate.office.com/sydney/.default"
|
|
51
|
+
|
|
52
|
+
# Optional: default agent id (overridable via --agent-id)
|
|
53
|
+
AGENT_ID="00000000-0000-0000-0000-000000000000"
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### 3. Run the Agent Evaluation
|
|
57
|
+
|
|
58
|
+
The CLI supports multiple ways to run evaluations:
|
|
59
|
+
|
|
60
|
+
#### Basic Usage (with default prompt set)
|
|
61
|
+
```bash
|
|
62
|
+
python main.py
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
#### Command Line Prompts
|
|
66
|
+
```bash
|
|
67
|
+
# Single prompt and expected response
|
|
68
|
+
python main.py --prompts "What is Microsoft Graph?" --expected "Microsoft Graph is a gateway to data and intelligence in Microsoft 365."
|
|
69
|
+
|
|
70
|
+
# Multiple prompts and expected responses
|
|
71
|
+
python main.py --prompts "What is Microsoft Graph?" "How does authentication work?" --expected "Microsoft Graph is a gateway..." "Authentication works by..."
|
|
72
|
+
|
|
73
|
+
# Override the agent configured in environment variables
|
|
74
|
+
python main.py --agent-id "00000000-0000-0000-0000-000000000000" --prompts "What is Microsoft Graph?"
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
#### Using Prompts from File
|
|
78
|
+
```bash
|
|
79
|
+
# JSON file with prompts and expected responses
|
|
80
|
+
python main.py --prompts-file samples/example_prompts.json
|
|
81
|
+
|
|
82
|
+
# Output to JSON file
|
|
83
|
+
python main.py --prompts-file samples/example_prompts.json --output results.json
|
|
84
|
+
|
|
85
|
+
# Output to CSV file
|
|
86
|
+
python main.py --prompts-file samples/example_prompts.json --output results.csv
|
|
87
|
+
|
|
88
|
+
# Output to HTML file (opens in browser)
|
|
89
|
+
python main.py --prompts-file samples/example_prompts.json --output report.html
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
#### Interactive Mode
|
|
93
|
+
```bash
|
|
94
|
+
# Enter prompts interactively
|
|
95
|
+
python main.py --interactive
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
#### Additional Options
|
|
99
|
+
```bash
|
|
100
|
+
# Verbose output (shows detailed processing steps)
|
|
101
|
+
python main.py --verbose
|
|
102
|
+
|
|
103
|
+
# Quiet mode (minimal output)
|
|
104
|
+
python main.py --quiet
|
|
105
|
+
|
|
106
|
+
# Get help and see all options
|
|
107
|
+
python main.py --help
|
|
108
|
+
|
|
109
|
+
# Specify / override the Agent ID (takes precedence over AZURE_AI_AGENT_ID env var)
|
|
110
|
+
python main.py --agent-id "00000000-0000-0000-0000-000000000000"
|
|
111
|
+
|
|
112
|
+
# Citation format options
|
|
113
|
+
python main.py --citation-format oai_unicode # Default: New OAI format
|
|
114
|
+
python main.py --citation-format legacy_bracket # Old [^i^] format
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
#### File Format Examples
|
|
118
|
+
|
|
119
|
+
**JSON Format 1 (Array of Objects):**
|
|
120
|
+
```json
|
|
121
|
+
[
|
|
122
|
+
{
|
|
123
|
+
"prompt": "What is Microsoft Graph?",
|
|
124
|
+
"expected_response": "Microsoft Graph is a gateway to data and intelligence in Microsoft 365."
|
|
125
|
+
},
|
|
126
|
+
{
|
|
127
|
+
"prompt": "How do I authenticate with Microsoft Graph?",
|
|
128
|
+
"expected_response": "You can authenticate using OAuth 2.0..."
|
|
129
|
+
}
|
|
130
|
+
]
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**JSON Format 2 (Separate Arrays):**
|
|
134
|
+
```json
|
|
135
|
+
{
|
|
136
|
+
"prompts": [
|
|
137
|
+
"What is Microsoft Graph?",
|
|
138
|
+
"How do I authenticate with Microsoft Graph?"
|
|
139
|
+
],
|
|
140
|
+
"expected_responses": [
|
|
141
|
+
"Microsoft Graph is a gateway to data and intelligence in Microsoft 365.",
|
|
142
|
+
"You can authenticate using OAuth 2.0..."
|
|
143
|
+
]
|
|
144
|
+
}
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
## 🔧 Tool Call Accuracy Evaluation
|
|
148
|
+
|
|
149
|
+
The CLI now includes advanced tool call accuracy evaluation that analyzes how effectively the agent uses available tools:
|
|
150
|
+
|
|
151
|
+
### What It Evaluates
|
|
152
|
+
- **Tool Selection**: Whether the agent chooses appropriate tools for the given task
|
|
153
|
+
- **Parameter Accuracy**: Correctness of arguments passed to tool functions
|
|
154
|
+
- **Tool Usage Patterns**: Overall effectiveness of tool invocation strategies
|
|
155
|
+
|
|
156
|
+
### How It Works
|
|
157
|
+
1. **Response Analysis**: Extracts tool calls and results from conversation telemetry
|
|
158
|
+
2. **Tool Definitions**: Captures available tools from conversation metadata
|
|
159
|
+
3. **Accuracy Assessment**: Uses Azure AI Evaluation SDK's ToolCallAccuracyEvaluator
|
|
160
|
+
4. **Score Calculation**: Returns 1-5 score with pass/fail threshold (default: 3)
|
|
161
|
+
|
|
162
|
+
### Enhanced Data Extraction
|
|
163
|
+
The tool now extracts detailed information from agent responses:
|
|
164
|
+
- **Message Flow**: Complete chronological sequence of tool calls and results
|
|
165
|
+
- **Tool Definitions**: Available tools and their schemas
|
|
166
|
+
- **Internal Filtering**: Removes framework-internal tools for cleaner analysis
|
|
167
|
+
|
|
168
|
+
### Example Output
|
|
169
|
+
```bash
|
|
170
|
+
📊 Aggregate Statistics (3 prompts):
|
|
171
|
+
════════════════════════════════════════════════════════════
|
|
172
|
+
Tool Call Accuracy:
|
|
173
|
+
Pass Rate: 66.7% (2/3 passed)
|
|
174
|
+
Avg Score: 0.75
|
|
175
|
+
Threshold: 0.5
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
**Note**: Tool Call Accuracy evaluation only applies when the agent response includes tool invocations. For text-only responses, this metric will not be computed.
|
|
179
|
+
|
|
180
|
+
## 🔧 Configuration
|
|
181
|
+
|
|
182
|
+
### Getting Azure AI Foundry Configuration Values
|
|
183
|
+
|
|
184
|
+
#### 1. Azure OpenAI Endpoint (`AZURE_AI_OPENAI_ENDPOINT`)
|
|
185
|
+
|
|
186
|
+
1. In [Azure AI Foundry](https://ai.azure.com/)
|
|
187
|
+
2. Go to **Models + endpoints** under My assets
|
|
188
|
+
3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
|
|
189
|
+
4. Copy the endpoint URL (format: `https://your-resource.openai.azure.com/`)
|
|
190
|
+
|
|
191
|
+
#### 2. Azure OpenAI API Key (`AZURE_AI_API_KEY`)
|
|
192
|
+
|
|
193
|
+
1. In [Azure AI Foundry](https://ai.azure.com/)
|
|
194
|
+
2. Go to **Models + endpoints** under My assets
|
|
195
|
+
3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
|
|
196
|
+
4. Copy the key
|
|
197
|
+
|
|
198
|
+
#### 3. Azure OpenAI Model Name (`AZURE_AI_MODEL_NAME`)
|
|
199
|
+
|
|
200
|
+
1. In [Azure AI Foundry](https://ai.azure.com/)
|
|
201
|
+
2. Go to **Models + endpoints** under My assets
|
|
202
|
+
3. Select your model. Create one if you don't have one available - Use `gpt-4o-mini` for better compatibility with the evaluators
|
|
203
|
+
4. Copy the model/deployment name (e.g., "gpt-4o-mini", "gpt-4", "gpt-35-turbo")
|
|
204
|
+
|
|
205
|
+
#### 4. API Version (`AZURE_AI_API_VERSION`)
|
|
206
|
+
|
|
207
|
+
Use the API version: `2024-12-01-preview`
|
|
208
|
+
|
|
209
|
+
For the most current version, check the [Azure OpenAI API reference](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/reference).
|
|
210
|
+
|
|
211
|
+
|
|
212
|
+
## 📁 Project Structure
|
|
213
|
+
|
|
214
|
+
```
|
|
215
|
+
M365-Copilot-Agent-Evals/
|
|
216
|
+
├── src/
|
|
217
|
+
│ └── clients/
|
|
218
|
+
│ └── cli/
|
|
219
|
+
│ ├── main.py # Main evaluation script
|
|
220
|
+
│ ├── response_extractor.py # Enhanced response parsing and tool extraction
|
|
221
|
+
│ ├── generate_report.py # HTML report generation with aggregates
|
|
222
|
+
│ ├── requirements.txt # Python dependencies
|
|
223
|
+
│ ├── readme.md # This file
|
|
224
|
+
│ ├── CHANGELOG.md # Version history and changes
|
|
225
|
+
│ ├── .env.template # Environment variables template
|
|
226
|
+
│ ├── .env # Environment variables (create this)
|
|
227
|
+
│ ├── custom_evaluators/ # Custom evaluation modules
|
|
228
|
+
│ │ ├── __init__.py
|
|
229
|
+
│ │ └── CitationsEvaluator.py # Citation detection evaluator
|
|
230
|
+
│ └── samples/ # Example prompt files
|
|
231
|
+
│ ├── example_prompts.json
|
|
232
|
+
│ └── prompts_*.json # Various test scenarios
|
|
233
|
+
├── tests/ # Test files
|
|
234
|
+
├── CODE_OF_CONDUCT.md # Code of conduct
|
|
235
|
+
├── LICENSE # License file
|
|
236
|
+
├── README.md # Root project README
|
|
237
|
+
├── SECURITY.md # Security guidelines
|
|
238
|
+
└── SUPPORT.md # Support information
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
## 📊 Features
|
|
242
|
+
|
|
243
|
+
- **Chat Invocation**: Sends prompts to the Sydney chat API
|
|
244
|
+
- **Evaluation Metrics**: Relevance, Coherence, Groundedness, Citations, Tool Call Accuracy
|
|
245
|
+
- **Multi-format Citation Detection**: Supports both new OAI Unicode and legacy bracket formats
|
|
246
|
+
- **Tool Call Analysis**: Extracts and evaluates tool invocations from conversation telemetry
|
|
247
|
+
- **Enhanced Response Extraction**: Detailed parsing of tool calls, results, and message flow
|
|
248
|
+
- **Aggregate Statistics**: Summary metrics across multiple prompts with pass/fail rates
|
|
249
|
+
- **Colorized Console Output**
|
|
250
|
+
- **Multiple Output Formats**: JSON, CSV, HTML with aggregate dashboards
|
|
251
|
+
- **Flexible Prompt Input**: Command line, file, interactive
|
|
252
|
+
|
|
253
|
+
## 🔒 Security Best Practices
|
|
254
|
+
|
|
255
|
+
- ✅ All sensitive information is stored in environment variables
|
|
256
|
+
- ✅ No hardcoded credentials in source code
|
|
257
|
+
- ✅ `.env` files are excluded from version control
|
|
258
|
+
|
|
259
|
+
## 🔑 Authentication Requirements
|
|
260
|
+
|
|
261
|
+
- Copilot API: Either a static `COPILOT_API_ACCESS_TOKEN` **or** WAM interactive auth via `M365_EVAL_CLIENT_ID` and `TENANT_ID` (optional `COPILOT_SCOPES`)
|
|
262
|
+
- Evaluators: Azure OpenAI key (`AZURE_AI_API_KEY`)
|
|
263
|
+
|
|
264
|
+
## 📚 Useful Resources
|
|
265
|
+
|
|
266
|
+
- [Azure AI Foundry Documentation](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/)
|
|
267
|
+
- [Azure AI Projects Python SDK](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/README.md)
|
|
268
|
+
- [Azure AI Evaluation Documentation](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/how-to/evaluate-models)
|
|
269
|
+
- [Azure OpenAI Service Documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/)
|
|
270
|
+
- [Azure Identity Library](https://docs.microsoft.com/en-us/python/api/overview/azure/identity-readme)
|
|
271
|
+
|
|
272
|
+
## 🐛 Troubleshooting
|
|
273
|
+
|
|
274
|
+
### Common Issues
|
|
275
|
+
|
|
276
|
+
1. **Authentication Errors**: Ensure your Azure credentials are properly configured and you have access to the resources.
|
|
277
|
+
|
|
278
|
+
2. **HTTP 401 / 403**: If using a static token, verify `COPILOT_API_ACCESS_TOKEN` and required headers. If using WAM, confirm `M365_EVAL_CLIENT_ID`, `X_SCENARIO_HEADER`, `TENANT_ID`, and `COPILOT_SCOPES` are correct.
|
|
279
|
+
|
|
280
|
+
3. **Endpoint Issues**: Confirm `COPILOT_API_ENDPOINT` (base URL without `/chat`) is reachable (try `curl` / `Invoke-WebRequest`).
|
|
281
|
+
|
|
282
|
+
### Getting Help
|
|
283
|
+
|
|
284
|
+
- Check the [Azure AI Foundry troubleshooting guide](https://docs.microsoft.com/en-us/azure/ai-services/ai-foundry/troubleshooting)
|
|
285
|
+
- Review the Azure AI Projects SDK [troubleshooting section](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/ai/azure-ai-projects/README.md#troubleshooting)
|
|
286
|
+
|
|
287
|
+
## 📋 Version History
|
|
288
|
+
|
|
289
|
+
For detailed information about changes, new features, and breaking changes, see [CHANGELOG.md](CHANGELOG.md).
|
|
290
|
+
|
|
291
|
+
## 🤝 Contributing
|
|
292
|
+
|
|
293
|
+
1. Fork the repository
|
|
294
|
+
2. Create a feature branch
|
|
295
|
+
3. Make your changes
|
|
296
|
+
4. Ensure all sensitive information uses environment variables
|
|
297
|
+
5. Submit a pull request
|
|
298
|
+
|
|
299
|
+
## 📄 License
|
|
300
|
+
|
|
301
|
+
This project is licensed under the MIT License - see the LICENSE file for details.
|