PyPI - agentops-toolkit - Versions diffs - 0.1.0__tar.gz - Mend

agentops-toolkit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

agentops_toolkit-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,704 @@
+Metadata-Version: 2.3
+Name: agentops-toolkit
+Version: 0.1.0
+Summary: CLI toolkit for evaluating, tracing, and monitoring AI agents on Azure AI Foundry
+Keywords: ai,agent,evaluation,azure,foundry,observability
+Author: DB Lee
+Author-email: DB Lee <donlee@microsoft.com>
+License: MIT
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Software Development :: Testing
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Dist: azure-ai-evaluation>=1.0.0b1,<2.0.0
+Requires-Dist: azure-ai-projects>=1.0.0,<3.0.0
+Requires-Dist: azure-identity>=1.17.0,<2.0.0
+Requires-Dist: typer[all]>=0.12.0,<1.0.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: ruamel-yaml>=0.18.0
+Requires-Dist: pydantic>=2.5.0,<3.0.0
+Requires-Dist: aiofiles>=24.0.0
+Requires-Dist: httpx>=0.27.0,<1.0.0
+Requires-Dist: pytest>=8.0.0 ; extra == 'dev'
+Requires-Dist: pytest-asyncio>=0.24.0 ; extra == 'dev'
+Requires-Dist: pytest-cov>=5.0.0 ; extra == 'dev'
+Requires-Dist: ruff>=0.8.0 ; extra == 'dev'
+Requires-Dist: mypy>=1.11.0 ; extra == 'dev'
+Requires-Dist: pre-commit>=4.0.0 ; extra == 'dev'
+Requires-Dist: azure-search-documents>=11.6.0b6,<12.0.0 ; extra == 'iq'
+Requires-Dist: azure-monitor-opentelemetry>=1.6.0,<2.0.0 ; extra == 'observability'
+Requires-Dist: opentelemetry-sdk>=1.25.0,<2.0.0 ; extra == 'observability'
+Requires-Dist: azure-core-tracing-opentelemetry>=1.0.0b11,<2.0.0 ; extra == 'observability'
+Requires-Dist: opentelemetry-instrumentation-openai-v2>=2.0.0,<3.0.0 ; extra == 'observability'
+Requires-Dist: opentelemetry-exporter-otlp>=1.25.0,<2.0.0 ; extra == 'observability'
+Requires-Python: >=3.12
+Project-URL: Homepage, https://github.com/mcaps-microsoft/agentops-toolkit
+Project-URL: Documentation, https://github.com/mcaps-microsoft/agentops-toolkit#readme
+Project-URL: Repository, https://github.com/mcaps-microsoft/agentops-toolkit
+Project-URL: Issues, https://github.com/mcaps-microsoft/agentops-toolkit/issues
+Provides-Extra: dev
+Provides-Extra: iq
+Provides-Extra: observability
+Description-Content-Type: text/markdown
+# AgentOps Toolkit
+> **Evaluate, trace, and monitor AI agents — from terminal to production.**
+AgentOps is an open-source CLI toolkit that makes it easy to evaluate AI agent applications using [Azure AI Foundry](https://ai.azure.com) evaluators. It provides prescriptive, best-practice bundles of evaluators for common agent patterns — RAG, tool-using agents, and multi-agent systems — so you can go from prototype to production-grade evaluation in minutes, not days.
+```
+pip install agentops-toolkit
+```
+---
+## Why AgentOps?
+| Problem | AgentOps Solution |
+|---|---|
+| Evaluation setup requires significant glue code | One command: `agentops init` scaffolds everything |
+| No standard approach — every team builds differently | Prescriptive evaluator bundles per use case |
+| No CLI workflow for Foundry evaluation | Full CLI with `agentops eval`, `agentops run`, `agentops report` |
+| Model migration breaks things silently | `agentops eval migrate` compares models side-by-side |
+| Copilot users want natural language | `@agentops evaluate my RAG agent` in VS Code Chat |
+---
+## Feature Overview
+### Phase 1 — Evaluation (GA)
+| Feature | Command | Description |
+|---|---|---|
+| **Project scaffolding** | `agentops init` | One-command setup with config, bundles, sample dataset, and directory structure |
+| **Evaluator bundles** | `agentops bundle list\|show\|create` | 13 pre-built bundles for RAG, agents, multi-agent, and IQ knowledge layers + custom bundle creation |
+| **Dataset management** | `agentops dataset list\|validate\|import\|describe` | JSONL/CSV/JSON datasets with schema validation, column coverage checks, and statistics |
+| **Agent execution** | `agentops run start` | Run agent against dataset with async concurrency, streaming persistence, and crash recovery |
+| **Evaluation engine** | `agentops eval run\|entry` | Score agent outputs using 15 Azure AI Foundry evaluators (quality + safety + agent-specific) |
+| **Reporting** | `agentops report show\|export` | Rich terminal reports, markdown for PRs, HTML for stakeholders, JSON for automation |
+| **Run comparison** | `agentops run compare` | Side-by-side evaluator deltas with regression detection |
+| **Custom evaluators** | `CustomEvaluator` subclass | Add domain-specific metrics (finance accuracy, medical safety) via Python + YAML registration |
+| **Python API** | `AgentOpsClient` | Programmatic access: `client.run()`, `client.evaluate()`, `client.compare()` |
+| **CI/CD gating** | `fail_on_threshold: true` | Exit code 1 when evaluator scores drop below thresholds — CI gate ready |
+| **Run management** | `agentops run list\|show` | List past runs with scores; show detailed run metadata and per-entry status |
+| **Config validation** | `agentops config validate\|show` | 15 validation rules with actionable error messages, resolved config display, and env var resolution |
+### Phase 2 — Observability & Integration
+| Feature | Command | Description |
+|---|---|---|
+| **OpenTelemetry tracing** | `agentops trace init\|instrument\|run\|list` | One-command tracing → Application Insights; auto-instruments OpenAI SDK, OpenAI Agents SDK, Semantic Kernel, LangChain |
+| **Local tracing** | `agentops trace init --local` | Console, Aspire Dashboard, or AI Toolkit for VS Code — no cloud required |
+| **Monitoring setup** | `agentops monitor setup\|status` | Wire Agent Monitoring Dashboard, view monitoring health |
+| **Monitoring dashboards** | `agentops monitor dashboard` | Pre-built Azure Monitor templates: agent-overview, eval-quality, safety-monitor |
+| **Alerting** | `agentops monitor alert create\|list` | Azure Monitor alert rules for evaluation metric regressions |
+| **Continuous evaluation** | `agentops eval --continuous` | Sample production traffic, evaluate with configurable sampling rate |
+| **AI Red Teaming** | `agentops eval --red-team` | Adversarial scanning via Foundry AI Red Teaming Agent |
+| **Model migration** | `agentops eval migrate` | Side-by-side model comparison with statistical significance and confidence intervals |
+| **CI/CD pipeline gen** | `agentops eval cicd` | Generate GitHub Actions / Azure DevOps evaluation pipelines |
+| **IQ dataset generation** | `agentops dataset generate --from-iq` | Generate golden datasets from Foundry IQ knowledge bases or Work IQ M365 data |
+| **IQ-specific evaluators** | `CitationAccuracy`, `PermissionCompliance`, `SourceCoverage`, `TemporalRelevance`, `AttributionAccuracy` | Evaluators for agentic retrieval, citation fidelity, ACL enforcement, data freshness, and M365 source attribution |
+| **Model lifecycle** | `agentops model list\|recommend\|benchmark\|quota\|deploy\|retire` | Browse Foundry catalog, compare benchmarks, check quota, deploy/retire via MCP |
+| **MCP eval backend** | `--via mcp` | Alternative evaluation path via Foundry MCP Server for interactive workflows |
+| **MCP monitoring** | `agentops monitor metrics` | Pull model monitoring metrics directly from Foundry MCP Server |
+### Phase 3 — Copilot & Framework Adapters
+| Feature | Command | Description |
+|---|---|---|
+| **Copilot CLI** | `gh copilot suggest` | Natural-language → `agentops` command generation |
+| **Copilot Extension** | `@agentops` in VS Code Chat | Interactive evaluation, inline reports, guided diagnostics |
+| **Semantic Kernel adapter** | Auto-detected | Plugin discovery, kernel I/O capture, planner step tracing |
+| **AutoGen adapter** | Auto-detected | Multi-agent message stream capture, conversation turn evaluation |
+| **Agent Service adapter** | Configured | Pull agent definitions from Foundry, replay threads |
+| **Generic adapter** | `@agentops.trace` decorator | Instrument any framework with one decorator or HTTP endpoint |
+| **Framework auto-detection** | `agentops init` | Scans `pyproject.toml` / `requirements.txt` to pick the right adapter |
+---
+## Quick Start
+### 1. Initialize your project
+```
+$ agentops init --use-case rag --framework semantic-kernel
+✓ AgentOps initialized for project 'my-rag-agent'
+  Use case:  rag
+  Framework: semantic-kernel (auto-detected)
+  Bundle:    rag_quality (4 evaluators)
+  Next steps:
+    1. Add your test data to agentops/datasets/golden_set.jsonl
+    2. Run: agentops run start
+    3. View results: agentops report show latest
+```
+This creates:
+```
+my-rag-agent/
+├── agentops.yaml              # Configuration — your single source of truth
+├── agentops/
+│   ├── bundles/
+│   │   └── rag_quality.yaml   # Evaluator bundle (groundedness + relevance + coherence + fluency)
+│   ├── datasets/
+│   │   └── golden_set.jsonl   # Sample test dataset (add your real data here)
+│   ├── runs/                  # Captured evaluation runs
+│   └── reports/               # Generated reports
+└── src/
+    └── agent.py               # Your agent code
+```
+### 2. Add your test data
+```jsonl
+{"query": "What is our refund policy?", "context": "Our refund policy allows returns within 30 days...", "ground_truth": "You can return items within 30 days for a full refund."}
+{"query": "How do I reset my password?", "context": "To reset your password, go to Settings > Security...", "ground_truth": "Go to Settings > Security and click Reset Password."}
+```
+### 3. Run evaluation
+```
+$ agentops run start
+Running 'default' on 'golden_set' with bundle 'rag_quality'...
+  Agent:    src/agent.py
+  Dataset:  50 entries
+  Bundle:   rag_quality (4 evaluators)
+  ████████████████████████████  50/50 entries  [100%]  ⏱ 3m 42s
+  ✓ 48 success  ✗ 2 errors  ⏭ 0 skipped
+✓ Run 'default' completed (2026-02-26_a1b2c3d4)
+  Evaluator        Mean    Median  Pass Rate
+  ─────────────────────────────────────────
+  groundedness     4.2     4.0     92%
+  relevance        4.5     5.0     96%
+  coherence        4.1     4.0     88%
+  fluency          4.7     5.0     98%
+  Aggregate score: 4.38 / 5.00
+  Overall pass rate: 88%
+  Full report: agentops report show 2026-02-26_a1b2c3d4
+```
+### 4. View the report
+```
+$ agentops report show latest
+╭─────────────────────────────────────────────────────────╮
+│  AgentOps Evaluation Report                             │
+│  Run: 2026-02-26_a1b2c3d4                               │
+│  Date: 2026-02-26 10:35:42                               │
+│  Dataset: golden_set (50 entries)                        │
+│  Bundle: rag_quality                                     │
+╰─────────────────────────────────────────────────────────╯
+  Evaluator        Mean   Med    Min   Max   StdDev  Pass Rate
+  ──────────────────────────────────────────────────────────────
+  groundedness     4.20   4.0    2.0   5.0   0.80    92% ✓
+  relevance        4.50   5.0    3.0   5.0   0.60    96% ✓
+  coherence        4.10   4.0    2.0   5.0   0.90    88% ✓
+  fluency          4.70   5.0    3.0   5.0   0.50    98% ✓
+  Aggregate: 4.38 / 5.00   Pass rate: 88%
+  ⚠ 3 entries below threshold:
+    Entry #12: groundedness=2.0 (threshold: 3.0)
+    Entry #34: coherence=2.0 (threshold: 3.0)
+    Entry #45: coherence=2.0 (threshold: 3.0)
+  Export: agentops report export latest --format html
+```
+---
+## Features
+### Evaluator Bundles — prescriptive, per use case
+```
+$ agentops bundle list
+ Bundle                Use Case      Evaluators  Description
+────────────────────────────────────────────────────────────────────
+ rag_quality           rag           4           Core quality for RAG
+ rag_safety            rag           5           Safety evaluators for RAG
+ rag_complete          rag           10          Comprehensive RAG evaluation
+ agent_quality         agent         4           Quality for tool-using agents
+ agent_safety          agent         5           Safety for agents
+ multi_agent_quality   multi-agent   5           Quality for orchestrated agents
+ custom                any           0           Empty template for customization
+ rag_foundry_iq        rag           6           RAG + Foundry IQ knowledge bases
+ rag_agentic_retrieval rag           5           RAG + agentic retrieval pipeline
+ rag_permission_aware  rag           4           RAG + ACL enforcement testing
+ rag_fabric_iq         rag           5           RAG + Fabric IQ ontologies
+ rag_work_iq           rag           5           RAG + M365 collaboration data
+ rag_cross_iq          rag           5           RAG + multi-IQ source evaluation
+```
+```
+$ agentops bundle show rag_quality
+Bundle: rag_quality
+Use Case: rag
+Built-in: ✓
+ Evaluator       Category  Inputs                     Score Range  Threshold
+──────────────────────────────────────────────────────────────────────────────
+ groundedness    Quality   query, response, context   1-5          ≥ 3.0
+ relevance       Quality   query, response, context   1-5          ≥ 3.0
+ coherence       Quality   query, response            1-5          ≥ 3.0
+ fluency         Quality   response                   1-5          ≥ 3.0
+```
+### Create custom bundles
+```
+$ agentops bundle create my_safety --evaluators groundedness,violence,hate_unfairness,jailbreak \
+    --threshold groundedness=4.0 --threshold violence=4.5
+✓ Bundle 'my_safety' created with 4 evaluators
+  Saved to: agentops/bundles/my_safety.yaml
+```
+### Dataset management
+```
+$ agentops dataset list
+ Dataset       Format  Entries  Has Context  Has Ground Truth  Path
+──────────────────────────────────────────────────────────────────────
+ golden_set    jsonl   50       ✓            ✓                 agentops/datasets/golden_set.jsonl
+ edge_cases    jsonl   12       ✓            ✗                 agentops/datasets/edge_cases.jsonl
+```
+```
+$ agentops dataset validate golden_set --bundle rag_quality
+✓ Dataset 'golden_set' is valid
+  50 entries parsed
+  Required columns: query ✓, response ✓, context ✓
+  Optional columns: ground_truth ✓
+  Warnings: 0
+```
+### Compare runs — catch regressions
+```
+$ agentops run compare 2026-02-26_a1b2c3d4 2026-02-25_e5f6g7h8
+Comparing: 2026-02-26_a1b2c3d4 vs 2026-02-25_e5f6g7h8
+ Evaluator       Run A (Feb 26)  Run B (Feb 25)  Delta    Trend
+──────────────────────────────────────────────────────────────────
+ groundedness     4.20           3.90            +0.30    ▲ improved
+ relevance        4.50           4.55            -0.05    ≈ stable
+ coherence        4.10           3.80            +0.30    ▲ improved
+ fluency          4.70           4.60            +0.10    ▲ improved
+ Aggregate        4.38           4.12            +0.26    ▲ improved
+ Regressions: 3 entries scored lower on ≥1 evaluator
+   Entry #12: groundedness 3→2 (▼ regression)
+   Entry #34: relevance 4→3 (▼ regression)
+   Entry #45: coherence 5→3 (▼ regression)
+```
+### Model migration — evaluate before you switch
+```
+$ agentops eval migrate --from gpt-4o --to gpt-4.1 --dataset golden_set.jsonl
+Running side-by-side evaluation...
+  Model A (gpt-4o):    ████████████████████  50/50  ⏱ 2m 14s
+  Model B (gpt-4.1):   ████████████████████  50/50  ⏱ 1m 52s
+╭───────────────────────────────────────────────────────────╮
+│  Model Migration Report: gpt-4o → gpt-4.1                │
+╰───────────────────────────────────────────────────────────╯
+ Evaluator       gpt-4o   gpt-4.1   Delta    p-value   Verdict
+──────────────────────────────────────────────────────────────────
+ groundedness     4.20     4.35     +0.15    0.021     ▲ significant
+ relevance        4.50     4.48     -0.02    0.814     ≈ no difference
+ coherence        4.10     4.22     +0.12    0.045     ▲ significant
+ fluency          4.70     4.75     +0.05    0.312     ≈ no difference
+ Latency P95      3.4s     2.1s     -38%               ▲ faster
+ Recommendation: ✓ Safe to migrate — no regressions detected
+```
+### Configuration — one YAML file
+```yaml
+# agentops.yaml — minimal config
+schema_version: "1.0"
+project:
+  name: my-rag-agent
+foundry:
+  project_connection: ${FOUNDRY_CONNECTION}
+agent:
+  framework: semantic-kernel
+  use_case: rag
+  entry_point: src/agent.py
+```
+```
+$ agentops config validate
+✔ Schema version: 1.0
+✔ Project: my-rag-agent
+✔ Foundry connection: ✔ (resolved from FOUNDRY_CONNECTION)
+✔ Agent framework: semantic-kernel (entry_point exists)
+✔ Datasets: 2 defined (golden_set ✔, edge_cases ✔)
+✔ Bundles: 1 custom (my_safety ✔)
+✔ All 15 validation rules passed
+```
+### CI/CD integration — gate on quality
+```yaml
+# .github/workflows/agent-eval.yml
+name: Agent Evaluation
+on: [pull_request]
+jobs:
+  evaluate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: pip install agentops-toolkit
+      - run: agentops run start --no-interactive --format json
+        env:
+          FOUNDRY_CONNECTION: ${{ secrets.FOUNDRY_CONNECTION }}
+      - run: agentops report export latest --format markdown >> $GITHUB_STEP_SUMMARY
+```
+### Export reports — any format
+```
+$ agentops report export latest --format markdown --output report.md
+✓ Report exported to report.md
+$ agentops report export latest --format html --output report.html
+✓ Report exported to report.html
+$ agentops report export latest --format json | jq '.summary.pass_rate'
+0.88
+```
+---
+## Phase 2: Observability & Monitoring
+### Tracing — one-command OpenTelemetry setup
+```
+$ agentops trace init --project-endpoint $FOUNDRY_PROJECT_ENDPOINT
+✓ Tracing configured for project 'my-rag-agent'
+  Exporter: Application Insights (Foundry-linked)
+  Service:  my-rag-agent
+  Next: agentops trace run -- python src/agent.py
+```
+```
+$ agentops trace init --local --otlp http://localhost:4317
+✓ Local tracing configured
+  Exporter: OTLP → http://localhost:4317 (Aspire Dashboard)
+  Service:  my-rag-agent
+  Next: agentops trace run -- python src/agent.py
+```
+### Monitoring dashboards — pre-built templates
+```
+$ agentops monitor dashboard --template agent-overview
+✓ Dashboard 'Agent Overview' configured
+  Token usage, latency P50/P95/P99, error rates, throughput
+  View: https://portal.azure.com/...#dashboard/agent-overview
+$ agentops monitor alert create \
+    --name "groundedness-regression" \
+    --metric "eval.groundedness.avg" \
+    --operator lt --threshold 3.5 --severity 2
+✓ Alert 'groundedness-regression' created
+  Fires when eval.groundedness.avg < 3.5 over 15m window
+```
+### Continuous evaluation — sample production traffic
+```
+$ agentops eval --continuous \
+    --agent-id my-agent-001 \
+    --evaluators relevance,groundedness,coherence \
+    --sampling-percent 10
+✓ Continuous evaluation enabled for agent 'my-agent-001'
+  Sampling: 10% of interactions
+  Evaluators: relevance, groundedness, coherence
+  Results: Application Insights → Foundry Portal
+```
+### AI Red Teaming — adversarial testing
+```
+$ agentops eval --red-team \
+    --target-endpoint https://my-agent.azurewebsites.net/api/chat \
+    --risk-categories violence,hate,self_harm \
+    --iterations 100
+Running AI Red Teaming scan...
+  ████████████████████████████  100/100 attacks  ⏱ 8m 12s
+╭───────────────────────────────────────╮
+│  Red Team Scan Results                │
+╰───────────────────────────────────────╯
+ Category       Attacks  Defused  Breached  Rate
+──────────────────────────────────────────────────
+ violence       34       33       1         97%
+ hate           33       33       0         100%
+ self_harm      33       32       1         97%
+ Overall defense rate: 98%
+ ⚠ 2 breaches found — review in agentops/reports/red-team-latest.json
+```
+---
+## Phase 2: IQ Knowledge Layer Integration
+### Generate golden datasets from Foundry IQ
+```
+$ agentops dataset generate --from-iq foundry-iq \
+    --queries queries.txt \
+    --output golden_dataset.jsonl \
+    --include-citations
+Querying knowledge base 'my-kb'...
+  ████████████████████████████  25/25 queries  ⏱ 1m 05s
+✓ Dataset generated: golden_dataset.jsonl
+  25 entries with citations from 3 sources
+  Sources: sharepoint://policies, blob://manuals, web://docs
+```
+### Evaluate with IQ-specific bundles
+```
+$ agentops eval --run latest --bundle rag_foundry_iq
+Evaluating with bundle 'rag_foundry_iq' (6 evaluators)...
+ Evaluator          Mean   Pass Rate
+──────────────────────────────────────
+ groundedness       4.30   94%  ✓
+ relevance          4.50   96%  ✓
+ coherence          4.20   90%  ✓
+ fluency            4.60   98%  ✓
+ citation_accuracy  0.92   92%  ✓
+ source_coverage    0.85   85%  ⚠
+ ⚠ source_coverage below threshold (0.85 < 0.90)
+   5 entries missed expected sources — review details with:
+   agentops report show latest --evaluator source_coverage --verbose
+```
+---
+## Phase 2: Foundry MCP Server Integration
+### Model lifecycle — discover, compare, deploy
+```
+$ agentops model recommend --current gpt-4o
+╭───────────────────────────────────────────────────╮
+│  Model Recommendations for gpt-4o                 │
+╰───────────────────────────────────────────────────╯
+ Model       Provider  Reason                          Est. Impact
+──────────────────────────────────────────────────────────────────────
+ gpt-4.1     OpenAI    15% faster, equivalent quality  ▲ latency
+ gpt-4.1-m   OpenAI    40% cheaper, slight quality     ▼ -0.1 score
+ o4-mini     OpenAI    Reasoning model, new capability  + reasoning
+ Next: agentops model benchmark --from gpt-4o --to gpt-4.1
+```
+```
+$ agentops model quota --model gpt-4.1
+ Resource       Available    Limit     Utilization
+──────────────────────────────────────────────────────
+ TPM            120,000      150,000   80%
+ RPM            600          1,000     60%
+ ✓ Sufficient quota for evaluation + production deployment
+```
+---
+## Phase 3: Copilot Integration
+### Natural language via GitHub Copilot CLI
+```
+$ gh copilot suggest "run groundedness eval on my golden dataset"
+→ agentops run start --bundle rag_quality --dataset golden_set
+$ gh copilot suggest "compare my last two evaluation runs"
+→ agentops run compare latest latest~1
+$ gh copilot suggest "set up monitoring for my RAG agent"
+→ agentops monitor setup --service-name my-rag-agent
+```
+### VS Code Chat — `@agentops`
+```
+You: @agentops evaluate my last run and explain the regressions
+AgentOps: I ran evaluation on run `2026-02-26_a1b2c3d4` with bundle `rag_quality`.
+**Results:**
+| Evaluator    | Score | Δ vs Previous | Status  |
+|--------------|-------|---------------|---------|
+| groundedness | 4.2   | +0.3          | ✓       |
+| relevance    | 4.5   | -0.05         | ≈       |
+| coherence    | 4.1   | +0.3          | ✓       |
+| fluency      | 4.7   | +0.1          | ✓       |
+**3 regressions detected:**
+- Entry #12: groundedness dropped 3→2. The agent hallucinated a return policy
+  detail not in the provided context. Consider adding explicit grounding
+  instructions for policy-related queries.
+- Entry #34: relevance dropped 4→3. The response addressed a related but
+  different question. Check the query routing logic.
+- Entry #45: coherence dropped 5→3. The response has contradictory statements
+  in paragraphs 2 and 3.
+```
+---
+## Supported Frameworks
+| Framework | Integration | Status |
+|---|---|---|
+| **Semantic Kernel** | Auto-discover plugins, capture kernel I/O, planner step tracing | Phase 3 |
+| **AutoGen** | Hook message streams, multi-turn conversation capture | Phase 3 |
+| **Azure AI Agent Service** | Pull agent defs from Foundry, replay threads | Phase 3 |
+| **Custom / any framework** | Generic adapter via `@agentops.trace` decorator or HTTP endpoint | Phase 3 |
+---
+## Architecture
+```
+┌──────────────────────────────────────────────┐
+│  GitHub Copilot CLI / VS Code Chat           │  ← Natural language
+│  "evaluate my RAG agent on golden dataset"   │
+└─────────────────┬────────────────────────────┘
+                  │
+                  ▼
+┌──────────────────────────────────────────────┐
+│            AgentOps Toolkit (CLI)             │
+│                                              │
+│  Bundles → Runs → Evaluation → Reports       │
+│  Tracing → Monitoring → Continuous Eval      │
+│  Model Lifecycle → Migration → Retirement    │
+│                                              │
+│  Agent Framework Adapters:                   │
+│  [Semantic Kernel] [AutoGen] [Agent Service] │
+└─────────────────┬────────────────────────────┘
+                  │
+      ┌───────────┼───────────────┐
+      ▼           ▼               ▼
+┌───────────┐ ┌─────────┐ ┌─────────────┐
+│ Foundry   │ │ Foundry  │ │ IQ Knowledge│
+│ Evaluation│ │ MCP      │ │ Layer       │
+│ SDK       │ │ Server   │ │ (Foundry IQ │
+│           │ │          │ │  Fabric IQ  │
+│ Evaluators│ │ Models   │ │  Work IQ)   │
+│ Dashboard │ │ Eval     │ │             │
+│ Tracing   │ │ Monitor  │ │ Retrieval   │
+│ Continuous│ │ Agents   │ │ Citations   │
+└───────────┘ └─────────┘ └─────────────┘
+```
+---
+## Requirements
+- **Python 3.10+**
+- **Azure AI Foundry project** with evaluation APIs enabled
+- **Azure credentials** — `az login` or `DefaultAzureCredential`
+---
+## Installation
+```bash
+# Core toolkit
+pip install agentops-toolkit
+# With observability support (Phase 2)
+pip install agentops-toolkit[observability]
+# With IQ knowledge layer support (Phase 2)
+pip install agentops-toolkit[iq]
+# Everything
+pip install agentops-toolkit[observability,iq]
+```
+---
+## Documentation
+| Document | Description |
+|---|---|
+| [Build Plan](design/BUILD_PLAN.md) | Project plan, phases, sprints, team |
+| [Requirements](design/REQUIREMENTS.md) | 96+ functional & non-functional requirements |
+| [Specifications Index](design/SPECIFICATIONS.md) | 9 technical specifications (SPEC-001–009) |
+| [Review Report](design/REVIEW_REPORT.md) | Design consistency & feasibility review |
+---
+## Contributing
+AgentOps is open source. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
+---
+## License
+[MIT](LICENSE)
+---
+*Built with Azure AI Foundry. Designed for developers who ship agents to production.*