aigroup-workflow 2.2.1 → 2.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/fix-build.md +10 -5
- package/.claude/commands/init-project.md +13 -8
- package/.claude/commands/plan.md +15 -8
- package/.claude/commands/review.md +12 -6
- package/.claude/commands/tdd.md +11 -5
- package/.claude/commands/workflow-start.md +20 -11
- package/.claude/settings.json +28 -0
- package/.codex/agents/architect.toml +207 -0
- package/.codex/agents/build-error-resolver.toml +110 -0
- package/.codex/agents/code-reviewer.toml +233 -0
- package/.codex/agents/doc-updater.toml +103 -0
- package/.codex/agents/e2e-runner.toml +103 -0
- package/.codex/agents/get-current-datetime.toml +23 -0
- package/.codex/agents/init-architect.toml +181 -0
- package/.codex/agents/planner.toml +208 -0
- package/.codex/agents/refactor-cleaner.toml +81 -0
- package/.codex/agents/rust-reviewer.toml +90 -0
- package/.codex/agents/security-reviewer.toml +104 -0
- package/.codex/agents/tdd-guide.toml +87 -0
- package/AGENTS.md +2 -2
- package/CLAUDE.md +23 -1
- package/LICENSE +20 -20
- package/README.md +333 -333
- package/agents/a11y-architect.md +141 -141
- package/agents/architect.md +211 -211
- package/agents/build-error-resolver.md +114 -114
- package/agents/chief-of-staff.md +151 -151
- package/agents/code-architect.md +71 -71
- package/agents/code-explorer.md +69 -69
- package/agents/code-reviewer.md +237 -237
- package/agents/code-simplifier.md +47 -47
- package/agents/comment-analyzer.md +45 -45
- package/agents/conversation-analyzer.md +52 -52
- package/agents/cpp-build-resolver.md +90 -90
- package/agents/cpp-reviewer.md +72 -72
- package/agents/csharp-reviewer.md +101 -101
- package/agents/dart-build-resolver.md +201 -201
- package/agents/database-reviewer.md +91 -91
- package/agents/doc-updater.md +107 -107
- package/agents/docs-lookup.md +68 -68
- package/agents/e2e-runner.md +107 -107
- package/agents/flutter-reviewer.md +243 -243
- package/agents/gan-evaluator.md +209 -209
- package/agents/gan-generator.md +131 -131
- package/agents/gan-planner.md +99 -99
- package/agents/get-current-datetime.md +26 -26
- package/agents/go-build-resolver.md +94 -94
- package/agents/go-reviewer.md +76 -76
- package/agents/harness-optimizer.md +35 -35
- package/agents/healthcare-reviewer.md +83 -83
- package/agents/java-build-resolver.md +153 -153
- package/agents/java-reviewer.md +92 -92
- package/agents/kotlin-build-resolver.md +118 -118
- package/agents/kotlin-reviewer.md +159 -159
- package/agents/loop-operator.md +36 -36
- package/agents/opensource-forker.md +198 -198
- package/agents/opensource-packager.md +249 -249
- package/agents/opensource-sanitizer.md +188 -188
- package/agents/performance-optimizer.md +446 -446
- package/agents/planner.md +212 -212
- package/agents/pr-test-analyzer.md +45 -45
- package/agents/python-reviewer.md +98 -98
- package/agents/pytorch-build-resolver.md +120 -120
- package/agents/refactor-cleaner.md +85 -85
- package/agents/rust-build-resolver.md +148 -148
- package/agents/rust-reviewer.md +94 -94
- package/agents/security-reviewer.md +108 -108
- package/agents/seo-specialist.md +59 -59
- package/agents/silent-failure-hunter.md +50 -50
- package/agents/tdd-guide.md +91 -91
- package/agents/type-design-analyzer.md +41 -41
- package/agents/typescript-reviewer.md +112 -112
- package/cli/commands/update.mjs +1 -1
- package/cli/utils/scaffold.mjs +53 -0
- package/docs/rules/agents.md +166 -50
- package/docs/rules/cpp/coding-style.md +44 -44
- package/docs/rules/cpp/hooks.md +39 -39
- package/docs/rules/cpp/patterns.md +51 -51
- package/docs/rules/cpp/security.md +51 -51
- package/docs/rules/cpp/testing.md +44 -44
- package/docs/rules/csharp/coding-style.md +72 -72
- package/docs/rules/csharp/hooks.md +25 -25
- package/docs/rules/csharp/patterns.md +50 -50
- package/docs/rules/csharp/security.md +58 -58
- package/docs/rules/csharp/testing.md +46 -46
- package/docs/rules/dart/coding-style.md +159 -159
- package/docs/rules/dart/hooks.md +66 -66
- package/docs/rules/dart/patterns.md +261 -261
- package/docs/rules/dart/security.md +135 -135
- package/docs/rules/dart/testing.md +215 -215
- package/docs/rules/golang/coding-style.md +32 -32
- package/docs/rules/golang/hooks.md +17 -17
- package/docs/rules/golang/patterns.md +45 -45
- package/docs/rules/golang/security.md +34 -34
- package/docs/rules/golang/testing.md +31 -31
- package/docs/rules/java/coding-style.md +114 -114
- package/docs/rules/java/hooks.md +18 -18
- package/docs/rules/java/patterns.md +146 -146
- package/docs/rules/java/security.md +100 -100
- package/docs/rules/java/testing.md +131 -131
- package/docs/rules/kotlin/coding-style.md +86 -86
- package/docs/rules/kotlin/hooks.md +17 -17
- package/docs/rules/kotlin/patterns.md +146 -146
- package/docs/rules/kotlin/security.md +82 -82
- package/docs/rules/kotlin/testing.md +128 -128
- package/docs/rules/perl/coding-style.md +46 -46
- package/docs/rules/perl/hooks.md +22 -22
- package/docs/rules/perl/patterns.md +76 -76
- package/docs/rules/perl/security.md +69 -69
- package/docs/rules/perl/testing.md +54 -54
- package/docs/rules/php/coding-style.md +40 -40
- package/docs/rules/php/hooks.md +24 -24
- package/docs/rules/php/patterns.md +33 -33
- package/docs/rules/php/security.md +37 -37
- package/docs/rules/php/testing.md +39 -39
- package/docs/rules/python/coding-style.md +42 -42
- package/docs/rules/python/hooks.md +19 -19
- package/docs/rules/python/patterns.md +39 -39
- package/docs/rules/python/security.md +30 -30
- package/docs/rules/python/testing.md +38 -38
- package/docs/rules/rust/coding-style.md +151 -151
- package/docs/rules/rust/hooks.md +16 -16
- package/docs/rules/rust/patterns.md +168 -168
- package/docs/rules/rust/security.md +141 -141
- package/docs/rules/rust/testing.md +154 -154
- package/docs/rules/swift/coding-style.md +47 -47
- package/docs/rules/swift/hooks.md +20 -20
- package/docs/rules/swift/patterns.md +66 -66
- package/docs/rules/swift/security.md +33 -33
- package/docs/rules/swift/testing.md +45 -45
- package/docs/rules/typescript/coding-style.md +199 -199
- package/docs/rules/typescript/hooks.md +22 -22
- package/docs/rules/typescript/patterns.md +52 -52
- package/docs/rules/typescript/security.md +28 -28
- package/docs/rules/typescript/testing.md +18 -18
- package/docs/rules/web/coding-style.md +96 -96
- package/docs/rules/web/design-quality.md +62 -62
- package/docs/rules/web/hooks.md +120 -120
- package/docs/rules/web/patterns.md +79 -79
- package/docs/rules/web/performance.md +64 -64
- package/docs/rules/web/security.md +57 -57
- package/docs/rules/web/testing.md +55 -55
- package/docs/templates/README.md +36 -36
- package/docs/templates/ai-project-final.md +124 -124
- package/docs/templates/ai-project.md +105 -105
- package/docs/templates/api.md +157 -157
- package/docs/templates/bug.md +62 -62
- package/docs/templates/code-review.md +87 -87
- package/docs/templates/generic.md +116 -116
- package/docs/templates/implementation-plan.md +1 -1
- package/docs/templates/meeting.md +68 -68
- package/docs/templates/prd.md +98 -98
- package/docs/templates/ui.md +134 -134
- package/docs/workflow-pipeline.md +5 -5
- package/package.json +40 -39
- package/skills/SUPERPOWERS-LICENSE +21 -21
- package/skills/ai-ml/fine-tuning-expert/SKILL.md +162 -162
- package/skills/ai-ml/fine-tuning-expert/references/dataset-preparation.md +540 -540
- package/skills/ai-ml/fine-tuning-expert/references/deployment-optimization.md +673 -673
- package/skills/ai-ml/fine-tuning-expert/references/evaluation-metrics.md +597 -597
- package/skills/ai-ml/fine-tuning-expert/references/hyperparameter-tuning.md +565 -565
- package/skills/ai-ml/fine-tuning-expert/references/lora-peft.md +347 -347
- package/skills/ai-ml/ml-pipeline/SKILL.md +159 -159
- package/skills/ai-ml/ml-pipeline/references/experiment-tracking.md +833 -833
- package/skills/ai-ml/ml-pipeline/references/feature-engineering.md +631 -631
- package/skills/ai-ml/ml-pipeline/references/model-validation.md +978 -978
- package/skills/ai-ml/ml-pipeline/references/pipeline-orchestration.md +907 -907
- package/skills/ai-ml/ml-pipeline/references/training-pipelines.md +782 -782
- package/skills/ai-ml/rag-architect/SKILL.md +194 -194
- package/skills/ai-ml/rag-architect/references/chunking-strategies.md +878 -878
- package/skills/ai-ml/rag-architect/references/embedding-models.md +561 -561
- package/skills/ai-ml/rag-architect/references/rag-evaluation.md +833 -833
- package/skills/ai-ml/rag-architect/references/retrieval-optimization.md +795 -795
- package/skills/ai-ml/rag-architect/references/vector-databases.md +589 -589
- package/skills/ai-ml/spark-engineer/SKILL.md +148 -148
- package/skills/ai-ml/spark-engineer/references/partitioning-caching.md +543 -543
- package/skills/ai-ml/spark-engineer/references/performance-tuning.md +544 -544
- package/skills/ai-ml/spark-engineer/references/rdd-operations.md +599 -599
- package/skills/ai-ml/spark-engineer/references/spark-sql-dataframes.md +474 -474
- package/skills/ai-ml/spark-engineer/references/streaming-patterns.md +786 -786
- package/skills/backend/api-designer/SKILL.md +217 -217
- package/skills/backend/api-designer/references/error-handling.md +541 -541
- package/skills/backend/api-designer/references/openapi.md +824 -824
- package/skills/backend/api-designer/references/pagination.md +494 -494
- package/skills/backend/api-designer/references/rest-patterns.md +335 -335
- package/skills/backend/api-designer/references/versioning.md +391 -391
- package/skills/backend/architecture-designer/SKILL.md +117 -117
- package/skills/backend/architecture-designer/references/adr-template.md +116 -116
- package/skills/backend/architecture-designer/references/architecture-patterns.md +111 -111
- package/skills/backend/architecture-designer/references/database-selection.md +102 -102
- package/skills/backend/architecture-designer/references/nfr-checklist.md +112 -112
- package/skills/backend/architecture-designer/references/system-design.md +100 -100
- package/skills/backend/code-documenter/SKILL.md +147 -147
- package/skills/backend/code-documenter/references/api-docs-fastapi-django.md +166 -166
- package/skills/backend/code-documenter/references/api-docs-nestjs-express.md +220 -220
- package/skills/backend/code-documenter/references/coverage-reports.md +125 -125
- package/skills/backend/code-documenter/references/documentation-systems.md +333 -333
- package/skills/backend/code-documenter/references/interactive-api-docs.md +531 -531
- package/skills/backend/code-documenter/references/python-docstrings.md +121 -121
- package/skills/backend/code-documenter/references/typescript-jsdoc.md +145 -145
- package/skills/backend/code-documenter/references/user-guides-tutorials.md +530 -530
- package/skills/backend/debugging-wizard/SKILL.md +105 -105
- package/skills/backend/debugging-wizard/references/common-patterns.md +132 -132
- package/skills/backend/debugging-wizard/references/debugging-tools.md +140 -140
- package/skills/backend/debugging-wizard/references/quick-fixes.md +177 -177
- package/skills/backend/debugging-wizard/references/strategies.md +142 -142
- package/skills/backend/debugging-wizard/references/systematic-debugging.md +367 -367
- package/skills/backend/feature-forge/SKILL.md +98 -98
- package/skills/backend/feature-forge/references/acceptance-criteria.md +104 -104
- package/skills/backend/feature-forge/references/ears-syntax.md +99 -99
- package/skills/backend/feature-forge/references/interview-questions.md +150 -150
- package/skills/backend/feature-forge/references/pre-discovery-subagents.md +54 -54
- package/skills/backend/feature-forge/references/specification-template.md +103 -103
- package/skills/backend/fullstack-guardian/SKILL.md +105 -105
- package/skills/backend/fullstack-guardian/references/api-design-standards.md +307 -307
- package/skills/backend/fullstack-guardian/references/architecture-decisions.md +350 -350
- package/skills/backend/fullstack-guardian/references/backend-patterns.md +237 -237
- package/skills/backend/fullstack-guardian/references/common-patterns.md +134 -134
- package/skills/backend/fullstack-guardian/references/deliverables-checklist.md +354 -354
- package/skills/backend/fullstack-guardian/references/design-template.md +91 -91
- package/skills/backend/fullstack-guardian/references/error-handling.md +135 -135
- package/skills/backend/fullstack-guardian/references/frontend-patterns.md +340 -340
- package/skills/backend/fullstack-guardian/references/integration-patterns.md +333 -333
- package/skills/backend/fullstack-guardian/references/security-checklist.md +106 -106
- package/skills/backend/graphql-architect/SKILL.md +146 -146
- package/skills/backend/graphql-architect/references/federation.md +418 -418
- package/skills/backend/graphql-architect/references/migration-from-rest.md +1141 -1141
- package/skills/backend/graphql-architect/references/resolvers.md +425 -425
- package/skills/backend/graphql-architect/references/schema-design.md +393 -393
- package/skills/backend/graphql-architect/references/security.md +569 -569
- package/skills/backend/graphql-architect/references/subscriptions.md +510 -510
- package/skills/backend/legacy-modernizer/SKILL.md +137 -137
- package/skills/backend/legacy-modernizer/references/legacy-testing.md +381 -381
- package/skills/backend/legacy-modernizer/references/migration-strategies.md +423 -423
- package/skills/backend/legacy-modernizer/references/refactoring-patterns.md +395 -395
- package/skills/backend/legacy-modernizer/references/strangler-fig-pattern.md +281 -281
- package/skills/backend/legacy-modernizer/references/system-assessment.md +487 -487
- package/skills/backend/microservices-architect/SKILL.md +164 -164
- package/skills/backend/microservices-architect/references/communication.md +499 -499
- package/skills/backend/microservices-architect/references/data.md +721 -721
- package/skills/backend/microservices-architect/references/decomposition.md +344 -344
- package/skills/backend/microservices-architect/references/observability.md +805 -805
- package/skills/backend/microservices-architect/references/patterns.md +603 -603
- package/skills/database/database-optimizer/SKILL.md +147 -147
- package/skills/database/database-optimizer/references/index-strategies.md +331 -331
- package/skills/database/database-optimizer/references/monitoring-analysis.md +501 -501
- package/skills/database/database-optimizer/references/mysql-tuning.md +452 -452
- package/skills/database/database-optimizer/references/postgresql-tuning.md +413 -413
- package/skills/database/database-optimizer/references/query-optimization.md +251 -251
- package/skills/database/postgres-pro/SKILL.md +152 -152
- package/skills/database/postgres-pro/references/extensions.md +404 -404
- package/skills/database/postgres-pro/references/jsonb.md +321 -321
- package/skills/database/postgres-pro/references/maintenance.md +481 -481
- package/skills/database/postgres-pro/references/performance.md +265 -265
- package/skills/database/postgres-pro/references/replication.md +446 -446
- package/skills/database/sql-pro/SKILL.md +129 -129
- package/skills/database/sql-pro/references/database-design.md +402 -402
- package/skills/database/sql-pro/references/dialect-differences.md +419 -419
- package/skills/database/sql-pro/references/optimization.md +384 -384
- package/skills/database/sql-pro/references/query-patterns.md +285 -285
- package/skills/database/sql-pro/references/window-functions.md +328 -328
- package/skills/dotnet/csharp-developer/SKILL.md +125 -125
- package/skills/dotnet/csharp-developer/references/aspnet-core.md +394 -394
- package/skills/dotnet/csharp-developer/references/blazor.md +553 -553
- package/skills/dotnet/csharp-developer/references/entity-framework.md +409 -409
- package/skills/dotnet/csharp-developer/references/modern-csharp.md +248 -248
- package/skills/dotnet/csharp-developer/references/performance.md +498 -498
- package/skills/dotnet/dotnet-core-expert/SKILL.md +138 -138
- package/skills/dotnet/dotnet-core-expert/references/authentication.md +546 -546
- package/skills/dotnet/dotnet-core-expert/references/clean-architecture.md +455 -455
- package/skills/dotnet/dotnet-core-expert/references/cloud-native.md +548 -548
- package/skills/dotnet/dotnet-core-expert/references/entity-framework.md +440 -440
- package/skills/dotnet/dotnet-core-expert/references/minimal-apis.md +319 -319
- package/skills/frontend/angular-architect/SKILL.md +152 -152
- package/skills/frontend/angular-architect/references/components.md +297 -297
- package/skills/frontend/angular-architect/references/ngrx.md +401 -401
- package/skills/frontend/angular-architect/references/routing.md +361 -361
- package/skills/frontend/angular-architect/references/rxjs.md +319 -319
- package/skills/frontend/angular-architect/references/testing.md +405 -405
- package/skills/frontend/design-commands/design.md +91 -91
- package/skills/frontend/design-commands/handoff.md +97 -97
- package/skills/frontend/design-commands/prototype.md +120 -120
- package/skills/frontend/design-commands/spec.md +160 -160
- package/skills/frontend/design-commands/style.md +78 -78
- package/skills/frontend/flutter-expert/SKILL.md +138 -138
- package/skills/frontend/flutter-expert/references/bloc-state.md +259 -259
- package/skills/frontend/flutter-expert/references/gorouter-navigation.md +119 -119
- package/skills/frontend/flutter-expert/references/performance.md +99 -99
- package/skills/frontend/flutter-expert/references/project-structure.md +118 -118
- package/skills/frontend/flutter-expert/references/riverpod-state.md +130 -130
- package/skills/frontend/flutter-expert/references/widget-patterns.md +123 -123
- package/skills/frontend/nextjs-developer/SKILL.md +143 -143
- package/skills/frontend/nextjs-developer/references/app-router.md +311 -311
- package/skills/frontend/nextjs-developer/references/data-fetching.md +482 -482
- package/skills/frontend/nextjs-developer/references/deployment.md +545 -545
- package/skills/frontend/nextjs-developer/references/server-actions.md +462 -462
- package/skills/frontend/nextjs-developer/references/server-components.md +384 -384
- package/skills/frontend/react-expert/SKILL.md +149 -149
- package/skills/frontend/react-expert/references/hooks-patterns.md +162 -162
- package/skills/frontend/react-expert/references/migration-class-to-modern.md +1119 -1119
- package/skills/frontend/react-expert/references/performance.md +168 -168
- package/skills/frontend/react-expert/references/react-19-features.md +174 -174
- package/skills/frontend/react-expert/references/server-components.md +143 -143
- package/skills/frontend/react-expert/references/state-management.md +171 -171
- package/skills/frontend/react-expert/references/testing-react.md +174 -174
- package/skills/frontend/react-native-expert/SKILL.md +185 -185
- package/skills/frontend/react-native-expert/references/expo-router.md +187 -187
- package/skills/frontend/react-native-expert/references/list-optimization.md +204 -204
- package/skills/frontend/react-native-expert/references/platform-handling.md +188 -188
- package/skills/frontend/react-native-expert/references/project-structure.md +171 -171
- package/skills/frontend/react-native-expert/references/storage-hooks.md +173 -173
- package/skills/frontend/senior-frontend/SKILL.md +477 -477
- package/skills/frontend/senior-frontend/references/frontend_best_practices.md +806 -806
- package/skills/frontend/senior-frontend/references/nextjs_optimization_guide.md +724 -724
- package/skills/frontend/senior-frontend/references/react_patterns.md +746 -746
- package/skills/frontend/senior-frontend/scripts/bundle_analyzer.py +407 -407
- package/skills/frontend/senior-frontend/scripts/component_generator.py +329 -329
- package/skills/frontend/senior-frontend/scripts/frontend_scaffolder.py +1005 -1005
- package/skills/frontend/ui-ux-pro-max/SKILL.md +386 -386
- package/skills/frontend/ui-ux-pro-max/data/charts.csv +26 -26
- package/skills/frontend/ui-ux-pro-max/data/colors.csv +97 -97
- package/skills/frontend/ui-ux-pro-max/data/icons.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/landing.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/data/products.csv +96 -96
- package/skills/frontend/ui-ux-pro-max/data/react-performance.csv +45 -45
- package/skills/frontend/ui-ux-pro-max/data/stacks/astro.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/flutter.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -56
- package/skills/frontend/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nextjs.csv +53 -53
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -59
- package/skills/frontend/ui-ux-pro-max/data/stacks/react-native.csv +52 -52
- package/skills/frontend/ui-ux-pro-max/data/stacks/react.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/shadcn.csv +61 -61
- package/skills/frontend/ui-ux-pro-max/data/stacks/svelte.csv +54 -54
- package/skills/frontend/ui-ux-pro-max/data/stacks/swiftui.csv +51 -51
- package/skills/frontend/ui-ux-pro-max/data/stacks/vue.csv +50 -50
- package/skills/frontend/ui-ux-pro-max/data/styles.csv +68 -68
- package/skills/frontend/ui-ux-pro-max/data/typography.csv +57 -57
- package/skills/frontend/ui-ux-pro-max/data/ui-reasoning.csv +101 -101
- package/skills/frontend/ui-ux-pro-max/data/ux-guidelines.csv +99 -99
- package/skills/frontend/ui-ux-pro-max/data/web-interface.csv +31 -31
- package/skills/frontend/ui-ux-pro-max/scripts/core.py +253 -253
- package/skills/frontend/ui-ux-pro-max/scripts/design_system.py +1067 -1067
- package/skills/frontend/ui-ux-pro-max/scripts/search.py +114 -114
- package/skills/frontend/vue-expert/SKILL.md +98 -98
- package/skills/frontend/vue-expert/references/build-tooling.md +480 -480
- package/skills/frontend/vue-expert/references/components.md +448 -448
- package/skills/frontend/vue-expert/references/composition-api.md +299 -299
- package/skills/frontend/vue-expert/references/mobile-hybrid.md +636 -636
- package/skills/frontend/vue-expert/references/nuxt.md +669 -669
- package/skills/frontend/vue-expert/references/state-management.md +449 -449
- package/skills/frontend/vue-expert/references/typescript.md +584 -584
- package/skills/frontend/vue-expert-js/SKILL.md +167 -167
- package/skills/frontend/vue-expert-js/references/component-architecture.md +219 -219
- package/skills/frontend/vue-expert-js/references/composables-patterns.md +183 -183
- package/skills/frontend/vue-expert-js/references/jsdoc-typing.md +535 -535
- package/skills/frontend/vue-expert-js/references/state-management.md +249 -249
- package/skills/frontend/vue-expert-js/references/testing-patterns.md +237 -237
- package/skills/go-rust-cpp/cpp-pro/SKILL.md +115 -115
- package/skills/go-rust-cpp/cpp-pro/references/build-tooling.md +440 -440
- package/skills/go-rust-cpp/cpp-pro/references/concurrency.md +437 -437
- package/skills/go-rust-cpp/cpp-pro/references/memory-performance.md +397 -397
- package/skills/go-rust-cpp/cpp-pro/references/modern-cpp.md +304 -304
- package/skills/go-rust-cpp/cpp-pro/references/templates.md +357 -357
- package/skills/go-rust-cpp/golang-pro/SKILL.md +122 -122
- package/skills/go-rust-cpp/golang-pro/references/concurrency.md +329 -329
- package/skills/go-rust-cpp/golang-pro/references/generics.md +442 -442
- package/skills/go-rust-cpp/golang-pro/references/interfaces.md +432 -432
- package/skills/go-rust-cpp/golang-pro/references/project-structure.md +477 -477
- package/skills/go-rust-cpp/golang-pro/references/testing.md +451 -451
- package/skills/go-rust-cpp/rust-engineer/SKILL.md +167 -167
- package/skills/go-rust-cpp/rust-engineer/references/async.md +458 -458
- package/skills/go-rust-cpp/rust-engineer/references/error-handling.md +334 -334
- package/skills/go-rust-cpp/rust-engineer/references/ownership.md +278 -278
- package/skills/go-rust-cpp/rust-engineer/references/testing.md +470 -470
- package/skills/go-rust-cpp/rust-engineer/references/traits.md +413 -413
- package/skills/infra/cli-developer/SKILL.md +113 -113
- package/skills/infra/cli-developer/references/design-patterns.md +221 -221
- package/skills/infra/cli-developer/references/go-cli.md +540 -540
- package/skills/infra/cli-developer/references/node-cli.md +383 -383
- package/skills/infra/cli-developer/references/python-cli.md +422 -422
- package/skills/infra/cli-developer/references/ux-patterns.md +448 -448
- package/skills/infra/cloud-architect/SKILL.md +216 -216
- package/skills/infra/cloud-architect/references/aws.md +394 -394
- package/skills/infra/cloud-architect/references/azure.md +562 -562
- package/skills/infra/cloud-architect/references/cost.md +582 -582
- package/skills/infra/cloud-architect/references/gcp.md +633 -633
- package/skills/infra/cloud-architect/references/multi-cloud.md +483 -483
- package/skills/infra/devops-engineer/SKILL.md +144 -144
- package/skills/infra/devops-engineer/references/deployment-strategies.md +241 -241
- package/skills/infra/devops-engineer/references/docker-patterns.md +113 -113
- package/skills/infra/devops-engineer/references/github-actions.md +139 -139
- package/skills/infra/devops-engineer/references/incident-response.md +331 -331
- package/skills/infra/devops-engineer/references/kubernetes.md +154 -154
- package/skills/infra/devops-engineer/references/platform-engineering.md +417 -417
- package/skills/infra/devops-engineer/references/release-automation.md +527 -527
- package/skills/infra/devops-engineer/references/terraform-iac.md +141 -141
- package/skills/infra/kubernetes-specialist/SKILL.md +241 -241
- package/skills/infra/kubernetes-specialist/references/configuration.md +452 -452
- package/skills/infra/kubernetes-specialist/references/cost-optimization.md +458 -458
- package/skills/infra/kubernetes-specialist/references/custom-operators.md +563 -563
- package/skills/infra/kubernetes-specialist/references/gitops.md +530 -530
- package/skills/infra/kubernetes-specialist/references/helm-charts.md +912 -912
- package/skills/infra/kubernetes-specialist/references/multi-cluster.md +507 -507
- package/skills/infra/kubernetes-specialist/references/networking.md +447 -447
- package/skills/infra/kubernetes-specialist/references/service-mesh.md +459 -459
- package/skills/infra/kubernetes-specialist/references/storage.md +535 -535
- package/skills/infra/kubernetes-specialist/references/troubleshooting.md +414 -414
- package/skills/infra/kubernetes-specialist/references/workloads.md +377 -377
- package/skills/infra/mcp-developer/SKILL.md +143 -143
- package/skills/infra/mcp-developer/references/protocol.md +244 -244
- package/skills/infra/mcp-developer/references/python-sdk.md +367 -367
- package/skills/infra/mcp-developer/references/resources.md +554 -554
- package/skills/infra/mcp-developer/references/tools.md +480 -480
- package/skills/infra/mcp-developer/references/typescript-sdk.md +350 -350
- package/skills/infra/monitoring-expert/SKILL.md +176 -176
- package/skills/infra/monitoring-expert/references/alerting-rules.md +141 -141
- package/skills/infra/monitoring-expert/references/application-profiling.md +331 -331
- package/skills/infra/monitoring-expert/references/capacity-planning.md +344 -344
- package/skills/infra/monitoring-expert/references/dashboards.md +126 -126
- package/skills/infra/monitoring-expert/references/opentelemetry.md +123 -123
- package/skills/infra/monitoring-expert/references/performance-testing.md +269 -269
- package/skills/infra/monitoring-expert/references/prometheus-metrics.md +136 -136
- package/skills/infra/monitoring-expert/references/structured-logging.md +142 -142
- package/skills/infra/sre-engineer/SKILL.md +181 -181
- package/skills/infra/sre-engineer/references/automation-toil.md +492 -492
- package/skills/infra/sre-engineer/references/error-budget-policy.md +334 -334
- package/skills/infra/sre-engineer/references/incident-chaos.md +576 -576
- package/skills/infra/sre-engineer/references/monitoring-alerting.md +424 -424
- package/skills/infra/sre-engineer/references/slo-sli-management.md +238 -238
- package/skills/infra/terraform-engineer/SKILL.md +143 -143
- package/skills/infra/terraform-engineer/references/best-practices.md +583 -583
- package/skills/infra/terraform-engineer/references/module-patterns.md +297 -297
- package/skills/infra/terraform-engineer/references/providers.md +452 -452
- package/skills/infra/terraform-engineer/references/state-management.md +371 -371
- package/skills/infra/terraform-engineer/references/testing.md +486 -486
- package/skills/infra/websocket-engineer/SKILL.md +168 -168
- package/skills/infra/websocket-engineer/references/alternatives.md +391 -391
- package/skills/infra/websocket-engineer/references/patterns.md +400 -400
- package/skills/infra/websocket-engineer/references/protocol.md +195 -195
- package/skills/infra/websocket-engineer/references/scaling.md +333 -333
- package/skills/infra/websocket-engineer/references/security.md +474 -474
- package/skills/java/java-architect/SKILL.md +132 -132
- package/skills/java/java-architect/references/jpa-optimization.md +393 -393
- package/skills/java/java-architect/references/reactive-webflux.md +356 -356
- package/skills/java/java-architect/references/spring-boot-setup.md +269 -269
- package/skills/java/java-architect/references/spring-security.md +445 -445
- package/skills/java/java-architect/references/testing-patterns.md +500 -500
- package/skills/java/kotlin-specialist/SKILL.md +147 -147
- package/skills/java/kotlin-specialist/references/android-compose.md +419 -419
- package/skills/java/kotlin-specialist/references/coroutines-flow.md +276 -276
- package/skills/java/kotlin-specialist/references/dsl-idioms.md +421 -421
- package/skills/java/kotlin-specialist/references/ktor-server.md +426 -426
- package/skills/java/kotlin-specialist/references/multiplatform-kmp.md +380 -380
- package/skills/java/spring-boot-engineer/SKILL.md +195 -195
- package/skills/java/spring-boot-engineer/references/cloud.md +498 -498
- package/skills/java/spring-boot-engineer/references/data.md +381 -381
- package/skills/java/spring-boot-engineer/references/security.md +459 -459
- package/skills/java/spring-boot-engineer/references/testing.md +545 -545
- package/skills/java/spring-boot-engineer/references/web.md +295 -295
- package/skills/javascript/javascript-pro/SKILL.md +132 -132
- package/skills/javascript/javascript-pro/references/async-patterns.md +334 -334
- package/skills/javascript/javascript-pro/references/browser-apis.md +398 -398
- package/skills/javascript/javascript-pro/references/modern-syntax.md +272 -272
- package/skills/javascript/javascript-pro/references/modules.md +357 -357
- package/skills/javascript/javascript-pro/references/node-essentials.md +471 -471
- package/skills/javascript/nestjs-expert/SKILL.md +206 -206
- package/skills/javascript/nestjs-expert/references/authentication.md +166 -166
- package/skills/javascript/nestjs-expert/references/controllers-routing.md +111 -111
- package/skills/javascript/nestjs-expert/references/dtos-validation.md +153 -153
- package/skills/javascript/nestjs-expert/references/migration-from-express.md +1237 -1237
- package/skills/javascript/nestjs-expert/references/services-di.md +140 -140
- package/skills/javascript/nestjs-expert/references/testing-patterns.md +186 -186
- package/skills/javascript/typescript-pro/SKILL.md +145 -145
- package/skills/javascript/typescript-pro/references/advanced-types.md +259 -259
- package/skills/javascript/typescript-pro/references/configuration.md +445 -445
- package/skills/javascript/typescript-pro/references/patterns.md +484 -484
- package/skills/javascript/typescript-pro/references/type-guards.md +352 -352
- package/skills/javascript/typescript-pro/references/utility-types.md +329 -329
- package/skills/php/laravel-specialist/SKILL.md +262 -262
- package/skills/php/laravel-specialist/references/eloquent.md +351 -351
- package/skills/php/laravel-specialist/references/livewire.md +512 -512
- package/skills/php/laravel-specialist/references/queues.md +423 -423
- package/skills/php/laravel-specialist/references/routing.md +362 -362
- package/skills/php/laravel-specialist/references/testing.md +522 -522
- package/skills/php/php-pro/SKILL.md +206 -206
- package/skills/php/php-pro/references/async-patterns.md +412 -412
- package/skills/php/php-pro/references/laravel-patterns.md +377 -377
- package/skills/php/php-pro/references/modern-php-features.md +323 -323
- package/skills/php/php-pro/references/symfony-patterns.md +466 -466
- package/skills/php/php-pro/references/testing-quality.md +466 -466
- package/skills/product/competitive-analysis/SKILL.md +257 -257
- package/skills/product/meeting-notes/SKILL.md +266 -266
- package/skills/product/prd-template/SKILL.md +150 -150
- package/skills/product/stakeholder-update/SKILL.md +225 -225
- package/skills/product/user-research-synthesis/SKILL.md +235 -235
- package/skills/python/django-expert/SKILL.md +162 -162
- package/skills/python/django-expert/references/authentication.md +145 -145
- package/skills/python/django-expert/references/drf-serializers.md +148 -148
- package/skills/python/django-expert/references/models-orm.md +151 -151
- package/skills/python/django-expert/references/testing-django.md +204 -204
- package/skills/python/django-expert/references/viewsets-views.md +153 -153
- package/skills/python/fastapi-expert/SKILL.md +185 -185
- package/skills/python/fastapi-expert/references/async-sqlalchemy.md +146 -146
- package/skills/python/fastapi-expert/references/authentication.md +159 -159
- package/skills/python/fastapi-expert/references/endpoints-routing.md +142 -142
- package/skills/python/fastapi-expert/references/migration-from-django.md +996 -996
- package/skills/python/fastapi-expert/references/pydantic-v2.md +135 -135
- package/skills/python/fastapi-expert/references/testing-async.md +159 -159
- package/skills/python/pandas-pro/SKILL.md +178 -178
- package/skills/python/pandas-pro/references/aggregation-groupby.md +545 -545
- package/skills/python/pandas-pro/references/data-cleaning.md +500 -500
- package/skills/python/pandas-pro/references/dataframe-operations.md +420 -420
- package/skills/python/pandas-pro/references/merging-joining.md +596 -596
- package/skills/python/pandas-pro/references/performance-optimization.md +597 -597
- package/skills/python/python-pro/SKILL.md +177 -177
- package/skills/python/python-pro/references/async-patterns.md +356 -356
- package/skills/python/python-pro/references/packaging.md +460 -460
- package/skills/python/python-pro/references/standard-library.md +378 -378
- package/skills/python/python-pro/references/testing.md +404 -404
- package/skills/python/python-pro/references/type-system.md +290 -290
- package/skills/quality/chaos-engineer/SKILL.md +182 -182
- package/skills/quality/chaos-engineer/references/chaos-tools.md +511 -511
- package/skills/quality/chaos-engineer/references/experiment-design.md +229 -229
- package/skills/quality/chaos-engineer/references/game-days.md +434 -434
- package/skills/quality/chaos-engineer/references/infrastructure-chaos.md +348 -348
- package/skills/quality/chaos-engineer/references/kubernetes-chaos.md +432 -432
- package/skills/quality/code-reviewer/SKILL.md +119 -119
- package/skills/quality/code-reviewer/references/common-issues.md +142 -142
- package/skills/quality/code-reviewer/references/feedback-examples.md +144 -144
- package/skills/quality/code-reviewer/references/receiving-feedback.md +238 -238
- package/skills/quality/code-reviewer/references/report-template.md +109 -109
- package/skills/quality/code-reviewer/references/review-checklist.md +88 -88
- package/skills/quality/code-reviewer/references/spec-compliance-review.md +258 -258
- package/skills/quality/playwright-expert/SKILL.md +169 -169
- package/skills/quality/playwright-expert/references/api-mocking.md +140 -140
- package/skills/quality/playwright-expert/references/configuration.md +155 -155
- package/skills/quality/playwright-expert/references/debugging-flaky.md +150 -150
- package/skills/quality/playwright-expert/references/page-object-model.md +152 -152
- package/skills/quality/playwright-expert/references/selectors-locators.md +119 -119
- package/skills/quality/secure-code-guardian/SKILL.md +191 -191
- package/skills/quality/secure-code-guardian/references/authentication.md +136 -136
- package/skills/quality/secure-code-guardian/references/input-validation.md +146 -146
- package/skills/quality/secure-code-guardian/references/owasp-prevention.md +135 -135
- package/skills/quality/secure-code-guardian/references/security-headers.md +133 -133
- package/skills/quality/secure-code-guardian/references/xss-csrf.md +157 -157
- package/skills/quality/security-reviewer/SKILL.md +103 -103
- package/skills/quality/security-reviewer/references/infrastructure-security.md +268 -268
- package/skills/quality/security-reviewer/references/penetration-testing.md +268 -268
- package/skills/quality/security-reviewer/references/report-template.md +170 -170
- package/skills/quality/security-reviewer/references/sast-tools.md +117 -117
- package/skills/quality/security-reviewer/references/secret-scanning.md +125 -125
- package/skills/quality/security-reviewer/references/vulnerability-patterns.md +152 -152
- package/skills/quality/senior-qa/README.md +196 -196
- package/skills/quality/senior-qa/SKILL.md +399 -399
- package/skills/quality/senior-qa/references/qa_best_practices.md +964 -964
- package/skills/quality/senior-qa/references/test_automation_patterns.md +1009 -1009
- package/skills/quality/senior-qa/references/testing_strategies.md +649 -649
- package/skills/quality/senior-qa/scripts/coverage_analyzer.py +836 -836
- package/skills/quality/senior-qa/scripts/e2e_test_scaffolder.py +820 -820
- package/skills/quality/senior-qa/scripts/test_suite_generator.py +605 -605
- package/skills/quality/tdd-guide/HOW_TO_USE.md +313 -313
- package/skills/quality/tdd-guide/README.md +680 -680
- package/skills/quality/tdd-guide/SKILL.md +122 -122
- package/skills/quality/tdd-guide/assets/expected_output.json +77 -77
- package/skills/quality/tdd-guide/assets/sample_input_python.json +39 -39
- package/skills/quality/tdd-guide/assets/sample_input_typescript.json +36 -36
- package/skills/quality/tdd-guide/references/ci-integration.md +195 -195
- package/skills/quality/tdd-guide/references/framework-guide.md +206 -206
- package/skills/quality/tdd-guide/references/tdd-best-practices.md +128 -128
- package/skills/quality/tdd-guide/scripts/coverage_analyzer.py +434 -434
- package/skills/quality/tdd-guide/scripts/fixture_generator.py +440 -440
- package/skills/quality/tdd-guide/scripts/format_detector.py +384 -384
- package/skills/quality/tdd-guide/scripts/framework_adapter.py +428 -428
- package/skills/quality/tdd-guide/scripts/metrics_calculator.py +456 -456
- package/skills/quality/tdd-guide/scripts/output_formatter.py +354 -354
- package/skills/quality/tdd-guide/scripts/tdd_workflow.py +474 -474
- package/skills/quality/tdd-guide/scripts/test_generator.py +438 -438
- package/skills/quality/test-master/SKILL.md +94 -94
- package/skills/quality/test-master/references/automation-frameworks.md +294 -294
- package/skills/quality/test-master/references/e2e-testing.md +128 -128
- package/skills/quality/test-master/references/integration-testing.md +120 -120
- package/skills/quality/test-master/references/performance-testing.md +118 -118
- package/skills/quality/test-master/references/qa-methodology.md +247 -247
- package/skills/quality/test-master/references/security-testing.md +127 -127
- package/skills/quality/test-master/references/tdd-iron-laws.md +174 -174
- package/skills/quality/test-master/references/test-reports.md +104 -104
- package/skills/quality/test-master/references/testing-anti-patterns.md +231 -231
- package/skills/quality/test-master/references/unit-testing.md +113 -113
- package/skills/ruby/rails-expert/SKILL.md +154 -154
- package/skills/ruby/rails-expert/references/active-record.md +244 -244
- package/skills/ruby/rails-expert/references/api-development.md +401 -401
- package/skills/ruby/rails-expert/references/background-jobs.md +272 -272
- package/skills/ruby/rails-expert/references/hotwire-turbo.md +228 -228
- package/skills/ruby/rails-expert/references/rspec-testing.md +367 -367
- package/skills/swift/swift-expert/SKILL.md +163 -163
- package/skills/swift/swift-expert/references/async-concurrency.md +360 -360
- package/skills/swift/swift-expert/references/memory-performance.md +377 -377
- package/skills/swift/swift-expert/references/protocol-oriented.md +354 -354
- package/skills/swift/swift-expert/references/swiftui-patterns.md +291 -291
- package/skills/swift/swift-expert/references/testing-patterns.md +399 -399
- package/skills/workflow/brainstorming/SKILL.md +164 -164
- package/skills/workflow/brainstorming/scripts/frame-template.html +214 -214
- package/skills/workflow/brainstorming/scripts/helper.js +88 -88
- package/skills/workflow/brainstorming/scripts/server.cjs +354 -354
- package/skills/workflow/brainstorming/scripts/start-server.sh +148 -148
- package/skills/workflow/brainstorming/scripts/stop-server.sh +56 -56
- package/skills/workflow/brainstorming/spec-document-reviewer-prompt.md +49 -49
- package/skills/workflow/brainstorming/visual-companion.md +287 -287
- package/skills/workflow/documentation/SKILL.md +45 -45
- package/skills/workflow/entropy-management/SKILL.md +115 -115
- package/skills/workflow/executing-plans/SKILL.md +70 -70
- package/skills/workflow/finishing-a-development-branch/SKILL.md +200 -200
- package/skills/workflow/receiving-code-review/SKILL.md +213 -213
- package/skills/workflow/requesting-code-review/SKILL.md +105 -105
- package/skills/workflow/requesting-code-review/code-reviewer.md +146 -146
- package/skills/workflow/requirement-engineering/SKILL.md +111 -111
- package/skills/workflow/systematic-debugging/CREATION-LOG.md +119 -119
- package/skills/workflow/systematic-debugging/SKILL.md +296 -296
- package/skills/workflow/systematic-debugging/condition-based-waiting-example.ts +158 -158
- package/skills/workflow/systematic-debugging/condition-based-waiting.md +115 -115
- package/skills/workflow/systematic-debugging/defense-in-depth.md +122 -122
- package/skills/workflow/systematic-debugging/find-polluter.sh +63 -63
- package/skills/workflow/systematic-debugging/root-cause-tracing.md +169 -169
- package/skills/workflow/systematic-debugging/test-academic.md +14 -14
- package/skills/workflow/systematic-debugging/test-pressure-1.md +58 -58
- package/skills/workflow/systematic-debugging/test-pressure-2.md +68 -68
- package/skills/workflow/systematic-debugging/test-pressure-3.md +69 -69
- package/skills/workflow/using-git-worktrees/SKILL.md +218 -218
- package/skills/workflow/verification-before-completion/SKILL.md +139 -139
- package/skills/workflow/writing-plans/SKILL.md +151 -151
- package/skills/workflow/writing-plans/plan-document-reviewer-prompt.md +49 -49
- package/skills/workflow/writing-skills/SKILL.md +655 -655
- package/skills/workflow/writing-skills/anthropic-best-practices.md +1150 -1150
- package/skills/workflow/writing-skills/examples/CLAUDE_MD_TESTING.md +189 -189
- package/skills/workflow/writing-skills/persuasion-principles.md +187 -187
- package/skills/workflow/writing-skills/render-graphs.js +168 -168
- package/skills/workflow/writing-skills/testing-skills-with-subagents.md +384 -384
|
@@ -1,833 +1,833 @@
|
|
|
1
|
-
# RAG Evaluation
|
|
2
|
-
|
|
3
|
-
---
|
|
4
|
-
|
|
5
|
-
## Evaluation Framework Overview
|
|
6
|
-
|
|
7
|
-
| Framework | Focus | Strengths | Use Case |
|
|
8
|
-
|-----------|-------|-----------|----------|
|
|
9
|
-
| **RAGAS** | RAG-specific metrics | Faithfulness, relevance | Production RAG evaluation |
|
|
10
|
-
| **TruLens** | LLM app observability | Tracing, feedback functions | Debugging and monitoring |
|
|
11
|
-
| **LangSmith** | LangChain ecosystem | Traces, datasets, testing | LangChain projects |
|
|
12
|
-
| **Custom** | Specific requirements | Full control | Domain-specific needs |
|
|
13
|
-
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
## Core Metrics
|
|
17
|
-
|
|
18
|
-
### Retrieval Metrics
|
|
19
|
-
|
|
20
|
-
| Metric | Formula | What It Measures |
|
|
21
|
-
|--------|---------|------------------|
|
|
22
|
-
| **Precision@k** | Relevant in top-k / k | Are retrieved docs relevant? |
|
|
23
|
-
| **Recall@k** | Relevant in top-k / Total relevant | Did we get all relevant docs? |
|
|
24
|
-
| **MRR** | 1 / Rank of first relevant | How quickly do we find relevant? |
|
|
25
|
-
| **NDCG@k** | DCG@k / IDCG@k | Is ranking order correct? |
|
|
26
|
-
| **Hit Rate** | Queries with relevant in top-k / Total queries | Binary success rate |
|
|
27
|
-
|
|
28
|
-
### Generation Metrics
|
|
29
|
-
|
|
30
|
-
| Metric | What It Measures |
|
|
31
|
-
|--------|------------------|
|
|
32
|
-
| **Faithfulness** | Is answer grounded in retrieved context? |
|
|
33
|
-
| **Answer Relevance** | Does answer address the question? |
|
|
34
|
-
| **Context Relevance** | Is retrieved context relevant to question? |
|
|
35
|
-
| **Context Utilization** | How much context was actually used? |
|
|
36
|
-
|
|
37
|
-
---
|
|
38
|
-
|
|
39
|
-
## Implementing Core Metrics
|
|
40
|
-
|
|
41
|
-
### Precision, Recall, and Hit Rate
|
|
42
|
-
|
|
43
|
-
```python
|
|
44
|
-
from dataclasses import dataclass
|
|
45
|
-
from typing import Set
|
|
46
|
-
|
|
47
|
-
@dataclass
|
|
48
|
-
class RetrievalMetrics:
|
|
49
|
-
precision_at_k: float
|
|
50
|
-
recall_at_k: float
|
|
51
|
-
hit_rate: float
|
|
52
|
-
mrr: float
|
|
53
|
-
|
|
54
|
-
def calculate_retrieval_metrics(
|
|
55
|
-
retrieved_ids: list[str],
|
|
56
|
-
relevant_ids: set[str],
|
|
57
|
-
k: int
|
|
58
|
-
) -> RetrievalMetrics:
|
|
59
|
-
"""Calculate core retrieval metrics."""
|
|
60
|
-
top_k = retrieved_ids[:k]
|
|
61
|
-
top_k_set = set(top_k)
|
|
62
|
-
|
|
63
|
-
# Precision@k: relevant in top-k / k
|
|
64
|
-
relevant_in_top_k = len(top_k_set & relevant_ids)
|
|
65
|
-
precision = relevant_in_top_k / k if k > 0 else 0
|
|
66
|
-
|
|
67
|
-
# Recall@k: relevant in top-k / total relevant
|
|
68
|
-
recall = relevant_in_top_k / len(relevant_ids) if relevant_ids else 0
|
|
69
|
-
|
|
70
|
-
# Hit Rate: 1 if any relevant in top-k, else 0
|
|
71
|
-
hit_rate = 1.0 if relevant_in_top_k > 0 else 0.0
|
|
72
|
-
|
|
73
|
-
# MRR: 1 / rank of first relevant result
|
|
74
|
-
mrr = 0.0
|
|
75
|
-
for i, doc_id in enumerate(top_k, 1):
|
|
76
|
-
if doc_id in relevant_ids:
|
|
77
|
-
mrr = 1.0 / i
|
|
78
|
-
break
|
|
79
|
-
|
|
80
|
-
return RetrievalMetrics(
|
|
81
|
-
precision_at_k=precision,
|
|
82
|
-
recall_at_k=recall,
|
|
83
|
-
hit_rate=hit_rate,
|
|
84
|
-
mrr=mrr
|
|
85
|
-
)
|
|
86
|
-
|
|
87
|
-
# Usage
|
|
88
|
-
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
|
|
89
|
-
relevant = {"doc2", "doc5", "doc7"} # Ground truth
|
|
90
|
-
|
|
91
|
-
metrics = calculate_retrieval_metrics(retrieved, relevant, k=5)
|
|
92
|
-
print(f"Precision@5: {metrics.precision_at_k:.2f}") # 2/5 = 0.40
|
|
93
|
-
print(f"Recall@5: {metrics.recall_at_k:.2f}") # 2/3 = 0.67
|
|
94
|
-
print(f"MRR: {metrics.mrr:.2f}") # 1/2 = 0.50
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
### NDCG (Normalized Discounted Cumulative Gain)
|
|
98
|
-
|
|
99
|
-
```python
|
|
100
|
-
import numpy as np
|
|
101
|
-
|
|
102
|
-
def dcg_at_k(relevance_scores: list[float], k: int) -> float:
|
|
103
|
-
"""Calculate Discounted Cumulative Gain."""
|
|
104
|
-
relevance_scores = np.array(relevance_scores[:k])
|
|
105
|
-
if len(relevance_scores) == 0:
|
|
106
|
-
return 0.0
|
|
107
|
-
|
|
108
|
-
# DCG = sum(rel_i / log2(i + 1)) for i in 1..k
|
|
109
|
-
discounts = np.log2(np.arange(2, len(relevance_scores) + 2))
|
|
110
|
-
return np.sum(relevance_scores / discounts)
|
|
111
|
-
|
|
112
|
-
def ndcg_at_k(
|
|
113
|
-
retrieved_ids: list[str],
|
|
114
|
-
relevance_scores: dict[str, float],
|
|
115
|
-
k: int
|
|
116
|
-
) -> float:
|
|
117
|
-
"""
|
|
118
|
-
Calculate NDCG@k.
|
|
119
|
-
relevance_scores: dict mapping doc_id to relevance (e.g., 0, 1, 2, 3)
|
|
120
|
-
"""
|
|
121
|
-
# Get relevance scores for retrieved docs
|
|
122
|
-
retrieved_relevance = [
|
|
123
|
-
relevance_scores.get(doc_id, 0)
|
|
124
|
-
for doc_id in retrieved_ids[:k]
|
|
125
|
-
]
|
|
126
|
-
|
|
127
|
-
# Calculate DCG for retrieved order
|
|
128
|
-
dcg = dcg_at_k(retrieved_relevance, k)
|
|
129
|
-
|
|
130
|
-
# Calculate ideal DCG (perfect ranking)
|
|
131
|
-
ideal_relevance = sorted(relevance_scores.values(), reverse=True)[:k]
|
|
132
|
-
idcg = dcg_at_k(ideal_relevance, k)
|
|
133
|
-
|
|
134
|
-
return dcg / idcg if idcg > 0 else 0.0
|
|
135
|
-
|
|
136
|
-
# Usage with graded relevance
|
|
137
|
-
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
|
|
138
|
-
relevance = {
|
|
139
|
-
"doc1": 0, # Not relevant
|
|
140
|
-
"doc2": 3, # Highly relevant
|
|
141
|
-
"doc3": 1, # Somewhat relevant
|
|
142
|
-
"doc5": 2, # Relevant
|
|
143
|
-
"doc7": 3, # Highly relevant (not retrieved)
|
|
144
|
-
}
|
|
145
|
-
|
|
146
|
-
ndcg = ndcg_at_k(retrieved, relevance, k=5)
|
|
147
|
-
print(f"NDCG@5: {ndcg:.3f}")
|
|
148
|
-
```
|
|
149
|
-
|
|
150
|
-
---
|
|
151
|
-
|
|
152
|
-
## RAGAS Framework
|
|
153
|
-
|
|
154
|
-
### Installation and Setup
|
|
155
|
-
|
|
156
|
-
```python
|
|
157
|
-
# pip install ragas
|
|
158
|
-
|
|
159
|
-
from ragas import evaluate
|
|
160
|
-
from ragas.metrics import (
|
|
161
|
-
faithfulness,
|
|
162
|
-
answer_relevancy,
|
|
163
|
-
context_precision,
|
|
164
|
-
context_recall,
|
|
165
|
-
context_utilization,
|
|
166
|
-
)
|
|
167
|
-
from datasets import Dataset
|
|
168
|
-
|
|
169
|
-
# Prepare evaluation dataset
|
|
170
|
-
eval_data = {
|
|
171
|
-
"question": [
|
|
172
|
-
"What is the capital of France?",
|
|
173
|
-
"How do I install Python?"
|
|
174
|
-
],
|
|
175
|
-
"answer": [
|
|
176
|
-
"The capital of France is Paris.",
|
|
177
|
-
"You can install Python by downloading it from python.org."
|
|
178
|
-
],
|
|
179
|
-
"contexts": [
|
|
180
|
-
["Paris is the capital and largest city of France."],
|
|
181
|
-
["Python can be installed from the official website python.org.",
|
|
182
|
-
"You can also use package managers like brew or apt."]
|
|
183
|
-
],
|
|
184
|
-
"ground_truth": [
|
|
185
|
-
"Paris is the capital of France.",
|
|
186
|
-
"Install Python from python.org or use a package manager."
|
|
187
|
-
]
|
|
188
|
-
}
|
|
189
|
-
|
|
190
|
-
dataset = Dataset.from_dict(eval_data)
|
|
191
|
-
|
|
192
|
-
# Run evaluation
|
|
193
|
-
results = evaluate(
|
|
194
|
-
dataset,
|
|
195
|
-
metrics=[
|
|
196
|
-
faithfulness,
|
|
197
|
-
answer_relevancy,
|
|
198
|
-
context_precision,
|
|
199
|
-
context_recall,
|
|
200
|
-
]
|
|
201
|
-
)
|
|
202
|
-
|
|
203
|
-
print(results)
|
|
204
|
-
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, ...}
|
|
205
|
-
```
|
|
206
|
-
|
|
207
|
-
### Custom RAGAS Evaluation
|
|
208
|
-
|
|
209
|
-
```python
|
|
210
|
-
from ragas.metrics import Metric
|
|
211
|
-
from ragas.llms import LangchainLLM
|
|
212
|
-
from langchain_openai import ChatOpenAI
|
|
213
|
-
|
|
214
|
-
# Use custom LLM
|
|
215
|
-
custom_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4o-mini"))
|
|
216
|
-
|
|
217
|
-
# Evaluate with custom settings
|
|
218
|
-
results = evaluate(
|
|
219
|
-
dataset,
|
|
220
|
-
metrics=[faithfulness, answer_relevancy],
|
|
221
|
-
llm=custom_llm,
|
|
222
|
-
raise_exceptions=False # Continue on errors
|
|
223
|
-
)
|
|
224
|
-
|
|
225
|
-
# Per-sample scores
|
|
226
|
-
for i, row in enumerate(results.to_pandas().itertuples()):
|
|
227
|
-
print(f"Q{i+1}: Faithfulness={row.faithfulness:.2f}, "
|
|
228
|
-
f"Relevancy={row.answer_relevancy:.2f}")
|
|
229
|
-
```
|
|
230
|
-
|
|
231
|
-
### RAGAS Metrics Explained
|
|
232
|
-
|
|
233
|
-
```python
|
|
234
|
-
"""
|
|
235
|
-
RAGAS Core Metrics:
|
|
236
|
-
|
|
237
|
-
1. Faithfulness (0-1):
|
|
238
|
-
- Measures if answer is grounded in context
|
|
239
|
-
- LLM extracts claims from answer, verifies against context
|
|
240
|
-
- High score = answer doesn't hallucinate
|
|
241
|
-
|
|
242
|
-
2. Answer Relevancy (0-1):
|
|
243
|
-
- Measures if answer addresses the question
|
|
244
|
-
- Generates questions from answer, compares to original
|
|
245
|
-
- High score = answer is on-topic
|
|
246
|
-
|
|
247
|
-
3. Context Precision (0-1):
|
|
248
|
-
- Measures if retrieved contexts are relevant
|
|
249
|
-
- Ranks contexts by relevance, calculates precision at each rank
|
|
250
|
-
- High score = top contexts are most relevant
|
|
251
|
-
|
|
252
|
-
4. Context Recall (0-1):
|
|
253
|
-
- Measures if all ground truth info is in context
|
|
254
|
-
- Checks if ground truth sentences are supported by context
|
|
255
|
-
- High score = context contains needed information
|
|
256
|
-
"""
|
|
257
|
-
|
|
258
|
-
# Debugging low scores
|
|
259
|
-
def diagnose_ragas_scores(results_df):
|
|
260
|
-
"""Identify problematic samples."""
|
|
261
|
-
issues = []
|
|
262
|
-
|
|
263
|
-
for idx, row in results_df.iterrows():
|
|
264
|
-
if row.get('faithfulness', 1) < 0.5:
|
|
265
|
-
issues.append({
|
|
266
|
-
"index": idx,
|
|
267
|
-
"issue": "Low faithfulness - answer may contain hallucinations",
|
|
268
|
-
"question": row['question'],
|
|
269
|
-
"answer": row['answer'][:200]
|
|
270
|
-
})
|
|
271
|
-
|
|
272
|
-
if row.get('context_recall', 1) < 0.5:
|
|
273
|
-
issues.append({
|
|
274
|
-
"index": idx,
|
|
275
|
-
"issue": "Low context recall - retrieval missing relevant docs",
|
|
276
|
-
"question": row['question']
|
|
277
|
-
})
|
|
278
|
-
|
|
279
|
-
return issues
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
---
|
|
283
|
-
|
|
284
|
-
## TruLens Evaluation
|
|
285
|
-
|
|
286
|
-
### Setup and Basic Usage
|
|
287
|
-
|
|
288
|
-
```python
|
|
289
|
-
# pip install trulens-eval
|
|
290
|
-
|
|
291
|
-
from trulens_eval import Tru, TruChain, Feedback
|
|
292
|
-
from trulens_eval.feedback import Groundedness
|
|
293
|
-
from trulens_eval.feedback.provider import OpenAI as fOpenAI
|
|
294
|
-
|
|
295
|
-
# Initialize TruLens
|
|
296
|
-
tru = Tru()
|
|
297
|
-
|
|
298
|
-
# Create feedback provider
|
|
299
|
-
provider = fOpenAI()
|
|
300
|
-
|
|
301
|
-
# Define feedback functions
|
|
302
|
-
f_groundedness = Feedback(
|
|
303
|
-
provider.groundedness_measure_with_cot_reasons,
|
|
304
|
-
name="Groundedness"
|
|
305
|
-
).on(
|
|
306
|
-
TruChain.select_context().node.text # Retrieved context
|
|
307
|
-
).on_output()
|
|
308
|
-
|
|
309
|
-
f_relevance = Feedback(
|
|
310
|
-
provider.relevance_with_cot_reasons,
|
|
311
|
-
name="Answer Relevance"
|
|
312
|
-
).on_input().on_output()
|
|
313
|
-
|
|
314
|
-
f_context_relevance = Feedback(
|
|
315
|
-
provider.context_relevance_with_cot_reasons,
|
|
316
|
-
name="Context Relevance"
|
|
317
|
-
).on_input().on(
|
|
318
|
-
TruChain.select_context().node.text
|
|
319
|
-
)
|
|
320
|
-
|
|
321
|
-
# Wrap your RAG chain
|
|
322
|
-
from langchain.chains import RetrievalQA
|
|
323
|
-
|
|
324
|
-
rag_chain = RetrievalQA.from_chain_type(
|
|
325
|
-
llm=llm,
|
|
326
|
-
retriever=vector_store.as_retriever()
|
|
327
|
-
)
|
|
328
|
-
|
|
329
|
-
tru_recorder = TruChain(
|
|
330
|
-
rag_chain,
|
|
331
|
-
app_id="rag-v1",
|
|
332
|
-
feedbacks=[f_groundedness, f_relevance, f_context_relevance]
|
|
333
|
-
)
|
|
334
|
-
|
|
335
|
-
# Run with recording
|
|
336
|
-
with tru_recorder as recording:
|
|
337
|
-
response = rag_chain.invoke({"query": "How do I configure authentication?"})
|
|
338
|
-
|
|
339
|
-
# View results
|
|
340
|
-
tru.run_dashboard() # Opens web UI
|
|
341
|
-
# Or get programmatically
|
|
342
|
-
records = tru.get_records_and_feedback(app_ids=["rag-v1"])
|
|
343
|
-
```
|
|
344
|
-
|
|
345
|
-
### Custom Feedback Functions
|
|
346
|
-
|
|
347
|
-
```python
|
|
348
|
-
from trulens_eval import Feedback, Select
|
|
349
|
-
|
|
350
|
-
def custom_citation_check(response: str, context: str) -> float:
|
|
351
|
-
"""Check if response cites sources from context."""
|
|
352
|
-
# Extract citations from response (e.g., [1], [Source: X])
|
|
353
|
-
import re
|
|
354
|
-
citations = re.findall(r'\[[\d\w\s:]+\]', response)
|
|
355
|
-
|
|
356
|
-
if not citations:
|
|
357
|
-
return 0.0 # No citations
|
|
358
|
-
|
|
359
|
-
# Verify citations reference actual context
|
|
360
|
-
valid_citations = sum(1 for c in citations if c.lower() in context.lower())
|
|
361
|
-
return valid_citations / len(citations)
|
|
362
|
-
|
|
363
|
-
f_citation = Feedback(
|
|
364
|
-
custom_citation_check,
|
|
365
|
-
name="Citation Accuracy"
|
|
366
|
-
).on_output().on(Select.RecordCalls.retriever.get_relevant_documents.rets.page_content)
|
|
367
|
-
```
|
|
368
|
-
|
|
369
|
-
---
|
|
370
|
-
|
|
371
|
-
## Building Custom Evaluation Pipelines
|
|
372
|
-
|
|
373
|
-
### LLM-as-Judge Evaluation
|
|
374
|
-
|
|
375
|
-
```python
|
|
376
|
-
from openai import OpenAI
|
|
377
|
-
from dataclasses import dataclass
|
|
378
|
-
from typing import Literal
|
|
379
|
-
|
|
380
|
-
client = OpenAI()
|
|
381
|
-
|
|
382
|
-
@dataclass
|
|
383
|
-
class EvalResult:
|
|
384
|
-
score: float
|
|
385
|
-
reasoning: str
|
|
386
|
-
criteria: str
|
|
387
|
-
|
|
388
|
-
def evaluate_with_llm(
|
|
389
|
-
question: str,
|
|
390
|
-
answer: str,
|
|
391
|
-
context: str,
|
|
392
|
-
criteria: Literal["faithfulness", "relevance", "completeness"]
|
|
393
|
-
) -> EvalResult:
|
|
394
|
-
"""Use LLM as judge for evaluation."""
|
|
395
|
-
|
|
396
|
-
criteria_prompts = {
|
|
397
|
-
"faithfulness": """
|
|
398
|
-
Evaluate if the answer is fully supported by the provided context.
|
|
399
|
-
Score 1.0 if every claim in the answer is verifiable from context.
|
|
400
|
-
Score 0.5 if most claims are supported but some are not.
|
|
401
|
-
Score 0.0 if the answer contains significant unsupported claims.
|
|
402
|
-
""",
|
|
403
|
-
"relevance": """
|
|
404
|
-
Evaluate if the answer directly addresses the question.
|
|
405
|
-
Score 1.0 if the answer fully addresses the question.
|
|
406
|
-
Score 0.5 if the answer partially addresses the question.
|
|
407
|
-
Score 0.0 if the answer is off-topic or doesn't address the question.
|
|
408
|
-
""",
|
|
409
|
-
"completeness": """
|
|
410
|
-
Evaluate if the answer covers all aspects of the question.
|
|
411
|
-
Score 1.0 if the answer is comprehensive and complete.
|
|
412
|
-
Score 0.5 if the answer covers main points but misses details.
|
|
413
|
-
Score 0.0 if the answer is significantly incomplete.
|
|
414
|
-
"""
|
|
415
|
-
}
|
|
416
|
-
|
|
417
|
-
response = client.chat.completions.create(
|
|
418
|
-
model="gpt-4o-mini",
|
|
419
|
-
messages=[
|
|
420
|
-
{
|
|
421
|
-
"role": "system",
|
|
422
|
-
"content": f"""You are an expert evaluator for RAG systems.
|
|
423
|
-
{criteria_prompts[criteria]}
|
|
424
|
-
|
|
425
|
-
Respond in JSON format:
|
|
426
|
-
{{"score": <0.0-1.0>, "reasoning": "<explanation>"}}"""
|
|
427
|
-
},
|
|
428
|
-
{
|
|
429
|
-
"role": "user",
|
|
430
|
-
"content": f"""Question: {question}
|
|
431
|
-
|
|
432
|
-
Context:
|
|
433
|
-
{context}
|
|
434
|
-
|
|
435
|
-
Answer: {answer}
|
|
436
|
-
|
|
437
|
-
Evaluate the answer for {criteria}:"""
|
|
438
|
-
}
|
|
439
|
-
],
|
|
440
|
-
response_format={"type": "json_object"}
|
|
441
|
-
)
|
|
442
|
-
|
|
443
|
-
import json
|
|
444
|
-
result = json.loads(response.choices[0].message.content)
|
|
445
|
-
|
|
446
|
-
return EvalResult(
|
|
447
|
-
score=result["score"],
|
|
448
|
-
reasoning=result["reasoning"],
|
|
449
|
-
criteria=criteria
|
|
450
|
-
)
|
|
451
|
-
|
|
452
|
-
# Usage
|
|
453
|
-
eval_result = evaluate_with_llm(
|
|
454
|
-
question="How do I configure OAuth2?",
|
|
455
|
-
answer="Configure OAuth2 by setting client_id and client_secret in config.yaml.",
|
|
456
|
-
context="OAuth2 configuration requires client_id, client_secret, and redirect_uri in config.yaml.",
|
|
457
|
-
criteria="faithfulness"
|
|
458
|
-
)
|
|
459
|
-
print(f"Faithfulness: {eval_result.score:.2f}")
|
|
460
|
-
print(f"Reasoning: {eval_result.reasoning}")
|
|
461
|
-
```
|
|
462
|
-
|
|
463
|
-
### Batch Evaluation Pipeline
|
|
464
|
-
|
|
465
|
-
```python
|
|
466
|
-
import asyncio
|
|
467
|
-
from tqdm.asyncio import tqdm_asyncio
|
|
468
|
-
|
|
469
|
-
async def evaluate_batch(
|
|
470
|
-
test_cases: list[dict],
|
|
471
|
-
retriever,
|
|
472
|
-
generator,
|
|
473
|
-
metrics: list[str] = ["precision", "faithfulness", "relevance"]
|
|
474
|
-
) -> dict:
|
|
475
|
-
"""Run batch evaluation on test cases."""
|
|
476
|
-
|
|
477
|
-
results = {
|
|
478
|
-
"per_sample": [],
|
|
479
|
-
"aggregated": {}
|
|
480
|
-
}
|
|
481
|
-
|
|
482
|
-
async def evaluate_single(case: dict) -> dict:
|
|
483
|
-
# Retrieve
|
|
484
|
-
retrieved = await retriever.aretrieve(case["question"])
|
|
485
|
-
retrieved_ids = [r.id for r in retrieved]
|
|
486
|
-
|
|
487
|
-
# Generate
|
|
488
|
-
answer = await generator.agenerate(
|
|
489
|
-
question=case["question"],
|
|
490
|
-
context=[r.text for r in retrieved]
|
|
491
|
-
)
|
|
492
|
-
|
|
493
|
-
# Calculate metrics
|
|
494
|
-
sample_result = {
|
|
495
|
-
"question": case["question"],
|
|
496
|
-
"answer": answer,
|
|
497
|
-
"retrieved_ids": retrieved_ids
|
|
498
|
-
}
|
|
499
|
-
|
|
500
|
-
if "relevant_ids" in case and "precision" in metrics:
|
|
501
|
-
retrieval_metrics = calculate_retrieval_metrics(
|
|
502
|
-
retrieved_ids,
|
|
503
|
-
set(case["relevant_ids"]),
|
|
504
|
-
k=5
|
|
505
|
-
)
|
|
506
|
-
sample_result["precision@5"] = retrieval_metrics.precision_at_k
|
|
507
|
-
sample_result["recall@5"] = retrieval_metrics.recall_at_k
|
|
508
|
-
|
|
509
|
-
if "faithfulness" in metrics:
|
|
510
|
-
faith_eval = evaluate_with_llm(
|
|
511
|
-
case["question"],
|
|
512
|
-
answer,
|
|
513
|
-
"\n".join([r.text for r in retrieved]),
|
|
514
|
-
"faithfulness"
|
|
515
|
-
)
|
|
516
|
-
sample_result["faithfulness"] = faith_eval.score
|
|
517
|
-
|
|
518
|
-
return sample_result
|
|
519
|
-
|
|
520
|
-
# Run evaluations concurrently
|
|
521
|
-
tasks = [evaluate_single(case) for case in test_cases]
|
|
522
|
-
results["per_sample"] = await tqdm_asyncio.gather(*tasks)
|
|
523
|
-
|
|
524
|
-
# Aggregate results
|
|
525
|
-
for metric in ["precision@5", "recall@5", "faithfulness"]:
|
|
526
|
-
scores = [r.get(metric) for r in results["per_sample"] if r.get(metric) is not None]
|
|
527
|
-
if scores:
|
|
528
|
-
results["aggregated"][metric] = {
|
|
529
|
-
"mean": sum(scores) / len(scores),
|
|
530
|
-
"min": min(scores),
|
|
531
|
-
"max": max(scores)
|
|
532
|
-
}
|
|
533
|
-
|
|
534
|
-
return results
|
|
535
|
-
```
|
|
536
|
-
|
|
537
|
-
---
|
|
538
|
-
|
|
539
|
-
## Debugging Poor Retrieval
|
|
540
|
-
|
|
541
|
-
### Retrieval Diagnostics
|
|
542
|
-
|
|
543
|
-
```python
|
|
544
|
-
def diagnose_retrieval(
|
|
545
|
-
query: str,
|
|
546
|
-
retrieved_docs: list,
|
|
547
|
-
expected_docs: list,
|
|
548
|
-
embedding_model
|
|
549
|
-
) -> dict:
|
|
550
|
-
"""Diagnose why retrieval might be failing."""
|
|
551
|
-
|
|
552
|
-
query_embedding = embedding_model.encode(query)
|
|
553
|
-
retrieved_embeddings = [embedding_model.encode(d) for d in retrieved_docs]
|
|
554
|
-
expected_embeddings = [embedding_model.encode(d) for d in expected_docs]
|
|
555
|
-
|
|
556
|
-
from sklearn.metrics.pairwise import cosine_similarity
|
|
557
|
-
import numpy as np
|
|
558
|
-
|
|
559
|
-
diagnosis = {
|
|
560
|
-
"query": query,
|
|
561
|
-
"issues": []
|
|
562
|
-
}
|
|
563
|
-
|
|
564
|
-
# Check query-document similarity
|
|
565
|
-
for i, (doc, emb) in enumerate(zip(retrieved_docs, retrieved_embeddings)):
|
|
566
|
-
sim = cosine_similarity([query_embedding], [emb])[0][0]
|
|
567
|
-
if sim < 0.5:
|
|
568
|
-
diagnosis["issues"].append({
|
|
569
|
-
"type": "low_similarity",
|
|
570
|
-
"doc_index": i,
|
|
571
|
-
"similarity": float(sim),
|
|
572
|
-
"doc_preview": doc[:100]
|
|
573
|
-
})
|
|
574
|
-
|
|
575
|
-
# Check if expected docs would score higher
|
|
576
|
-
for i, (doc, emb) in enumerate(zip(expected_docs, expected_embeddings)):
|
|
577
|
-
sim = cosine_similarity([query_embedding], [emb])[0][0]
|
|
578
|
-
retrieved_max_sim = max(
|
|
579
|
-
cosine_similarity([query_embedding], [e])[0][0]
|
|
580
|
-
for e in retrieved_embeddings
|
|
581
|
-
)
|
|
582
|
-
|
|
583
|
-
if sim > retrieved_max_sim:
|
|
584
|
-
diagnosis["issues"].append({
|
|
585
|
-
"type": "missed_better_doc",
|
|
586
|
-
"expected_doc_index": i,
|
|
587
|
-
"expected_sim": float(sim),
|
|
588
|
-
"best_retrieved_sim": float(retrieved_max_sim),
|
|
589
|
-
"doc_preview": doc[:100]
|
|
590
|
-
})
|
|
591
|
-
|
|
592
|
-
# Check for vocabulary mismatch
|
|
593
|
-
query_terms = set(query.lower().split())
|
|
594
|
-
for i, doc in enumerate(retrieved_docs):
|
|
595
|
-
doc_terms = set(doc.lower().split())
|
|
596
|
-
overlap = query_terms & doc_terms
|
|
597
|
-
if len(overlap) < len(query_terms) * 0.3:
|
|
598
|
-
diagnosis["issues"].append({
|
|
599
|
-
"type": "vocabulary_mismatch",
|
|
600
|
-
"doc_index": i,
|
|
601
|
-
"query_terms": list(query_terms),
|
|
602
|
-
"overlapping_terms": list(overlap)
|
|
603
|
-
})
|
|
604
|
-
|
|
605
|
-
return diagnosis
|
|
606
|
-
|
|
607
|
-
# Usage
|
|
608
|
-
diagnosis = diagnose_retrieval(
|
|
609
|
-
query="How to configure OAuth authentication",
|
|
610
|
-
retrieved_docs=retrieved_texts,
|
|
611
|
-
expected_docs=expected_texts,
|
|
612
|
-
embedding_model=sentence_transformer
|
|
613
|
-
)
|
|
614
|
-
|
|
615
|
-
for issue in diagnosis["issues"]:
|
|
616
|
-
print(f"Issue: {issue['type']}")
|
|
617
|
-
print(f"Details: {issue}")
|
|
618
|
-
```
|
|
619
|
-
|
|
620
|
-
### Query Analysis
|
|
621
|
-
|
|
622
|
-
```python
|
|
623
|
-
def analyze_query_performance(
|
|
624
|
-
query_logs: list[dict],
|
|
625
|
-
threshold_precision: float = 0.6
|
|
626
|
-
) -> dict:
|
|
627
|
-
"""Analyze query patterns to find systematic issues."""
|
|
628
|
-
|
|
629
|
-
analysis = {
|
|
630
|
-
"total_queries": len(query_logs),
|
|
631
|
-
"low_performing": [],
|
|
632
|
-
"patterns": {}
|
|
633
|
-
}
|
|
634
|
-
|
|
635
|
-
for log in query_logs:
|
|
636
|
-
if log.get("precision@5", 1.0) < threshold_precision:
|
|
637
|
-
analysis["low_performing"].append(log)
|
|
638
|
-
|
|
639
|
-
# Analyze low-performing queries
|
|
640
|
-
if analysis["low_performing"]:
|
|
641
|
-
# Check for common patterns
|
|
642
|
-
low_perf_queries = [l["query"] for l in analysis["low_performing"]]
|
|
643
|
-
|
|
644
|
-
# Query length analysis
|
|
645
|
-
avg_length = sum(len(q.split()) for q in low_perf_queries) / len(low_perf_queries)
|
|
646
|
-
analysis["patterns"]["avg_low_perf_query_length"] = avg_length
|
|
647
|
-
|
|
648
|
-
# Common terms in failing queries
|
|
649
|
-
from collections import Counter
|
|
650
|
-
all_terms = []
|
|
651
|
-
for q in low_perf_queries:
|
|
652
|
-
all_terms.extend(q.lower().split())
|
|
653
|
-
analysis["patterns"]["common_failing_terms"] = Counter(all_terms).most_common(10)
|
|
654
|
-
|
|
655
|
-
# Question type analysis
|
|
656
|
-
question_words = ["how", "what", "why", "when", "where", "who"]
|
|
657
|
-
question_types = Counter()
|
|
658
|
-
for q in low_perf_queries:
|
|
659
|
-
for qw in question_words:
|
|
660
|
-
if q.lower().startswith(qw):
|
|
661
|
-
question_types[qw] += 1
|
|
662
|
-
break
|
|
663
|
-
else:
|
|
664
|
-
question_types["other"] += 1
|
|
665
|
-
analysis["patterns"]["failing_question_types"] = dict(question_types)
|
|
666
|
-
|
|
667
|
-
return analysis
|
|
668
|
-
```
|
|
669
|
-
|
|
670
|
-
---
|
|
671
|
-
|
|
672
|
-
## Continuous Monitoring
|
|
673
|
-
|
|
674
|
-
### Production Metrics Dashboard
|
|
675
|
-
|
|
676
|
-
```python
|
|
677
|
-
import time
|
|
678
|
-
from dataclasses import dataclass, field
|
|
679
|
-
from collections import deque
|
|
680
|
-
from threading import Lock
|
|
681
|
-
|
|
682
|
-
@dataclass
|
|
683
|
-
class RAGMetricsCollector:
|
|
684
|
-
"""Collect and track RAG metrics in production."""
|
|
685
|
-
|
|
686
|
-
window_size: int = 1000
|
|
687
|
-
_latencies: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
688
|
-
_retrieval_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
689
|
-
_generation_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
690
|
-
_lock: Lock = field(default_factory=Lock)
|
|
691
|
-
|
|
692
|
-
def record_query(
|
|
693
|
-
self,
|
|
694
|
-
latency_ms: float,
|
|
695
|
-
retrieval_score: float | None = None,
|
|
696
|
-
generation_score: float | None = None
|
|
697
|
-
):
|
|
698
|
-
"""Record metrics for a single query."""
|
|
699
|
-
with self._lock:
|
|
700
|
-
self._latencies.append(latency_ms)
|
|
701
|
-
if retrieval_score is not None:
|
|
702
|
-
self._retrieval_scores.append(retrieval_score)
|
|
703
|
-
if generation_score is not None:
|
|
704
|
-
self._generation_scores.append(generation_score)
|
|
705
|
-
|
|
706
|
-
def get_summary(self) -> dict:
|
|
707
|
-
"""Get current metrics summary."""
|
|
708
|
-
with self._lock:
|
|
709
|
-
import numpy as np
|
|
710
|
-
|
|
711
|
-
summary = {
|
|
712
|
-
"queries_in_window": len(self._latencies),
|
|
713
|
-
"latency": {
|
|
714
|
-
"p50": np.percentile(self._latencies, 50) if self._latencies else 0,
|
|
715
|
-
"p95": np.percentile(self._latencies, 95) if self._latencies else 0,
|
|
716
|
-
"p99": np.percentile(self._latencies, 99) if self._latencies else 0,
|
|
717
|
-
},
|
|
718
|
-
"retrieval_score": {
|
|
719
|
-
"mean": np.mean(self._retrieval_scores) if self._retrieval_scores else 0,
|
|
720
|
-
"std": np.std(self._retrieval_scores) if self._retrieval_scores else 0,
|
|
721
|
-
},
|
|
722
|
-
"generation_score": {
|
|
723
|
-
"mean": np.mean(self._generation_scores) if self._generation_scores else 0,
|
|
724
|
-
"std": np.std(self._generation_scores) if self._generation_scores else 0,
|
|
725
|
-
}
|
|
726
|
-
}
|
|
727
|
-
|
|
728
|
-
return summary
|
|
729
|
-
|
|
730
|
-
# Usage
|
|
731
|
-
metrics = RAGMetricsCollector()
|
|
732
|
-
|
|
733
|
-
# In your RAG endpoint
|
|
734
|
-
start = time.time()
|
|
735
|
-
response = rag_pipeline.query(question)
|
|
736
|
-
latency = (time.time() - start) * 1000
|
|
737
|
-
|
|
738
|
-
metrics.record_query(
|
|
739
|
-
latency_ms=latency,
|
|
740
|
-
retrieval_score=response.get("retrieval_score"),
|
|
741
|
-
generation_score=response.get("generation_score")
|
|
742
|
-
)
|
|
743
|
-
|
|
744
|
-
# Periodically check
|
|
745
|
-
print(metrics.get_summary())
|
|
746
|
-
```
|
|
747
|
-
|
|
748
|
-
### Alerting on Quality Degradation
|
|
749
|
-
|
|
750
|
-
```python
|
|
751
|
-
class RAGQualityMonitor:
|
|
752
|
-
"""Monitor RAG quality and alert on degradation."""
|
|
753
|
-
|
|
754
|
-
def __init__(
|
|
755
|
-
self,
|
|
756
|
-
baseline_precision: float = 0.8,
|
|
757
|
-
alert_threshold: float = 0.1, # Alert if drops by 10%
|
|
758
|
-
window_size: int = 100
|
|
759
|
-
):
|
|
760
|
-
self.baseline = baseline_precision
|
|
761
|
-
self.threshold = alert_threshold
|
|
762
|
-
self.window_size = window_size
|
|
763
|
-
self.recent_scores = deque(maxlen=window_size)
|
|
764
|
-
|
|
765
|
-
def record_score(self, precision: float) -> dict | None:
|
|
766
|
-
"""Record score and return alert if quality degraded."""
|
|
767
|
-
self.recent_scores.append(precision)
|
|
768
|
-
|
|
769
|
-
if len(self.recent_scores) < self.window_size // 2:
|
|
770
|
-
return None # Not enough data
|
|
771
|
-
|
|
772
|
-
current_mean = sum(self.recent_scores) / len(self.recent_scores)
|
|
773
|
-
degradation = self.baseline - current_mean
|
|
774
|
-
|
|
775
|
-
if degradation > self.threshold:
|
|
776
|
-
return {
|
|
777
|
-
"alert": "QUALITY_DEGRADATION",
|
|
778
|
-
"baseline": self.baseline,
|
|
779
|
-
"current": current_mean,
|
|
780
|
-
"degradation": degradation,
|
|
781
|
-
"window_size": len(self.recent_scores)
|
|
782
|
-
}
|
|
783
|
-
|
|
784
|
-
return None
|
|
785
|
-
|
|
786
|
-
# Usage
|
|
787
|
-
monitor = RAGQualityMonitor(baseline_precision=0.85)
|
|
788
|
-
|
|
789
|
-
for query_result in production_queries:
|
|
790
|
-
alert = monitor.record_score(query_result["precision@5"])
|
|
791
|
-
if alert:
|
|
792
|
-
send_alert(alert) # Slack, PagerDuty, etc.
|
|
793
|
-
```
|
|
794
|
-
|
|
795
|
-
---
|
|
796
|
-
|
|
797
|
-
## Evaluation Best Practices
|
|
798
|
-
|
|
799
|
-
| Practice | Description |
|
|
800
|
-
|----------|-------------|
|
|
801
|
-
| **Golden test set** | Maintain 50-200 curated Q&A pairs with ground truth |
|
|
802
|
-
| **Stratified sampling** | Include diverse query types in test set |
|
|
803
|
-
| **Human baselines** | Compare LLM judges against human annotators |
|
|
804
|
-
| **Version control** | Track evaluation results alongside model versions |
|
|
805
|
-
| **Regular re-evaluation** | Re-run golden tests on every retrieval change |
|
|
806
|
-
| **A/B testing** | Compare new retrieval strategies on live traffic |
|
|
807
|
-
|
|
808
|
-
---
|
|
809
|
-
|
|
810
|
-
## Quick Reference
|
|
811
|
-
|
|
812
|
-
| Goal | Metric | Target |
|
|
813
|
-
|------|--------|--------|
|
|
814
|
-
| Are docs relevant? | Precision@5 | > 0.7 |
|
|
815
|
-
| Did we get all docs? | Recall@5 | > 0.8 |
|
|
816
|
-
| Is ranking good? | NDCG@5 | > 0.7 |
|
|
817
|
-
| Is answer grounded? | Faithfulness | > 0.9 |
|
|
818
|
-
| Does answer fit question? | Answer Relevance | > 0.8 |
|
|
819
|
-
| Is context useful? | Context Relevance | > 0.7 |
|
|
820
|
-
|
|
821
|
-
| Framework | Best For |
|
|
822
|
-
|-----------|----------|
|
|
823
|
-
| RAGAS | Quick RAG-specific evaluation |
|
|
824
|
-
| TruLens | Production monitoring and tracing |
|
|
825
|
-
| Custom LLM-judge | Domain-specific criteria |
|
|
826
|
-
| Manual annotation | Ground truth creation |
|
|
827
|
-
|
|
828
|
-
## Related Skills
|
|
829
|
-
|
|
830
|
-
- **RAG Architect** - System design
|
|
831
|
-
- **ML Pipeline** - Evaluation automation
|
|
832
|
-
- **Data Scientist** - Statistical analysis
|
|
833
|
-
- **Monitoring Expert** - Production observability
|
|
1
|
+
# RAG Evaluation
|
|
2
|
+
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
## Evaluation Framework Overview
|
|
6
|
+
|
|
7
|
+
| Framework | Focus | Strengths | Use Case |
|
|
8
|
+
|-----------|-------|-----------|----------|
|
|
9
|
+
| **RAGAS** | RAG-specific metrics | Faithfulness, relevance | Production RAG evaluation |
|
|
10
|
+
| **TruLens** | LLM app observability | Tracing, feedback functions | Debugging and monitoring |
|
|
11
|
+
| **LangSmith** | LangChain ecosystem | Traces, datasets, testing | LangChain projects |
|
|
12
|
+
| **Custom** | Specific requirements | Full control | Domain-specific needs |
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## Core Metrics
|
|
17
|
+
|
|
18
|
+
### Retrieval Metrics
|
|
19
|
+
|
|
20
|
+
| Metric | Formula | What It Measures |
|
|
21
|
+
|--------|---------|------------------|
|
|
22
|
+
| **Precision@k** | Relevant in top-k / k | Are retrieved docs relevant? |
|
|
23
|
+
| **Recall@k** | Relevant in top-k / Total relevant | Did we get all relevant docs? |
|
|
24
|
+
| **MRR** | 1 / Rank of first relevant | How quickly do we find relevant? |
|
|
25
|
+
| **NDCG@k** | DCG@k / IDCG@k | Is ranking order correct? |
|
|
26
|
+
| **Hit Rate** | Queries with relevant in top-k / Total queries | Binary success rate |
|
|
27
|
+
|
|
28
|
+
### Generation Metrics
|
|
29
|
+
|
|
30
|
+
| Metric | What It Measures |
|
|
31
|
+
|--------|------------------|
|
|
32
|
+
| **Faithfulness** | Is answer grounded in retrieved context? |
|
|
33
|
+
| **Answer Relevance** | Does answer address the question? |
|
|
34
|
+
| **Context Relevance** | Is retrieved context relevant to question? |
|
|
35
|
+
| **Context Utilization** | How much context was actually used? |
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Implementing Core Metrics
|
|
40
|
+
|
|
41
|
+
### Precision, Recall, and Hit Rate
|
|
42
|
+
|
|
43
|
+
```python
|
|
44
|
+
from dataclasses import dataclass
|
|
45
|
+
from typing import Set
|
|
46
|
+
|
|
47
|
+
@dataclass
|
|
48
|
+
class RetrievalMetrics:
|
|
49
|
+
precision_at_k: float
|
|
50
|
+
recall_at_k: float
|
|
51
|
+
hit_rate: float
|
|
52
|
+
mrr: float
|
|
53
|
+
|
|
54
|
+
def calculate_retrieval_metrics(
|
|
55
|
+
retrieved_ids: list[str],
|
|
56
|
+
relevant_ids: set[str],
|
|
57
|
+
k: int
|
|
58
|
+
) -> RetrievalMetrics:
|
|
59
|
+
"""Calculate core retrieval metrics."""
|
|
60
|
+
top_k = retrieved_ids[:k]
|
|
61
|
+
top_k_set = set(top_k)
|
|
62
|
+
|
|
63
|
+
# Precision@k: relevant in top-k / k
|
|
64
|
+
relevant_in_top_k = len(top_k_set & relevant_ids)
|
|
65
|
+
precision = relevant_in_top_k / k if k > 0 else 0
|
|
66
|
+
|
|
67
|
+
# Recall@k: relevant in top-k / total relevant
|
|
68
|
+
recall = relevant_in_top_k / len(relevant_ids) if relevant_ids else 0
|
|
69
|
+
|
|
70
|
+
# Hit Rate: 1 if any relevant in top-k, else 0
|
|
71
|
+
hit_rate = 1.0 if relevant_in_top_k > 0 else 0.0
|
|
72
|
+
|
|
73
|
+
# MRR: 1 / rank of first relevant result
|
|
74
|
+
mrr = 0.0
|
|
75
|
+
for i, doc_id in enumerate(top_k, 1):
|
|
76
|
+
if doc_id in relevant_ids:
|
|
77
|
+
mrr = 1.0 / i
|
|
78
|
+
break
|
|
79
|
+
|
|
80
|
+
return RetrievalMetrics(
|
|
81
|
+
precision_at_k=precision,
|
|
82
|
+
recall_at_k=recall,
|
|
83
|
+
hit_rate=hit_rate,
|
|
84
|
+
mrr=mrr
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
# Usage
|
|
88
|
+
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
|
|
89
|
+
relevant = {"doc2", "doc5", "doc7"} # Ground truth
|
|
90
|
+
|
|
91
|
+
metrics = calculate_retrieval_metrics(retrieved, relevant, k=5)
|
|
92
|
+
print(f"Precision@5: {metrics.precision_at_k:.2f}") # 2/5 = 0.40
|
|
93
|
+
print(f"Recall@5: {metrics.recall_at_k:.2f}") # 2/3 = 0.67
|
|
94
|
+
print(f"MRR: {metrics.mrr:.2f}") # 1/2 = 0.50
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### NDCG (Normalized Discounted Cumulative Gain)
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
import numpy as np
|
|
101
|
+
|
|
102
|
+
def dcg_at_k(relevance_scores: list[float], k: int) -> float:
|
|
103
|
+
"""Calculate Discounted Cumulative Gain."""
|
|
104
|
+
relevance_scores = np.array(relevance_scores[:k])
|
|
105
|
+
if len(relevance_scores) == 0:
|
|
106
|
+
return 0.0
|
|
107
|
+
|
|
108
|
+
# DCG = sum(rel_i / log2(i + 1)) for i in 1..k
|
|
109
|
+
discounts = np.log2(np.arange(2, len(relevance_scores) + 2))
|
|
110
|
+
return np.sum(relevance_scores / discounts)
|
|
111
|
+
|
|
112
|
+
def ndcg_at_k(
|
|
113
|
+
retrieved_ids: list[str],
|
|
114
|
+
relevance_scores: dict[str, float],
|
|
115
|
+
k: int
|
|
116
|
+
) -> float:
|
|
117
|
+
"""
|
|
118
|
+
Calculate NDCG@k.
|
|
119
|
+
relevance_scores: dict mapping doc_id to relevance (e.g., 0, 1, 2, 3)
|
|
120
|
+
"""
|
|
121
|
+
# Get relevance scores for retrieved docs
|
|
122
|
+
retrieved_relevance = [
|
|
123
|
+
relevance_scores.get(doc_id, 0)
|
|
124
|
+
for doc_id in retrieved_ids[:k]
|
|
125
|
+
]
|
|
126
|
+
|
|
127
|
+
# Calculate DCG for retrieved order
|
|
128
|
+
dcg = dcg_at_k(retrieved_relevance, k)
|
|
129
|
+
|
|
130
|
+
# Calculate ideal DCG (perfect ranking)
|
|
131
|
+
ideal_relevance = sorted(relevance_scores.values(), reverse=True)[:k]
|
|
132
|
+
idcg = dcg_at_k(ideal_relevance, k)
|
|
133
|
+
|
|
134
|
+
return dcg / idcg if idcg > 0 else 0.0
|
|
135
|
+
|
|
136
|
+
# Usage with graded relevance
|
|
137
|
+
retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
|
|
138
|
+
relevance = {
|
|
139
|
+
"doc1": 0, # Not relevant
|
|
140
|
+
"doc2": 3, # Highly relevant
|
|
141
|
+
"doc3": 1, # Somewhat relevant
|
|
142
|
+
"doc5": 2, # Relevant
|
|
143
|
+
"doc7": 3, # Highly relevant (not retrieved)
|
|
144
|
+
}
|
|
145
|
+
|
|
146
|
+
ndcg = ndcg_at_k(retrieved, relevance, k=5)
|
|
147
|
+
print(f"NDCG@5: {ndcg:.3f}")
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
---
|
|
151
|
+
|
|
152
|
+
## RAGAS Framework
|
|
153
|
+
|
|
154
|
+
### Installation and Setup
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
# pip install ragas
|
|
158
|
+
|
|
159
|
+
from ragas import evaluate
|
|
160
|
+
from ragas.metrics import (
|
|
161
|
+
faithfulness,
|
|
162
|
+
answer_relevancy,
|
|
163
|
+
context_precision,
|
|
164
|
+
context_recall,
|
|
165
|
+
context_utilization,
|
|
166
|
+
)
|
|
167
|
+
from datasets import Dataset
|
|
168
|
+
|
|
169
|
+
# Prepare evaluation dataset
|
|
170
|
+
eval_data = {
|
|
171
|
+
"question": [
|
|
172
|
+
"What is the capital of France?",
|
|
173
|
+
"How do I install Python?"
|
|
174
|
+
],
|
|
175
|
+
"answer": [
|
|
176
|
+
"The capital of France is Paris.",
|
|
177
|
+
"You can install Python by downloading it from python.org."
|
|
178
|
+
],
|
|
179
|
+
"contexts": [
|
|
180
|
+
["Paris is the capital and largest city of France."],
|
|
181
|
+
["Python can be installed from the official website python.org.",
|
|
182
|
+
"You can also use package managers like brew or apt."]
|
|
183
|
+
],
|
|
184
|
+
"ground_truth": [
|
|
185
|
+
"Paris is the capital of France.",
|
|
186
|
+
"Install Python from python.org or use a package manager."
|
|
187
|
+
]
|
|
188
|
+
}
|
|
189
|
+
|
|
190
|
+
dataset = Dataset.from_dict(eval_data)
|
|
191
|
+
|
|
192
|
+
# Run evaluation
|
|
193
|
+
results = evaluate(
|
|
194
|
+
dataset,
|
|
195
|
+
metrics=[
|
|
196
|
+
faithfulness,
|
|
197
|
+
answer_relevancy,
|
|
198
|
+
context_precision,
|
|
199
|
+
context_recall,
|
|
200
|
+
]
|
|
201
|
+
)
|
|
202
|
+
|
|
203
|
+
print(results)
|
|
204
|
+
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, ...}
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
### Custom RAGAS Evaluation
|
|
208
|
+
|
|
209
|
+
```python
|
|
210
|
+
from ragas.metrics import Metric
|
|
211
|
+
from ragas.llms import LangchainLLM
|
|
212
|
+
from langchain_openai import ChatOpenAI
|
|
213
|
+
|
|
214
|
+
# Use custom LLM
|
|
215
|
+
custom_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4o-mini"))
|
|
216
|
+
|
|
217
|
+
# Evaluate with custom settings
|
|
218
|
+
results = evaluate(
|
|
219
|
+
dataset,
|
|
220
|
+
metrics=[faithfulness, answer_relevancy],
|
|
221
|
+
llm=custom_llm,
|
|
222
|
+
raise_exceptions=False # Continue on errors
|
|
223
|
+
)
|
|
224
|
+
|
|
225
|
+
# Per-sample scores
|
|
226
|
+
for i, row in enumerate(results.to_pandas().itertuples()):
|
|
227
|
+
print(f"Q{i+1}: Faithfulness={row.faithfulness:.2f}, "
|
|
228
|
+
f"Relevancy={row.answer_relevancy:.2f}")
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
### RAGAS Metrics Explained
|
|
232
|
+
|
|
233
|
+
```python
|
|
234
|
+
"""
|
|
235
|
+
RAGAS Core Metrics:
|
|
236
|
+
|
|
237
|
+
1. Faithfulness (0-1):
|
|
238
|
+
- Measures if answer is grounded in context
|
|
239
|
+
- LLM extracts claims from answer, verifies against context
|
|
240
|
+
- High score = answer doesn't hallucinate
|
|
241
|
+
|
|
242
|
+
2. Answer Relevancy (0-1):
|
|
243
|
+
- Measures if answer addresses the question
|
|
244
|
+
- Generates questions from answer, compares to original
|
|
245
|
+
- High score = answer is on-topic
|
|
246
|
+
|
|
247
|
+
3. Context Precision (0-1):
|
|
248
|
+
- Measures if retrieved contexts are relevant
|
|
249
|
+
- Ranks contexts by relevance, calculates precision at each rank
|
|
250
|
+
- High score = top contexts are most relevant
|
|
251
|
+
|
|
252
|
+
4. Context Recall (0-1):
|
|
253
|
+
- Measures if all ground truth info is in context
|
|
254
|
+
- Checks if ground truth sentences are supported by context
|
|
255
|
+
- High score = context contains needed information
|
|
256
|
+
"""
|
|
257
|
+
|
|
258
|
+
# Debugging low scores
|
|
259
|
+
def diagnose_ragas_scores(results_df):
|
|
260
|
+
"""Identify problematic samples."""
|
|
261
|
+
issues = []
|
|
262
|
+
|
|
263
|
+
for idx, row in results_df.iterrows():
|
|
264
|
+
if row.get('faithfulness', 1) < 0.5:
|
|
265
|
+
issues.append({
|
|
266
|
+
"index": idx,
|
|
267
|
+
"issue": "Low faithfulness - answer may contain hallucinations",
|
|
268
|
+
"question": row['question'],
|
|
269
|
+
"answer": row['answer'][:200]
|
|
270
|
+
})
|
|
271
|
+
|
|
272
|
+
if row.get('context_recall', 1) < 0.5:
|
|
273
|
+
issues.append({
|
|
274
|
+
"index": idx,
|
|
275
|
+
"issue": "Low context recall - retrieval missing relevant docs",
|
|
276
|
+
"question": row['question']
|
|
277
|
+
})
|
|
278
|
+
|
|
279
|
+
return issues
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## TruLens Evaluation
|
|
285
|
+
|
|
286
|
+
### Setup and Basic Usage
|
|
287
|
+
|
|
288
|
+
```python
|
|
289
|
+
# pip install trulens-eval
|
|
290
|
+
|
|
291
|
+
from trulens_eval import Tru, TruChain, Feedback
|
|
292
|
+
from trulens_eval.feedback import Groundedness
|
|
293
|
+
from trulens_eval.feedback.provider import OpenAI as fOpenAI
|
|
294
|
+
|
|
295
|
+
# Initialize TruLens
|
|
296
|
+
tru = Tru()
|
|
297
|
+
|
|
298
|
+
# Create feedback provider
|
|
299
|
+
provider = fOpenAI()
|
|
300
|
+
|
|
301
|
+
# Define feedback functions
|
|
302
|
+
f_groundedness = Feedback(
|
|
303
|
+
provider.groundedness_measure_with_cot_reasons,
|
|
304
|
+
name="Groundedness"
|
|
305
|
+
).on(
|
|
306
|
+
TruChain.select_context().node.text # Retrieved context
|
|
307
|
+
).on_output()
|
|
308
|
+
|
|
309
|
+
f_relevance = Feedback(
|
|
310
|
+
provider.relevance_with_cot_reasons,
|
|
311
|
+
name="Answer Relevance"
|
|
312
|
+
).on_input().on_output()
|
|
313
|
+
|
|
314
|
+
f_context_relevance = Feedback(
|
|
315
|
+
provider.context_relevance_with_cot_reasons,
|
|
316
|
+
name="Context Relevance"
|
|
317
|
+
).on_input().on(
|
|
318
|
+
TruChain.select_context().node.text
|
|
319
|
+
)
|
|
320
|
+
|
|
321
|
+
# Wrap your RAG chain
|
|
322
|
+
from langchain.chains import RetrievalQA
|
|
323
|
+
|
|
324
|
+
rag_chain = RetrievalQA.from_chain_type(
|
|
325
|
+
llm=llm,
|
|
326
|
+
retriever=vector_store.as_retriever()
|
|
327
|
+
)
|
|
328
|
+
|
|
329
|
+
tru_recorder = TruChain(
|
|
330
|
+
rag_chain,
|
|
331
|
+
app_id="rag-v1",
|
|
332
|
+
feedbacks=[f_groundedness, f_relevance, f_context_relevance]
|
|
333
|
+
)
|
|
334
|
+
|
|
335
|
+
# Run with recording
|
|
336
|
+
with tru_recorder as recording:
|
|
337
|
+
response = rag_chain.invoke({"query": "How do I configure authentication?"})
|
|
338
|
+
|
|
339
|
+
# View results
|
|
340
|
+
tru.run_dashboard() # Opens web UI
|
|
341
|
+
# Or get programmatically
|
|
342
|
+
records = tru.get_records_and_feedback(app_ids=["rag-v1"])
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### Custom Feedback Functions
|
|
346
|
+
|
|
347
|
+
```python
|
|
348
|
+
from trulens_eval import Feedback, Select
|
|
349
|
+
|
|
350
|
+
def custom_citation_check(response: str, context: str) -> float:
|
|
351
|
+
"""Check if response cites sources from context."""
|
|
352
|
+
# Extract citations from response (e.g., [1], [Source: X])
|
|
353
|
+
import re
|
|
354
|
+
citations = re.findall(r'\[[\d\w\s:]+\]', response)
|
|
355
|
+
|
|
356
|
+
if not citations:
|
|
357
|
+
return 0.0 # No citations
|
|
358
|
+
|
|
359
|
+
# Verify citations reference actual context
|
|
360
|
+
valid_citations = sum(1 for c in citations if c.lower() in context.lower())
|
|
361
|
+
return valid_citations / len(citations)
|
|
362
|
+
|
|
363
|
+
f_citation = Feedback(
|
|
364
|
+
custom_citation_check,
|
|
365
|
+
name="Citation Accuracy"
|
|
366
|
+
).on_output().on(Select.RecordCalls.retriever.get_relevant_documents.rets.page_content)
|
|
367
|
+
```
|
|
368
|
+
|
|
369
|
+
---
|
|
370
|
+
|
|
371
|
+
## Building Custom Evaluation Pipelines
|
|
372
|
+
|
|
373
|
+
### LLM-as-Judge Evaluation
|
|
374
|
+
|
|
375
|
+
```python
|
|
376
|
+
from openai import OpenAI
|
|
377
|
+
from dataclasses import dataclass
|
|
378
|
+
from typing import Literal
|
|
379
|
+
|
|
380
|
+
client = OpenAI()
|
|
381
|
+
|
|
382
|
+
@dataclass
|
|
383
|
+
class EvalResult:
|
|
384
|
+
score: float
|
|
385
|
+
reasoning: str
|
|
386
|
+
criteria: str
|
|
387
|
+
|
|
388
|
+
def evaluate_with_llm(
|
|
389
|
+
question: str,
|
|
390
|
+
answer: str,
|
|
391
|
+
context: str,
|
|
392
|
+
criteria: Literal["faithfulness", "relevance", "completeness"]
|
|
393
|
+
) -> EvalResult:
|
|
394
|
+
"""Use LLM as judge for evaluation."""
|
|
395
|
+
|
|
396
|
+
criteria_prompts = {
|
|
397
|
+
"faithfulness": """
|
|
398
|
+
Evaluate if the answer is fully supported by the provided context.
|
|
399
|
+
Score 1.0 if every claim in the answer is verifiable from context.
|
|
400
|
+
Score 0.5 if most claims are supported but some are not.
|
|
401
|
+
Score 0.0 if the answer contains significant unsupported claims.
|
|
402
|
+
""",
|
|
403
|
+
"relevance": """
|
|
404
|
+
Evaluate if the answer directly addresses the question.
|
|
405
|
+
Score 1.0 if the answer fully addresses the question.
|
|
406
|
+
Score 0.5 if the answer partially addresses the question.
|
|
407
|
+
Score 0.0 if the answer is off-topic or doesn't address the question.
|
|
408
|
+
""",
|
|
409
|
+
"completeness": """
|
|
410
|
+
Evaluate if the answer covers all aspects of the question.
|
|
411
|
+
Score 1.0 if the answer is comprehensive and complete.
|
|
412
|
+
Score 0.5 if the answer covers main points but misses details.
|
|
413
|
+
Score 0.0 if the answer is significantly incomplete.
|
|
414
|
+
"""
|
|
415
|
+
}
|
|
416
|
+
|
|
417
|
+
response = client.chat.completions.create(
|
|
418
|
+
model="gpt-4o-mini",
|
|
419
|
+
messages=[
|
|
420
|
+
{
|
|
421
|
+
"role": "system",
|
|
422
|
+
"content": f"""You are an expert evaluator for RAG systems.
|
|
423
|
+
{criteria_prompts[criteria]}
|
|
424
|
+
|
|
425
|
+
Respond in JSON format:
|
|
426
|
+
{{"score": <0.0-1.0>, "reasoning": "<explanation>"}}"""
|
|
427
|
+
},
|
|
428
|
+
{
|
|
429
|
+
"role": "user",
|
|
430
|
+
"content": f"""Question: {question}
|
|
431
|
+
|
|
432
|
+
Context:
|
|
433
|
+
{context}
|
|
434
|
+
|
|
435
|
+
Answer: {answer}
|
|
436
|
+
|
|
437
|
+
Evaluate the answer for {criteria}:"""
|
|
438
|
+
}
|
|
439
|
+
],
|
|
440
|
+
response_format={"type": "json_object"}
|
|
441
|
+
)
|
|
442
|
+
|
|
443
|
+
import json
|
|
444
|
+
result = json.loads(response.choices[0].message.content)
|
|
445
|
+
|
|
446
|
+
return EvalResult(
|
|
447
|
+
score=result["score"],
|
|
448
|
+
reasoning=result["reasoning"],
|
|
449
|
+
criteria=criteria
|
|
450
|
+
)
|
|
451
|
+
|
|
452
|
+
# Usage
|
|
453
|
+
eval_result = evaluate_with_llm(
|
|
454
|
+
question="How do I configure OAuth2?",
|
|
455
|
+
answer="Configure OAuth2 by setting client_id and client_secret in config.yaml.",
|
|
456
|
+
context="OAuth2 configuration requires client_id, client_secret, and redirect_uri in config.yaml.",
|
|
457
|
+
criteria="faithfulness"
|
|
458
|
+
)
|
|
459
|
+
print(f"Faithfulness: {eval_result.score:.2f}")
|
|
460
|
+
print(f"Reasoning: {eval_result.reasoning}")
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
### Batch Evaluation Pipeline
|
|
464
|
+
|
|
465
|
+
```python
|
|
466
|
+
import asyncio
|
|
467
|
+
from tqdm.asyncio import tqdm_asyncio
|
|
468
|
+
|
|
469
|
+
async def evaluate_batch(
|
|
470
|
+
test_cases: list[dict],
|
|
471
|
+
retriever,
|
|
472
|
+
generator,
|
|
473
|
+
metrics: list[str] = ["precision", "faithfulness", "relevance"]
|
|
474
|
+
) -> dict:
|
|
475
|
+
"""Run batch evaluation on test cases."""
|
|
476
|
+
|
|
477
|
+
results = {
|
|
478
|
+
"per_sample": [],
|
|
479
|
+
"aggregated": {}
|
|
480
|
+
}
|
|
481
|
+
|
|
482
|
+
async def evaluate_single(case: dict) -> dict:
|
|
483
|
+
# Retrieve
|
|
484
|
+
retrieved = await retriever.aretrieve(case["question"])
|
|
485
|
+
retrieved_ids = [r.id for r in retrieved]
|
|
486
|
+
|
|
487
|
+
# Generate
|
|
488
|
+
answer = await generator.agenerate(
|
|
489
|
+
question=case["question"],
|
|
490
|
+
context=[r.text for r in retrieved]
|
|
491
|
+
)
|
|
492
|
+
|
|
493
|
+
# Calculate metrics
|
|
494
|
+
sample_result = {
|
|
495
|
+
"question": case["question"],
|
|
496
|
+
"answer": answer,
|
|
497
|
+
"retrieved_ids": retrieved_ids
|
|
498
|
+
}
|
|
499
|
+
|
|
500
|
+
if "relevant_ids" in case and "precision" in metrics:
|
|
501
|
+
retrieval_metrics = calculate_retrieval_metrics(
|
|
502
|
+
retrieved_ids,
|
|
503
|
+
set(case["relevant_ids"]),
|
|
504
|
+
k=5
|
|
505
|
+
)
|
|
506
|
+
sample_result["precision@5"] = retrieval_metrics.precision_at_k
|
|
507
|
+
sample_result["recall@5"] = retrieval_metrics.recall_at_k
|
|
508
|
+
|
|
509
|
+
if "faithfulness" in metrics:
|
|
510
|
+
faith_eval = evaluate_with_llm(
|
|
511
|
+
case["question"],
|
|
512
|
+
answer,
|
|
513
|
+
"\n".join([r.text for r in retrieved]),
|
|
514
|
+
"faithfulness"
|
|
515
|
+
)
|
|
516
|
+
sample_result["faithfulness"] = faith_eval.score
|
|
517
|
+
|
|
518
|
+
return sample_result
|
|
519
|
+
|
|
520
|
+
# Run evaluations concurrently
|
|
521
|
+
tasks = [evaluate_single(case) for case in test_cases]
|
|
522
|
+
results["per_sample"] = await tqdm_asyncio.gather(*tasks)
|
|
523
|
+
|
|
524
|
+
# Aggregate results
|
|
525
|
+
for metric in ["precision@5", "recall@5", "faithfulness"]:
|
|
526
|
+
scores = [r.get(metric) for r in results["per_sample"] if r.get(metric) is not None]
|
|
527
|
+
if scores:
|
|
528
|
+
results["aggregated"][metric] = {
|
|
529
|
+
"mean": sum(scores) / len(scores),
|
|
530
|
+
"min": min(scores),
|
|
531
|
+
"max": max(scores)
|
|
532
|
+
}
|
|
533
|
+
|
|
534
|
+
return results
|
|
535
|
+
```
|
|
536
|
+
|
|
537
|
+
---
|
|
538
|
+
|
|
539
|
+
## Debugging Poor Retrieval
|
|
540
|
+
|
|
541
|
+
### Retrieval Diagnostics
|
|
542
|
+
|
|
543
|
+
```python
|
|
544
|
+
def diagnose_retrieval(
|
|
545
|
+
query: str,
|
|
546
|
+
retrieved_docs: list,
|
|
547
|
+
expected_docs: list,
|
|
548
|
+
embedding_model
|
|
549
|
+
) -> dict:
|
|
550
|
+
"""Diagnose why retrieval might be failing."""
|
|
551
|
+
|
|
552
|
+
query_embedding = embedding_model.encode(query)
|
|
553
|
+
retrieved_embeddings = [embedding_model.encode(d) for d in retrieved_docs]
|
|
554
|
+
expected_embeddings = [embedding_model.encode(d) for d in expected_docs]
|
|
555
|
+
|
|
556
|
+
from sklearn.metrics.pairwise import cosine_similarity
|
|
557
|
+
import numpy as np
|
|
558
|
+
|
|
559
|
+
diagnosis = {
|
|
560
|
+
"query": query,
|
|
561
|
+
"issues": []
|
|
562
|
+
}
|
|
563
|
+
|
|
564
|
+
# Check query-document similarity
|
|
565
|
+
for i, (doc, emb) in enumerate(zip(retrieved_docs, retrieved_embeddings)):
|
|
566
|
+
sim = cosine_similarity([query_embedding], [emb])[0][0]
|
|
567
|
+
if sim < 0.5:
|
|
568
|
+
diagnosis["issues"].append({
|
|
569
|
+
"type": "low_similarity",
|
|
570
|
+
"doc_index": i,
|
|
571
|
+
"similarity": float(sim),
|
|
572
|
+
"doc_preview": doc[:100]
|
|
573
|
+
})
|
|
574
|
+
|
|
575
|
+
# Check if expected docs would score higher
|
|
576
|
+
for i, (doc, emb) in enumerate(zip(expected_docs, expected_embeddings)):
|
|
577
|
+
sim = cosine_similarity([query_embedding], [emb])[0][0]
|
|
578
|
+
retrieved_max_sim = max(
|
|
579
|
+
cosine_similarity([query_embedding], [e])[0][0]
|
|
580
|
+
for e in retrieved_embeddings
|
|
581
|
+
)
|
|
582
|
+
|
|
583
|
+
if sim > retrieved_max_sim:
|
|
584
|
+
diagnosis["issues"].append({
|
|
585
|
+
"type": "missed_better_doc",
|
|
586
|
+
"expected_doc_index": i,
|
|
587
|
+
"expected_sim": float(sim),
|
|
588
|
+
"best_retrieved_sim": float(retrieved_max_sim),
|
|
589
|
+
"doc_preview": doc[:100]
|
|
590
|
+
})
|
|
591
|
+
|
|
592
|
+
# Check for vocabulary mismatch
|
|
593
|
+
query_terms = set(query.lower().split())
|
|
594
|
+
for i, doc in enumerate(retrieved_docs):
|
|
595
|
+
doc_terms = set(doc.lower().split())
|
|
596
|
+
overlap = query_terms & doc_terms
|
|
597
|
+
if len(overlap) < len(query_terms) * 0.3:
|
|
598
|
+
diagnosis["issues"].append({
|
|
599
|
+
"type": "vocabulary_mismatch",
|
|
600
|
+
"doc_index": i,
|
|
601
|
+
"query_terms": list(query_terms),
|
|
602
|
+
"overlapping_terms": list(overlap)
|
|
603
|
+
})
|
|
604
|
+
|
|
605
|
+
return diagnosis
|
|
606
|
+
|
|
607
|
+
# Usage
|
|
608
|
+
diagnosis = diagnose_retrieval(
|
|
609
|
+
query="How to configure OAuth authentication",
|
|
610
|
+
retrieved_docs=retrieved_texts,
|
|
611
|
+
expected_docs=expected_texts,
|
|
612
|
+
embedding_model=sentence_transformer
|
|
613
|
+
)
|
|
614
|
+
|
|
615
|
+
for issue in diagnosis["issues"]:
|
|
616
|
+
print(f"Issue: {issue['type']}")
|
|
617
|
+
print(f"Details: {issue}")
|
|
618
|
+
```
|
|
619
|
+
|
|
620
|
+
### Query Analysis
|
|
621
|
+
|
|
622
|
+
```python
|
|
623
|
+
def analyze_query_performance(
|
|
624
|
+
query_logs: list[dict],
|
|
625
|
+
threshold_precision: float = 0.6
|
|
626
|
+
) -> dict:
|
|
627
|
+
"""Analyze query patterns to find systematic issues."""
|
|
628
|
+
|
|
629
|
+
analysis = {
|
|
630
|
+
"total_queries": len(query_logs),
|
|
631
|
+
"low_performing": [],
|
|
632
|
+
"patterns": {}
|
|
633
|
+
}
|
|
634
|
+
|
|
635
|
+
for log in query_logs:
|
|
636
|
+
if log.get("precision@5", 1.0) < threshold_precision:
|
|
637
|
+
analysis["low_performing"].append(log)
|
|
638
|
+
|
|
639
|
+
# Analyze low-performing queries
|
|
640
|
+
if analysis["low_performing"]:
|
|
641
|
+
# Check for common patterns
|
|
642
|
+
low_perf_queries = [l["query"] for l in analysis["low_performing"]]
|
|
643
|
+
|
|
644
|
+
# Query length analysis
|
|
645
|
+
avg_length = sum(len(q.split()) for q in low_perf_queries) / len(low_perf_queries)
|
|
646
|
+
analysis["patterns"]["avg_low_perf_query_length"] = avg_length
|
|
647
|
+
|
|
648
|
+
# Common terms in failing queries
|
|
649
|
+
from collections import Counter
|
|
650
|
+
all_terms = []
|
|
651
|
+
for q in low_perf_queries:
|
|
652
|
+
all_terms.extend(q.lower().split())
|
|
653
|
+
analysis["patterns"]["common_failing_terms"] = Counter(all_terms).most_common(10)
|
|
654
|
+
|
|
655
|
+
# Question type analysis
|
|
656
|
+
question_words = ["how", "what", "why", "when", "where", "who"]
|
|
657
|
+
question_types = Counter()
|
|
658
|
+
for q in low_perf_queries:
|
|
659
|
+
for qw in question_words:
|
|
660
|
+
if q.lower().startswith(qw):
|
|
661
|
+
question_types[qw] += 1
|
|
662
|
+
break
|
|
663
|
+
else:
|
|
664
|
+
question_types["other"] += 1
|
|
665
|
+
analysis["patterns"]["failing_question_types"] = dict(question_types)
|
|
666
|
+
|
|
667
|
+
return analysis
|
|
668
|
+
```
|
|
669
|
+
|
|
670
|
+
---
|
|
671
|
+
|
|
672
|
+
## Continuous Monitoring
|
|
673
|
+
|
|
674
|
+
### Production Metrics Dashboard
|
|
675
|
+
|
|
676
|
+
```python
|
|
677
|
+
import time
|
|
678
|
+
from dataclasses import dataclass, field
|
|
679
|
+
from collections import deque
|
|
680
|
+
from threading import Lock
|
|
681
|
+
|
|
682
|
+
@dataclass
|
|
683
|
+
class RAGMetricsCollector:
|
|
684
|
+
"""Collect and track RAG metrics in production."""
|
|
685
|
+
|
|
686
|
+
window_size: int = 1000
|
|
687
|
+
_latencies: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
688
|
+
_retrieval_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
689
|
+
_generation_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
|
|
690
|
+
_lock: Lock = field(default_factory=Lock)
|
|
691
|
+
|
|
692
|
+
def record_query(
|
|
693
|
+
self,
|
|
694
|
+
latency_ms: float,
|
|
695
|
+
retrieval_score: float | None = None,
|
|
696
|
+
generation_score: float | None = None
|
|
697
|
+
):
|
|
698
|
+
"""Record metrics for a single query."""
|
|
699
|
+
with self._lock:
|
|
700
|
+
self._latencies.append(latency_ms)
|
|
701
|
+
if retrieval_score is not None:
|
|
702
|
+
self._retrieval_scores.append(retrieval_score)
|
|
703
|
+
if generation_score is not None:
|
|
704
|
+
self._generation_scores.append(generation_score)
|
|
705
|
+
|
|
706
|
+
def get_summary(self) -> dict:
|
|
707
|
+
"""Get current metrics summary."""
|
|
708
|
+
with self._lock:
|
|
709
|
+
import numpy as np
|
|
710
|
+
|
|
711
|
+
summary = {
|
|
712
|
+
"queries_in_window": len(self._latencies),
|
|
713
|
+
"latency": {
|
|
714
|
+
"p50": np.percentile(self._latencies, 50) if self._latencies else 0,
|
|
715
|
+
"p95": np.percentile(self._latencies, 95) if self._latencies else 0,
|
|
716
|
+
"p99": np.percentile(self._latencies, 99) if self._latencies else 0,
|
|
717
|
+
},
|
|
718
|
+
"retrieval_score": {
|
|
719
|
+
"mean": np.mean(self._retrieval_scores) if self._retrieval_scores else 0,
|
|
720
|
+
"std": np.std(self._retrieval_scores) if self._retrieval_scores else 0,
|
|
721
|
+
},
|
|
722
|
+
"generation_score": {
|
|
723
|
+
"mean": np.mean(self._generation_scores) if self._generation_scores else 0,
|
|
724
|
+
"std": np.std(self._generation_scores) if self._generation_scores else 0,
|
|
725
|
+
}
|
|
726
|
+
}
|
|
727
|
+
|
|
728
|
+
return summary
|
|
729
|
+
|
|
730
|
+
# Usage
|
|
731
|
+
metrics = RAGMetricsCollector()
|
|
732
|
+
|
|
733
|
+
# In your RAG endpoint
|
|
734
|
+
start = time.time()
|
|
735
|
+
response = rag_pipeline.query(question)
|
|
736
|
+
latency = (time.time() - start) * 1000
|
|
737
|
+
|
|
738
|
+
metrics.record_query(
|
|
739
|
+
latency_ms=latency,
|
|
740
|
+
retrieval_score=response.get("retrieval_score"),
|
|
741
|
+
generation_score=response.get("generation_score")
|
|
742
|
+
)
|
|
743
|
+
|
|
744
|
+
# Periodically check
|
|
745
|
+
print(metrics.get_summary())
|
|
746
|
+
```
|
|
747
|
+
|
|
748
|
+
### Alerting on Quality Degradation
|
|
749
|
+
|
|
750
|
+
```python
|
|
751
|
+
class RAGQualityMonitor:
|
|
752
|
+
"""Monitor RAG quality and alert on degradation."""
|
|
753
|
+
|
|
754
|
+
def __init__(
|
|
755
|
+
self,
|
|
756
|
+
baseline_precision: float = 0.8,
|
|
757
|
+
alert_threshold: float = 0.1, # Alert if drops by 10%
|
|
758
|
+
window_size: int = 100
|
|
759
|
+
):
|
|
760
|
+
self.baseline = baseline_precision
|
|
761
|
+
self.threshold = alert_threshold
|
|
762
|
+
self.window_size = window_size
|
|
763
|
+
self.recent_scores = deque(maxlen=window_size)
|
|
764
|
+
|
|
765
|
+
def record_score(self, precision: float) -> dict | None:
|
|
766
|
+
"""Record score and return alert if quality degraded."""
|
|
767
|
+
self.recent_scores.append(precision)
|
|
768
|
+
|
|
769
|
+
if len(self.recent_scores) < self.window_size // 2:
|
|
770
|
+
return None # Not enough data
|
|
771
|
+
|
|
772
|
+
current_mean = sum(self.recent_scores) / len(self.recent_scores)
|
|
773
|
+
degradation = self.baseline - current_mean
|
|
774
|
+
|
|
775
|
+
if degradation > self.threshold:
|
|
776
|
+
return {
|
|
777
|
+
"alert": "QUALITY_DEGRADATION",
|
|
778
|
+
"baseline": self.baseline,
|
|
779
|
+
"current": current_mean,
|
|
780
|
+
"degradation": degradation,
|
|
781
|
+
"window_size": len(self.recent_scores)
|
|
782
|
+
}
|
|
783
|
+
|
|
784
|
+
return None
|
|
785
|
+
|
|
786
|
+
# Usage
|
|
787
|
+
monitor = RAGQualityMonitor(baseline_precision=0.85)
|
|
788
|
+
|
|
789
|
+
for query_result in production_queries:
|
|
790
|
+
alert = monitor.record_score(query_result["precision@5"])
|
|
791
|
+
if alert:
|
|
792
|
+
send_alert(alert) # Slack, PagerDuty, etc.
|
|
793
|
+
```
|
|
794
|
+
|
|
795
|
+
---
|
|
796
|
+
|
|
797
|
+
## Evaluation Best Practices
|
|
798
|
+
|
|
799
|
+
| Practice | Description |
|
|
800
|
+
|----------|-------------|
|
|
801
|
+
| **Golden test set** | Maintain 50-200 curated Q&A pairs with ground truth |
|
|
802
|
+
| **Stratified sampling** | Include diverse query types in test set |
|
|
803
|
+
| **Human baselines** | Compare LLM judges against human annotators |
|
|
804
|
+
| **Version control** | Track evaluation results alongside model versions |
|
|
805
|
+
| **Regular re-evaluation** | Re-run golden tests on every retrieval change |
|
|
806
|
+
| **A/B testing** | Compare new retrieval strategies on live traffic |
|
|
807
|
+
|
|
808
|
+
---
|
|
809
|
+
|
|
810
|
+
## Quick Reference
|
|
811
|
+
|
|
812
|
+
| Goal | Metric | Target |
|
|
813
|
+
|------|--------|--------|
|
|
814
|
+
| Are docs relevant? | Precision@5 | > 0.7 |
|
|
815
|
+
| Did we get all docs? | Recall@5 | > 0.8 |
|
|
816
|
+
| Is ranking good? | NDCG@5 | > 0.7 |
|
|
817
|
+
| Is answer grounded? | Faithfulness | > 0.9 |
|
|
818
|
+
| Does answer fit question? | Answer Relevance | > 0.8 |
|
|
819
|
+
| Is context useful? | Context Relevance | > 0.7 |
|
|
820
|
+
|
|
821
|
+
| Framework | Best For |
|
|
822
|
+
|-----------|----------|
|
|
823
|
+
| RAGAS | Quick RAG-specific evaluation |
|
|
824
|
+
| TruLens | Production monitoring and tracing |
|
|
825
|
+
| Custom LLM-judge | Domain-specific criteria |
|
|
826
|
+
| Manual annotation | Ground truth creation |
|
|
827
|
+
|
|
828
|
+
## Related Skills
|
|
829
|
+
|
|
830
|
+
- **RAG Architect** - System design
|
|
831
|
+
- **ML Pipeline** - Evaluation automation
|
|
832
|
+
- **Data Scientist** - Statistical analysis
|
|
833
|
+
- **Monitoring Expert** - Production observability
|