evalvault 1.64.0__tar.gz → 1.66.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (844) hide show
  1. {evalvault-1.64.0 → evalvault-1.66.0}/PKG-INFO +25 -1
  2. {evalvault-1.64.0 → evalvault-1.66.0}/README.md +23 -0
  3. evalvault-1.66.0/config/ragas_prompts_override.yaml +11 -0
  4. {evalvault-1.64.0 → evalvault-1.66.0}/docs/INDEX.md +7 -3
  5. {evalvault-1.64.0 → evalvault-1.66.0}/docs/ROADMAP.md +9 -1
  6. {evalvault-1.64.0 → evalvault-1.66.0}/docs/STATUS.md +14 -1
  7. evalvault-1.66.0/docs/guides/CLI_PARALLEL_FEATURES_SPEC.md +315 -0
  8. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EVALVAULT_RUN_EXCEL_SHEETS.md +16 -0
  9. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EVALVAULT_WORK_PLAN.md +3 -4
  10. evalvault-1.66.0/docs/guides/Extension_2.md +114 -0
  11. evalvault-1.66.0/docs/guides/Extension_Data_Difficulty_Profiling_Custom_Judge_Model.md +1412 -0
  12. evalvault-1.66.0/docs/guides/INSURANCE_SUMMARY_METRICS_PLAN.md +152 -0
  13. evalvault-1.66.0/docs/guides/PARALLEL_WORK_APPROVAL_RULES.md +51 -0
  14. evalvault-1.66.0/docs/guides/PROJECT_STATUS_AND_PLAN.md +291 -0
  15. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md +77 -2
  16. evalvault-1.66.0/docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md +318 -0
  17. evalvault-1.66.0/docs/guides/RAG_NOISE_REDUCTION_GUIDE.md +284 -0
  18. evalvault-1.66.0/docs/guides/RAG_PERFORMANCE_IMPLEMENTATION_LOG.md +180 -0
  19. evalvault-1.66.0/docs/guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md +477 -0
  20. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/USER_GUIDE.md +200 -7
  21. evalvault-1.66.0/docs/guides/refactoring_strategy.md +63 -0
  22. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/01_overview.md +1 -0
  23. evalvault-1.66.0/docs/refactor/REFAC_000_master_plan.md +161 -0
  24. evalvault-1.66.0/docs/refactor/REFAC_010_agent_playbook.md +83 -0
  25. evalvault-1.66.0/docs/refactor/REFAC_020_logging_policy.md +61 -0
  26. evalvault-1.66.0/docs/refactor/REFAC_030_phase0_responsibility_map.md +82 -0
  27. evalvault-1.66.0/docs/refactor/REFAC_040_wbs_parallel_plan.md +117 -0
  28. evalvault-1.66.0/docs/refactor/logs/phase-0-baseline.md +26 -0
  29. evalvault-1.66.0/docs/refactor/logs/phase-1-evaluator.md +26 -0
  30. evalvault-1.66.0/docs/refactor/logs/phase-2-cli-run.md +26 -0
  31. evalvault-1.66.0/docs/refactor/logs/phase-3-analysis.md +26 -0
  32. evalvault-1.66.0/docs/templates/eval_report_templates.md +117 -0
  33. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/App.tsx +2 -0
  34. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/InsightSpacePanel.tsx +33 -1
  35. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/Layout.tsx +3 -1
  36. evalvault-1.66.0/frontend/src/components/ai-elements/Conversation.tsx +23 -0
  37. evalvault-1.66.0/frontend/src/components/ai-elements/Message.tsx +48 -0
  38. evalvault-1.66.0/frontend/src/components/ai-elements/PromptInput.tsx +64 -0
  39. evalvault-1.66.0/frontend/src/components/ai-elements/Response.tsx +14 -0
  40. evalvault-1.66.0/frontend/src/components/ai-elements/index.ts +4 -0
  41. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/config/ui.ts +17 -0
  42. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/AnalysisLab.tsx +36 -0
  43. evalvault-1.66.0/frontend/src/pages/Chat.tsx +207 -0
  44. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/CompareRuns.tsx +42 -1
  45. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/EvaluationStudio.tsx +22 -12
  46. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/RunDetails.tsx +75 -3
  47. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/Visualization.tsx +32 -5
  48. {evalvault-1.64.0 → evalvault-1.66.0}/pyproject.toml +2 -1
  49. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/adapter.py +14 -0
  50. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/main.py +14 -4
  51. evalvault-1.66.0/src/evalvault/adapters/inbound/api/routers/chat.py +543 -0
  52. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/__init__.py +14 -7
  53. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/artifacts.py +107 -0
  54. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/calibrate_judge.py +283 -0
  55. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/compare.py +290 -0
  56. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/history.py +13 -85
  57. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/ops.py +110 -0
  58. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/profile_difficulty.py +160 -0
  59. evalvault-1.66.0/src/evalvault/adapters/inbound/cli/commands/regress.py +251 -0
  60. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/run.py +14 -0
  61. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/run_helpers.py +21 -2
  62. evalvault-1.66.0/src/evalvault/adapters/outbound/analysis/comparison_pipeline_adapter.py +49 -0
  63. evalvault-1.66.0/src/evalvault/adapters/outbound/artifact_fs.py +16 -0
  64. evalvault-1.66.0/src/evalvault/adapters/outbound/filesystem/__init__.py +3 -0
  65. evalvault-1.66.0/src/evalvault/adapters/outbound/filesystem/difficulty_profile_writer.py +50 -0
  66. evalvault-1.66.0/src/evalvault/adapters/outbound/filesystem/ops_snapshot_writer.py +13 -0
  67. evalvault-1.66.0/src/evalvault/adapters/outbound/judge_calibration_adapter.py +36 -0
  68. evalvault-1.66.0/src/evalvault/adapters/outbound/judge_calibration_reporter.py +57 -0
  69. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/report/llm_report_generator.py +13 -1
  70. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/base_sql.py +41 -1
  71. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracker/langfuse_adapter.py +13 -7
  72. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracker/mlflow_adapter.py +5 -0
  73. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracker/phoenix_adapter.py +68 -14
  74. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/settings.py +21 -0
  75. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/__init__.py +10 -0
  76. evalvault-1.66.0/src/evalvault/domain/entities/judge_calibration.py +50 -0
  77. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/prompt.py +1 -1
  78. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/stage.py +11 -3
  79. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/__init__.py +8 -0
  80. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/registry.py +39 -3
  81. evalvault-1.66.0/src/evalvault/domain/metrics/summary_accuracy.py +189 -0
  82. evalvault-1.66.0/src/evalvault/domain/metrics/summary_needs_followup.py +45 -0
  83. evalvault-1.66.0/src/evalvault/domain/metrics/summary_non_definitive.py +41 -0
  84. evalvault-1.66.0/src/evalvault/domain/metrics/summary_risk_coverage.py +45 -0
  85. evalvault-1.66.0/src/evalvault/domain/services/artifact_lint_service.py +268 -0
  86. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/benchmark_runner.py +1 -6
  87. evalvault-1.66.0/src/evalvault/domain/services/custom_metric_snapshot.py +233 -0
  88. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/dataset_preprocessor.py +26 -0
  89. evalvault-1.66.0/src/evalvault/domain/services/difficulty_profile_reporter.py +25 -0
  90. evalvault-1.66.0/src/evalvault/domain/services/difficulty_profiling_service.py +304 -0
  91. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/evaluator.py +282 -27
  92. evalvault-1.66.0/src/evalvault/domain/services/judge_calibration_service.py +495 -0
  93. evalvault-1.66.0/src/evalvault/domain/services/ops_snapshot_service.py +159 -0
  94. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_registry.py +39 -10
  95. evalvault-1.66.0/src/evalvault/domain/services/regression_gate_service.py +199 -0
  96. evalvault-1.66.0/src/evalvault/domain/services/run_comparison_service.py +159 -0
  97. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/stage_event_builder.py +6 -1
  98. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/stage_metric_service.py +83 -18
  99. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/threshold_profiles.py +4 -0
  100. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/visual_space_service.py +79 -4
  101. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/__init__.py +4 -0
  102. evalvault-1.66.0/src/evalvault/ports/outbound/artifact_fs_port.py +12 -0
  103. evalvault-1.66.0/src/evalvault/ports/outbound/comparison_pipeline_port.py +22 -0
  104. evalvault-1.66.0/src/evalvault/ports/outbound/difficulty_profile_port.py +15 -0
  105. evalvault-1.66.0/src/evalvault/ports/outbound/judge_calibration_port.py +22 -0
  106. evalvault-1.66.0/src/evalvault/ports/outbound/ops_snapshot_port.py +8 -0
  107. evalvault-1.66.0/tests/fixtures/e2e/callcenter_summary_5cases.json +74 -0
  108. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/summary_eval_minimal.json +15 -3
  109. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_dataset_preprocessor.py +45 -0
  110. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_evaluator_comprehensive.py +34 -2
  111. evalvault-1.66.0/tests/unit/domain/services/test_judge_calibration_service.py +95 -0
  112. evalvault-1.66.0/tests/unit/domain/services/test_ops_snapshot_service.py +52 -0
  113. evalvault-1.66.0/tests/unit/domain/services/test_regression_gate_service.py +68 -0
  114. evalvault-1.66.0/tests/unit/test_artifact_lint_service.py +68 -0
  115. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_cli.py +110 -20
  116. evalvault-1.66.0/tests/unit/test_cli_artifacts.py +37 -0
  117. evalvault-1.66.0/tests/unit/test_cli_calibrate_judge.py +111 -0
  118. evalvault-1.66.0/tests/unit/test_cli_ops.py +14 -0
  119. evalvault-1.66.0/tests/unit/test_difficulty_profiling_service.py +120 -0
  120. evalvault-1.66.0/tests/unit/test_regress_cli.py +104 -0
  121. evalvault-1.66.0/tests/unit/test_run_comparison_service.py +86 -0
  122. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_stage_metric_service.py +58 -0
  123. {evalvault-1.64.0 → evalvault-1.66.0}/uv.lock +1250 -108
  124. evalvault-1.64.0/config/ragas_prompts_override.yaml +0 -4
  125. {evalvault-1.64.0 → evalvault-1.66.0}/.dockerignore +0 -0
  126. {evalvault-1.64.0 → evalvault-1.66.0}/.env.example +0 -0
  127. {evalvault-1.64.0 → evalvault-1.66.0}/.github/workflows/ci.yml +0 -0
  128. {evalvault-1.64.0 → evalvault-1.66.0}/.github/workflows/release.yml +0 -0
  129. {evalvault-1.64.0 → evalvault-1.66.0}/.github/workflows/stale.yml +0 -0
  130. {evalvault-1.64.0 → evalvault-1.66.0}/.gitignore +0 -0
  131. {evalvault-1.64.0 → evalvault-1.66.0}/.pre-commit-config.yaml +0 -0
  132. {evalvault-1.64.0 → evalvault-1.66.0}/.python-version +0 -0
  133. {evalvault-1.64.0 → evalvault-1.66.0}/AGENTS.md +0 -0
  134. {evalvault-1.64.0 → evalvault-1.66.0}/CHANGELOG.md +0 -0
  135. {evalvault-1.64.0 → evalvault-1.66.0}/CLAUDE.md +0 -0
  136. {evalvault-1.64.0 → evalvault-1.66.0}/CODE_OF_CONDUCT.md +0 -0
  137. {evalvault-1.64.0 → evalvault-1.66.0}/CONTRIBUTING.md +0 -0
  138. {evalvault-1.64.0 → evalvault-1.66.0}/Dockerfile +0 -0
  139. {evalvault-1.64.0 → evalvault-1.66.0}/LICENSE.md +0 -0
  140. {evalvault-1.64.0 → evalvault-1.66.0}/README.en.md +0 -0
  141. {evalvault-1.64.0 → evalvault-1.66.0}/SECURITY.md +0 -0
  142. {evalvault-1.64.0 → evalvault-1.66.0}/agent/README.md +0 -0
  143. {evalvault-1.64.0 → evalvault-1.66.0}/agent/agent.py +0 -0
  144. {evalvault-1.64.0 → evalvault-1.66.0}/agent/client.py +0 -0
  145. {evalvault-1.64.0 → evalvault-1.66.0}/agent/config.py +0 -0
  146. {evalvault-1.64.0 → evalvault-1.66.0}/agent/main.py +0 -0
  147. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory/README.md +0 -0
  148. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory/shared/decisions.md +0 -0
  149. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory/shared/dependencies.md +0 -0
  150. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory/templates/coordinator_guide.md +0 -0
  151. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory/templates/work_log_template.md +0 -0
  152. {evalvault-1.64.0 → evalvault-1.66.0}/agent/memory_integration.py +0 -0
  153. {evalvault-1.64.0 → evalvault-1.66.0}/agent/progress.py +0 -0
  154. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/app_spec.txt +0 -0
  155. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/baseline.txt +0 -0
  156. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/coding_prompt.md +0 -0
  157. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/existing_project_prompt.md +0 -0
  158. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/improvement/architecture_prompt.md +0 -0
  159. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/improvement/base_prompt.md +0 -0
  160. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/improvement/coordinator_prompt.md +0 -0
  161. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/improvement/observability_prompt.md +0 -0
  162. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/initializer_prompt.md +0 -0
  163. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/prompt_manifest.json +0 -0
  164. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts/system.txt +0 -0
  165. {evalvault-1.64.0 → evalvault-1.66.0}/agent/prompts.py +0 -0
  166. {evalvault-1.64.0 → evalvault-1.66.0}/agent/requirements.txt +0 -0
  167. {evalvault-1.64.0 → evalvault-1.66.0}/agent/security.py +0 -0
  168. {evalvault-1.64.0 → evalvault-1.66.0}/config/domains/insurance/memory.yaml +0 -0
  169. {evalvault-1.64.0 → evalvault-1.66.0}/config/domains/insurance/terms_dictionary_en.json +0 -0
  170. {evalvault-1.64.0 → evalvault-1.66.0}/config/domains/insurance/terms_dictionary_ko.json +0 -0
  171. {evalvault-1.64.0 → evalvault-1.66.0}/config/methods.yaml +0 -0
  172. {evalvault-1.64.0 → evalvault-1.66.0}/config/models.yaml +0 -0
  173. {evalvault-1.64.0 → evalvault-1.66.0}/config/regressions/default.json +0 -0
  174. {evalvault-1.64.0 → evalvault-1.66.0}/config/regressions/ux.json +0 -0
  175. {evalvault-1.64.0 → evalvault-1.66.0}/config/stage_metric_playbook.yaml +0 -0
  176. {evalvault-1.64.0 → evalvault-1.66.0}/config/stage_metric_thresholds.json +0 -0
  177. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/dummy_test_dataset.json +0 -0
  178. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/insurance_qa_korean.csv +0 -0
  179. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/insurance_qa_korean.json +0 -0
  180. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/insurance_qa_korean_2.json +0 -0
  181. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/insurance_qa_korean_3.json +0 -0
  182. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/ragas_ko90_en10.json +0 -0
  183. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/sample.json +0 -0
  184. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/visualization_20q_cluster_map.csv +0 -0
  185. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/visualization_20q_korean.json +0 -0
  186. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/visualization_2q_cluster_map.csv +0 -0
  187. {evalvault-1.64.0 → evalvault-1.66.0}/data/datasets/visualization_2q_korean.json +0 -0
  188. {evalvault-1.64.0 → evalvault-1.66.0}/data/kg/knowledge_graph.json +0 -0
  189. {evalvault-1.64.0 → evalvault-1.66.0}/data/raw/The Complete Guide to Mastering Suno Advanced Strategies for Professional Music Generation.md +0 -0
  190. {evalvault-1.64.0 → evalvault-1.66.0}/data/raw/edge_cases.json +0 -0
  191. {evalvault-1.64.0 → evalvault-1.66.0}/data/raw/run_mode_full_domain_memory.json +0 -0
  192. {evalvault-1.64.0 → evalvault-1.66.0}/data/raw/sample_rag_knowledge.txt +0 -0
  193. {evalvault-1.64.0 → evalvault-1.66.0}/dataset_templates/dataset_template.csv +0 -0
  194. {evalvault-1.64.0 → evalvault-1.66.0}/dataset_templates/dataset_template.json +0 -0
  195. {evalvault-1.64.0 → evalvault-1.66.0}/dataset_templates/dataset_template.xlsx +0 -0
  196. {evalvault-1.64.0 → evalvault-1.66.0}/dataset_templates/method_input_template.json +0 -0
  197. {evalvault-1.64.0 → evalvault-1.66.0}/docker-compose.langfuse.yml +0 -0
  198. {evalvault-1.64.0 → evalvault-1.66.0}/docker-compose.phoenix.yaml +0 -0
  199. {evalvault-1.64.0 → evalvault-1.66.0}/docker-compose.yml +0 -0
  200. {evalvault-1.64.0 → evalvault-1.66.0}/docs/README.ko.md +0 -0
  201. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/adapters/inbound.md +0 -0
  202. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/adapters/outbound.md +0 -0
  203. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/config.md +0 -0
  204. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/domain/entities.md +0 -0
  205. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/domain/metrics.md +0 -0
  206. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/domain/services.md +0 -0
  207. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/ports/inbound.md +0 -0
  208. {evalvault-1.64.0 → evalvault-1.66.0}/docs/api/ports/outbound.md +0 -0
  209. {evalvault-1.64.0 → evalvault-1.66.0}/docs/architecture/open-rag-trace-collector.md +0 -0
  210. {evalvault-1.64.0 → evalvault-1.66.0}/docs/architecture/open-rag-trace-spec.md +0 -0
  211. {evalvault-1.64.0 → evalvault-1.66.0}/docs/getting-started/INSTALLATION.md +0 -0
  212. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/AGENTS_SYSTEM_GUIDE.md +0 -0
  213. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/CLI_MCP_PLAN.md +0 -0
  214. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/DEV_GUIDE.md +0 -0
  215. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EVALVAULT_DIAGNOSTIC_PLAYBOOK.md +0 -0
  216. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EXTERNAL_TRACE_API_SPEC.md +0 -0
  217. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/LENA_MVP_IMPLEMENTATION_PLAN.md +0 -0
  218. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/LENA_RAGAS_CALIBRATION_DEV_PLAN.md +0 -0
  219. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/OPEN_RAG_TRACE_INTERNAL_ADAPTER.md +0 -0
  220. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/OPEN_RAG_TRACE_SAMPLES.md +0 -0
  221. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/PRD_LENA.md +0 -0
  222. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/RELEASE_CHECKLIST.md +0 -0
  223. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/WEBUI_CLI_ROLLOUT_PLAN.md +0 -0
  224. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/prompt_suggestions_design.md +0 -0
  225. {evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/rag_human_feedback_calibration_implementation_plan.md +0 -0
  226. {evalvault-1.64.0 → evalvault-1.66.0}/docs/mapping/component-to-whitepaper.yaml +0 -0
  227. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/00_frontmatter.md +0 -0
  228. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/02_architecture.md +0 -0
  229. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/03_data_flow.md +0 -0
  230. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/04_components.md +0 -0
  231. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/05_expert_lenses.md +0 -0
  232. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/06_implementation.md +0 -0
  233. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/07_advanced.md +0 -0
  234. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/08_customization.md +0 -0
  235. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/09_quality.md +0 -0
  236. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/10_performance.md +0 -0
  237. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/11_security.md +0 -0
  238. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/12_operations.md +0 -0
  239. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/13_standards.md +0 -0
  240. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/14_roadmap.md +0 -0
  241. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/INDEX.md +0 -0
  242. {evalvault-1.64.0 → evalvault-1.66.0}/docs/new_whitepaper/STYLE_GUIDE.md +0 -0
  243. {evalvault-1.64.0 → evalvault-1.66.0}/docs/security_audit_worklog.md +0 -0
  244. {evalvault-1.64.0 → evalvault-1.66.0}/docs/stylesheets/extra.css +0 -0
  245. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/dataset_template.csv +0 -0
  246. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/dataset_template.json +0 -0
  247. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/dataset_template.xlsx +0 -0
  248. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/kg_template.json +0 -0
  249. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/otel_openinference_trace_example.json +0 -0
  250. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/ragas_dataset_example_ko90_en10.json +0 -0
  251. {evalvault-1.64.0 → evalvault-1.66.0}/docs/templates/retriever_docs_template.json +0 -0
  252. {evalvault-1.64.0 → evalvault-1.66.0}/docs/tools/generate-whitepaper.py +0 -0
  253. {evalvault-1.64.0 → evalvault-1.66.0}/docs/web_ui_analysis_migration_plan.md +0 -0
  254. {evalvault-1.64.0 → evalvault-1.66.0}/dummy_test_dataset.json +0 -0
  255. {evalvault-1.64.0 → evalvault-1.66.0}/examples/README.md +0 -0
  256. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/README.md +0 -0
  257. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/korean_rag/faithfulness_test.json +0 -0
  258. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/korean_rag/insurance_qa_100.json +0 -0
  259. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/korean_rag/keyword_extraction_test.json +0 -0
  260. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/korean_rag/retrieval_test.json +0 -0
  261. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/output/comparison.json +0 -0
  262. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/output/full_results.json +0 -0
  263. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/output/leaderboard.json +0 -0
  264. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/output/results_mteb.json +0 -0
  265. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/output/retrieval_result.json +0 -0
  266. {evalvault-1.64.0 → evalvault-1.66.0}/examples/benchmarks/run_korean_benchmark.py +0 -0
  267. {evalvault-1.64.0 → evalvault-1.66.0}/examples/kg_generator_demo.py +0 -0
  268. {evalvault-1.64.0 → evalvault-1.66.0}/examples/method_plugin_template/README.md +0 -0
  269. {evalvault-1.64.0 → evalvault-1.66.0}/examples/method_plugin_template/pyproject.toml +0 -0
  270. {evalvault-1.64.0 → evalvault-1.66.0}/examples/method_plugin_template/src/method_plugin_template/__init__.py +0 -0
  271. {evalvault-1.64.0 → evalvault-1.66.0}/examples/method_plugin_template/src/method_plugin_template/methods.py +0 -0
  272. {evalvault-1.64.0 → evalvault-1.66.0}/examples/stage_events.jsonl +0 -0
  273. {evalvault-1.64.0 → evalvault-1.66.0}/examples/usecase/comprehensive_workflow_test.py +0 -0
  274. {evalvault-1.64.0 → evalvault-1.66.0}/examples/usecase/insurance_eval_dataset.json +0 -0
  275. {evalvault-1.64.0 → evalvault-1.66.0}/examples/usecase/output/comprehensive_report.html +0 -0
  276. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/.env.example +0 -0
  277. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/.gitignore +0 -0
  278. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/README.md +0 -0
  279. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/analysis-compare.spec.ts +0 -0
  280. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/analysis-lab.spec.ts +0 -0
  281. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/compare-runs.spec.ts +0 -0
  282. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/dashboard.spec.ts +0 -0
  283. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/domain-memory.spec.ts +0 -0
  284. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/evaluation-studio.spec.ts +0 -0
  285. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/knowledge-base.spec.ts +0 -0
  286. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/mocks/intents.json +0 -0
  287. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/mocks/run_details.json +0 -0
  288. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/mocks/runs.json +0 -0
  289. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/e2e/run-details.spec.ts +0 -0
  290. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/eslint.config.js +0 -0
  291. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/index.html +0 -0
  292. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/package-lock.json +0 -0
  293. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/package.json +0 -0
  294. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/playwright.config.ts +0 -0
  295. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/public/vite.svg +0 -0
  296. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/App.css +0 -0
  297. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/assets/react.svg +0 -0
  298. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/AnalysisNodeOutputs.tsx +0 -0
  299. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/MarkdownContent.tsx +0 -0
  300. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/PrioritySummaryPanel.tsx +0 -0
  301. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/SpaceLegend.tsx +0 -0
  302. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/SpacePlot2D.tsx +0 -0
  303. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/SpacePlot3D.tsx +0 -0
  304. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/StatusBadge.tsx +0 -0
  305. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/components/VirtualizedText.tsx +0 -0
  306. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/config.ts +0 -0
  307. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/hooks/useInsightSpace.ts +0 -0
  308. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/index.css +0 -0
  309. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/main.tsx +0 -0
  310. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/AnalysisCompareView.tsx +0 -0
  311. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/AnalysisResultView.tsx +0 -0
  312. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/ComprehensiveAnalysis.tsx +0 -0
  313. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/CustomerReport.tsx +0 -0
  314. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/Dashboard.tsx +0 -0
  315. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/DomainMemory.tsx +0 -0
  316. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/KnowledgeBase.tsx +0 -0
  317. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/Settings.tsx +0 -0
  318. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/pages/VisualizationHome.tsx +0 -0
  319. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/services/api.ts +0 -0
  320. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/types/plotly.d.ts +0 -0
  321. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/utils/format.ts +0 -0
  322. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/utils/phoenix.ts +0 -0
  323. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/utils/runAnalytics.ts +0 -0
  324. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/utils/score.ts +0 -0
  325. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/src/utils/summaryMetrics.ts +0 -0
  326. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/tailwind.config.js +0 -0
  327. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/tsconfig.app.json +0 -0
  328. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/tsconfig.json +0 -0
  329. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/tsconfig.node.json +0 -0
  330. {evalvault-1.64.0 → evalvault-1.66.0}/frontend/vite.config.ts +0 -0
  331. {evalvault-1.64.0 → evalvault-1.66.0}/mkdocs.yml +0 -0
  332. {evalvault-1.64.0 → evalvault-1.66.0}/package-lock.json +0 -0
  333. {evalvault-1.64.0 → evalvault-1.66.0}/prompts/system_override.txt +0 -0
  334. {evalvault-1.64.0 → evalvault-1.66.0}/reports/.gitkeep +0 -0
  335. {evalvault-1.64.0 → evalvault-1.66.0}/reports/README.md +0 -0
  336. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/final_output.json +0 -0
  337. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/index.json +0 -0
  338. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/load_runs.json +0 -0
  339. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/report.json +0 -0
  340. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/run_change_detection.json +0 -0
  341. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_0aa9fab0_f1287e90/run_metric_comparison.json +0 -0
  342. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/final_output.json +0 -0
  343. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/index.json +0 -0
  344. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/load_runs.json +0 -0
  345. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/report.json +0 -0
  346. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/run_change_detection.json +0 -0
  347. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_8f825b22_4516d358/run_metric_comparison.json +0 -0
  348. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/final_output.json +0 -0
  349. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/index.json +0 -0
  350. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/load_runs.json +0 -0
  351. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/report.json +0 -0
  352. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/run_change_detection.json +0 -0
  353. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/artifacts/comparison_f1287e90_8f825b22/run_metric_comparison.json +0 -0
  354. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_0aa9fab0_9fbf4776.json +0 -0
  355. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_0aa9fab0_9fbf4776.md +0 -0
  356. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_0aa9fab0_f1287e90.json +0 -0
  357. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_0aa9fab0_f1287e90.md +0 -0
  358. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_8f825b22_4516d358.json +0 -0
  359. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_8f825b22_4516d358.md +0 -0
  360. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_9fbf4776_a491fa0e.json +0 -0
  361. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_9fbf4776_a491fa0e.md +0 -0
  362. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_f1287e90_8f825b22.json +0 -0
  363. {evalvault-1.64.0 → evalvault-1.66.0}/reports/comparison/comparison_f1287e90_8f825b22.md +0 -0
  364. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r1_smoke.md +0 -0
  365. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r2_graphrag.md +0 -0
  366. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r2_graphrag_openai.md +0 -0
  367. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r3_bm25.md +0 -0
  368. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r3_bm25_langfuse3.md +0 -0
  369. {evalvault-1.64.0 → evalvault-1.66.0}/reports/debug_report_r3_dense_faiss.md +0 -0
  370. {evalvault-1.64.0 → evalvault-1.66.0}/reports/improvement_1d91a667-4288-4742-be3a-a8f5310c5140.md +0 -0
  371. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r2_graphrag_openai_stage_events.jsonl +0 -0
  372. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r2_graphrag_openai_stage_report.txt +0 -0
  373. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r2_graphrag_stage_events.jsonl +0 -0
  374. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r2_graphrag_stage_report.txt +0 -0
  375. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_langfuse2_stage_events.jsonl +0 -0
  376. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_langfuse3_stage_events.jsonl +0 -0
  377. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_langfuse_stage_events.jsonl +0 -0
  378. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_phoenix_stage_events.jsonl +0 -0
  379. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_stage_events.jsonl +0 -0
  380. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_bm25_stage_report.txt +0 -0
  381. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_dense_faiss_stage_events.jsonl +0 -0
  382. {evalvault-1.64.0 → evalvault-1.66.0}/reports/r3_dense_faiss_stage_report.txt +0 -0
  383. {evalvault-1.64.0 → evalvault-1.66.0}/reports/retrieval_benchmark_smoke_precision.csv +0 -0
  384. {evalvault-1.64.0 → evalvault-1.66.0}/reports/retrieval_benchmark_smoke_precision_graphrag.csv +0 -0
  385. {evalvault-1.64.0 → evalvault-1.66.0}/reports/retrieval_benchmark_smoke_precision_multi.csv +0 -0
  386. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/benchmark/download_kmmlu.py +0 -0
  387. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev/open_rag_trace_demo.py +0 -0
  388. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev/open_rag_trace_integration_template.py +0 -0
  389. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev/otel-collector-config.yaml +0 -0
  390. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev/start_web_ui_with_phoenix.sh +0 -0
  391. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev/validate_open_rag_trace.py +0 -0
  392. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/dev_seed_pipeline_results.py +0 -0
  393. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/__init__.py +0 -0
  394. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/analyzer/__init__.py +0 -0
  395. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/analyzer/ast_scanner.py +0 -0
  396. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/analyzer/confidence_scorer.py +0 -0
  397. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/analyzer/graph_builder.py +0 -0
  398. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/analyzer/side_effect_detector.py +0 -0
  399. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/generate_api_docs.py +0 -0
  400. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/models/__init__.py +0 -0
  401. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/models/schema.py +0 -0
  402. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/renderer/__init__.py +0 -0
  403. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/docs/renderer/html_generator.py +0 -0
  404. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/ops/phoenix_watch.py +0 -0
  405. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/backfill_langfuse_trace_url.py +0 -0
  406. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/r3_dense_smoke.py +0 -0
  407. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/r3_evalvault_run_dataset.json +0 -0
  408. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/r3_retriever_docs.json +0 -0
  409. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/r3_smoke_real.jsonl +0 -0
  410. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/perf/r3_stage_events_sample.jsonl +0 -0
  411. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/pipeline_template_inspect.py +0 -0
  412. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/reports/generate_release_notes.py +0 -0
  413. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/run_with_timeout.py +0 -0
  414. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/test_full_evaluation.py +0 -0
  415. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/tests/run_regressions.py +0 -0
  416. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/tests/run_retriever_stage_report_smoke.sh +0 -0
  417. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/validate_tutorials.py +0 -0
  418. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/verify_ragas_compliance.py +0 -0
  419. {evalvault-1.64.0 → evalvault-1.66.0}/scripts/verify_workflows.py +0 -0
  420. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/__init__.py +0 -0
  421. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/__init__.py +0 -0
  422. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/__init__.py +0 -0
  423. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/__init__.py +0 -0
  424. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/__init__.py +0 -0
  425. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/benchmark.py +0 -0
  426. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/config.py +0 -0
  427. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/domain.py +0 -0
  428. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/knowledge.py +0 -0
  429. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/pipeline.py +0 -0
  430. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/api/routers/runs.py +0 -0
  431. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/__init__.py +0 -0
  432. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/app.py +0 -0
  433. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/agent.py +0 -0
  434. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/analyze.py +0 -0
  435. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/api.py +0 -0
  436. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/benchmark.py +0 -0
  437. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/calibrate.py +0 -0
  438. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/config.py +0 -0
  439. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/debug.py +0 -0
  440. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/domain.py +0 -0
  441. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/experiment.py +0 -0
  442. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/gate.py +0 -0
  443. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/generate.py +0 -0
  444. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/init.py +0 -0
  445. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/kg.py +0 -0
  446. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/langfuse.py +0 -0
  447. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/method.py +0 -0
  448. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/phoenix.py +0 -0
  449. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/pipeline.py +0 -0
  450. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/prompts.py +0 -0
  451. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/commands/stage.py +0 -0
  452. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/__init__.py +0 -0
  453. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/analysis_io.py +0 -0
  454. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/console.py +0 -0
  455. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/errors.py +0 -0
  456. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/formatters.py +0 -0
  457. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/options.py +0 -0
  458. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/presets.py +0 -0
  459. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/progress.py +0 -0
  460. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/cli/utils/validators.py +0 -0
  461. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/mcp/__init__.py +0 -0
  462. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/mcp/schemas.py +0 -0
  463. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/inbound/mcp/tools.py +0 -0
  464. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/__init__.py +0 -0
  465. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/__init__.py +0 -0
  466. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/analysis_report_module.py +0 -0
  467. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/base_module.py +0 -0
  468. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/bm25_searcher_module.py +0 -0
  469. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/causal_adapter.py +0 -0
  470. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/causal_analyzer_module.py +0 -0
  471. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/common.py +0 -0
  472. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/comparison_report_module.py +0 -0
  473. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/data_loader_module.py +0 -0
  474. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/detailed_report_module.py +0 -0
  475. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/diagnostic_playbook_module.py +0 -0
  476. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/embedding_analyzer_module.py +0 -0
  477. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/embedding_distribution_module.py +0 -0
  478. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/embedding_searcher_module.py +0 -0
  479. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/hybrid_rrf_module.py +0 -0
  480. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/hybrid_weighted_module.py +0 -0
  481. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/hypothesis_generator_module.py +0 -0
  482. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/llm_report_module.py +0 -0
  483. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/low_performer_extractor_module.py +0 -0
  484. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/model_analyzer_module.py +0 -0
  485. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/morpheme_analyzer_module.py +0 -0
  486. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/morpheme_quality_checker_module.py +0 -0
  487. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/network_analyzer_module.py +0 -0
  488. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/nlp_adapter.py +0 -0
  489. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/nlp_analyzer_module.py +0 -0
  490. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/pattern_detector_module.py +0 -0
  491. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/pipeline_factory.py +0 -0
  492. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/pipeline_helpers.py +0 -0
  493. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/priority_summary_module.py +0 -0
  494. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/ragas_evaluator_module.py +0 -0
  495. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/retrieval_analyzer_module.py +0 -0
  496. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/retrieval_benchmark_module.py +0 -0
  497. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/retrieval_quality_checker_module.py +0 -0
  498. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/root_cause_analyzer_module.py +0 -0
  499. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/run_analyzer_module.py +0 -0
  500. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/run_change_detector_module.py +0 -0
  501. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/run_comparator_module.py +0 -0
  502. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/run_loader_module.py +0 -0
  503. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/run_metric_comparator_module.py +0 -0
  504. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/search_comparator_module.py +0 -0
  505. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/statistical_adapter.py +0 -0
  506. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/statistical_analyzer_module.py +0 -0
  507. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/statistical_comparator_module.py +0 -0
  508. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/summary_report_module.py +0 -0
  509. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/time_series_analyzer_module.py +0 -0
  510. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/timeseries_advanced_module.py +0 -0
  511. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/trend_detector_module.py +0 -0
  512. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/analysis/verification_report_module.py +0 -0
  513. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/benchmark/__init__.py +0 -0
  514. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/benchmark/lm_eval_adapter.py +0 -0
  515. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/cache/__init__.py +0 -0
  516. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/cache/hybrid_cache.py +0 -0
  517. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/cache/memory_cache.py +0 -0
  518. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/__init__.py +0 -0
  519. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/base.py +0 -0
  520. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/csv_loader.py +0 -0
  521. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/excel_loader.py +0 -0
  522. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/json_loader.py +0 -0
  523. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/loader_factory.py +0 -0
  524. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/method_input_loader.py +0 -0
  525. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/streaming_loader.py +0 -0
  526. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/templates.py +0 -0
  527. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/dataset/thresholds.py +0 -0
  528. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/debug/__init__.py +0 -0
  529. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/debug/report_renderer.py +0 -0
  530. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/documents/__init__.py +0 -0
  531. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/documents/ocr/__init__.py +0 -0
  532. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/documents/ocr/paddleocr_backend.py +0 -0
  533. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/documents/pdf_extractor.py +0 -0
  534. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/documents/versioned_loader.py +0 -0
  535. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/domain_memory/__init__.py +0 -0
  536. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/domain_memory/domain_memory_schema.sql +0 -0
  537. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/domain_memory/sqlite_adapter.py +0 -0
  538. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/improvement/__init__.py +0 -0
  539. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/improvement/insight_generator.py +0 -0
  540. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/improvement/pattern_detector.py +0 -0
  541. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/improvement/playbook_loader.py +0 -0
  542. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/improvement/stage_metric_playbook_loader.py +0 -0
  543. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/kg/__init__.py +0 -0
  544. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/kg/graph_rag_retriever.py +0 -0
  545. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/kg/networkx_adapter.py +0 -0
  546. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/kg/parallel_kg_builder.py +0 -0
  547. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/kg/query_strategies.py +0 -0
  548. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/__init__.py +0 -0
  549. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/anthropic_adapter.py +0 -0
  550. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/azure_adapter.py +0 -0
  551. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/base.py +0 -0
  552. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/factory.py +0 -0
  553. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/instructor_factory.py +0 -0
  554. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/llm_relation_augmenter.py +0 -0
  555. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/ollama_adapter.py +0 -0
  556. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/openai_adapter.py +0 -0
  557. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/token_aware_chat.py +0 -0
  558. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/llm/vllm_adapter.py +0 -0
  559. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/methods/__init__.py +0 -0
  560. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/methods/baseline_oracle.py +0 -0
  561. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/methods/external_command.py +0 -0
  562. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/methods/registry.py +0 -0
  563. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/__init__.py +0 -0
  564. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/__init__.py +0 -0
  565. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/bm25_retriever.py +0 -0
  566. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/dense_retriever.py +0 -0
  567. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/document_chunker.py +0 -0
  568. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/hybrid_retriever.py +0 -0
  569. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/kiwi_tokenizer.py +0 -0
  570. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/korean_evaluation.py +0 -0
  571. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/korean_stopwords.py +0 -0
  572. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/toolkit.py +0 -0
  573. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/nlp/korean/toolkit_factory.py +0 -0
  574. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/phoenix/sync_service.py +0 -0
  575. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/report/__init__.py +0 -0
  576. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/report/dashboard_generator.py +0 -0
  577. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/report/markdown_adapter.py +0 -0
  578. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/__init__.py +0 -0
  579. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/benchmark_storage_adapter.py +0 -0
  580. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/postgres_adapter.py +0 -0
  581. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/postgres_schema.sql +0 -0
  582. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/schema.sql +0 -0
  583. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/storage/sqlite_adapter.py +0 -0
  584. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/__init__.py +0 -0
  585. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/open_rag_log_handler.py +0 -0
  586. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/open_rag_trace_adapter.py +0 -0
  587. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/open_rag_trace_decorators.py +0 -0
  588. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/open_rag_trace_helpers.py +0 -0
  589. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracer/phoenix_tracer_adapter.py +0 -0
  590. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracker/__init__.py +0 -0
  591. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/adapters/outbound/tracker/log_sanitizer.py +0 -0
  592. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/__init__.py +0 -0
  593. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/agent_types.py +0 -0
  594. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/domain_config.py +0 -0
  595. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/instrumentation.py +0 -0
  596. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/langfuse_support.py +0 -0
  597. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/model_config.py +0 -0
  598. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/phoenix_support.py +0 -0
  599. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/playbooks/improvement_playbook.yaml +0 -0
  600. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/config/secret_manager.py +0 -0
  601. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/debug_ragas.py +0 -0
  602. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/debug_ragas_real.py +0 -0
  603. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/__init__.py +0 -0
  604. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/analysis.py +0 -0
  605. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/analysis_pipeline.py +0 -0
  606. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/benchmark.py +0 -0
  607. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/benchmark_run.py +0 -0
  608. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/dataset.py +0 -0
  609. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/debug.py +0 -0
  610. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/experiment.py +0 -0
  611. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/feedback.py +0 -0
  612. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/improvement.py +0 -0
  613. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/kg.py +0 -0
  614. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/memory.py +0 -0
  615. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/method.py +0 -0
  616. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/prompt_suggestion.py +0 -0
  617. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/rag_trace.py +0 -0
  618. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/entities/result.py +0 -0
  619. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/analysis_registry.py +0 -0
  620. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/confidence.py +0 -0
  621. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/contextual_relevancy.py +0 -0
  622. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/entity_preservation.py +0 -0
  623. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/insurance.py +0 -0
  624. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/no_answer.py +0 -0
  625. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/retrieval_rank.py +0 -0
  626. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/terms_dictionary.json +0 -0
  627. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/metrics/text_match.py +0 -0
  628. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/__init__.py +0 -0
  629. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/analysis_service.py +0 -0
  630. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/async_batch_executor.py +0 -0
  631. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/batch_executor.py +0 -0
  632. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/benchmark_report_service.py +0 -0
  633. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/benchmark_service.py +0 -0
  634. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/cache_metrics.py +0 -0
  635. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/cluster_map_builder.py +0 -0
  636. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/debug_report_service.py +0 -0
  637. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/document_chunker.py +0 -0
  638. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/document_versioning.py +0 -0
  639. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/domain_learning_hook.py +0 -0
  640. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/embedding_overlay.py +0 -0
  641. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/entity_extractor.py +0 -0
  642. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/experiment_comparator.py +0 -0
  643. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/experiment_manager.py +0 -0
  644. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/experiment_reporter.py +0 -0
  645. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/experiment_repository.py +0 -0
  646. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/experiment_statistics.py +0 -0
  647. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/holdout_splitter.py +0 -0
  648. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/improvement_guide_service.py +0 -0
  649. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/intent_classifier.py +0 -0
  650. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/kg_generator.py +0 -0
  651. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/memory_aware_evaluator.py +0 -0
  652. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/memory_based_analysis.py +0 -0
  653. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/method_runner.py +0 -0
  654. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/pipeline_orchestrator.py +0 -0
  655. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/pipeline_template_registry.py +0 -0
  656. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_candidate_service.py +0 -0
  657. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_manifest.py +0 -0
  658. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_scoring_service.py +0 -0
  659. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_status.py +0 -0
  660. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/prompt_suggestion_reporter.py +0 -0
  661. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/ragas_prompt_overrides.py +0 -0
  662. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/retrieval_metrics.py +0 -0
  663. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/retriever_context.py +0 -0
  664. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/satisfaction_calibration_service.py +0 -0
  665. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/stage_metric_guide_service.py +0 -0
  666. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/stage_summary_service.py +0 -0
  667. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/synthetic_qa_generator.py +0 -0
  668. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/testset_generator.py +0 -0
  669. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/domain/services/unified_report_service.py +0 -0
  670. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/mkdocs_helpers.py +0 -0
  671. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/__init__.py +0 -0
  672. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/inbound/__init__.py +0 -0
  673. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/inbound/analysis_pipeline_port.py +0 -0
  674. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/inbound/evaluator_port.py +0 -0
  675. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/inbound/learning_hook_port.py +0 -0
  676. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/inbound/web_port.py +0 -0
  677. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/analysis_cache_port.py +0 -0
  678. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/analysis_module_port.py +0 -0
  679. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/analysis_port.py +0 -0
  680. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/benchmark_port.py +0 -0
  681. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/causal_analysis_port.py +0 -0
  682. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/dataset_port.py +0 -0
  683. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/domain_memory_port.py +0 -0
  684. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/embedding_port.py +0 -0
  685. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/improvement_port.py +0 -0
  686. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/intent_classifier_port.py +0 -0
  687. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/korean_nlp_port.py +0 -0
  688. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/llm_factory_port.py +0 -0
  689. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/llm_port.py +0 -0
  690. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/method_port.py +0 -0
  691. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/nlp_analysis_port.py +0 -0
  692. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/relation_augmenter_port.py +0 -0
  693. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/report_port.py +0 -0
  694. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/stage_storage_port.py +0 -0
  695. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/storage_port.py +0 -0
  696. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/tracer_port.py +0 -0
  697. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/ports/outbound/tracker_port.py +0 -0
  698. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/reports/__init__.py +0 -0
  699. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/reports/release_notes.py +0 -0
  700. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/scripts/__init__.py +0 -0
  701. {evalvault-1.64.0 → evalvault-1.66.0}/src/evalvault/scripts/regression_runner.py +0 -0
  702. {evalvault-1.64.0 → evalvault-1.66.0}/tests/__init__.py +0 -0
  703. {evalvault-1.64.0 → evalvault-1.66.0}/tests/conftest.py +0 -0
  704. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/README.md +0 -0
  705. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/benchmark/retrieval_ground_truth_min.json +0 -0
  706. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/benchmark/retrieval_ground_truth_multi.json +0 -0
  707. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/auto_insurance_qa_korean_full.json +0 -0
  708. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/comprehensive_dataset.json +0 -0
  709. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/edge_cases.json +0 -0
  710. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/edge_cases.xlsx +0 -0
  711. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/evaluation_test_sample.json +0 -0
  712. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/graphrag_retriever_docs.json +0 -0
  713. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/graphrag_smoke.json +0 -0
  714. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_document.txt +0 -0
  715. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_english.csv +0 -0
  716. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_english.json +0 -0
  717. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_english.xlsx +0 -0
  718. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_korean.csv +0 -0
  719. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_korean.json +0 -0
  720. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_korean.xlsx +0 -0
  721. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/insurance_qa_korean_versioned_pdf.json +0 -0
  722. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/run_mode_full_domain_memory.json +0 -0
  723. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/e2e/run_mode_simple.json +0 -0
  724. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/kg/minimal_graph.json +0 -0
  725. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/sample_dataset.csv +0 -0
  726. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/sample_dataset.json +0 -0
  727. {evalvault-1.64.0 → evalvault-1.66.0}/tests/fixtures/sample_dataset.xlsx +0 -0
  728. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/__init__.py +0 -0
  729. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/benchmark/test_benchmark_service_integration.py +0 -0
  730. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/conftest.py +0 -0
  731. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_cli_integration.py +0 -0
  732. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_data_flow.py +0 -0
  733. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_e2e_scenarios.py +0 -0
  734. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_evaluation_flow.py +0 -0
  735. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_full_workflow.py +0 -0
  736. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_langfuse_flow.py +0 -0
  737. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_phoenix_flow.py +0 -0
  738. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_pipeline_api_contracts.py +0 -0
  739. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_storage_flow.py +0 -0
  740. {evalvault-1.64.0 → evalvault-1.66.0}/tests/integration/test_summary_eval_fixture.py +0 -0
  741. {evalvault-1.64.0 → evalvault-1.66.0}/tests/optional_deps.py +0 -0
  742. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/__init__.py +0 -0
  743. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/inbound/mcp/test_execute_tools.py +0 -0
  744. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/inbound/mcp/test_read_tools.py +0 -0
  745. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/documents/test_pdf_extractor.py +0 -0
  746. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/documents/test_versioned_loader.py +0 -0
  747. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/improvement/__init__.py +0 -0
  748. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/improvement/test_insight_generator.py +0 -0
  749. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/improvement/test_pattern_detector.py +0 -0
  750. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/improvement/test_playbook_loader.py +0 -0
  751. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/improvement/test_stage_metric_playbook_loader.py +0 -0
  752. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/kg/test_graph_rag_retriever.py +0 -0
  753. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/kg/test_parallel_kg_builder.py +0 -0
  754. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/adapters/outbound/storage/test_benchmark_storage_adapter.py +0 -0
  755. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/config/test_phoenix_support.py +0 -0
  756. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/conftest.py +0 -0
  757. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_analysis_metric_registry.py +0 -0
  758. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_confidence.py +0 -0
  759. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_contextual_relevancy.py +0 -0
  760. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_entity_preservation.py +0 -0
  761. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_metric_registry.py +0 -0
  762. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_no_answer.py +0 -0
  763. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_retrieval_rank.py +0 -0
  764. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/metrics/test_text_match.py +0 -0
  765. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_cache_metrics.py +0 -0
  766. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_claim_level.py +0 -0
  767. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_document_versioning.py +0 -0
  768. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_holdout_splitter.py +0 -0
  769. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_improvement_guide_service.py +0 -0
  770. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_retrieval_metrics.py +0 -0
  771. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_retriever_context.py +0 -0
  772. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_stage_event_builder.py +0 -0
  773. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_stage_metric_guide_service.py +0 -0
  774. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/services/test_synthetic_qa_generator.py +0 -0
  775. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/test_embedding_overlay.py +0 -0
  776. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/test_prompt_manifest.py +0 -0
  777. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/domain/test_prompt_status.py +0 -0
  778. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/reports/test_release_notes.py +0 -0
  779. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/scripts/test_regression_runner.py +0 -0
  780. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_agent_types.py +0 -0
  781. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_analysis_entities.py +0 -0
  782. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_analysis_modules.py +0 -0
  783. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_analysis_pipeline.py +0 -0
  784. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_analysis_service.py +0 -0
  785. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_anthropic_adapter.py +0 -0
  786. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_async_batch_executor.py +0 -0
  787. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_azure_adapter.py +0 -0
  788. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_benchmark_helpers.py +0 -0
  789. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_benchmark_runner.py +0 -0
  790. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_causal_adapter.py +0 -0
  791. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_cli_domain.py +0 -0
  792. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_cli_init.py +0 -0
  793. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_cli_progress.py +0 -0
  794. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_cli_utils.py +0 -0
  795. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_data_loaders.py +0 -0
  796. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_domain_config.py +0 -0
  797. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_domain_memory.py +0 -0
  798. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_entities.py +0 -0
  799. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_entities_kg.py +0 -0
  800. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_entity_extractor.py +0 -0
  801. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_evaluator.py +0 -0
  802. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_experiment.py +0 -0
  803. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_hybrid_cache.py +0 -0
  804. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_instrumentation.py +0 -0
  805. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_insurance_metric.py +0 -0
  806. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_intent_classifier.py +0 -0
  807. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_kg_generator.py +0 -0
  808. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_kg_networkx.py +0 -0
  809. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_kiwi_tokenizer.py +0 -0
  810. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_kiwi_warning_suppression.py +0 -0
  811. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_korean_dense.py +0 -0
  812. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_korean_evaluation.py +0 -0
  813. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_korean_retrieval.py +0 -0
  814. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_langfuse_tracker.py +0 -0
  815. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_llm_relation_augmenter.py +0 -0
  816. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_lm_eval_adapter.py +0 -0
  817. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_markdown_report.py +0 -0
  818. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_memory_cache.py +0 -0
  819. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_memory_services.py +0 -0
  820. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_method_plugins.py +0 -0
  821. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_mlflow_tracker.py +0 -0
  822. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_model_config.py +0 -0
  823. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_nlp_adapter.py +0 -0
  824. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_nlp_entities.py +0 -0
  825. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_ollama_adapter.py +0 -0
  826. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_openai_adapter.py +0 -0
  827. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_phoenix_adapter.py +0 -0
  828. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_pipeline_orchestrator.py +0 -0
  829. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_ports.py +0 -0
  830. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_postgres_storage.py +0 -0
  831. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_prompt_candidate_service.py +0 -0
  832. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_rag_trace_entities.py +0 -0
  833. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_run_memory_helpers.py +0 -0
  834. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_run_mode_fixtures.py +0 -0
  835. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_settings.py +0 -0
  836. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_sqlite_storage.py +0 -0
  837. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_stage_cli.py +0 -0
  838. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_stage_storage.py +0 -0
  839. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_stage_summary_service.py +0 -0
  840. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_statistical_adapter.py +0 -0
  841. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_streaming_loader.py +0 -0
  842. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_summary_eval_fixture.py +0 -0
  843. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_testset_generator.py +0 -0
  844. {evalvault-1.64.0 → evalvault-1.66.0}/tests/unit/test_web_adapter.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: evalvault
3
- Version: 1.64.0
3
+ Version: 1.66.0
4
4
  Summary: RAG evaluation system using Ragas with Phoenix/Langfuse tracing
5
5
  Project-URL: Homepage, https://github.com/ntts9990/EvalVault
6
6
  Project-URL: Documentation, https://github.com/ntts9990/EvalVault#readme
@@ -25,6 +25,7 @@ Classifier: Topic :: Software Development :: Quality Assurance
25
25
  Classifier: Topic :: Software Development :: Testing
26
26
  Classifier: Typing :: Typed
27
27
  Requires-Python: >=3.12
28
+ Requires-Dist: chainlit>=2.9.5
28
29
  Requires-Dist: chardet
29
30
  Requires-Dist: fastapi>=0.128.0
30
31
  Requires-Dist: instructor
@@ -137,12 +138,17 @@ English version? See `README.en.md`.
137
138
  ## Quick Links
138
139
 
139
140
  - 문서 허브: `docs/INDEX.md`
141
+ - CLI 실행 시나리오 가이드: `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
140
142
  - 사용자 가이드: `docs/guides/USER_GUIDE.md`
141
143
  - 개발 가이드: `docs/guides/DEV_GUIDE.md`
142
144
  - 상태/로드맵: `docs/STATUS.md`, `docs/ROADMAP.md`
143
145
  - 개발 백서(설계/운영/품질 기준): `docs/new_whitepaper/INDEX.md`
144
146
  - Open RAG Trace: `docs/architecture/open-rag-trace-spec.md`
145
147
 
148
+ ### 다음 개선 작업 메모
149
+ - 보험 요약 메트릭 확장 계획: `docs/guides/INSURANCE_SUMMARY_METRICS_PLAN.md`
150
+ - Prompt 반복 적용 계획: `docs/guides/repeat_query.md`
151
+
146
152
  ---
147
153
 
148
154
  ## EvalVault가 해결하는 문제
@@ -470,6 +476,24 @@ npm run dev
470
476
  - Ragas 계열: `faithfulness`, `answer_relevancy`, `context_precision`, `context_recall`, `factual_correctness`, `semantic_similarity`
471
477
  - 커스텀 예시(도메인): `insurance_term_accuracy`
472
478
 
479
+ ### 요약 메트릭 설계 근거 (summary_score, summary_faithfulness, entity_preservation)
480
+
481
+ ### 커스텀 메트릭 스냅샷 (평가 방식/과정/결과 기록)
482
+ - 평가 방식/입출력/규칙/구현 파일 해시를 `run.tracker_metadata.custom_metric_snapshot`에 기록합니다.
483
+ - Excel `CustomMetrics` 시트와 Langfuse/Phoenix/MLflow artifact에도 함께 저장됩니다.
484
+
485
+ - `summary_faithfulness`: 요약의 모든 주장이 컨텍스트에 근거하는지 평가합니다. 환각/왜곡 리스크를 직접적으로 측정합니다.
486
+ - `summary_score`: 컨텍스트 대비 요약의 핵심 정보 보존/간결성 균형을 평가합니다. 정답 요약 단일 기준의 편향을 줄입니다.
487
+ - `entity_preservation`: 금액·기간·조건·면책 등 보험 약관에서 중요한 엔티티가 요약에 유지되는지 측정합니다.
488
+
489
+ **보험 도메인 특화 근거**
490
+ - 보험 약관에서 치명적인 요소(면책, 자기부담, 한도, 조건 등)를 키워드로 직접 반영하고, 금액/기간/비율 같은 핵심 엔티티를 보존하도록 설계했습니다.
491
+ - 범용 규칙(숫자/기간/금액)과 보험 특화 키워드를 함께 사용하므로, 현재 상태는 “보험 리스크 중심의 약한 도메인 특화”로 보는 것이 정확합니다.
492
+
493
+ **해석 주의사항**
494
+ - 세 메트릭 모두 `contexts` 품질에 크게 의존합니다. 컨텍스트가 부정확/과도하면 점수가 낮아질 수 있습니다.
495
+ - `summary_score`는 키프레이즈 기반이므로, 표현이 달라지면 점수가 낮게 나올 수 있습니다.
496
+
473
497
  정확한 옵션/운영 레시피는 `docs/guides/USER_GUIDE.md`를 기준으로 최신화합니다.
474
498
 
475
499
  ---
@@ -14,12 +14,17 @@ English version? See `README.en.md`.
14
14
  ## Quick Links
15
15
 
16
16
  - 문서 허브: `docs/INDEX.md`
17
+ - CLI 실행 시나리오 가이드: `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
17
18
  - 사용자 가이드: `docs/guides/USER_GUIDE.md`
18
19
  - 개발 가이드: `docs/guides/DEV_GUIDE.md`
19
20
  - 상태/로드맵: `docs/STATUS.md`, `docs/ROADMAP.md`
20
21
  - 개발 백서(설계/운영/품질 기준): `docs/new_whitepaper/INDEX.md`
21
22
  - Open RAG Trace: `docs/architecture/open-rag-trace-spec.md`
22
23
 
24
+ ### 다음 개선 작업 메모
25
+ - 보험 요약 메트릭 확장 계획: `docs/guides/INSURANCE_SUMMARY_METRICS_PLAN.md`
26
+ - Prompt 반복 적용 계획: `docs/guides/repeat_query.md`
27
+
23
28
  ---
24
29
 
25
30
  ## EvalVault가 해결하는 문제
@@ -347,6 +352,24 @@ npm run dev
347
352
  - Ragas 계열: `faithfulness`, `answer_relevancy`, `context_precision`, `context_recall`, `factual_correctness`, `semantic_similarity`
348
353
  - 커스텀 예시(도메인): `insurance_term_accuracy`
349
354
 
355
+ ### 요약 메트릭 설계 근거 (summary_score, summary_faithfulness, entity_preservation)
356
+
357
+ ### 커스텀 메트릭 스냅샷 (평가 방식/과정/결과 기록)
358
+ - 평가 방식/입출력/규칙/구현 파일 해시를 `run.tracker_metadata.custom_metric_snapshot`에 기록합니다.
359
+ - Excel `CustomMetrics` 시트와 Langfuse/Phoenix/MLflow artifact에도 함께 저장됩니다.
360
+
361
+ - `summary_faithfulness`: 요약의 모든 주장이 컨텍스트에 근거하는지 평가합니다. 환각/왜곡 리스크를 직접적으로 측정합니다.
362
+ - `summary_score`: 컨텍스트 대비 요약의 핵심 정보 보존/간결성 균형을 평가합니다. 정답 요약 단일 기준의 편향을 줄입니다.
363
+ - `entity_preservation`: 금액·기간·조건·면책 등 보험 약관에서 중요한 엔티티가 요약에 유지되는지 측정합니다.
364
+
365
+ **보험 도메인 특화 근거**
366
+ - 보험 약관에서 치명적인 요소(면책, 자기부담, 한도, 조건 등)를 키워드로 직접 반영하고, 금액/기간/비율 같은 핵심 엔티티를 보존하도록 설계했습니다.
367
+ - 범용 규칙(숫자/기간/금액)과 보험 특화 키워드를 함께 사용하므로, 현재 상태는 “보험 리스크 중심의 약한 도메인 특화”로 보는 것이 정확합니다.
368
+
369
+ **해석 주의사항**
370
+ - 세 메트릭 모두 `contexts` 품질에 크게 의존합니다. 컨텍스트가 부정확/과도하면 점수가 낮아질 수 있습니다.
371
+ - `summary_score`는 키프레이즈 기반이므로, 표현이 달라지면 점수가 낮게 나올 수 있습니다.
372
+
350
373
  정확한 옵션/운영 레시피는 `docs/guides/USER_GUIDE.md`를 기준으로 최신화합니다.
351
374
 
352
375
  ---
@@ -0,0 +1,11 @@
1
+ faithfulness: |
2
+ 당신은 평가자입니다. 아래 CONTEXT를 기준으로 각 STATEMENT가 직접적으로
3
+ 추론 가능한지 판단하세요.
4
+ - verdict는 반드시 정수 1 또는 0으로만 출력하세요(따옴표 없이).
5
+ - 1: 컨텍스트에서 직접적으로 지지됨, 0: 지지되지 않음.
6
+ - JSON 형식으로만 반환하세요.
7
+
8
+ answer_relevancy: |
9
+ 당신은 평가자입니다. 질문과 답변이 얼마나 관련 있는지 0~1 점수로 평가하세요.
10
+ - 출력은 숫자 점수와 간단한 근거를 포함해야 합니다.
11
+ - 질문과 무관한 내용이 많으면 낮은 점수를 부여하세요.
@@ -13,13 +13,17 @@
13
13
  ## 빠른 링크
14
14
 
15
15
  - 설치: `getting-started/INSTALLATION.md`
16
+ - CLI 실행 시나리오 가이드: `guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
16
17
  - 사용자 가이드(운영 포함): `guides/USER_GUIDE.md`
17
18
  - 개발/기여: `guides/DEV_GUIDE.md`
18
- - CLI→MCP 이식 계획: `guides/CLI_MCP_PLAN.md`
19
- - Web UI 확장 설계서: `guides/WEBUI_CLI_ROLLOUT_PLAN.md` (1단계 구현 파일 목록 포함)
20
- - RAGAS 인간 피드백 보정: `guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md`
21
19
  - 진단 플레이북: `guides/EVALVAULT_DIAGNOSTIC_PLAYBOOK.md` (문제→분석→해석→액션 흐름)
20
+ - RAG 성능 개선 제안서: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md` (목적/미션·KPI·로드맵)
21
+ - RAGAS 인간 피드백 보정: `guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md`
22
22
  - 실행 결과 엑셀 시트 요약: `guides/EVALVAULT_RUN_EXCEL_SHEETS.md`
23
+ - 평가 리포트 템플릿: `templates/eval_report_templates.md`
24
+ - CLI→MCP 이식 계획: `guides/CLI_MCP_PLAN.md`
25
+ - Web UI 확장 설계서: `guides/WEBUI_CLI_ROLLOUT_PLAN.md`
26
+ - 문서 최신화 작업 계획: `guides/DOCS_REFRESH_PLAN.md`
23
27
  - 릴리즈 체크리스트: `guides/RELEASE_CHECKLIST.md`
24
28
  - 상태 요약: `STATUS.md`
25
29
  - 로드맵: `ROADMAP.md`
@@ -1,6 +1,6 @@
1
1
  # EvalVault 로드맵 (Roadmap)
2
2
 
3
- > Last Updated: 2026-01-11
3
+ > Last Updated: 2026-01-18
4
4
 
5
5
  이 문서는 **"우리가 다음으로 무엇을, 왜 하는가"**를 외부(사용자/기여자) 관점에서 간단히 공유합니다.
6
6
 
@@ -19,10 +19,18 @@
19
19
  ### P1 (사용성)
20
20
  - Web UI에서 핵심 워크플로(Evaluation → History → Reports) 완성도 향상
21
21
  - CLI/웹 공통 DB/아티팩트 경로 규약을 문서/UX에 일관되게 노출
22
+ - Run 상세 탭(Staging/Prompts/Gate/Debug)과 분석 실험실 연동 강화
22
23
 
23
24
  ### P2 (관측성/표준)
24
25
  - Open RAG Trace 스펙/샘플을 실제 운영 요구에 맞춰 점진 확장(버전 정책 준수)
25
26
  - Collector 구성 및 데이터 보존(artifact 분리, PII 마스킹) 가이드 강화
27
+ - Stage Events 최소 스키마 표준화 및 문서 동기화
28
+
29
+ ### P3 (성능 개선 로드맵)
30
+ - RAG 성능 개선 제안서 기반으로 KPI/평가 프로토콜/로드맵 정립
31
+ - Retrieval/리랭킹/GraphRAG 실험과 운영 지표 통합
32
+ - 전문가 관점(인지/UX/운영) 기반 개선 루프 고도화
33
+ - 노이즈 저감/ordering_warning 운영 기준 정착
26
34
 
27
35
  ## 작업 트래킹
28
36
 
@@ -1,7 +1,7 @@
1
1
  # EvalVault 상태 요약 (Status)
2
2
 
3
3
  > Audience: 사용자 · 개발자 · 운영자
4
- > Last Updated: 2026-01-11
4
+ > Last Updated: 2026-01-18
5
5
 
6
6
  EvalVault의 목표는 **RAG 평가/분석/추적을 하나의 Run 단위로 연결**해, 실험→회귀→개선 루프를 빠르게 만드는 것입니다.
7
7
 
@@ -12,6 +12,19 @@ EvalVault의 목표는 **RAG 평가/분석/추적을 하나의 Run 단위로 연
12
12
  - **Observability**: Phoenix(OpenTelemetry/OpenInference) 및 (선택) Langfuse/MLflow
13
13
  - **프로필 기반 모델 전환**: `config/models.yaml` + `.env`로 OpenAI/Ollama/vLLM/Anthropic 등
14
14
  - **Open RAG Trace 표준**: 외부/내부 RAG 시스템을 표준 스키마로 계측/수집
15
+ - **성능 개선 프레임**: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md`에 KPI/평가/로드맵 정리
16
+
17
+ ## 최근 완료 사항
18
+
19
+ - **CLI 병렬 명령군 완료**: compare/calibrate-judge/profile-difficulty/regress/artifacts lint/ops snapshot
20
+ - **노이즈 저감 파이프라인 강화**: dataset_preprocessor/evaluator/stage_metric_service 개선
21
+ - **ordering_warning 도입**: 순서 복원/경고 메트릭 + 런북/strict 기준 문서화
22
+ - **Web UI 반영**: RunDetails/CompareRuns/AnalysisLab에 경고 표시 및 런북 링크 추가
23
+
24
+ ## 품질/검증 상태
25
+
26
+ - Python unit smoke: dataset_preprocessor/evaluator_comprehensive/stage_metric_service PASS
27
+ - Frontend lint/build: eslint PASS, vite build PASS (bundle size warning only)
15
28
 
16
29
  ## 현재 제약 (투명 공개)
17
30
 
@@ -0,0 +1,315 @@
1
+ # CLI Parallel Features Spec (Draft)
2
+
3
+ > Audience: CLI/Platform contributors
4
+ > Purpose: Future CLI features aligned with SOLID, BDD, hexagonal & clean architecture
5
+ > Last Updated: 2026-01-18
6
+
7
+ ## 1. Overview
8
+
9
+ This document specifies new CLI features that are parallel-by-default, deterministic, and cleanly separated by ports/adapters. The scope is design-level documentation with stable JSON outputs and BDD scenarios.
10
+
11
+ Design goals:
12
+ - SOLID: each command = one use-case orchestrator; dependencies injected via ports
13
+ - Clean/Hexagonal: CLI is an inbound adapter; domain services depend on outbound ports only
14
+ - Parallel execution: bounded concurrency with deterministic aggregation
15
+ - BDD: user-visible behavior is defined via Gherkin scenarios
16
+
17
+ Collaboration rules (conflict avoidance):
18
+ - Each stream modifies different files only.
19
+ - Shared schemas or interfaces change only after explicit agreement.
20
+ - Documentation edits are assigned to a single owner to avoid merge conflicts.
21
+
22
+ ## 1.1 Parallel Agent Implementation Plan (Execution)
23
+
24
+ Scope:
25
+ - Implement all commands below in parallel (CLI + domain services + ports + adapters).
26
+ - Each command is owned by exactly one agent end-to-end.
27
+
28
+ Ownership:
29
+ - Agent Compare: `evalvault compare`
30
+ - Agent Calibrate: `evalvault calibrate-judge`
31
+ - Agent Difficulty: `evalvault profile-difficulty`
32
+ - Agent Regress: `evalvault regress`
33
+ - Agent Artifacts: `evalvault artifacts lint`
34
+ - Agent Ops: `evalvault ops snapshot`
35
+
36
+ File boundaries (default):
37
+ - CLI command module for the command
38
+ - Domain service (one use-case service per command)
39
+ - Outbound port interfaces needed by that service
40
+ - Outbound adapters for storage/reporting/FS as needed
41
+ - Tests for the command/service
42
+
43
+ Shared files (change only with explicit agreement):
44
+ - `adapters/inbound/cli/app.py`
45
+ - `adapters/inbound/cli/commands/__init__.py`
46
+ - Common JSON envelope schema or report templates
47
+ - `domain/services/async_batch_executor.py`
48
+
49
+ Definition of done (per agent):
50
+ - CLI command registered and functional with `--help` and a basic run path
51
+ - Domain service + ports/adapters implemented for the use-case
52
+ - Tests added for core logic and CLI wiring
53
+ - Tests and lint pass with the standard project commands
54
+
55
+ Test commands (standard project flow):
56
+ - `uv run ruff check src/ tests/`
57
+ - `uv run ruff format src/ tests/`
58
+ - `uv run pytest tests -v`
59
+
60
+ ## 2. Command Specs
61
+
62
+ ### 2.1 `evalvault compare`
63
+
64
+ Purpose:
65
+ - Compare two runs (metrics, prompts/config diffs, difficulty distribution) and output a unified report.
66
+
67
+ Synopsis:
68
+ ```
69
+ uv run evalvault compare RUN_A RUN_B \
70
+ --db data/db/evalvault.db \
71
+ --metrics faithfulness,answer_relevancy \
72
+ --test t-test \
73
+ --format table \
74
+ --output reports/comparison/comparison_RUNA_RUNB.json \
75
+ --report reports/comparison/comparison_RUNA_RUNB.md \
76
+ --output-dir reports/comparison \
77
+ --artifacts-dir reports/comparison/artifacts/comparison_RUNA_RUNB \
78
+ --parallel --concurrency 8
79
+ ```
80
+
81
+ Options:
82
+ - `--db, -D <path>`: sqlite db path
83
+ - `--metrics, -m <csv>`: allowlist of metrics
84
+ - `--test, -t <t-test|mann-whitney>`
85
+ - `--format, -f <table|json>`
86
+ - `--output, -o <path>`
87
+ - `--report <path>`
88
+ - `--output-dir <path>`
89
+ - `--artifacts-dir <path>`
90
+ - `--parallel/--no-parallel`, `--concurrency <int>`
91
+
92
+ Exit codes:
93
+ - `0`: success
94
+ - `1`: invalid args or missing run
95
+ - `2`: report generation degraded
96
+
97
+ ### 2.2 `evalvault calibrate-judge`
98
+
99
+ Purpose:
100
+ - Calibrate judge scores and emit reliability summary.
101
+
102
+ Synopsis:
103
+ ```
104
+ uv run evalvault calibrate-judge RUN_ID \
105
+ --db data/db/evalvault.db \
106
+ --labels-source feedback \
107
+ --method isotonic \
108
+ --metric faithfulness \
109
+ --holdout-ratio 0.2 \
110
+ --seed 42 \
111
+ --write-back \
112
+ --output reports/calibration/judge_calibration_RUNID.json \
113
+ --parallel --concurrency 8
114
+ ```
115
+
116
+ Options:
117
+ - `--labels-source <feedback|gold|hybrid>`
118
+ - `--method <platt|isotonic|temperature|none>`
119
+ - `--metric <name>` (repeatable)
120
+ - `--holdout-ratio <float>`
121
+ - `--seed <int>`
122
+ - `--write-back`
123
+ - `--output, -o <path>`
124
+ - `--artifacts-dir <path>`
125
+ - `--parallel/--no-parallel`, `--concurrency <int>`
126
+
127
+ Exit codes:
128
+ - `0`: success
129
+ - `1`: labels missing / invalid args
130
+ - `2`: calibration quality below gate
131
+
132
+ ### 2.3 `evalvault profile-difficulty`
133
+
134
+ Purpose:
135
+ - Compute difficulty buckets for a dataset or a run.
136
+
137
+ Synopsis:
138
+ ```
139
+ uv run evalvault profile-difficulty \
140
+ --db data/db/evalvault.db \
141
+ --dataset-name insurance-qa \
142
+ --limit-runs 50 \
143
+ --metrics faithfulness,answer_relevancy \
144
+ --bucket-count 5 \
145
+ --output reports/difficulty/difficulty_insurance-qa.json \
146
+ --parallel --concurrency 8
147
+ ```
148
+
149
+ Options:
150
+ - `--dataset-name <string>` or `--run-id <id>`
151
+ - `--limit-runs <int>`
152
+ - `--metrics, -m <csv>`
153
+ - `--bucket-count <int>`
154
+ - `--min-samples <int>`
155
+ - `--output, -o <path>`
156
+ - `--artifacts-dir <path>`
157
+ - `--parallel/--no-parallel`, `--concurrency <int>`
158
+
159
+ Exit codes:
160
+ - `0`: success
161
+ - `1`: insufficient history or invalid args
162
+
163
+ ### 2.4 `evalvault regress`
164
+
165
+ Purpose:
166
+ - CI-grade regression gate vs baseline run.
167
+
168
+ Synopsis:
169
+ ```
170
+ uv run evalvault regress RUN_CANDIDATE \
171
+ --db data/db/evalvault.db \
172
+ --baseline RUN_BASELINE \
173
+ --fail-on-regression 0.05 \
174
+ --test t-test \
175
+ --metrics faithfulness,answer_relevancy \
176
+ --format github-actions \
177
+ --output reports/regress/regress_RUNCAND.json \
178
+ --parallel --concurrency 8
179
+ ```
180
+
181
+ Exit codes:
182
+ - `0`: pass
183
+ - `1`: invalid input
184
+ - `2`: regression detected
185
+ - `3`: internal error
186
+
187
+ ### 2.5 `evalvault artifacts lint`
188
+
189
+ Purpose:
190
+ - Validate required artifacts and schema invariants.
191
+
192
+ Synopsis:
193
+ ```
194
+ uv run evalvault artifacts lint ARTIFACT_DIR \
195
+ --strict \
196
+ --format json \
197
+ --output reports/artifacts_lint/lint_RUNID.json \
198
+ --parallel --concurrency 16
199
+ ```
200
+
201
+ Checks:
202
+ - `index.json` presence
203
+ - required paths exist
204
+ - JSON schema validation
205
+
206
+ ### 2.6 `evalvault ops snapshot`
207
+
208
+ Purpose:
209
+ - Collect reproducibility metadata (profile, model config, env redactions).
210
+
211
+ Synopsis:
212
+ ```
213
+ uv run evalvault ops snapshot \
214
+ --profile dev \
215
+ --db data/db/evalvault.db \
216
+ --run-id RUN_ID \
217
+ --include-model-config \
218
+ --include-env \
219
+ --redact OPENAI_API_KEY \
220
+ --output reports/ops/snapshot_RUNID.json
221
+ ```
222
+
223
+ ## 3. Architecture Alignment
224
+
225
+ ### 3.1 SOLID
226
+ - SRP: each command orchestrates a single use-case service
227
+ - OCP: add new commands via new registrars without modifying core command modules
228
+ - DIP: domain services depend on ports (StoragePort, ReportPort, FileSystemPort)
229
+
230
+ ### 3.2 Hexagonal/Clean
231
+ - Inbound adapter: `adapters/inbound/cli/commands/*`
232
+ - Domain services: `domain/services/*` for use-cases
233
+ - Outbound ports: `ports/outbound/*`
234
+ - Outbound adapters: sqlite storage, report writers, LLM providers
235
+
236
+ ### 3.3 Proposed Services (Draft)
237
+ - `RunComparisonService`
238
+ - `JudgeCalibrationService`
239
+ - `DifficultyProfilingService`
240
+ - `RegressionGateService`
241
+ - `ArtifactLintService`
242
+ - `OpsSnapshotService`
243
+
244
+ ## 4. Parallel Execution Model
245
+
246
+ - Use bounded concurrency (`--concurrency`) and deterministic aggregation.
247
+ - Candidate base utility: `domain/services/async_batch_executor.py`.
248
+ - Parallelize per-metric/per-case computations; merge results with stable sorting.
249
+ - LLM calls default to sequential unless explicitly enabled.
250
+
251
+ ## 5. JSON Output Envelope
252
+
253
+ Common envelope (recommended):
254
+ ```
255
+ {
256
+ "command": "compare",
257
+ "version": 1,
258
+ "status": "ok",
259
+ "started_at": "2026-01-18T00:00:00Z",
260
+ "finished_at": "2026-01-18T00:00:05Z",
261
+ "duration_ms": 5000,
262
+ "artifacts": {
263
+ "dir": "reports/.../artifacts/...",
264
+ "index": "reports/.../artifacts/.../index.json"
265
+ },
266
+ "data": {}
267
+ }
268
+ ```
269
+
270
+ ## 6. BDD Scenarios (Gherkin)
271
+
272
+ ### compare
273
+ ```
274
+ Feature: Compare two evaluation runs
275
+ Scenario: Compare two runs with shared metrics
276
+ Given a database with runs "run_a" and "run_b"
277
+ When I run "evalvault compare run_a run_b --format json"
278
+ Then the command exits with code 0
279
+ And the JSON output contains "run_ids" ["run_a", "run_b"]
280
+ ```
281
+
282
+ ### calibrate-judge
283
+ ```
284
+ Feature: Calibrate judge scoring
285
+ Scenario: Calibrate judge scores using feedback labels
286
+ Given a run "run_x" with feedback labels in storage
287
+ When I run "evalvault calibrate-judge run_x --labels-source feedback"
288
+ Then the command exits with code 0
289
+ ```
290
+
291
+ ### regress
292
+ ```
293
+ Feature: Regression gate for CI
294
+ Scenario: Regression detected
295
+ Given a candidate run "run_new" and baseline "run_base"
296
+ When I run "evalvault regress run_new --baseline run_base"
297
+ Then the command exits with code 2
298
+ ```
299
+
300
+ ## 7. Non-goals
301
+ - No distributed execution or multi-node scheduling
302
+ - No new scoring algorithms; only orchestration and reporting
303
+ - No breaking change to existing CLI
304
+
305
+ ## 8. Risks
306
+ - Provider rate limits with parallel LLM calls
307
+ - DB contention under high concurrency
308
+ - Schema drift in artifacts without linting
309
+
310
+ ## 9. Mapping to Existing Modules (Evidence)
311
+ - CLI app: `adapters/inbound/cli/app.py`
312
+ - Command registration: `adapters/inbound/cli/commands/__init__.py`
313
+ - Existing compare pipeline: `adapters/inbound/cli/commands/analyze.py`
314
+ - Artifact utilities: `adapters/inbound/cli/utils/analysis_io.py`
315
+ - Async batch executor: `domain/services/async_batch_executor.py`
@@ -65,6 +65,22 @@
65
65
  - `samples`: 샘플 수
66
66
  - 샘플: `avg_score=0.7200`, `pass_rate=0.6`, `samples=30`
67
67
 
68
+ ## CustomMetrics
69
+ - 컬럼 설명
70
+ - `schema_version`: 스냅샷 스키마 버전
71
+ - `metric_name`: 메트릭 이름
72
+ - `source`: 메트릭 출처 (custom)
73
+ - `description`: 메트릭 설명
74
+ - `evaluation_method`: 평가 방식
75
+ - `inputs`: 입력 필드 목록
76
+ - `output`: 점수 범위/판정 규칙
77
+ - `evaluation_process`: 평가 과정 요약
78
+ - `rules`: 키워드/정규식/가중치 등
79
+ - `notes`: 도메인 특화/해석 주의사항
80
+ - `implementation_path`: 구현 파일 경로
81
+ - `implementation_hash`: 구현 파일 해시
82
+ - 샘플: `metric_name=entity_preservation`, `evaluation_method=rule-based`
83
+
68
84
  ## RunPromptSets
69
85
  - 컬럼 설명
70
86
  - `run_id`: 실행 ID
@@ -1,10 +1,9 @@
1
- # EvalVault 작업 계획서 (RAGAS/Tracing/Prompt Override)
1
+ # EvalVault 작업 계획서 (Archived)
2
2
 
3
3
  ## 0) 목적
4
4
 
5
- - RAGAS 평가 결과 저장 Phoenix 트레이싱 → 추가 분석 → 보고서(Markdown)까지 **정상 동작** 확인
6
- - 외부 로그 API 입력(JSON 가정)을 **RAGAS형/비정형**으로 분기해 분석 수행
7
- - RAGAS 프롬프트와 시스템 프롬프트를 **분리 오버라이드**하고 실제 실행으로 검증
5
+ - 문서는 과거 작업 계획서로 분류되어 보존용으로만 남깁니다.
6
+ - 최신 실행 시나리오는 `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`를 기준으로 합니다.
8
7
 
9
8
  ## 1) 전제 및 범위
10
9
 
@@ -0,0 +1,114 @@
1
+ # RAG 시스템 데이터 난이도 평가 및 평가용 LLM 파인튜닝 전략 (현실적 관점)
2
+
3
+ ## 1. 데이터 난이도 평가 체계: 근거는 있으나 전제조건이 중요
4
+
5
+ ### 1.1 핵심 전제
6
+ - 난이도는 질문/문맥/응답 간 상호작용으로 결정되며, 단일 지표로는 포착이 어렵다.
7
+ - Retrieval Complexity(RC)는 질문 난이도와 QA 성능/전문가 판단 간 상관을 보인다는 근거가 있다.
8
+ - 그러나 난이도는 “프록시 지표”이며, 실제 운영 데이터와의 상관 검증이 선행되어야 한다.
9
+
10
+ ### 1.2 난이도 축(권장)
11
+ - 질문 복잡도: 복합 질문, 다단계 추론, 시간/조건 맥락 포함 여부
12
+ - 검색 난이도: 필요한 증거가 여러 문서에 분산되어 있는지, 검색 세트 완전도
13
+ - 답변 품질 신호: 정답 라벨/판정 점수, faithfulness/answer relevancy
14
+ - 노이즈/도메인 일탈: 검색 결과 부재, 도메인 분류 모델의 저확신
15
+
16
+ ### 1.3 단계적 구현(현실적)
17
+ 1. v0 (휴리스틱): 질의 길이, 멀티홉 플래그, 검색 성공/실패 여부, top-k 점수 분포
18
+ 2. v1 (RC 기반): RRCP류 파이프라인을 적용해 RC 추정, 난이도-오류율 상관 검증
19
+ 3. v2 (난이도 운영): 난이도 분포 드리프트를 KPI로 관리, 난이도 구간별 threshold 분리
20
+
21
+ ### 1.4 노이즈/오류 입력 처리
22
+ - 검색 결과 유사도 하한, 결과 0건, 도메인 분류 저확신을 노이즈로 분류
23
+ - 노이즈 케이스는 별도 태그로 분리하고, 다운스트림에서 안전 응답으로 처리
24
+
25
+ ### 1.5 EvalVault 연계
26
+ - 난이도 점수를 run_id 아티팩트로 저장해 난이도별 성능 추세를 비교 가능하게 한다.
27
+ - 난이도 분포 변화가 품질 저하와 연동되는지 검증해 “진짜 원인”인지 확인한다.
28
+
29
+ ### 1.6 도메인별 예시(보험/원전)
30
+ - 보험
31
+ - Easy: “자동차 보험 가입 연령은?” (단일 문서 명시)
32
+ - Medium: “운전자 범위 변경 시 보험료가 어떻게 달라지나?” (규정+예외 조합)
33
+ - Hard: “실손보험에서 특정 치료가 비급여일 때 보장 범위는?” (다중 문서/조건 추론)
34
+ - 원전
35
+ - Easy: “1차 계통과 2차 계통의 차이는?” (정의성 질문)
36
+ - Medium: “정비 절차의 단계별 요구 사항은?” (절차/조건 조합)
37
+ - Hard: “특정 사고 시나리오에서 안전 계통 동작 순서와 근거는?” (다단계 추론)
38
+
39
+ ---
40
+
41
+ ## 2. 평가용 LLM(as-a-judge) 파인튜닝: 비용 절감 가능, 일반화 리스크 존재
42
+
43
+ ### 2.1 기본 원칙
44
+ - 비용 절감은 가능하나, 소형 judge의 일반화/공정성/도메인 이동성은 취약하다.
45
+ - judge 품질은 모델 크기보다 라벨 품질/캘리브레이션에 더 좌우된다.
46
+
47
+ ### 2.2 데이터 구성(필수)
48
+ - 휴먼 레이블: 질문-문맥-응답과 점수(1~5) 또는 등급 라벨
49
+ - 선호도(pairwise): A/B 비교 데이터(가능하면 이유 포함)
50
+ - 전문가 정답: 기준 정답과의 일치/누락 평가
51
+ - 운영 로그: thumbs up/down, 재질의, 불만족 신호(약한 라벨)
52
+
53
+ ### 2.3 학습 전략(권장)
54
+ - SFT로 시작 후, 선호 데이터가 충분하면 DPO 또는 SLiC-HF 추가 적용
55
+ - 출력 형식은 JSON 스키마를 고정하여 판정 안정성 확보
56
+ - 검증은 GPT-4급 judge와의 일치율, 인간 평가와의 상관을 함께 확인
57
+
58
+ ### 2.4 운영 가드레일
59
+ - 캐스케이드 평가: 소형 judge로 대량 처리 후 경계 케이스만 상위 모델로 승격
60
+ - 캘리브레이션: 소량 인간 라벨로 점수 보정 및 신뢰구간 제공
61
+ - 편향 완화: 위치/형식/지식 편향에 대한 swap/format 랜덤화 테스트
62
+
63
+ ---
64
+
65
+ ## 3. 최신 파인튜닝/효율 기법: “효율”과 “평가 품질”을 분리해 판단
66
+
67
+ ### 3.1 적용 시점 가이드
68
+ - QLoRA/LoRA+/LoftQ는 메모리 효율에 유리하지만, 평가 품질 향상은 별도 검증 필요
69
+ - LongLoRA/Cartridges/MQA는 장문/서빙 효율에 유리하나 judge 성능 보장을 의미하지 않음
70
+ - GaLore는 메모리 절감과 full-update 가능성이 장점이나 운영 복잡도 증가
71
+
72
+ ### 3.2 권장 선택 순서
73
+ 1. QLoRA + LoRA(또는 LoRA+)로 시작
74
+ 2. 캘리브레이션/일관성 확보 후에 확장 기법 고려
75
+ 3. 장문 최적화는 실제 장문 업무에서 병목이 확인된 경우에만 적용
76
+
77
+ ---
78
+
79
+ ## 4. 결론
80
+ - 난이도 프로파일링은 유효하지만, “상관 검증 + 운영 KPI화”가 필수 전제다.
81
+ - 소형 judge는 비용 절감에 유리하나 일반화/편향/일관성 리스크가 크므로 캘리브레이션과 캐스케이드 운영이 필수다.
82
+ - 최신 파인튜닝 기법은 효율성 개선 도구이며, 평가 품질 향상을 보장하지 않는다.
83
+
84
+ ---
85
+
86
+ ## 5. 실행 체크리스트
87
+ - 데이터 난이도
88
+ - 난이도 v0 지표가 오류율과 유의미하게 상관되는지 확인
89
+ - 난이도 분포 드리프트가 실제 품질 하락과 연동되는지 검증
90
+ - judge
91
+ - 사람 라벨 3–5% 확보 및 캘리브레이션 리포트 생성
92
+ - 캐스케이드 승격 조건(저신뢰/경계 케이스) 정의
93
+ - 운영
94
+ - run_id 아티팩트에 난이도/판정 근거 저장 여부 확인
95
+ - 난이도별 threshold 및 대응 정책 문서화
96
+
97
+ ---
98
+
99
+ ## References
100
+ - RC metric: https://aclanthology.org/2024.findings-acl.872/
101
+ - GRADE difficulty matrix: https://arxiv.org/abs/2508.16994
102
+ - QLoRA: https://arxiv.org/abs/2305.14314
103
+ - LoftQ: https://arxiv.org/abs/2310.08659
104
+ - LoRA+: https://arxiv.org/abs/2402.12354
105
+ - LongLoRA: https://arxiv.org/abs/2309.12307
106
+ - DPO: https://arxiv.org/abs/2305.18290
107
+ - SLiC-HF: https://arxiv.org/abs/2305.10425
108
+ - GaLore: https://arxiv.org/abs/2403.03507
109
+ - Cartridges: https://arxiv.org/abs/2506.06266
110
+ - MQA: https://arxiv.org/abs/1911.02150
111
+ - JudgeLM: https://arxiv.org/abs/2310.17631
112
+ - Fine-tuned judge limits: https://aclanthology.org/2025.findings-acl.306/
113
+ - LLM judge reliability: https://arxiv.org/abs/2412.12509
114
+ - LLM judge bias: https://llm-judge-bias.github.io/