PyPI - evalvault - Versions diffs - 1.64.0__tar.gz → 1.66.0__tar.gz - Mend

evalvault 1.64.0tar.gz → 1.66.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (844) hide show

{evalvault-1.64.0 → evalvault-1.66.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: evalvault
-Version: 1.64.0
+Version: 1.66.0
 Summary: RAG evaluation system using Ragas with Phoenix/Langfuse tracing
 Project-URL: Homepage, https://github.com/ntts9990/EvalVault
 Project-URL: Documentation, https://github.com/ntts9990/EvalVault#readme
@@ -25,6 +25,7 @@ Classifier: Topic :: Software Development :: Quality Assurance
 Classifier: Topic :: Software Development :: Testing
 Classifier: Typing :: Typed
 Requires-Python: >=3.12
+Requires-Dist: chainlit>=2.9.5
 Requires-Dist: chardet
 Requires-Dist: fastapi>=0.128.0
 Requires-Dist: instructor
@@ -137,12 +138,17 @@ English version? See `README.en.md`.
 ## Quick Links
 - 문서 허브: `docs/INDEX.md`
+- CLI 실행 시나리오 가이드: `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
 - 사용자 가이드: `docs/guides/USER_GUIDE.md`
 - 개발 가이드: `docs/guides/DEV_GUIDE.md`
 - 상태/로드맵: `docs/STATUS.md`, `docs/ROADMAP.md`
 - 개발 백서(설계/운영/품질 기준): `docs/new_whitepaper/INDEX.md`
 - Open RAG Trace: `docs/architecture/open-rag-trace-spec.md`
+### 다음 개선 작업 메모
+- 보험 요약 메트릭 확장 계획: `docs/guides/INSURANCE_SUMMARY_METRICS_PLAN.md`
+- Prompt 반복 적용 계획: `docs/guides/repeat_query.md`
 ---
 ## EvalVault가 해결하는 문제
@@ -470,6 +476,24 @@ npm run dev
 - Ragas 계열: `faithfulness`, `answer_relevancy`, `context_precision`, `context_recall`, `factual_correctness`, `semantic_similarity`
 - 커스텀 예시(도메인): `insurance_term_accuracy`
+### 요약 메트릭 설계 근거 (summary_score, summary_faithfulness, entity_preservation)
+### 커스텀 메트릭 스냅샷 (평가 방식/과정/결과 기록)
+- 평가 방식/입출력/규칙/구현 파일 해시를 `run.tracker_metadata.custom_metric_snapshot`에 기록합니다.
+- Excel `CustomMetrics` 시트와 Langfuse/Phoenix/MLflow artifact에도 함께 저장됩니다.
+- `summary_faithfulness`: 요약의 모든 주장이 컨텍스트에 근거하는지 평가합니다. 환각/왜곡 리스크를 직접적으로 측정합니다.
+- `summary_score`: 컨텍스트 대비 요약의 핵심 정보 보존/간결성 균형을 평가합니다. 정답 요약 단일 기준의 편향을 줄입니다.
+- `entity_preservation`: 금액·기간·조건·면책 등 보험 약관에서 중요한 엔티티가 요약에 유지되는지 측정합니다.
+**보험 도메인 특화 근거**
+- 보험 약관에서 치명적인 요소(면책, 자기부담, 한도, 조건 등)를 키워드로 직접 반영하고, 금액/기간/비율 같은 핵심 엔티티를 보존하도록 설계했습니다.
+- 범용 규칙(숫자/기간/금액)과 보험 특화 키워드를 함께 사용하므로, 현재 상태는 “보험 리스크 중심의 약한 도메인 특화”로 보는 것이 정확합니다.
+**해석 주의사항**
+- 세 메트릭 모두 `contexts` 품질에 크게 의존합니다. 컨텍스트가 부정확/과도하면 점수가 낮아질 수 있습니다.
+- `summary_score`는 키프레이즈 기반이므로, 표현이 달라지면 점수가 낮게 나올 수 있습니다.
 정확한 옵션/운영 레시피는 `docs/guides/USER_GUIDE.md`를 기준으로 최신화합니다.
 ---

{evalvault-1.64.0 → evalvault-1.66.0}/README.md RENAMED Viewed

@@ -14,12 +14,17 @@ English version? See `README.en.md`.
 ## Quick Links
 - 문서 허브: `docs/INDEX.md`
+- CLI 실행 시나리오 가이드: `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
 - 사용자 가이드: `docs/guides/USER_GUIDE.md`
 - 개발 가이드: `docs/guides/DEV_GUIDE.md`
 - 상태/로드맵: `docs/STATUS.md`, `docs/ROADMAP.md`
 - 개발 백서(설계/운영/품질 기준): `docs/new_whitepaper/INDEX.md`
 - Open RAG Trace: `docs/architecture/open-rag-trace-spec.md`
+### 다음 개선 작업 메모
+- 보험 요약 메트릭 확장 계획: `docs/guides/INSURANCE_SUMMARY_METRICS_PLAN.md`
+- Prompt 반복 적용 계획: `docs/guides/repeat_query.md`
 ---
 ## EvalVault가 해결하는 문제
@@ -347,6 +352,24 @@ npm run dev
 - Ragas 계열: `faithfulness`, `answer_relevancy`, `context_precision`, `context_recall`, `factual_correctness`, `semantic_similarity`
 - 커스텀 예시(도메인): `insurance_term_accuracy`
+### 요약 메트릭 설계 근거 (summary_score, summary_faithfulness, entity_preservation)
+### 커스텀 메트릭 스냅샷 (평가 방식/과정/결과 기록)
+- 평가 방식/입출력/규칙/구현 파일 해시를 `run.tracker_metadata.custom_metric_snapshot`에 기록합니다.
+- Excel `CustomMetrics` 시트와 Langfuse/Phoenix/MLflow artifact에도 함께 저장됩니다.
+- `summary_faithfulness`: 요약의 모든 주장이 컨텍스트에 근거하는지 평가합니다. 환각/왜곡 리스크를 직접적으로 측정합니다.
+- `summary_score`: 컨텍스트 대비 요약의 핵심 정보 보존/간결성 균형을 평가합니다. 정답 요약 단일 기준의 편향을 줄입니다.
+- `entity_preservation`: 금액·기간·조건·면책 등 보험 약관에서 중요한 엔티티가 요약에 유지되는지 측정합니다.
+**보험 도메인 특화 근거**
+- 보험 약관에서 치명적인 요소(면책, 자기부담, 한도, 조건 등)를 키워드로 직접 반영하고, 금액/기간/비율 같은 핵심 엔티티를 보존하도록 설계했습니다.
+- 범용 규칙(숫자/기간/금액)과 보험 특화 키워드를 함께 사용하므로, 현재 상태는 “보험 리스크 중심의 약한 도메인 특화”로 보는 것이 정확합니다.
+**해석 주의사항**
+- 세 메트릭 모두 `contexts` 품질에 크게 의존합니다. 컨텍스트가 부정확/과도하면 점수가 낮아질 수 있습니다.
+- `summary_score`는 키프레이즈 기반이므로, 표현이 달라지면 점수가 낮게 나올 수 있습니다.
 정확한 옵션/운영 레시피는 `docs/guides/USER_GUIDE.md`를 기준으로 최신화합니다.
 ---

evalvault-1.66.0/config/ragas_prompts_override.yaml ADDED Viewed

@@ -0,0 +1,11 @@
+faithfulness: |
+  당신은 평가자입니다. 아래 CONTEXT를 기준으로 각 STATEMENT가 직접적으로
+  추론 가능한지 판단하세요.
+  - verdict는 반드시 정수 1 또는 0으로만 출력하세요(따옴표 없이).
+  - 1: 컨텍스트에서 직접적으로 지지됨, 0: 지지되지 않음.
+  - JSON 형식으로만 반환하세요.
+answer_relevancy: |
+  당신은 평가자입니다. 질문과 답변이 얼마나 관련 있는지 0~1 점수로 평가하세요.
+  - 출력은 숫자 점수와 간단한 근거를 포함해야 합니다.
+  - 질문과 무관한 내용이 많으면 낮은 점수를 부여하세요.

{evalvault-1.64.0 → evalvault-1.66.0}/docs/INDEX.md RENAMED Viewed

@@ -13,13 +13,17 @@
 ## 빠른 링크
 - 설치: `getting-started/INSTALLATION.md`
+- CLI 실행 시나리오 가이드: `guides/RAG_CLI_WORKFLOW_TEMPLATES.md`
 - 사용자 가이드(운영 포함): `guides/USER_GUIDE.md`
 - 개발/기여: `guides/DEV_GUIDE.md`
-- CLI→MCP 이식 계획: `guides/CLI_MCP_PLAN.md`
-- Web UI 확장 설계서: `guides/WEBUI_CLI_ROLLOUT_PLAN.md` (1단계 구현 파일 목록 포함)
-- RAGAS 인간 피드백 보정: `guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md`
 - 진단 플레이북: `guides/EVALVAULT_DIAGNOSTIC_PLAYBOOK.md` (문제→분석→해석→액션 흐름)
+- RAG 성능 개선 제안서: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md` (목적/미션·KPI·로드맵)
+- RAGAS 인간 피드백 보정: `guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md`
 - 실행 결과 엑셀 시트 요약: `guides/EVALVAULT_RUN_EXCEL_SHEETS.md`
+- 평가 리포트 템플릿: `templates/eval_report_templates.md`
+- CLI→MCP 이식 계획: `guides/CLI_MCP_PLAN.md`
+- Web UI 확장 설계서: `guides/WEBUI_CLI_ROLLOUT_PLAN.md`
+- 문서 최신화 작업 계획: `guides/DOCS_REFRESH_PLAN.md`
 - 릴리즈 체크리스트: `guides/RELEASE_CHECKLIST.md`
 - 상태 요약: `STATUS.md`
 - 로드맵: `ROADMAP.md`

{evalvault-1.64.0 → evalvault-1.66.0}/docs/ROADMAP.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # EvalVault 로드맵 (Roadmap)
-> Last Updated: 2026-01-11
+> Last Updated: 2026-01-18
 이 문서는 **"우리가 다음으로 무엇을, 왜 하는가"**를 외부(사용자/기여자) 관점에서 간단히 공유합니다.
@@ -19,10 +19,18 @@
 ### P1 (사용성)
 - Web UI에서 핵심 워크플로(Evaluation → History → Reports) 완성도 향상
 - CLI/웹 공통 DB/아티팩트 경로 규약을 문서/UX에 일관되게 노출
+- Run 상세 탭(Staging/Prompts/Gate/Debug)과 분석 실험실 연동 강화
 ### P2 (관측성/표준)
 - Open RAG Trace 스펙/샘플을 실제 운영 요구에 맞춰 점진 확장(버전 정책 준수)
 - Collector 구성 및 데이터 보존(artifact 분리, PII 마스킹) 가이드 강화
+- Stage Events 최소 스키마 표준화 및 문서 동기화
+### P3 (성능 개선 로드맵)
+- RAG 성능 개선 제안서 기반으로 KPI/평가 프로토콜/로드맵 정립
+- Retrieval/리랭킹/GraphRAG 실험과 운영 지표 통합
+- 전문가 관점(인지/UX/운영) 기반 개선 루프 고도화
+- 노이즈 저감/ordering_warning 운영 기준 정착
 ## 작업 트래킹

{evalvault-1.64.0 → evalvault-1.66.0}/docs/STATUS.md RENAMED Viewed

@@ -1,7 +1,7 @@
 # EvalVault 상태 요약 (Status)
 > Audience: 사용자 · 개발자 · 운영자
-> Last Updated: 2026-01-11
+> Last Updated: 2026-01-18
 EvalVault의 목표는 **RAG 평가/분석/추적을 하나의 Run 단위로 연결**해, 실험→회귀→개선 루프를 빠르게 만드는 것입니다.
@@ -12,6 +12,19 @@ EvalVault의 목표는 **RAG 평가/분석/추적을 하나의 Run 단위로 연
 - **Observability**: Phoenix(OpenTelemetry/OpenInference) 및 (선택) Langfuse/MLflow
 - **프로필 기반 모델 전환**: `config/models.yaml` + `.env`로 OpenAI/Ollama/vLLM/Anthropic 등
 - **Open RAG Trace 표준**: 외부/내부 RAG 시스템을 표준 스키마로 계측/수집
+- **성능 개선 프레임**: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md`에 KPI/평가/로드맵 정리
+## 최근 완료 사항
+- **CLI 병렬 명령군 완료**: compare/calibrate-judge/profile-difficulty/regress/artifacts lint/ops snapshot
+- **노이즈 저감 파이프라인 강화**: dataset_preprocessor/evaluator/stage_metric_service 개선
+- **ordering_warning 도입**: 순서 복원/경고 메트릭 + 런북/strict 기준 문서화
+- **Web UI 반영**: RunDetails/CompareRuns/AnalysisLab에 경고 표시 및 런북 링크 추가
+## 품질/검증 상태
+- Python unit smoke: dataset_preprocessor/evaluator_comprehensive/stage_metric_service PASS
+- Frontend lint/build: eslint PASS, vite build PASS (bundle size warning only)
 ## 현재 제약 (투명 공개)

evalvault-1.66.0/docs/guides/CLI_PARALLEL_FEATURES_SPEC.md ADDED Viewed

@@ -0,0 +1,315 @@
+# CLI Parallel Features Spec (Draft)
+> Audience: CLI/Platform contributors
+> Purpose: Future CLI features aligned with SOLID, BDD, hexagonal & clean architecture
+> Last Updated: 2026-01-18
+## 1. Overview
+This document specifies new CLI features that are parallel-by-default, deterministic, and cleanly separated by ports/adapters. The scope is design-level documentation with stable JSON outputs and BDD scenarios.
+Design goals:
+- SOLID: each command = one use-case orchestrator; dependencies injected via ports
+- Clean/Hexagonal: CLI is an inbound adapter; domain services depend on outbound ports only
+- Parallel execution: bounded concurrency with deterministic aggregation
+- BDD: user-visible behavior is defined via Gherkin scenarios
+Collaboration rules (conflict avoidance):
+- Each stream modifies different files only.
+- Shared schemas or interfaces change only after explicit agreement.
+- Documentation edits are assigned to a single owner to avoid merge conflicts.
+## 1.1 Parallel Agent Implementation Plan (Execution)
+Scope:
+- Implement all commands below in parallel (CLI + domain services + ports + adapters).
+- Each command is owned by exactly one agent end-to-end.
+Ownership:
+- Agent Compare: `evalvault compare`
+- Agent Calibrate: `evalvault calibrate-judge`
+- Agent Difficulty: `evalvault profile-difficulty`
+- Agent Regress: `evalvault regress`
+- Agent Artifacts: `evalvault artifacts lint`
+- Agent Ops: `evalvault ops snapshot`
+File boundaries (default):
+- CLI command module for the command
+- Domain service (one use-case service per command)
+- Outbound port interfaces needed by that service
+- Outbound adapters for storage/reporting/FS as needed
+- Tests for the command/service
+Shared files (change only with explicit agreement):
+- `adapters/inbound/cli/app.py`
+- `adapters/inbound/cli/commands/__init__.py`
+- Common JSON envelope schema or report templates
+- `domain/services/async_batch_executor.py`
+Definition of done (per agent):
+- CLI command registered and functional with `--help` and a basic run path
+- Domain service + ports/adapters implemented for the use-case
+- Tests added for core logic and CLI wiring
+- Tests and lint pass with the standard project commands
+Test commands (standard project flow):
+- `uv run ruff check src/ tests/`
+- `uv run ruff format src/ tests/`
+- `uv run pytest tests -v`
+## 2. Command Specs
+### 2.1 `evalvault compare`
+Purpose:
+- Compare two runs (metrics, prompts/config diffs, difficulty distribution) and output a unified report.
+Synopsis:
+```
+uv run evalvault compare RUN_A RUN_B \
+  --db data/db/evalvault.db \
+  --metrics faithfulness,answer_relevancy \
+  --test t-test \
+  --format table \
+  --output reports/comparison/comparison_RUNA_RUNB.json \
+  --report reports/comparison/comparison_RUNA_RUNB.md \
+  --output-dir reports/comparison \
+  --artifacts-dir reports/comparison/artifacts/comparison_RUNA_RUNB \
+  --parallel --concurrency 8
+```
+Options:
+- `--db, -D <path>`: sqlite db path
+- `--metrics, -m <csv>`: allowlist of metrics
+- `--test, -t <t-test|mann-whitney>`
+- `--format, -f <table|json>`
+- `--output, -o <path>`
+- `--report <path>`
+- `--output-dir <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: invalid args or missing run
+- `2`: report generation degraded
+### 2.2 `evalvault calibrate-judge`
+Purpose:
+- Calibrate judge scores and emit reliability summary.
+Synopsis:
+```
+uv run evalvault calibrate-judge RUN_ID \
+  --db data/db/evalvault.db \
+  --labels-source feedback \
+  --method isotonic \
+  --metric faithfulness \
+  --holdout-ratio 0.2 \
+  --seed 42 \
+  --write-back \
+  --output reports/calibration/judge_calibration_RUNID.json \
+  --parallel --concurrency 8
+```
+Options:
+- `--labels-source <feedback|gold|hybrid>`
+- `--method <platt|isotonic|temperature|none>`
+- `--metric <name>` (repeatable)
+- `--holdout-ratio <float>`
+- `--seed <int>`
+- `--write-back`
+- `--output, -o <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: labels missing / invalid args
+- `2`: calibration quality below gate
+### 2.3 `evalvault profile-difficulty`
+Purpose:
+- Compute difficulty buckets for a dataset or a run.
+Synopsis:
+```
+uv run evalvault profile-difficulty \
+  --db data/db/evalvault.db \
+  --dataset-name insurance-qa \
+  --limit-runs 50 \
+  --metrics faithfulness,answer_relevancy \
+  --bucket-count 5 \
+  --output reports/difficulty/difficulty_insurance-qa.json \
+  --parallel --concurrency 8
+```
+Options:
+- `--dataset-name <string>` or `--run-id <id>`
+- `--limit-runs <int>`
+- `--metrics, -m <csv>`
+- `--bucket-count <int>`
+- `--min-samples <int>`
+- `--output, -o <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: insufficient history or invalid args
+### 2.4 `evalvault regress`
+Purpose:
+- CI-grade regression gate vs baseline run.
+Synopsis:
+```
+uv run evalvault regress RUN_CANDIDATE \
+  --db data/db/evalvault.db \
+  --baseline RUN_BASELINE \
+  --fail-on-regression 0.05 \
+  --test t-test \
+  --metrics faithfulness,answer_relevancy \
+  --format github-actions \
+  --output reports/regress/regress_RUNCAND.json \
+  --parallel --concurrency 8
+```
+Exit codes:
+- `0`: pass
+- `1`: invalid input
+- `2`: regression detected
+- `3`: internal error
+### 2.5 `evalvault artifacts lint`
+Purpose:
+- Validate required artifacts and schema invariants.
+Synopsis:
+```
+uv run evalvault artifacts lint ARTIFACT_DIR \
+  --strict \
+  --format json \
+  --output reports/artifacts_lint/lint_RUNID.json \
+  --parallel --concurrency 16
+```
+Checks:
+- `index.json` presence
+- required paths exist
+- JSON schema validation
+### 2.6 `evalvault ops snapshot`
+Purpose:
+- Collect reproducibility metadata (profile, model config, env redactions).
+Synopsis:
+```
+uv run evalvault ops snapshot \
+  --profile dev \
+  --db data/db/evalvault.db \
+  --run-id RUN_ID \
+  --include-model-config \
+  --include-env \
+  --redact OPENAI_API_KEY \
+  --output reports/ops/snapshot_RUNID.json
+```
+## 3. Architecture Alignment
+### 3.1 SOLID
+- SRP: each command orchestrates a single use-case service
+- OCP: add new commands via new registrars without modifying core command modules
+- DIP: domain services depend on ports (StoragePort, ReportPort, FileSystemPort)
+### 3.2 Hexagonal/Clean
+- Inbound adapter: `adapters/inbound/cli/commands/*`
+- Domain services: `domain/services/*` for use-cases
+- Outbound ports: `ports/outbound/*`
+- Outbound adapters: sqlite storage, report writers, LLM providers
+### 3.3 Proposed Services (Draft)
+- `RunComparisonService`
+- `JudgeCalibrationService`
+- `DifficultyProfilingService`
+- `RegressionGateService`
+- `ArtifactLintService`
+- `OpsSnapshotService`
+## 4. Parallel Execution Model
+- Use bounded concurrency (`--concurrency`) and deterministic aggregation.
+- Candidate base utility: `domain/services/async_batch_executor.py`.
+- Parallelize per-metric/per-case computations; merge results with stable sorting.
+- LLM calls default to sequential unless explicitly enabled.
+## 5. JSON Output Envelope
+Common envelope (recommended):
+```
+{
+  "command": "compare",
+  "version": 1,
+  "status": "ok",
+  "started_at": "2026-01-18T00:00:00Z",
+  "finished_at": "2026-01-18T00:00:05Z",
+  "duration_ms": 5000,
+  "artifacts": {
+    "dir": "reports/.../artifacts/...",
+    "index": "reports/.../artifacts/.../index.json"
+  },
+  "data": {}
+}
+```
+## 6. BDD Scenarios (Gherkin)
+### compare
+```
+Feature: Compare two evaluation runs
+  Scenario: Compare two runs with shared metrics
+    Given a database with runs "run_a" and "run_b"
+    When I run "evalvault compare run_a run_b --format json"
+    Then the command exits with code 0
+    And the JSON output contains "run_ids" ["run_a", "run_b"]
+```
+### calibrate-judge
+```
+Feature: Calibrate judge scoring
+  Scenario: Calibrate judge scores using feedback labels
+    Given a run "run_x" with feedback labels in storage
+    When I run "evalvault calibrate-judge run_x --labels-source feedback"
+    Then the command exits with code 0
+```
+### regress
+```
+Feature: Regression gate for CI
+  Scenario: Regression detected
+    Given a candidate run "run_new" and baseline "run_base"
+    When I run "evalvault regress run_new --baseline run_base"
+    Then the command exits with code 2
+```
+## 7. Non-goals
+- No distributed execution or multi-node scheduling
+- No new scoring algorithms; only orchestration and reporting
+- No breaking change to existing CLI
+## 8. Risks
+- Provider rate limits with parallel LLM calls
+- DB contention under high concurrency
+- Schema drift in artifacts without linting
+## 9. Mapping to Existing Modules (Evidence)
+- CLI app: `adapters/inbound/cli/app.py`
+- Command registration: `adapters/inbound/cli/commands/__init__.py`
+- Existing compare pipeline: `adapters/inbound/cli/commands/analyze.py`
+- Artifact utilities: `adapters/inbound/cli/utils/analysis_io.py`
+- Async batch executor: `domain/services/async_batch_executor.py`

{evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EVALVAULT_RUN_EXCEL_SHEETS.md RENAMED Viewed

@@ -65,6 +65,22 @@
   - `samples`: 샘플 수
 - 샘플: `avg_score=0.7200`, `pass_rate=0.6`, `samples=30`
+## CustomMetrics
+- 컬럼 설명
+  - `schema_version`: 스냅샷 스키마 버전
+  - `metric_name`: 메트릭 이름
+  - `source`: 메트릭 출처 (custom)
+  - `description`: 메트릭 설명
+  - `evaluation_method`: 평가 방식
+  - `inputs`: 입력 필드 목록
+  - `output`: 점수 범위/판정 규칙
+  - `evaluation_process`: 평가 과정 요약
+  - `rules`: 키워드/정규식/가중치 등
+  - `notes`: 도메인 특화/해석 주의사항
+  - `implementation_path`: 구현 파일 경로
+  - `implementation_hash`: 구현 파일 해시
+- 샘플: `metric_name=entity_preservation`, `evaluation_method=rule-based`
 ## RunPromptSets
 - 컬럼 설명
   - `run_id`: 실행 ID

{evalvault-1.64.0 → evalvault-1.66.0}/docs/guides/EVALVAULT_WORK_PLAN.md RENAMED Viewed

@@ -1,10 +1,9 @@
-# EvalVault 작업 계획서 (RAGAS/Tracing/Prompt Override)
+# EvalVault 작업 계획서 (Archived)
 ## 0) 목적
-- RAGAS 평가 → 결과 저장 → Phoenix 트레이싱 → 추가 분석 → 보고서(Markdown)까지 **정상 동작** 확인
-- 외부 로그 API 입력(JSON 가정)을 **RAGAS형/비정형**으로 분기해 분석 수행
-- RAGAS 프롬프트와 시스템 프롬프트를 **분리 오버라이드**하고 실제 실행으로 검증
+- 본 문서는 과거 작업 계획서로 분류되어 보존용으로만 남깁니다.
+- 최신 실행 시나리오는 `docs/guides/RAG_CLI_WORKFLOW_TEMPLATES.md`를 기준으로 합니다.
 ## 1) 전제 및 범위

evalvault-1.66.0/docs/guides/Extension_2.md ADDED Viewed

@@ -0,0 +1,114 @@
+# RAG 시스템 데이터 난이도 평가 및 평가용 LLM 파인튜닝 전략 (현실적 관점)
+## 1. 데이터 난이도 평가 체계: 근거는 있으나 전제조건이 중요
+### 1.1 핵심 전제
+- 난이도는 질문/문맥/응답 간 상호작용으로 결정되며, 단일 지표로는 포착이 어렵다.
+- Retrieval Complexity(RC)는 질문 난이도와 QA 성능/전문가 판단 간 상관을 보인다는 근거가 있다.
+- 그러나 난이도는 “프록시 지표”이며, 실제 운영 데이터와의 상관 검증이 선행되어야 한다.
+### 1.2 난이도 축(권장)
+- 질문 복잡도: 복합 질문, 다단계 추론, 시간/조건 맥락 포함 여부
+- 검색 난이도: 필요한 증거가 여러 문서에 분산되어 있는지, 검색 세트 완전도
+- 답변 품질 신호: 정답 라벨/판정 점수, faithfulness/answer relevancy
+- 노이즈/도메인 일탈: 검색 결과 부재, 도메인 분류 모델의 저확신
+### 1.3 단계적 구현(현실적)
+1. v0 (휴리스틱): 질의 길이, 멀티홉 플래그, 검색 성공/실패 여부, top-k 점수 분포
+2. v1 (RC 기반): RRCP류 파이프라인을 적용해 RC 추정, 난이도-오류율 상관 검증
+3. v2 (난이도 운영): 난이도 분포 드리프트를 KPI로 관리, 난이도 구간별 threshold 분리
+### 1.4 노이즈/오류 입력 처리
+- 검색 결과 유사도 하한, 결과 0건, 도메인 분류 저확신을 노이즈로 분류
+- 노이즈 케이스는 별도 태그로 분리하고, 다운스트림에서 안전 응답으로 처리
+### 1.5 EvalVault 연계
+- 난이도 점수를 run_id 아티팩트로 저장해 난이도별 성능 추세를 비교 가능하게 한다.
+- 난이도 분포 변화가 품질 저하와 연동되는지 검증해 “진짜 원인”인지 확인한다.
+### 1.6 도메인별 예시(보험/원전)
+- 보험
+  - Easy: “자동차 보험 가입 연령은?” (단일 문서 명시)
+  - Medium: “운전자 범위 변경 시 보험료가 어떻게 달라지나?” (규정+예외 조합)
+  - Hard: “실손보험에서 특정 치료가 비급여일 때 보장 범위는?” (다중 문서/조건 추론)
+- 원전
+  - Easy: “1차 계통과 2차 계통의 차이는?” (정의성 질문)
+  - Medium: “정비 절차의 단계별 요구 사항은?” (절차/조건 조합)
+  - Hard: “특정 사고 시나리오에서 안전 계통 동작 순서와 근거는?” (다단계 추론)
+---
+## 2. 평가용 LLM(as-a-judge) 파인튜닝: 비용 절감 가능, 일반화 리스크 존재
+### 2.1 기본 원칙
+- 비용 절감은 가능하나, 소형 judge의 일반화/공정성/도메인 이동성은 취약하다.
+- judge 품질은 모델 크기보다 라벨 품질/캘리브레이션에 더 좌우된다.
+### 2.2 데이터 구성(필수)
+- 휴먼 레이블: 질문-문맥-응답과 점수(1~5) 또는 등급 라벨
+- 선호도(pairwise): A/B 비교 데이터(가능하면 이유 포함)
+- 전문가 정답: 기준 정답과의 일치/누락 평가
+- 운영 로그: thumbs up/down, 재질의, 불만족 신호(약한 라벨)
+### 2.3 학습 전략(권장)
+- SFT로 시작 후, 선호 데이터가 충분하면 DPO 또는 SLiC-HF 추가 적용
+- 출력 형식은 JSON 스키마를 고정하여 판정 안정성 확보
+- 검증은 GPT-4급 judge와의 일치율, 인간 평가와의 상관을 함께 확인
+### 2.4 운영 가드레일
+- 캐스케이드 평가: 소형 judge로 대량 처리 후 경계 케이스만 상위 모델로 승격
+- 캘리브레이션: 소량 인간 라벨로 점수 보정 및 신뢰구간 제공
+- 편향 완화: 위치/형식/지식 편향에 대한 swap/format 랜덤화 테스트
+---
+## 3. 최신 파인튜닝/효율 기법: “효율”과 “평가 품질”을 분리해 판단
+### 3.1 적용 시점 가이드
+- QLoRA/LoRA+/LoftQ는 메모리 효율에 유리하지만, 평가 품질 향상은 별도 검증 필요
+- LongLoRA/Cartridges/MQA는 장문/서빙 효율에 유리하나 judge 성능 보장을 의미하지 않음
+- GaLore는 메모리 절감과 full-update 가능성이 장점이나 운영 복잡도 증가
+### 3.2 권장 선택 순서
+1. QLoRA + LoRA(또는 LoRA+)로 시작
+2. 캘리브레이션/일관성 확보 후에 확장 기법 고려
+3. 장문 최적화는 실제 장문 업무에서 병목이 확인된 경우에만 적용
+---
+## 4. 결론
+- 난이도 프로파일링은 유효하지만, “상관 검증 + 운영 KPI화”가 필수 전제다.
+- 소형 judge는 비용 절감에 유리하나 일반화/편향/일관성 리스크가 크므로 캘리브레이션과 캐스케이드 운영이 필수다.
+- 최신 파인튜닝 기법은 효율성 개선 도구이며, 평가 품질 향상을 보장하지 않는다.
+---
+## 5. 실행 체크리스트
+- 데이터 난이도
+  - 난이도 v0 지표가 오류율과 유의미하게 상관되는지 확인
+  - 난이도 분포 드리프트가 실제 품질 하락과 연동되는지 검증
+- judge
+  - 사람 라벨 3–5% 확보 및 캘리브레이션 리포트 생성
+  - 캐스케이드 승격 조건(저신뢰/경계 케이스) 정의
+- 운영
+  - run_id 아티팩트에 난이도/판정 근거 저장 여부 확인
+  - 난이도별 threshold 및 대응 정책 문서화
+---
+## References
+- RC metric: https://aclanthology.org/2024.findings-acl.872/
+- GRADE difficulty matrix: https://arxiv.org/abs/2508.16994
+- QLoRA: https://arxiv.org/abs/2305.14314
+- LoftQ: https://arxiv.org/abs/2310.08659
+- LoRA+: https://arxiv.org/abs/2402.12354
+- LongLoRA: https://arxiv.org/abs/2309.12307
+- DPO: https://arxiv.org/abs/2305.18290
+- SLiC-HF: https://arxiv.org/abs/2305.10425
+- GaLore: https://arxiv.org/abs/2403.03507
+- Cartridges: https://arxiv.org/abs/2506.06266
+- MQA: https://arxiv.org/abs/1911.02150
+- JudgeLM: https://arxiv.org/abs/2310.17631
+- Fine-tuned judge limits: https://aclanthology.org/2025.findings-acl.306/
+- LLM judge reliability: https://arxiv.org/abs/2412.12509
+- LLM judge bias: https://llm-judge-bias.github.io/

evalvault 1.64.0__tar.gz → 1.66.0__tar.gz

evalvault 1.64.0tar.gz → 1.66.0tar.gz