PyPI - evalvault - Versions diffs - 1.64.0__tar.gz → 1.65.0__tar.gz - Mend

evalvault 1.64.0tar.gz → 1.65.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (828) hide show

{evalvault-1.64.0 → evalvault-1.65.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: evalvault
-Version: 1.64.0
+Version: 1.65.0
 Summary: RAG evaluation system using Ragas with Phoenix/Langfuse tracing
 Project-URL: Homepage, https://github.com/ntts9990/EvalVault
 Project-URL: Documentation, https://github.com/ntts9990/EvalVault#readme

{evalvault-1.64.0 → evalvault-1.65.0}/docs/INDEX.md RENAMED Viewed

@@ -19,7 +19,10 @@
 - Web UI 확장 설계서: `guides/WEBUI_CLI_ROLLOUT_PLAN.md` (1단계 구현 파일 목록 포함)
 - RAGAS 인간 피드백 보정: `guides/RAGAS_HUMAN_FEEDBACK_CALIBRATION_GUIDE.md`
 - 진단 플레이북: `guides/EVALVAULT_DIAGNOSTIC_PLAYBOOK.md` (문제→분석→해석→액션 흐름)
+- RAG 성능 개선 제안서: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md` (목적/미션·KPI·로드맵)
+- CLI 병렬 기능 설계서: `guides/CLI_PARALLEL_FEATURES_SPEC.md`
 - 실행 결과 엑셀 시트 요약: `guides/EVALVAULT_RUN_EXCEL_SHEETS.md`
+- 평가 리포트 템플릿: `templates/eval_report_templates.md`
 - 릴리즈 체크리스트: `guides/RELEASE_CHECKLIST.md`
 - 상태 요약: `STATUS.md`
 - 로드맵: `ROADMAP.md`

{evalvault-1.64.0 → evalvault-1.65.0}/docs/ROADMAP.md RENAMED Viewed

@@ -24,6 +24,11 @@
 - Open RAG Trace 스펙/샘플을 실제 운영 요구에 맞춰 점진 확장(버전 정책 준수)
 - Collector 구성 및 데이터 보존(artifact 분리, PII 마스킹) 가이드 강화
+### P3 (성능 개선 로드맵)
+- RAG 성능 개선 제안서 기반으로 KPI/평가 프로토콜/로드맵 정립
+- Retrieval/리랭킹/GraphRAG 실험과 운영 지표 통합
+- 전문가 관점(인지/UX/운영) 기반 개선 루프 고도화
 ## 작업 트래킹
 - 구체적인 이슈/PR 단위 계획은 GitHub Issues/PR에서 관리합니다.

{evalvault-1.64.0 → evalvault-1.65.0}/docs/STATUS.md RENAMED Viewed

@@ -12,6 +12,7 @@ EvalVault의 목표는 **RAG 평가/분석/추적을 하나의 Run 단위로 연
 - **Observability**: Phoenix(OpenTelemetry/OpenInference) 및 (선택) Langfuse/MLflow
 - **프로필 기반 모델 전환**: `config/models.yaml` + `.env`로 OpenAI/Ollama/vLLM/Anthropic 등
 - **Open RAG Trace 표준**: 외부/내부 RAG 시스템을 표준 스키마로 계측/수집
+- **성능 개선 프레임**: `guides/RAG_PERFORMANCE_IMPROVEMENT_PROPOSAL.md`에 KPI/평가/로드맵 정리
 ## 현재 제약 (투명 공개)

evalvault-1.65.0/docs/guides/CLI_PARALLEL_FEATURES_SPEC.md ADDED Viewed

@@ -0,0 +1,315 @@
+# CLI Parallel Features Spec (Draft)
+> Audience: CLI/Platform contributors
+> Purpose: Future CLI features aligned with SOLID, BDD, hexagonal & clean architecture
+> Last Updated: 2026-01-18
+## 1. Overview
+This document specifies new CLI features that are parallel-by-default, deterministic, and cleanly separated by ports/adapters. The scope is design-level documentation with stable JSON outputs and BDD scenarios.
+Design goals:
+- SOLID: each command = one use-case orchestrator; dependencies injected via ports
+- Clean/Hexagonal: CLI is an inbound adapter; domain services depend on outbound ports only
+- Parallel execution: bounded concurrency with deterministic aggregation
+- BDD: user-visible behavior is defined via Gherkin scenarios
+Collaboration rules (conflict avoidance):
+- Each stream modifies different files only.
+- Shared schemas or interfaces change only after explicit agreement.
+- Documentation edits are assigned to a single owner to avoid merge conflicts.
+## 1.1 Parallel Agent Implementation Plan (Execution)
+Scope:
+- Implement all commands below in parallel (CLI + domain services + ports + adapters).
+- Each command is owned by exactly one agent end-to-end.
+Ownership:
+- Agent Compare: `evalvault compare`
+- Agent Calibrate: `evalvault calibrate-judge`
+- Agent Difficulty: `evalvault profile-difficulty`
+- Agent Regress: `evalvault regress`
+- Agent Artifacts: `evalvault artifacts lint`
+- Agent Ops: `evalvault ops snapshot`
+File boundaries (default):
+- CLI command module for the command
+- Domain service (one use-case service per command)
+- Outbound port interfaces needed by that service
+- Outbound adapters for storage/reporting/FS as needed
+- Tests for the command/service
+Shared files (change only with explicit agreement):
+- `adapters/inbound/cli/app.py`
+- `adapters/inbound/cli/commands/__init__.py`
+- Common JSON envelope schema or report templates
+- `domain/services/async_batch_executor.py`
+Definition of done (per agent):
+- CLI command registered and functional with `--help` and a basic run path
+- Domain service + ports/adapters implemented for the use-case
+- Tests added for core logic and CLI wiring
+- Tests and lint pass with the standard project commands
+Test commands (standard project flow):
+- `uv run ruff check src/ tests/`
+- `uv run ruff format src/ tests/`
+- `uv run pytest tests -v`
+## 2. Command Specs
+### 2.1 `evalvault compare`
+Purpose:
+- Compare two runs (metrics, prompts/config diffs, difficulty distribution) and output a unified report.
+Synopsis:
+```
+uv run evalvault compare RUN_A RUN_B \
+  --db data/db/evalvault.db \
+  --metrics faithfulness,answer_relevancy \
+  --test t-test \
+  --format table \
+  --output reports/comparison/comparison_RUNA_RUNB.json \
+  --report reports/comparison/comparison_RUNA_RUNB.md \
+  --output-dir reports/comparison \
+  --artifacts-dir reports/comparison/artifacts/comparison_RUNA_RUNB \
+  --parallel --concurrency 8
+```
+Options:
+- `--db, -D <path>`: sqlite db path
+- `--metrics, -m <csv>`: allowlist of metrics
+- `--test, -t <t-test|mann-whitney>`
+- `--format, -f <table|json>`
+- `--output, -o <path>`
+- `--report <path>`
+- `--output-dir <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: invalid args or missing run
+- `2`: report generation degraded
+### 2.2 `evalvault calibrate-judge`
+Purpose:
+- Calibrate judge scores and emit reliability summary.
+Synopsis:
+```
+uv run evalvault calibrate-judge RUN_ID \
+  --db data/db/evalvault.db \
+  --labels-source feedback \
+  --method isotonic \
+  --metric faithfulness \
+  --holdout-ratio 0.2 \
+  --seed 42 \
+  --write-back \
+  --output reports/calibration/judge_calibration_RUNID.json \
+  --parallel --concurrency 8
+```
+Options:
+- `--labels-source <feedback|gold|hybrid>`
+- `--method <platt|isotonic|temperature|none>`
+- `--metric <name>` (repeatable)
+- `--holdout-ratio <float>`
+- `--seed <int>`
+- `--write-back`
+- `--output, -o <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: labels missing / invalid args
+- `2`: calibration quality below gate
+### 2.3 `evalvault profile-difficulty`
+Purpose:
+- Compute difficulty buckets for a dataset or a run.
+Synopsis:
+```
+uv run evalvault profile-difficulty \
+  --db data/db/evalvault.db \
+  --dataset-name insurance-qa \
+  --limit-runs 50 \
+  --metrics faithfulness,answer_relevancy \
+  --bucket-count 5 \
+  --output reports/difficulty/difficulty_insurance-qa.json \
+  --parallel --concurrency 8
+```
+Options:
+- `--dataset-name <string>` or `--run-id <id>`
+- `--limit-runs <int>`
+- `--metrics, -m <csv>`
+- `--bucket-count <int>`
+- `--min-samples <int>`
+- `--output, -o <path>`
+- `--artifacts-dir <path>`
+- `--parallel/--no-parallel`, `--concurrency <int>`
+Exit codes:
+- `0`: success
+- `1`: insufficient history or invalid args
+### 2.4 `evalvault regress`
+Purpose:
+- CI-grade regression gate vs baseline run.
+Synopsis:
+```
+uv run evalvault regress RUN_CANDIDATE \
+  --db data/db/evalvault.db \
+  --baseline RUN_BASELINE \
+  --fail-on-regression 0.05 \
+  --test t-test \
+  --metrics faithfulness,answer_relevancy \
+  --format github-actions \
+  --output reports/regress/regress_RUNCAND.json \
+  --parallel --concurrency 8
+```
+Exit codes:
+- `0`: pass
+- `1`: invalid input
+- `2`: regression detected
+- `3`: internal error
+### 2.5 `evalvault artifacts lint`
+Purpose:
+- Validate required artifacts and schema invariants.
+Synopsis:
+```
+uv run evalvault artifacts lint ARTIFACT_DIR \
+  --strict \
+  --format json \
+  --output reports/artifacts_lint/lint_RUNID.json \
+  --parallel --concurrency 16
+```
+Checks:
+- `index.json` presence
+- required paths exist
+- JSON schema validation
+### 2.6 `evalvault ops snapshot`
+Purpose:
+- Collect reproducibility metadata (profile, model config, env redactions).
+Synopsis:
+```
+uv run evalvault ops snapshot \
+  --profile dev \
+  --db data/db/evalvault.db \
+  --run-id RUN_ID \
+  --include-model-config \
+  --include-env \
+  --redact OPENAI_API_KEY \
+  --output reports/ops/snapshot_RUNID.json
+```
+## 3. Architecture Alignment
+### 3.1 SOLID
+- SRP: each command orchestrates a single use-case service
+- OCP: add new commands via new registrars without modifying core command modules
+- DIP: domain services depend on ports (StoragePort, ReportPort, FileSystemPort)
+### 3.2 Hexagonal/Clean
+- Inbound adapter: `adapters/inbound/cli/commands/*`
+- Domain services: `domain/services/*` for use-cases
+- Outbound ports: `ports/outbound/*`
+- Outbound adapters: sqlite storage, report writers, LLM providers
+### 3.3 Proposed Services (Draft)
+- `RunComparisonService`
+- `JudgeCalibrationService`
+- `DifficultyProfilingService`
+- `RegressionGateService`
+- `ArtifactLintService`
+- `OpsSnapshotService`
+## 4. Parallel Execution Model
+- Use bounded concurrency (`--concurrency`) and deterministic aggregation.
+- Candidate base utility: `domain/services/async_batch_executor.py`.
+- Parallelize per-metric/per-case computations; merge results with stable sorting.
+- LLM calls default to sequential unless explicitly enabled.
+## 5. JSON Output Envelope
+Common envelope (recommended):
+```
+{
+  "command": "compare",
+  "version": 1,
+  "status": "ok",
+  "started_at": "2026-01-18T00:00:00Z",
+  "finished_at": "2026-01-18T00:00:05Z",
+  "duration_ms": 5000,
+  "artifacts": {
+    "dir": "reports/.../artifacts/...",
+    "index": "reports/.../artifacts/.../index.json"
+  },
+  "data": {}
+}
+```
+## 6. BDD Scenarios (Gherkin)
+### compare
+```
+Feature: Compare two evaluation runs
+  Scenario: Compare two runs with shared metrics
+    Given a database with runs "run_a" and "run_b"
+    When I run "evalvault compare run_a run_b --format json"
+    Then the command exits with code 0
+    And the JSON output contains "run_ids" ["run_a", "run_b"]
+```
+### calibrate-judge
+```
+Feature: Calibrate judge scoring
+  Scenario: Calibrate judge scores using feedback labels
+    Given a run "run_x" with feedback labels in storage
+    When I run "evalvault calibrate-judge run_x --labels-source feedback"
+    Then the command exits with code 0
+```
+### regress
+```
+Feature: Regression gate for CI
+  Scenario: Regression detected
+    Given a candidate run "run_new" and baseline "run_base"
+    When I run "evalvault regress run_new --baseline run_base"
+    Then the command exits with code 2
+```
+## 7. Non-goals
+- No distributed execution or multi-node scheduling
+- No new scoring algorithms; only orchestration and reporting
+- No breaking change to existing CLI
+## 8. Risks
+- Provider rate limits with parallel LLM calls
+- DB contention under high concurrency
+- Schema drift in artifacts without linting
+## 9. Mapping to Existing Modules (Evidence)
+- CLI app: `adapters/inbound/cli/app.py`
+- Command registration: `adapters/inbound/cli/commands/__init__.py`
+- Existing compare pipeline: `adapters/inbound/cli/commands/analyze.py`
+- Artifact utilities: `adapters/inbound/cli/utils/analysis_io.py`
+- Async batch executor: `domain/services/async_batch_executor.py`

evalvault-1.65.0/docs/guides/Extension_2.md ADDED Viewed

@@ -0,0 +1,114 @@
+# RAG 시스템 데이터 난이도 평가 및 평가용 LLM 파인튜닝 전략 (현실적 관점)
+## 1. 데이터 난이도 평가 체계: 근거는 있으나 전제조건이 중요
+### 1.1 핵심 전제
+- 난이도는 질문/문맥/응답 간 상호작용으로 결정되며, 단일 지표로는 포착이 어렵다.
+- Retrieval Complexity(RC)는 질문 난이도와 QA 성능/전문가 판단 간 상관을 보인다는 근거가 있다.
+- 그러나 난이도는 “프록시 지표”이며, 실제 운영 데이터와의 상관 검증이 선행되어야 한다.
+### 1.2 난이도 축(권장)
+- 질문 복잡도: 복합 질문, 다단계 추론, 시간/조건 맥락 포함 여부
+- 검색 난이도: 필요한 증거가 여러 문서에 분산되어 있는지, 검색 세트 완전도
+- 답변 품질 신호: 정답 라벨/판정 점수, faithfulness/answer relevancy
+- 노이즈/도메인 일탈: 검색 결과 부재, 도메인 분류 모델의 저확신
+### 1.3 단계적 구현(현실적)
+1. v0 (휴리스틱): 질의 길이, 멀티홉 플래그, 검색 성공/실패 여부, top-k 점수 분포
+2. v1 (RC 기반): RRCP류 파이프라인을 적용해 RC 추정, 난이도-오류율 상관 검증
+3. v2 (난이도 운영): 난이도 분포 드리프트를 KPI로 관리, 난이도 구간별 threshold 분리
+### 1.4 노이즈/오류 입력 처리
+- 검색 결과 유사도 하한, 결과 0건, 도메인 분류 저확신을 노이즈로 분류
+- 노이즈 케이스는 별도 태그로 분리하고, 다운스트림에서 안전 응답으로 처리
+### 1.5 EvalVault 연계
+- 난이도 점수를 run_id 아티팩트로 저장해 난이도별 성능 추세를 비교 가능하게 한다.
+- 난이도 분포 변화가 품질 저하와 연동되는지 검증해 “진짜 원인”인지 확인한다.
+### 1.6 도메인별 예시(보험/원전)
+- 보험
+  - Easy: “자동차 보험 가입 연령은?” (단일 문서 명시)
+  - Medium: “운전자 범위 변경 시 보험료가 어떻게 달라지나?” (규정+예외 조합)
+  - Hard: “실손보험에서 특정 치료가 비급여일 때 보장 범위는?” (다중 문서/조건 추론)
+- 원전
+  - Easy: “1차 계통과 2차 계통의 차이는?” (정의성 질문)
+  - Medium: “정비 절차의 단계별 요구 사항은?” (절차/조건 조합)
+  - Hard: “특정 사고 시나리오에서 안전 계통 동작 순서와 근거는?” (다단계 추론)
+---
+## 2. 평가용 LLM(as-a-judge) 파인튜닝: 비용 절감 가능, 일반화 리스크 존재
+### 2.1 기본 원칙
+- 비용 절감은 가능하나, 소형 judge의 일반화/공정성/도메인 이동성은 취약하다.
+- judge 품질은 모델 크기보다 라벨 품질/캘리브레이션에 더 좌우된다.
+### 2.2 데이터 구성(필수)
+- 휴먼 레이블: 질문-문맥-응답과 점수(1~5) 또는 등급 라벨
+- 선호도(pairwise): A/B 비교 데이터(가능하면 이유 포함)
+- 전문가 정답: 기준 정답과의 일치/누락 평가
+- 운영 로그: thumbs up/down, 재질의, 불만족 신호(약한 라벨)
+### 2.3 학습 전략(권장)
+- SFT로 시작 후, 선호 데이터가 충분하면 DPO 또는 SLiC-HF 추가 적용
+- 출력 형식은 JSON 스키마를 고정하여 판정 안정성 확보
+- 검증은 GPT-4급 judge와의 일치율, 인간 평가와의 상관을 함께 확인
+### 2.4 운영 가드레일
+- 캐스케이드 평가: 소형 judge로 대량 처리 후 경계 케이스만 상위 모델로 승격
+- 캘리브레이션: 소량 인간 라벨로 점수 보정 및 신뢰구간 제공
+- 편향 완화: 위치/형식/지식 편향에 대한 swap/format 랜덤화 테스트
+---
+## 3. 최신 파인튜닝/효율 기법: “효율”과 “평가 품질”을 분리해 판단
+### 3.1 적용 시점 가이드
+- QLoRA/LoRA+/LoftQ는 메모리 효율에 유리하지만, 평가 품질 향상은 별도 검증 필요
+- LongLoRA/Cartridges/MQA는 장문/서빙 효율에 유리하나 judge 성능 보장을 의미하지 않음
+- GaLore는 메모리 절감과 full-update 가능성이 장점이나 운영 복잡도 증가
+### 3.2 권장 선택 순서
+1. QLoRA + LoRA(또는 LoRA+)로 시작
+2. 캘리브레이션/일관성 확보 후에 확장 기법 고려
+3. 장문 최적화는 실제 장문 업무에서 병목이 확인된 경우에만 적용
+---
+## 4. 결론
+- 난이도 프로파일링은 유효하지만, “상관 검증 + 운영 KPI화”가 필수 전제다.
+- 소형 judge는 비용 절감에 유리하나 일반화/편향/일관성 리스크가 크므로 캘리브레이션과 캐스케이드 운영이 필수다.
+- 최신 파인튜닝 기법은 효율성 개선 도구이며, 평가 품질 향상을 보장하지 않는다.
+---
+## 5. 실행 체크리스트
+- 데이터 난이도
+  - 난이도 v0 지표가 오류율과 유의미하게 상관되는지 확인
+  - 난이도 분포 드리프트가 실제 품질 하락과 연동되는지 검증
+- judge
+  - 사람 라벨 3–5% 확보 및 캘리브레이션 리포트 생성
+  - 캐스케이드 승격 조건(저신뢰/경계 케이스) 정의
+- 운영
+  - run_id 아티팩트에 난이도/판정 근거 저장 여부 확인
+  - 난이도별 threshold 및 대응 정책 문서화
+---
+## References
+- RC metric: https://aclanthology.org/2024.findings-acl.872/
+- GRADE difficulty matrix: https://arxiv.org/abs/2508.16994
+- QLoRA: https://arxiv.org/abs/2305.14314
+- LoftQ: https://arxiv.org/abs/2310.08659
+- LoRA+: https://arxiv.org/abs/2402.12354
+- LongLoRA: https://arxiv.org/abs/2309.12307
+- DPO: https://arxiv.org/abs/2305.18290
+- SLiC-HF: https://arxiv.org/abs/2305.10425
+- GaLore: https://arxiv.org/abs/2403.03507
+- Cartridges: https://arxiv.org/abs/2506.06266
+- MQA: https://arxiv.org/abs/1911.02150
+- JudgeLM: https://arxiv.org/abs/2310.17631
+- Fine-tuned judge limits: https://aclanthology.org/2025.findings-acl.306/
+- LLM judge reliability: https://arxiv.org/abs/2412.12509
+- LLM judge bias: https://llm-judge-bias.github.io/

evalvault 1.64.0__tar.gz → 1.65.0__tar.gz

evalvault 1.64.0tar.gz → 1.65.0tar.gz