PyPI - eval-ai-library - Versions diffs - 0.3.3__py3-none-any.whl → 0.3.10__py3-none-any.whl - Mend

eval-ai-library 0.3.3py3-none-any.whl → 0.3.10py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of eval-ai-library might be problematic. Click here for more details.

Files changed (11) hide show

{eval_ai_library-0.3.3.dist-info → eval_ai_library-0.3.10.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: eval-ai-library
-Version: 0.3.3
+Version: 0.3.10
 Summary: Comprehensive AI Model Evaluation Framework with support for multiple LLM providers
 Author-email: Aleksandr Meshkov <alekslynx90@gmail.com>
 License: MIT
@@ -45,6 +45,7 @@ Requires-Dist: html2text>=2020.1.16
 Requires-Dist: markdown>=3.4.0
 Requires-Dist: pandas>=2.0.0
 Requires-Dist: striprtf>=0.0.26
+Requires-Dist: flask>=3.0.0
 Provides-Extra: dev
 Requires-Dist: pytest>=7.0.0; extra == "dev"
 Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
@@ -807,6 +808,170 @@ response, cost = await chat_complete(
 )
 ```
+## Dashboard
+The library includes an interactive web dashboard for visualizing evaluation results. All evaluation results are automatically saved to cache and can be viewed in a beautiful web interface.
+### Features
+- 📊 **Interactive Charts**: Visual representation of metrics with Chart.js
+- 📈 **Metrics Summary**: Aggregate statistics across all evaluations
+- 🔍 **Detailed View**: Drill down into individual test cases and metric results
+- 💾 **Session History**: Access past evaluation runs
+- 🎨 **Beautiful UI**: Modern, responsive interface with color-coded results
+- 🔄 **Real-time Updates**: Refresh to see new evaluation results
+### Starting the Dashboard
+The dashboard runs as a separate server that you start once and keep running:
+```bash
+# Start dashboard server (from your project directory)
+eval-lib dashboard
+# Custom port if 14500 is busy
+eval-lib dashboard --port 8080
+# Custom cache directory
+eval-lib dashboard --cache-dir /path/to/cache
+```
+Once started, the dashboard will be available at `http://localhost:14500`
+### Saving Results to Dashboard
+Enable dashboard cache saving in your evaluation:
+```python
+import asyncio
+from eval_lib import (
+    evaluate,
+    EvalTestCase,
+    AnswerRelevancyMetric,
+    FaithfulnessMetric
+)
+async def evaluate_with_dashboard():
+    test_cases = [
+        EvalTestCase(
+            input="What is the capital of France?",
+            actual_output="Paris is the capital.",
+            expected_output="Paris",
+            retrieval_context=["Paris is the capital of France."]
+        )
+    ]
+    metrics = [
+        AnswerRelevancyMetric(model="gpt-4o-mini", threshold=0.7),
+        FaithfulnessMetric(model="gpt-4o-mini", threshold=0.8)
+    ]
+    # Results are saved to .eval_cache/ for dashboard viewing
+    results = await evaluate(
+        test_cases=test_cases,
+        metrics=metrics,
+        show_dashboard=True,  # ← Enable dashboard cache
+        session_name="My First Evaluation"  # Optional session name
+    )
+    return results
+asyncio.run(evaluate_with_dashboard())
+```
+### Typical Workflow
+**Terminal 1 - Start Dashboard (once):**
+```bash
+cd ~/my_project
+eval-lib dashboard
+# Leave this terminal open - dashboard stays running
+```
+**Terminal 2 - Run Evaluations (multiple times):**
+```python
+# Run evaluation 1
+results1 = await evaluate(
+    test_cases=test_cases1,
+    metrics=metrics,
+    show_dashboard=True,
+    session_name="Evaluation 1"
+)
+# Run evaluation 2
+results2 = await evaluate(
+    test_cases=test_cases2,
+    metrics=metrics,
+    show_dashboard=True,
+    session_name="Evaluation 2"
+)
+# All results are cached and viewable in dashboard
+```
+**Browser:**
+- Open `http://localhost:14500`
+- Refresh page (F5) to see new evaluation results
+- Switch between different evaluation sessions using the dropdown
+### Dashboard Features
+**Summary Cards:**
+- Total test cases evaluated
+- Total cost across all evaluations
+- Number of metrics used
+**Metrics Overview:**
+- Average scores per metric
+- Pass/fail counts
+- Success rates
+- Model used for evaluation
+- Total cost per metric
+**Detailed Results Table:**
+- Test case inputs and outputs
+- Individual metric scores
+- Pass/fail status
+- Click "View Details" for full information including:
+  - Complete input/output/expected output
+  - Full retrieval context
+  - Detailed evaluation reasoning
+  - Complete evaluation logs
+**Charts:**
+- Bar chart: Average scores by metric
+- Doughnut chart: Success rate distribution
+### Cache Management
+Results are stored in `.eval_cache/results.json` in your project directory:
+```bash
+# View cache contents
+cat .eval_cache/results.json
+# Clear cache via dashboard
+# Click "Clear Cache" button in dashboard UI
+# Or manually delete cache
+rm -rf .eval_cache/
+```
+### CLI Commands
+```bash
+# Start dashboard with defaults
+eval-lib dashboard
+# Custom port
+eval-lib dashboard --port 8080
+# Custom cache directory
+eval-lib dashboard --cache-dir /path/to/project/.eval_cache
+# Check library version
+eval-lib version
+# Help
+eval-lib help
+```
 ## Custom LLM Providers
 The library supports custom LLM providers through the `CustomLLMClient` abstract base class. This allows you to integrate any LLM provider, including internal corporate models, locally-hosted models, or custom endpoints.

{eval_ai_library-0.3.3.dist-info → eval_ai_library-0.3.10.dist-info}/RECORD RENAMED Viewed

@@ -1,7 +1,10 @@
-eval_ai_library-0.3.3.dist-info/licenses/LICENSE,sha256=rK9uLDgWNrCHNdp-Zma_XghDE7Fs0u0kDi3WMcmYx6w,1074
-eval_lib/__init__.py,sha256=ySdAQb2DQma2y-ERuFv3VQEAq3S8d8G4vORfo__aqfk,3087
-eval_lib/evaluate.py,sha256=GjlXZb5dnl44LCaJwdkyGCYcC50zoNZn3NrofzNAVJ0,11490
+eval_ai_library-0.3.10.dist-info/licenses/LICENSE,sha256=rK9uLDgWNrCHNdp-Zma_XghDE7Fs0u0kDi3WMcmYx6w,1074
+eval_lib/__init__.py,sha256=OMrncAoUbbrJXfaYf8k2wJEGw1e2r9k-s1uXkerZ9mE,3204
+eval_lib/cli.py,sha256=Fvnj6HgCQ3lhx28skweALgHSm3FMEpavQCB3o_sQhtE,4731
+eval_lib/dashboard_server.py,sha256=6ND7ujtzN0PdMyVmJFnKDWrIf4kaodnetLZRPUhYHas,6751
+eval_lib/evaluate.py,sha256=LEjwPsuuPGpdwes-xXesCKtKlBFFMF5X1CpIGJIrZ20,12630
 eval_lib/evaluation_schema.py,sha256=7IDd_uozqewhh7k0p1hKut_20udvRxxkV6thclxKUg0,1904
+eval_lib/html.py,sha256=_tBTtwxZpjIwc3TVOyLGDw2VFD77aAeA47JdovoZ0CI,24094
 eval_lib/llm_client.py,sha256=eeTVhCLR1uYbhqOEOSBt3wWPKuzgzA9v8m0F9f-4Gqg,14910
 eval_lib/metric_pattern.py,sha256=wULgMNDeAqJC_Qjglo7bYzY2eGhA_PmY_hA_qGfg0sI,11730
 eval_lib/price.py,sha256=jbmkkUTxPuXrkSHuaJYPl7jSzfDIzQ9p_swWWs26UJ0,1986
@@ -28,7 +31,8 @@ eval_lib/metrics/faithfulness_metric/faithfulness.py,sha256=OqamlhTOps7d-NOStSIK
 eval_lib/metrics/geval/geval.py,sha256=mNciHXnqU2drOJsWlYmbwftGiKM89-Ykw2f6XneIGBM,10629
 eval_lib/metrics/restricted_refusal_metric/restricted_refusal.py,sha256=4QqYgGMcp6W9Lw-v4s0AlUhMSOKvBOEgnLvhqVXaT9I,4286
 eval_lib/metrics/toxicity_metric/toxicity.py,sha256=rBE1_fvpbCRdBpBep1y1LTIhofKR8GD4Eh76EOYzxL0,4076
-eval_ai_library-0.3.3.dist-info/METADATA,sha256=S6nodzMnFB5T1Gvtsg19qi1TEwxGtwc9CqLaBWxgPnM,43879
-eval_ai_library-0.3.3.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
-eval_ai_library-0.3.3.dist-info/top_level.txt,sha256=uQHpEd2XI0oZgq1eCww9zMvVgDJgwXMWkCD45fYUzEg,9
-eval_ai_library-0.3.3.dist-info/RECORD,,
+eval_ai_library-0.3.10.dist-info/METADATA,sha256=pevxrimXqbreKbRwHZ0GBu_VXsfGhles6OMN2SBOJHo,47969
+eval_ai_library-0.3.10.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
+eval_ai_library-0.3.10.dist-info/entry_points.txt,sha256=VTDuJiTezDkBLQw1NWcRoOOuZPHqYgOCcVIoYno-L00,47
+eval_ai_library-0.3.10.dist-info/top_level.txt,sha256=uQHpEd2XI0oZgq1eCww9zMvVgDJgwXMWkCD45fYUzEg,9
+eval_ai_library-0.3.10.dist-info/RECORD,,

eval_ai_library-0.3.10.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ eval-lib = eval_lib.cli:main

eval_lib/__init__.py CHANGED Viewed

@@ -7,7 +7,7 @@ A powerful library for evaluating AI models with support for multiple LLM provid
 and a wide range of evaluation metrics for RAG systems and AI agents.
 """
-__version__ = "0.3.3"
+__version__ = "0.3.10"
 __author__ = "Aleksandr Meshkov"
 # Core evaluation functions
@@ -66,6 +66,10 @@ from eval_lib.agent_metrics import (
     KnowledgeRetentionMetric
 )
+from .dashboard_server import (
+    DashboardCache
+)
 def __getattr__(name):
     """
@@ -136,4 +140,8 @@ __all__ = [
     # Utils
     "score_agg",
     "extract_json_block",
+    # Dashboard
+    'start_dashboard',
+    'DashboardCache',
 ]

eval_lib/cli.py ADDED Viewed

@@ -0,0 +1,166 @@
+# eval_lib/cli.py
+"""
+Command-line interface for Eval AI Library
+"""
+import argparse
+import sys
+from pathlib import Path
+def run_dashboard():
+    """Run dashboard server from CLI"""
+    parser = argparse.ArgumentParser(
+        description='Eval AI Library Dashboard Server',
+        prog='eval-lib dashboard'
+    )
+    parser.add_argument(
+        '--port',
+        type=int,
+        default=14500,
+        help='Port to run dashboard on (default: 14500)'
+    )
+    parser.add_argument(
+        '--host',
+        type=str,
+        default='0.0.0.0',
+        help='Host to bind to (default: 0.0.0.0)'
+    )
+    parser.add_argument(
+        '--cache-dir',
+        type=str,
+        default='.eval_cache',
+        help='Path to cache directory (default: .eval_cache)'
+    )
+    args = parser.parse_args(sys.argv[2:])  # Skip 'eval-lib' and 'dashboard'
+    # Import here to avoid loading everything for --help
+    from eval_lib.dashboard_server import DashboardCache
+    from eval_lib.html import HTML_TEMPLATE
+    from flask import Flask, render_template_string, jsonify
+    # Create cache with custom directory
+    def get_fresh_cache():
+        """Reload cache from disk"""
+        return DashboardCache(cache_dir=args.cache_dir)
+    cache = get_fresh_cache()
+    print("="*70)
+    print("📊 Eval AI Library - Dashboard Server")
+    print("="*70)
+    # Check cache
+    latest = cache.get_latest()
+    if latest:
+        print(f"\n✅ Found cached results:")
+        print(f"   Latest session: {latest['session_id']}")
+        print(f"   Timestamp: {latest['timestamp']}")
+        print(f"   Total sessions: {len(cache.get_all())}")
+    else:
+        print("\n⚠️  No cached results found")
+        print("   Run an evaluation with show_dashboard=True to populate cache")
+    print(f"\n🚀 Starting server...")
+    print(f"   URL: http://localhost:{args.port}")
+    print(f"   Host: {args.host}")
+    print(f"   Cache: {Path(args.cache_dir).absolute()}")
+    print(f"\n💡 Keep this terminal open to keep the server running")
+    print(f"   Press Ctrl+C to stop\n")
+    print("="*70 + "\n")
+    app = Flask(__name__)
+    app.config['WTF_CSRF_ENABLED'] = False
+    @app.route('/')
+    def index():
+        return render_template_string(HTML_TEMPLATE)
+    @app.route('/favicon.ico')
+    def favicon():
+        return '', 204
+    @app.after_request
+    def after_request(response):
+        response.headers['Access-Control-Allow-Origin'] = '*'
+        response.headers['Access-Control-Allow-Methods'] = 'GET, POST, OPTIONS'
+        response.headers['Access-Control-Allow-Headers'] = 'Content-Type'
+        return response
+    @app.route('/api/latest')
+    def api_latest():
+        cache = get_fresh_cache()
+        latest = cache.get_latest()
+        if latest:
+            return jsonify(latest)
+        return jsonify({'error': 'No results available'}), 404
+    @app.route('/api/sessions')
+    def api_sessions():
+        cache = get_fresh_cache()
+        sessions = [
+            {
+                'session_id': s['session_id'],
+                'timestamp': s['timestamp'],
+                'total_tests': s['data']['total_tests']
+            }
+            for s in cache.get_all()
+        ]
+        return jsonify(sessions)
+    @app.route('/api/session/<session_id>')
+    def api_session(session_id):
+        cache = get_fresh_cache()
+        session = cache.get_by_session(session_id)
+        if session:
+            return jsonify(session)
+        return jsonify({'error': 'Session not found'}), 404
+    @app.route('/api/clear')
+    def api_clear():
+        cache = get_fresh_cache()
+        cache.clear()
+        return jsonify({'message': 'Cache cleared'})
+    try:
+        app.run(
+            host=args.host,
+            port=args.port,
+            debug=False,
+            use_reloader=False,
+            threaded=True
+        )
+    except KeyboardInterrupt:
+        print("\n\n🛑 Dashboard server stopped")
+def main():
+    """Main CLI entry point"""
+    parser = argparse.ArgumentParser(
+        description='Eval AI Library CLI',
+        usage='eval-lib <command> [options]'
+    )
+    parser.add_argument(
+        'command',
+        help='Command to run (dashboard, version, help)'
+    )
+    # Parse only the command
+    args = parser.parse_args(sys.argv[1:2])
+    if args.command == 'dashboard':
+        run_dashboard()
+    elif args.command == 'version':
+        from eval_lib import __version__
+        print(f"Eval AI Library v{__version__}")
+    elif args.command == 'help':
+        parser.print_help()
+    else:
+        print(f"Unknown command: {args.command}")
+        print("Available commands: dashboard, version, help")
+        sys.exit(1)
+if __name__ == '__main__':
+    main()

eval_lib/dashboard_server.py ADDED Viewed

@@ -0,0 +1,172 @@
+# eval_lib/dashboard_server.py
+import json
+from pathlib import Path
+from typing import List, Dict, Any, Optional
+from datetime import datetime
+class DashboardCache:
+    """Cache to store evaluation results for the dashboard"""
+    def __init__(self, cache_dir: str = ".eval_cache"):
+        self.cache_dir = Path(cache_dir)
+        self.cache_dir.mkdir(exist_ok=True)
+        self.cache_file = self.cache_dir / "results.json"
+        self.results_history = []
+        self._load_cache()
+    def _load_cache(self):
+        """Load cache from file"""
+        if self.cache_file.exists():
+            try:
+                with open(self.cache_file, 'r', encoding='utf-8') as f:
+                    self.results_history = json.load(f)
+            except Exception as e:
+                print(f"Warning: Could not load cache: {e}")
+                self.results_history = []
+    def _save_cache(self):
+        """Save cache to file"""
+        try:
+            with open(self.cache_file, 'w', encoding='utf-8') as f:
+                json.dump(self.results_history, f,
+                          indent=2, ensure_ascii=False)
+        except Exception as e:
+            print(f"Warning: Could not save cache: {e}")
+    def add_results(self, results: List[tuple], session_name: Optional[str] = None) -> str:
+        """Add new results to the cache"""
+        import time
+        session_id = session_name or f"session_{int(time.time())}"
+        parsed_data = self._parse_results(results)
+        session_data = {
+            'session_id': session_id,
+            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+            'data': parsed_data
+        }
+        self.results_history.append(session_data)
+        self._save_cache()
+        return session_id
+    def get_latest(self) -> Optional[Dict[str, Any]]:
+        """Get latest results"""
+        if self.results_history:
+            return self.results_history[-1]
+        return None
+    def get_all(self) -> List[Dict[str, Any]]:
+        """Get all results"""
+        return self.results_history
+    def get_by_session(self, session_id: str) -> Optional[Dict[str, Any]]:
+        """Get results by session_id"""
+        for session in self.results_history:
+            if session['session_id'] == session_id:
+                return session
+        return None
+    def clear(self):
+        """Clear the cache"""
+        self.results_history = []
+        self._save_cache()
+    def _parse_results(self, results: List[tuple]) -> Dict[str, Any]:
+        """Parse raw results into structured format for dashboard"""
+        test_cases = []
+        metrics_summary = {}
+        total_cost = 0.0
+        for test_idx, test_results in results:
+            for result in test_results:
+                test_case_data = {
+                    'test_index': test_idx,
+                    'input': result.input[:100] + '...' if len(result.input) > 100 else result.input,
+                    'input_full': result.input,
+                    'actual_output': result.actual_output[:200] if result.actual_output else '',
+                    'actual_output_full': result.actual_output,
+                    'expected_output': result.expected_output[:200] if result.expected_output else '',
+                    'expected_output_full': result.expected_output,
+                    'retrieval_context': result.retrieval_context if result.retrieval_context else [],
+                    'metrics': []
+                }
+                for metric_data in result.metrics_data:
+                    # Determine model name
+                    if isinstance(metric_data.evaluation_model, str):
+                        model_name = metric_data.evaluation_model
+                    else:
+                        # For CustomLLMClient
+                        try:
+                            model_name = metric_data.evaluation_model.get_model_name()
+                        except:
+                            model_name = str(
+                                type(metric_data.evaluation_model).__name__)
+                    test_case_data['metrics'].append({
+                        'name': metric_data.name,
+                        'score': round(metric_data.score, 3),
+                        'success': metric_data.success,
+                        'threshold': metric_data.threshold,
+                        'reason': metric_data.reason[:300] if metric_data.reason else '',
+                        'reason_full': metric_data.reason,
+                        'evaluation_model': model_name,
+                        'evaluation_cost': metric_data.evaluation_cost,
+                        'evaluation_log': metric_data.evaluation_log
+                    })
+                    if metric_data.name not in metrics_summary:
+                        metrics_summary[metric_data.name] = {
+                            'scores': [],
+                            'passed': 0,
+                            'failed': 0,
+                            'threshold': metric_data.threshold,
+                            'total_cost': 0.0,
+                            'model': model_name
+                        }
+                    metrics_summary[metric_data.name]['scores'].append(
+                        metric_data.score)
+                    if metric_data.success:
+                        metrics_summary[metric_data.name]['passed'] += 1
+                    else:
+                        metrics_summary[metric_data.name]['failed'] += 1
+                    if metric_data.evaluation_cost:
+                        total_cost += metric_data.evaluation_cost
+                        metrics_summary[metric_data.name]['total_cost'] += metric_data.evaluation_cost
+                test_cases.append(test_case_data)
+        for metric_name, data in metrics_summary.items():
+            data['avg_score'] = sum(data['scores']) / \
+                len(data['scores']) if data['scores'] else 0
+            data['success_rate'] = (data['passed'] / (data['passed'] + data['failed'])
+                                    * 100) if (data['passed'] + data['failed']) > 0 else 0
+        return {
+            'test_cases': test_cases,
+            'metrics_summary': metrics_summary,
+            'total_cost': total_cost,
+            'total_tests': len(test_cases)
+        }
+def save_results_to_cache(results: List[tuple], session_name: Optional[str] = None) -> str:
+    """
+    Save evaluation results to cache for dashboard viewing.
+    Cache is always saved to .eval_cache/ in current directory.
+    Args:
+        results: Evaluation results from evaluate()
+        session_name: Optional name for the session
+    Returns:
+        Session ID
+    """
+    cache = DashboardCache()
+    return cache.add_results(results, session_name)

eval_lib/evaluate.py CHANGED Viewed

@@ -68,7 +68,9 @@ def _print_summary(results: List, total_cost: float, total_time: float, passed:
 async def evaluate(
     test_cases: List[EvalTestCase],
     metrics: List[MetricPattern],
-    verbose: bool = True
+    verbose: bool = True,
+    show_dashboard: bool = False,
+    session_name: str = None,
 ) -> List[Tuple[None, List[TestCaseResult]]]:
     """
     Evaluate test cases with multiple metrics.
@@ -77,6 +79,10 @@ async def evaluate(
         test_cases: List of test cases to evaluate
         metrics: List of metrics to apply
         verbose: Enable detailed logging (default: True)
+        show_dashboard: Launch interactive web dashboard (default: False)
+        dashboard_port: Port for dashboard server (default: 14500)
+        session_name: Name for this evaluation session
+        cache_dir: Directory to store cache (default: .eval_cache)
     Returns:
         List of evaluation results
@@ -183,6 +189,23 @@ async def evaluate(
         _print_summary(results, total_cost, total_time,
                        total_passed, total_tests)
+    if show_dashboard:
+        from eval_lib.dashboard_server import save_results_to_cache
+        session_id = save_results_to_cache(results, session_name)
+        if verbose:
+            print(f"\n{Colors.BOLD}{Colors.GREEN}{'='*70}{Colors.ENDC}")
+            print(f"{Colors.BOLD}{Colors.GREEN}📊 DASHBOARD{Colors.ENDC}")
+            print(f"{Colors.BOLD}{Colors.GREEN}{'='*70}{Colors.ENDC}")
+            print(
+                f"\n✅ Results saved to cache: {Colors.CYAN}{session_id}{Colors.ENDC}")
+            print(f"\n💡 To view results, run:")
+            print(f"   {Colors.YELLOW}eval-lib dashboard{Colors.ENDC}")
+            print(
+                f"\n   Then open: {Colors.CYAN}http://localhost:14500{Colors.ENDC}")
+            print(f"\n{Colors.BOLD}{Colors.GREEN}{'='*70}{Colors.ENDC}\n")
     return results

eval_lib/html.py ADDED Viewed

@@ -0,0 +1,736 @@
+HTML_TEMPLATE = """
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Eval AI Library - Interactive Dashboard</title>
+    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+    <style>
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            padding: 20px;
+            min-height: 100vh;
+        }
+        .container {
+            max-width: 1400px;
+            margin: 0 auto;
+            background: white;
+            border-radius: 20px;
+            padding: 30px;
+            box-shadow: 0 20px 60px rgba(0,0,0,0.3);
+        }
+        header {
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            margin-bottom: 40px;
+            padding-bottom: 20px;
+            border-bottom: 3px solid #667eea;
+        }
+        h1 {
+            color: #667eea;
+            font-size: 2.5em;
+        }
+        .controls {
+            display: flex;
+            gap: 10px;
+            align-items: center;
+        }
+        select, button {
+            padding: 10px 20px;
+            border-radius: 8px;
+            border: 2px solid #667eea;
+            background: white;
+            color: #667eea;
+            font-weight: 600;
+            cursor: pointer;
+            transition: all 0.3s;
+        }
+        button:hover {
+            background: #667eea;
+            color: white;
+        }
+        .timestamp {
+            color: #666;
+            font-size: 0.9em;
+            margin-left: 20px;
+        }
+        .summary {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+            gap: 20px;
+            margin-bottom: 40px;
+        }
+        .summary-card {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            padding: 25px;
+            border-radius: 15px;
+            text-align: center;
+            box-shadow: 0 5px 15px rgba(0,0,0,0.2);
+            transition: transform 0.3s;
+        }
+        .summary-card:hover {
+            transform: translateY(-5px);
+        }
+        .summary-card h3 {
+            font-size: 0.9em;
+            margin-bottom: 10px;
+            opacity: 0.9;
+        }
+        .summary-card .value {
+            font-size: 2em;
+            font-weight: bold;
+        }
+        .metrics-grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+            gap: 20px;
+            margin-bottom: 40px;
+        }
+        .metric-card {
+            background: #f8f9fa;
+            border-radius: 15px;
+            padding: 20px;
+            box-shadow: 0 3px 10px rgba(0,0,0,0.1);
+            transition: transform 0.3s;
+        }
+        .metric-card:hover {
+            transform: translateY(-5px);
+        }
+        .metric-card h3 {
+            color: #667eea;
+            margin-bottom: 15px;
+            font-size: 1.1em;
+        }
+        .metric-score {
+            font-size: 2.5em;
+            font-weight: bold;
+            color: #764ba2;
+            margin-bottom: 15px;
+        }
+        .metric-details p {
+            margin: 8px 0;
+            color: #555;
+            font-size: 0.9em;
+        }
+        .charts {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(400px, 1fr));
+            gap: 30px;
+            margin-bottom: 40px;
+        }
+        .chart-container {
+            background: #f8f9fa;
+            border-radius: 15px;
+            padding: 20px;
+            box-shadow: 0 3px 10px rgba(0,0,0,0.1);
+        }
+        .chart-container h2 {
+            color: #667eea;
+            margin-bottom: 20px;
+            font-size: 1.3em;
+        }
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            background: white;
+            border-radius: 10px;
+            overflow: hidden;
+            box-shadow: 0 3px 10px rgba(0,0,0,0.1);
+        }
+        th {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            padding: 15px;
+            text-align: left;
+            font-weight: 600;
+            cursor: pointer;
+            user-select: none;
+        }
+        th:hover {
+            background: linear-gradient(135deg, #5568d3 0%, #653a8b 100%);
+        }
+        td {
+            padding: 12px 15px;
+            border-bottom: 1px solid #eee;
+            font-size: 0.9em;
+        }
+        tr.success {
+            background: #f0fdf4;
+        }
+        tr.failed {
+            background: #fef2f2;
+        }
+        tr:hover {
+            background: #f8f9fa !important;
+        }
+        .reason {
+            max-width: 300px;
+            color: #666;
+        }
+        .view-details-btn {
+            background: #667eea;
+            color: white;
+            border: none;
+            padding: 5px 12px;
+            border-radius: 5px;
+            cursor: pointer;
+            font-size: 0.85em;
+            transition: all 0.3s;
+        }
+        .view-details-btn:hover {
+            background: #5568d3;
+            transform: scale(1.05);
+        }
+        /* Modal styles */
+        .modal {
+            display: none;
+            position: fixed;
+            z-index: 1000;
+            left: 0;
+            top: 0;
+            width: 100%;
+            height: 100%;
+            overflow: auto;
+            background-color: rgba(0,0,0,0.7);
+            animation: fadeIn 0.3s;
+        }
+        @keyframes fadeIn {
+            from { opacity: 0; }
+            to { opacity: 1; }
+        }
+        .modal-content {
+            background-color: #fefefe;
+            margin: 2% auto;
+            padding: 30px;
+            border-radius: 15px;
+            width: 90%;
+            max-width: 900px;
+            max-height: 90vh;
+            overflow-y: auto;
+            box-shadow: 0 20px 60px rgba(0,0,0,0.3);
+            animation: slideIn 0.3s;
+        }
+        @keyframes slideIn {
+            from {
+                transform: translateY(-50px);
+                opacity: 0;
+            }
+            to {
+                transform: translateY(0);
+                opacity: 1;
+            }
+        }
+        .modal-header {
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            margin-bottom: 20px;
+            padding-bottom: 15px;
+            border-bottom: 2px solid #667eea;
+        }
+        .modal-header h2 {
+            color: #667eea;
+            margin: 0;
+        }
+        .close {
+            color: #aaa;
+            font-size: 35px;
+            font-weight: bold;
+            cursor: pointer;
+            transition: color 0.3s;
+        }
+        .close:hover {
+            color: #667eea;
+        }
+        .detail-section {
+            margin: 20px 0;
+            padding: 15px;
+            background: #f8f9fa;
+            border-radius: 10px;
+            border-left: 4px solid #667eea;
+        }
+        .detail-section h3 {
+            color: #667eea;
+            margin-bottom: 10px;
+            font-size: 1.1em;
+        }
+        .detail-section pre {
+            background: white;
+            padding: 15px;
+            border-radius: 8px;
+            overflow-x: auto;
+            font-size: 0.85em;
+            line-height: 1.5;
+        }
+        .detail-section p {
+            margin: 8px 0;
+            color: #555;
+            line-height: 1.6;
+        }
+        .badge {
+            display: inline-block;
+            padding: 4px 10px;
+            border-radius: 12px;
+            font-size: 0.8em;
+            font-weight: 600;
+            margin-right: 8px;
+        }
+        .badge-success {
+            background: #d1fae5;
+            color: #065f46;
+        }
+        .badge-failed {
+            background: #fee2e2;
+            color: #991b1b;
+        }
+        .loading {
+            text-align: center;
+            padding: 40px;
+            color: #667eea;
+            font-size: 1.2em;
+        }
+        .no-data {
+            text-align: center;
+            padding: 60px;
+            color: #999;
+        }
+        .no-data h2 {
+            color: #667eea;
+            margin-bottom: 20px;
+        }
+        @media (max-width: 768px) {
+            .charts {
+                grid-template-columns: 1fr;
+            }
+            .metrics-grid {
+                grid-template-columns: 1fr;
+            }
+            header {
+                flex-direction: column;
+                gap: 15px;
+            }
+            .modal-content {
+                width: 95%;
+                margin: 5% auto;
+                padding: 20px;
+            }
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <div>
+                <h1>📊 Eval AI Library Dashboard</h1>
+                <span class="timestamp" id="timestamp">Loading...</span>
+            </div>
+            <div class="controls">
+                <select id="sessionSelect" onchange="loadSession()">
+                    <option value="">Loading sessions...</option>
+                </select>
+                <button onclick="refreshData()">🔄 Refresh</button>
+                <button onclick="clearCache()">🗑️ Clear Cache</button>
+            </div>
+        </header>
+        <div id="content" class="loading">
+            Loading data...
+        </div>
+    </div>
+    <!-- Modal для детальной информации -->
+    <div id="detailsModal" class="modal">
+        <div class="modal-content">
+            <div class="modal-header">
+                <h2>📋 Evaluation Details</h2>
+                <span class="close" onclick="closeModal()">&times;</span>
+            </div>
+            <div id="modalBody"></div>
+        </div>
+    </div>
+    <script>
+        let currentData = null;
+        let scoresChart = null;
+        let successChart = null;
+        // Загрузить список сессий
+        async function loadSessions() {
+            try {
+                const response = await fetch('/api/sessions');
+                const sessions = await response.json();
+                const select = document.getElementById('sessionSelect');
+                select.innerHTML = '<option value="latest">Latest Results</option>';
+                sessions.reverse().forEach(session => {
+                    const option = document.createElement('option');
+                    option.value = session.session_id;
+                    option.textContent = `${session.session_id} (${session.timestamp}) - ${session.total_tests} tests`;
+                    select.appendChild(option);
+                });
+            } catch (error) {
+                console.error('Error loading sessions:', error);
+            }
+        }
+        // Загрузить данные сессии
+        async function loadSession() {
+            const select = document.getElementById('sessionSelect');
+            const sessionId = select.value;
+            try {
+                let url = '/api/latest';
+                if (sessionId && sessionId !== 'latest') {
+                    url = `/api/session/${sessionId}`;
+                }
+                const response = await fetch(url);
+                if (!response.ok) {
+                    showNoData();
+                    return;
+                }
+                const session = await response.json();
+                currentData = session.data;
+                renderDashboard(session);
+            } catch (error) {
+                console.error('Error loading session:', error);
+                showNoData();
+            }
+        }
+        // Показать "нет данных"
+        function showNoData() {
+            document.getElementById('content').innerHTML = `
+                <div class="no-data">
+                    <h2>No evaluation results available</h2>
+                    <p>Run an evaluation with <code>show_dashboard=True</code> to see results here.</p>
+                </div>
+            `;
+        }
+        // Отрисовать дашборд
+        function renderDashboard(session) {
+            const data = session.data;
+            document.getElementById('timestamp').textContent = `Generated: ${session.timestamp}`;
+            const metricsLabels = Object.keys(data.metrics_summary);
+            const metricsScores = metricsLabels.map(m => data.metrics_summary[m].avg_score);
+            const metricsSuccessRates = metricsLabels.map(m => data.metrics_summary[m].success_rate);
+            let metricCards = '';
+            for (const [metricName, metricData] of Object.entries(data.metrics_summary)) {
+                metricCards += `
+                    <div class="metric-card">
+                        <h3>${metricName}</h3>
+                        <div class="metric-score">${metricData.avg_score.toFixed(3)}</div>
+                        <div class="metric-details">
+                            <p>✅ Passed: ${metricData.passed}</p>
+                            <p>❌ Failed: ${metricData.failed}</p>
+                            <p>📊 Success Rate: ${metricData.success_rate.toFixed(1)}%</p>
+                            <p>🎯 Threshold: ${metricData.threshold}</p>
+                            <p>🤖 Model: ${metricData.model}</p>
+                            <p>💰 Total Cost: $${metricData.total_cost.toFixed(6)}</p>
+                        </div>
+                    </div>
+                `;
+            }
+            let tableRows = '';
+            data.test_cases.forEach((testCase, tcIdx) => {
+                testCase.metrics.forEach((metric, mIdx) => {
+                    const statusEmoji = metric.success ? '✅' : '❌';
+                    const statusClass = metric.success ? 'success' : 'failed';
+                    tableRows += `
+                        <tr class="${statusClass}">
+                            <td>${testCase.test_index}</td>
+                            <td>${testCase.input}</td>
+                            <td>${metric.name}</td>
+                            <td>${metric.score.toFixed(3)}</td>
+                            <td>${metric.threshold}</td>
+                            <td>${statusEmoji}</td>
+                            <td>${metric.evaluation_model}</td>
+                            <td>$${(metric.evaluation_cost || 0).toFixed(6)}</td>
+                            <td>
+                                <button class="view-details-btn" onclick="showDetails(${tcIdx}, ${mIdx})">
+                                    View Details
+                                </button>
+                            </td>
+                        </tr>
+                    `;
+                });
+            });
+            document.getElementById('content').innerHTML = `
+                <div class="summary">
+                    <div class="summary-card">
+                        <h3>Total Tests</h3>
+                        <div class="value">${data.total_tests}</div>
+                    </div>
+                    <div class="summary-card">
+                        <h3>Total Cost</h3>
+                        <div class="value">$${data.total_cost.toFixed(6)}</div>
+                    </div>
+                    <div class="summary-card">
+                        <h3>Metrics</h3>
+                        <div class="value">${metricsLabels.length}</div>
+                    </div>
+                </div>
+                <h2 style="color: #667eea; margin-bottom: 20px;">📈 Metrics Summary</h2>
+                <div class="metrics-grid">
+                    ${metricCards}
+                </div>
+                <h2 style="color: #667eea; margin-bottom: 20px;">📊 Charts</h2>
+                <div class="charts">
+                    <div class="chart-container">
+                        <h2>Average Scores by Metric</h2>
+                        <canvas id="scoresChart"></canvas>
+                    </div>
+                    <div class="chart-container">
+                        <h2>Success Rate by Metric</h2>
+                        <canvas id="successChart"></canvas>
+                    </div>
+                </div>
+                <h2 style="color: #667eea; margin: 40px 0 20px 0;">📋 Detailed Results</h2>
+                <table>
+                    <thead>
+                        <tr>
+                            <th>Test #</th>
+                            <th>Input</th>
+                            <th>Metric</th>
+                            <th>Score</th>
+                            <th>Threshold</th>
+                            <th>Status</th>
+                            <th>Model</th>
+                            <th>Cost</th>
+                            <th>Actions</th>
+                        </tr>
+                    </thead>
+                    <tbody>
+                        ${tableRows}
+                    </tbody>
+                </table>
+            `;
+            renderCharts(metricsLabels, metricsScores, metricsSuccessRates);
+        }
+        // Показать детали в модальном окне
+        function showDetails(testCaseIdx, metricIdx) {
+            const testCase = currentData.test_cases[testCaseIdx];
+            const metric = testCase.metrics[metricIdx];
+            const statusBadge = metric.success
+                ? '<span class="badge badge-success">✅ PASSED</span>'
+                : '<span class="badge badge-failed">❌ FAILED</span>';
+            let modalContent = `
+                <div class="detail-section">
+                    <h3>Test Case #${testCase.test_index}</h3>
+                    <p><strong>Input:</strong> ${testCase.input_full}</p>
+                    <p><strong>Actual Output:</strong> ${testCase.actual_output_full || 'N/A'}</p>
+                    <p><strong>Expected Output:</strong> ${testCase.expected_output_full || 'N/A'}</p>
+                </div>
+                <div class="detail-section">
+                    <h3>Metric: ${metric.name}</h3>
+                    ${statusBadge}
+                    <p><strong>Score:</strong> ${metric.score.toFixed(3)} / ${metric.threshold}</p>
+                    <p><strong>Model:</strong> ${metric.evaluation_model}</p>
+                    <p><strong>Cost:</strong> $${(metric.evaluation_cost || 0).toFixed(6)}</p>
+                </div>
+                <div class="detail-section">
+                    <h3>Reason</h3>
+                    <p>${metric.reason_full || metric.reason}</p>
+                </div>
+            `;
+            // Добавляем retrieval context если есть
+            if (testCase.retrieval_context && testCase.retrieval_context.length > 0) {
+                modalContent += `
+                    <div class="detail-section">
+                        <h3>Retrieval Context (${testCase.retrieval_context.length} chunks)</h3>
+                        ${testCase.retrieval_context.map((ctx, idx) => `
+                            <p><strong>Chunk ${idx + 1}:</strong></p>
+                            <p style="margin-left: 20px; color: #666;">${ctx.substring(0, 300)}${ctx.length > 300 ? '...' : ''}</p>
+                        `).join('')}
+                    </div>
+                `;
+            }
+            // Добавляем evaluation log если есть
+            if (metric.evaluation_log) {
+                modalContent += `
+                    <div class="detail-section">
+                        <h3>Evaluation Log</h3>
+                        <pre>${JSON.stringify(metric.evaluation_log, null, 2)}</pre>
+                    </div>
+                `;
+            }
+            document.getElementById('modalBody').innerHTML = modalContent;
+            document.getElementById('detailsModal').style.display = 'block';
+        }
+        // Закрыть модальное окно
+        function closeModal() {
+            document.getElementById('detailsModal').style.display = 'none';
+        }
+        // Закрытие по клику вне модального окна
+        window.onclick = function(event) {
+            const modal = document.getElementById('detailsModal');
+            if (event.target == modal) {
+                closeModal();
+            }
+        }
+        // Отрисовать графики
+        function renderCharts(labels, scores, successRates) {
+            if (scoresChart) scoresChart.destroy();
+            if (successChart) successChart.destroy();
+            const scoresCtx = document.getElementById('scoresChart').getContext('2d');
+            scoresChart = new Chart(scoresCtx, {
+                type: 'bar',
+                data: {
+                    labels: labels,
+                    datasets: [{
+                        label: 'Average Score',
+                        data: scores,
+                        backgroundColor: 'rgba(102, 126, 234, 0.8)',
+                        borderColor: 'rgba(102, 126, 234, 1)',
+                        borderWidth: 2
+                    }]
+                },
+                options: {
+                    responsive: true,
+                    scales: {
+                        y: {
+                            beginAtZero: true,
+                            max: 1.0
+                        }
+                    }
+                }
+            });
+            const successCtx = document.getElementById('successChart').getContext('2d');
+            successChart = new Chart(successCtx, {
+                type: 'doughnut',
+                data: {
+                    labels: labels,
+                    datasets: [{
+                        label: 'Success Rate (%)',
+                        data: successRates,
+                        backgroundColor: [
+                            'rgba(102, 126, 234, 0.8)',
+                            'rgba(118, 75, 162, 0.8)',
+                            'rgba(237, 100, 166, 0.8)',
+                            'rgba(255, 154, 158, 0.8)',
+                            'rgba(250, 208, 196, 0.8)'
+                        ],
+                        borderWidth: 2
+                    }]
+                },
+                options: {
+                    responsive: true
+                }
+            });
+        }
+        // Обновить данные
+        function refreshData() {
+            loadSessions();
+            loadSession();
+        }
+        // Очистить кеш
+        async function clearCache() {
+            if (confirm('Are you sure you want to clear all cached results?')) {
+                try {
+                    await fetch('/api/clear');
+                    alert('Cache cleared!');
+                    refreshData();
+                } catch (error) {
+                    console.error('Error clearing cache:', error);
+                    alert('Error clearing cache');
+                }
+            }
+        }
+        // Инициализация
+        loadSessions();
+        loadSession();
+    </script>
+</body>
+</html>
+"""

{eval_ai_library-0.3.3.dist-info → eval_ai_library-0.3.10.dist-info}/WHEEL RENAMED Viewed

File without changes

{eval_ai_library-0.3.3.dist-info → eval_ai_library-0.3.10.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{eval_ai_library-0.3.3.dist-info → eval_ai_library-0.3.10.dist-info}/top_level.txt RENAMED Viewed

File without changes

eval-ai-library 0.3.3__py3-none-any.whl → 0.3.10__py3-none-any.whl

Potentially problematic release.

eval-ai-library 0.3.3py3-none-any.whl → 0.3.10py3-none-any.whl