PyPI - promptum - Versions diffs - 0.0.2__tar.gz → 0.0.3__tar.gz - Mend

promptum 0.0.2tar.gz → 0.0.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{promptum-0.0.2 → promptum-0.0.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: promptum
-Version: 0.0.2
+Version: 0.0.3
 Summary: Async LLM benchmarking library with protocol-based extensibility
 Project-URL: Homepage, https://github.com/deyna256/promptum
 Project-URL: Repository, https://github.com/deyna256/promptum
@@ -46,7 +46,7 @@ Description-Content-Type: text/markdown
 ![Async](https://img.shields.io/badge/Async-First-green?style=for-the-badge)
 ![License: MIT](https://img.shields.io/badge/License-MIT-yellow?style=for-the-badge)
-**Benchmark LLMs Like a Pro. In 5 Lines of Code.**
+**Benchmark LLMs Like a Pro.**
 Stop writing boilerplate to test LLMs. Start getting results.
@@ -56,11 +56,12 @@ Stop writing boilerplate to test LLMs. Start getting results.
 ## What's This?
-A dead-simple Python library for benchmarking LLM providers. Write tests once, run them across any model, get beautiful reports.
+A dead-simple Python library for benchmarking LLM providers. Write tests once, run them across any model, get structured results.
 ```python
 benchmark = Benchmark(provider=client, name="my_test")
 benchmark.add_test(TestCase(
+    name="basic_math",
     prompt="What is 2+2?",
     model="gpt-3.5-turbo",
     validator=Contains("4")
@@ -130,9 +131,9 @@ async def main():
         report = await benchmark.run_async()
         summary = report.get_summary()
-        print(f"✓ {summary['passed']}/{summary['total']} tests passed")
-        print(f"⚡ {summary['avg_latency_ms']:.0f}ms average")
-        print(f"💰 ${summary['total_cost_usd']:.6f} total cost")
+        print(f"✓ {summary.passed}/{summary.total} tests passed")
+        print(f"⚡ {summary.avg_latency_ms:.0f}ms average")
+        print(f"💰 ${summary.total_cost_usd:.6f} total cost")
 asyncio.run(main())
 ```
@@ -146,7 +147,7 @@ python your_script.py
 ## What You Get
-- [x] **One API for 100+ Models** - OpenRouter support out of the box (OpenAI, Anthropic, Google, etc.)
+- [x] **100+ Models via OpenRouter** - One client for OpenAI, Anthropic, Google, and more
 - [x] **Smart Validation** - ExactMatch, Contains, Regex, JsonSchema, or write your own
 - [x] **Automatic Retries** - Exponential/linear backoff with configurable attempts
 - [x] **Metrics Tracking** - Latency, tokens, cost - automatically captured
@@ -161,35 +162,42 @@ python your_script.py
 Compare GPT-4 vs Claude on your tasks:
 ```python
-from promptum import Benchmark, TestCase, ExactMatch, Contains, Regex
-tests = [
-    TestCase(
-        name="json_output",
-        prompt='Output JSON: {"status": "ok"}',
-        model="openai/gpt-4",
-        validator=Regex(r'\{"status":\s*"ok"\}')
-    ),
-    TestCase(
-        name="json_output",
-        prompt='Output JSON: {"status": "ok"}',
-        model="anthropic/claude-3-5-sonnet",
-        validator=Regex(r'\{"status":\s*"ok"\}')
-    ),
-    TestCase(
-        name="creative_writing",
-        prompt="Write a haiku about Python",
-        model="openai/gpt-4",
-        validator=Contains("Python", case_sensitive=False)
-    ),
-]
-benchmark.add_tests(tests)
-report = await benchmark.run_async()
+import asyncio
+from promptum import Benchmark, TestCase, Contains, Regex, OpenRouterClient
+async def main():
+    async with OpenRouterClient(api_key="your-key") as client:
+        benchmark = Benchmark(provider=client, name="model_comparison")
+        benchmark.add_tests([
+            TestCase(
+                name="json_output_gpt4",
+                prompt='Output JSON: {"status": "ok"}',
+                model="openai/gpt-4",
+                validator=Regex(r'\{"status":\s*"ok"\}')
+            ),
+            TestCase(
+                name="json_output_claude",
+                prompt='Output JSON: {"status": "ok"}',
+                model="anthropic/claude-3-5-sonnet",
+                validator=Regex(r'\{"status":\s*"ok"\}')
+            ),
+            TestCase(
+                name="creative_writing",
+                prompt="Write a haiku about Python",
+                model="openai/gpt-4",
+                validator=Contains("Python", case_sensitive=False)
+            ),
+        ])
-# Side-by-side model comparison
-for model, summary in report.compare_models().items():
-    print(f"{model}: {summary['pass_rate']:.0%} pass rate, {summary['avg_latency_ms']:.0f}ms avg")
+        report = await benchmark.run_async()
+        # Side-by-side model comparison
+        for model, model_report in report.group_by(lambda r: r.test_case.model).items():
+            summary = model_report.get_summary()
+            print(f"{model}: {summary.pass_rate:.0%} pass rate, {summary.avg_latency_ms:.0f}ms avg")
+asyncio.run(main())
 ```
 ---

{promptum-0.0.2 → promptum-0.0.3}/README.md RENAMED Viewed

@@ -6,7 +6,7 @@
 ![Async](https://img.shields.io/badge/Async-First-green?style=for-the-badge)
 ![License: MIT](https://img.shields.io/badge/License-MIT-yellow?style=for-the-badge)
-**Benchmark LLMs Like a Pro. In 5 Lines of Code.**
+**Benchmark LLMs Like a Pro.**
 Stop writing boilerplate to test LLMs. Start getting results.
@@ -16,11 +16,12 @@ Stop writing boilerplate to test LLMs. Start getting results.
 ## What's This?
-A dead-simple Python library for benchmarking LLM providers. Write tests once, run them across any model, get beautiful reports.
+A dead-simple Python library for benchmarking LLM providers. Write tests once, run them across any model, get structured results.
 ```python
 benchmark = Benchmark(provider=client, name="my_test")
 benchmark.add_test(TestCase(
+    name="basic_math",
     prompt="What is 2+2?",
     model="gpt-3.5-turbo",
     validator=Contains("4")
@@ -90,9 +91,9 @@ async def main():
         report = await benchmark.run_async()
         summary = report.get_summary()
-        print(f"✓ {summary['passed']}/{summary['total']} tests passed")
-        print(f"⚡ {summary['avg_latency_ms']:.0f}ms average")
-        print(f"💰 ${summary['total_cost_usd']:.6f} total cost")
+        print(f"✓ {summary.passed}/{summary.total} tests passed")
+        print(f"⚡ {summary.avg_latency_ms:.0f}ms average")
+        print(f"💰 ${summary.total_cost_usd:.6f} total cost")
 asyncio.run(main())
 ```
@@ -106,7 +107,7 @@ python your_script.py
 ## What You Get
-- [x] **One API for 100+ Models** - OpenRouter support out of the box (OpenAI, Anthropic, Google, etc.)
+- [x] **100+ Models via OpenRouter** - One client for OpenAI, Anthropic, Google, and more
 - [x] **Smart Validation** - ExactMatch, Contains, Regex, JsonSchema, or write your own
 - [x] **Automatic Retries** - Exponential/linear backoff with configurable attempts
 - [x] **Metrics Tracking** - Latency, tokens, cost - automatically captured
@@ -121,35 +122,42 @@ python your_script.py
 Compare GPT-4 vs Claude on your tasks:
 ```python
-from promptum import Benchmark, TestCase, ExactMatch, Contains, Regex
-tests = [
-    TestCase(
-        name="json_output",
-        prompt='Output JSON: {"status": "ok"}',
-        model="openai/gpt-4",
-        validator=Regex(r'\{"status":\s*"ok"\}')
-    ),
-    TestCase(
-        name="json_output",
-        prompt='Output JSON: {"status": "ok"}',
-        model="anthropic/claude-3-5-sonnet",
-        validator=Regex(r'\{"status":\s*"ok"\}')
-    ),
-    TestCase(
-        name="creative_writing",
-        prompt="Write a haiku about Python",
-        model="openai/gpt-4",
-        validator=Contains("Python", case_sensitive=False)
-    ),
-]
-benchmark.add_tests(tests)
-report = await benchmark.run_async()
+import asyncio
+from promptum import Benchmark, TestCase, Contains, Regex, OpenRouterClient
+async def main():
+    async with OpenRouterClient(api_key="your-key") as client:
+        benchmark = Benchmark(provider=client, name="model_comparison")
+        benchmark.add_tests([
+            TestCase(
+                name="json_output_gpt4",
+                prompt='Output JSON: {"status": "ok"}',
+                model="openai/gpt-4",
+                validator=Regex(r'\{"status":\s*"ok"\}')
+            ),
+            TestCase(
+                name="json_output_claude",
+                prompt='Output JSON: {"status": "ok"}',
+                model="anthropic/claude-3-5-sonnet",
+                validator=Regex(r'\{"status":\s*"ok"\}')
+            ),
+            TestCase(
+                name="creative_writing",
+                prompt="Write a haiku about Python",
+                model="openai/gpt-4",
+                validator=Contains("Python", case_sensitive=False)
+            ),
+        ])
-# Side-by-side model comparison
-for model, summary in report.compare_models().items():
-    print(f"{model}: {summary['pass_rate']:.0%} pass rate, {summary['avg_latency_ms']:.0f}ms avg")
+        report = await benchmark.run_async()
+        # Side-by-side model comparison
+        for model, model_report in report.group_by(lambda r: r.test_case.model).items():
+            summary = model_report.get_summary()
+            print(f"{model}: {summary.pass_rate:.0%} pass rate, {summary.avg_latency_ms:.0f}ms avg")
+asyncio.run(main())
 ```
 ---

{promptum-0.0.2 → promptum-0.0.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "promptum"
-version = "0.0.2"
+version = "0.0.3"
 description = "Async LLM benchmarking library with protocol-based extensibility"
 readme = "README.md"
 requires-python = ">=3.13"

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from promptum.benchmark import Benchmark, Report, Runner, TestCase, TestResult
+from promptum.benchmark import Benchmark, Report, Runner, Summary, TestCase, TestResult
 from promptum.providers import LLMProvider, Metrics, OpenRouterClient, RetryConfig, RetryStrategy
 from promptum.validation import (
     Contains,
@@ -8,11 +8,12 @@ from promptum.validation import (
     Validator,
 )
-__version__ = "0.0.1"
+__version__ = "0.0.3"
 __all__ = [
     "TestCase",
     "TestResult",
+    "Summary",
     "Metrics",
     "RetryConfig",
     "RetryStrategy",

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/__init__.py RENAMED Viewed

@@ -2,6 +2,7 @@ from promptum.benchmark.benchmark import Benchmark
 from promptum.benchmark.report import Report
 from promptum.benchmark.result import TestResult
 from promptum.benchmark.runner import Runner
+from promptum.benchmark.summary import Summary
 from promptum.benchmark.test_case import TestCase
-__all__ = ["Benchmark", "Report", "Runner", "TestCase", "TestResult"]
+__all__ = ["Benchmark", "Report", "Runner", "Summary", "TestCase", "TestResult"]

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/report.py RENAMED Viewed

@@ -1,15 +1,15 @@
 from collections.abc import Callable, Sequence
 from dataclasses import dataclass
-from typing import Any
 from promptum.benchmark.result import TestResult
+from promptum.benchmark.summary import Summary
 @dataclass(frozen=True, slots=True)
 class Report:
     results: Sequence[TestResult]
-    def get_summary(self) -> dict[str, Any]:
+    def get_summary(self) -> Summary:
         total = len(self.results)
         passed = sum(1 for r in self.results if r.passed)
@@ -17,18 +17,17 @@ class Report:
         total_cost = sum(r.metrics.cost_usd or 0 for r in self.results if r.metrics)
         total_tokens = sum(r.metrics.total_tokens or 0 for r in self.results if r.metrics)
-        return {
-            "total": total,
-            "passed": passed,
-            "failed": total - passed,
-            "pass_rate": passed / total if total > 0 else 0,
-            "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
-            "p50_latency_ms": self._percentile(latencies, 0.5) if latencies else 0,
-            "p95_latency_ms": self._percentile(latencies, 0.95) if latencies else 0,
-            "p99_latency_ms": self._percentile(latencies, 0.99) if latencies else 0,
-            "total_cost_usd": total_cost,
-            "total_tokens": total_tokens,
-        }
+        return Summary(
+            total=total,
+            passed=passed,
+            failed=total - passed,
+            pass_rate=passed / total if total > 0 else 0,
+            avg_latency_ms=sum(latencies) / len(latencies) if latencies else 0,
+            min_latency_ms=min(latencies) if latencies else 0,
+            max_latency_ms=max(latencies) if latencies else 0,
+            total_cost_usd=total_cost,
+            total_tokens=total_tokens,
+        )
     def filter(
         self,
@@ -60,15 +59,3 @@ class Report:
             groups[group_key].append(result)
         return {k: Report(results=v) for k, v in groups.items()}
-    def compare_models(self) -> dict[str, dict[str, Any]]:
-        by_model = self.group_by(lambda r: r.test_case.model)
-        return {model: report.get_summary() for model, report in by_model.items()}
-    @staticmethod
-    def _percentile(values: list[float], p: float) -> float:
-        if not values:
-            return 0
-        sorted_values = sorted(values)
-        index = int((len(sorted_values) - 1) * p)
-        return sorted_values[index]

promptum-0.0.3/src/promptum/benchmark/summary.py ADDED Viewed

@@ -0,0 +1,14 @@
+from dataclasses import dataclass
+@dataclass(frozen=True, slots=True)
+class Summary:
+    total: int
+    passed: int
+    failed: int
+    pass_rate: float
+    avg_latency_ms: float
+    min_latency_ms: float
+    max_latency_ms: float
+    total_cost_usd: float
+    total_tokens: int

{promptum-0.0.2 → promptum-0.0.3}/tests/benchmark/test_report_filtering.py RENAMED Viewed

@@ -42,10 +42,3 @@ def test_report_group_by_model(sample_report: Report) -> None:
     assert len(grouped["model2"].results) == 1
-def test_report_compare_models(sample_report: Report) -> None:
-    comparison = sample_report.compare_models()
-    assert "model1" in comparison
-    assert "model2" in comparison
-    assert comparison["model1"]["total"] == 2
-    assert comparison["model2"]["total"] == 1

promptum-0.0.3/tests/benchmark/test_report_summary.py ADDED Viewed

@@ -0,0 +1,28 @@
+from promptum.benchmark import Report
+def test_report_summary(sample_report: Report) -> None:
+    summary = sample_report.get_summary()
+    assert summary.total == 3
+    assert summary.passed == 2
+    assert summary.failed == 1
+    assert summary.pass_rate == 2 / 3
+    assert summary.total_cost_usd == 0.045
+    assert summary.avg_latency_ms == 123.33333333333333
+    assert summary.min_latency_ms == 100.0
+    assert summary.max_latency_ms == 150.0
+def test_report_summary_empty() -> None:
+    report = Report(results=[])
+    summary = report.get_summary()
+    assert summary.total == 0
+    assert summary.passed == 0
+    assert summary.failed == 0
+    assert summary.pass_rate == 0
+    assert summary.total_cost_usd == 0
+    assert summary.avg_latency_ms == 0
+    assert summary.min_latency_ms == 0
+    assert summary.max_latency_ms == 0

{promptum-0.0.2 → promptum-0.0.3}/uv.lock RENAMED Viewed

@@ -168,7 +168,7 @@ wheels = [
 [[package]]
 name = "promptum"
-version = "0.0.1"
+version = "0.0.3"
 source = { editable = "." }
 dependencies = [
     { name = "httpx" },

promptum-0.0.2/tests/benchmark/test_report_summary.py DELETED Viewed

@@ -1,24 +0,0 @@
-from promptum.benchmark import Report
-def test_report_summary(sample_report: Report) -> None:
-    summary = sample_report.get_summary()
-    assert summary["total"] == 3
-    assert summary["passed"] == 2
-    assert summary["failed"] == 1
-    assert summary["pass_rate"] == 2 / 3
-    assert summary["total_cost_usd"] == 0.045
-    assert summary["avg_latency_ms"] == 123.33333333333333
-def test_report_summary_empty() -> None:
-    report = Report(results=[])
-    summary = report.get_summary()
-    assert summary["total"] == 0
-    assert summary["passed"] == 0
-    assert summary["failed"] == 0
-    assert summary["pass_rate"] == 0
-    assert summary["total_cost_usd"] == 0
-    assert summary["avg_latency_ms"] == 0

{promptum-0.0.2 → promptum-0.0.3}/.coveragerc RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.github/workflows/lint.yml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.github/workflows/publish-test.yml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.github/workflows/publish.yml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.github/workflows/test.yml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.github/workflows/typecheck.yml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.gitignore RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/.python-version RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/CONTRIBUTING.md RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/Justfile RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/LICENSE RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/pytest.ini RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/ruff.toml RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/benchmark.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/result.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/runner.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/benchmark/test_case.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/providers/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/providers/metrics.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/providers/openrouter.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/providers/protocol.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/providers/retry.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/py.typed RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/validation/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/validation/protocol.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/src/promptum/validation/validators.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/benchmark/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/benchmark/conftest.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/benchmark/test_test_case.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/conftest.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/providers/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/providers/conftest.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/providers/test_metrics.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/providers/test_retry.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/__init__.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/conftest.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/test_contains.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/test_exact_match.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/test_json_schema.py RENAMED Viewed

File without changes

{promptum-0.0.2 → promptum-0.0.3}/tests/validation/test_regex.py RENAMED Viewed

File without changes

promptum 0.0.2__tar.gz → 0.0.3__tar.gz

promptum 0.0.2tar.gz → 0.0.3tar.gz