PyPI - inferencebench-code - Versions diffs - 0.0.2__tar.gz - Mend

inferencebench-code 0.0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

inferencebench_code-0.0.2/.gitignore ADDED Viewed

@@ -0,0 +1,137 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# uv / virtualenv
+.venv/
+venv/
+env/
+ENV/
+uv.lock.tmp
+.python-version
+# Testing / coverage
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+htmlcov/
+# Type checking
+.mypy_cache/
+.dmypy.json
+dmypy.json
+.pyre/
+.pytype/
+# Ruff
+.ruff_cache/
+# IDE / editor
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+.DS_Store
+# OS
+Thumbs.db
+desktop.ini
+# Secrets / env
+.env
+.env.*
+!.env.example
+.envrc
+# Bench-specific local caches
+~/.cache/inferencebench/
+.cache/inferencebench/
+.inferencebench/
+# Sigstore dev keys (never commit private keys)
+cosign.key
+cosign-*.key
+cosign-*.pub
+.bench/*.key
+# Local benchmark working dirs (kept local; published outputs land under validation-runs/)
+envelopes-voice/
+envelopes-*/
+*.pem
+!tests/fixtures/**/*.pem
+# Real-GPU validation artifacts (kept locally, never pushed)
+# Use slash-star (not trailing slash) so individual subpaths can be re-included below.
+validation-runs/*
+# ...except the canonical published marathon corpus — small, public, used by docs + CI
+!validation-runs/2026-05-18-multi-vendor-marathon
+validation-runs/2026-05-18-multi-vendor-marathon/*
+!validation-runs/2026-05-18-multi-vendor-marathon/marathon
+validation-runs/2026-05-18-multi-vendor-marathon/marathon/*
+!validation-runs/2026-05-18-multi-vendor-marathon/marathon/all
+!validation-runs/2026-05-18-multi-vendor-marathon/marathon/all/*.json
+# Voice ASR validation envelopes (small, public, used by leaderboard build)
+!validation-runs/2026-05-25-voice-rtx4000ada
+!validation-runs/2026-05-25-voice-rtx4000ada/*.json
+!validation-runs/2026-05-29-voice-testbm-h100
+!validation-runs/2026-05-29-voice-testbm-h100/*.json
+# Model weights / datasets (use Git LFS or S3)
+*.bin
+*.safetensors
+*.pt
+*.pth
+*.gguf
+*.onnx
+*.parquet
+!tests/fixtures/**/*.parquet
+# Logs
+*.log
+logs/
+# Documentation build
+docs/_build/
+site/
+# Internal-only files (Claude Code context + planning) — kept locally, not pushed
+/CLAUDE.md
+/INDEX.md
+/PROJECT_PLAN.md
+/CONVENTIONS.md
+/HUMAN_REVIEW_GATES.md
+**/CLAUDE.md
+memory/
+skills/
+agents/
+.claude/
+TICKETS/

inferencebench_code-0.0.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,68 @@
+Metadata-Version: 2.4
+Name: inferencebench-code
+Version: 0.0.2
+Summary: Code-generation plugin for InferenceBench Suite (HumanEval-style execution-based scoring).
+Project-URL: Homepage, https://github.com/yobitelcomm/bench
+Author-email: Yobitel Communications <bench@yobitel.com>
+License: Apache-2.0
+Keywords: ai,benchmark,code-generation,humaneval,llm,ml
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.12
+Requires-Dist: inferencebench-envelope
+Requires-Dist: inferencebench-harness
+Requires-Dist: pydantic~=2.9
+Requires-Dist: pyyaml~=6.0
+Description-Content-Type: text/markdown
+# inferencebench-code
+Code-generation plugin for the InferenceBench Suite.
+HumanEval-style execution-based benchmarks: the plugin sends a function-signature
+prompt to the model, extracts the Python code from its response, executes it
+against bundled unit tests in a subprocess, and reports `pass_at_1`.
+Suite ID: `code.generation`
+Bundled benchmarks:
+- `code.generation.humaneval-mini` — 5 stdlib-only Python tasks, `pass_at_1`
+  scoring with a 5-second per-task wall-clock timeout.
+## SAFETY WARNING — read before running
+**This plugin executes model-generated code.** Every run prints a yellow banner
+reminding you of that. The execution layer is *best-effort* defence-in-depth,
+not a real sandbox:
+- Each task's solution + tests are written to a temp file and invoked with
+  `python -I` (isolated mode) under a `subprocess.run(timeout=...)` wall clock.
+- A cheap substring pre-scan refuses any solution that imports `subprocess`,
+  `os.system`, `socket`, `urllib`, `multiprocessing`, or `ctypes`.
+- The bundled fixtures are stdlib-only, no I/O, no network.
+This is **deliberately not airtight**. Phase 2 adds real isolation (firejail /
+nsjail / container-per-task). Until then: only run code-generation benchmarks
+against models you trust, on machines you can afford to throw away, and never
+against the bundled fixtures replaced with untrusted input.
+## Metrics
+The envelope's `metrics` block includes:
+| Metric             | Direction       | Meaning                                   |
+| ------------------ | --------------- | ----------------------------------------- |
+| `pass_at_1`        | higher is better | mean of per-task passed booleans          |
+| `pass_at_1_p05/50/95` | higher is better | bootstrap quantiles of per-sample scores |
+| `timeout_rate`     | lower is better  | fraction of tasks that hit the wall clock |
+| `ttft_p50_ms`      | -               | model time-to-first-token, median         |
+| `total_p50_ms`     | -               | model total request time, median          |
+| `tokens_out_total` | -               | total generated tokens across the run     |
+| `ok_rate`          | -               | fraction of model calls that succeeded    |
+| `n_samples`        | -               | fixture row count                         |

inferencebench_code-0.0.2/README.md ADDED Viewed

@@ -0,0 +1,46 @@
+# inferencebench-code
+Code-generation plugin for the InferenceBench Suite.
+HumanEval-style execution-based benchmarks: the plugin sends a function-signature
+prompt to the model, extracts the Python code from its response, executes it
+against bundled unit tests in a subprocess, and reports `pass_at_1`.
+Suite ID: `code.generation`
+Bundled benchmarks:
+- `code.generation.humaneval-mini` — 5 stdlib-only Python tasks, `pass_at_1`
+  scoring with a 5-second per-task wall-clock timeout.
+## SAFETY WARNING — read before running
+**This plugin executes model-generated code.** Every run prints a yellow banner
+reminding you of that. The execution layer is *best-effort* defence-in-depth,
+not a real sandbox:
+- Each task's solution + tests are written to a temp file and invoked with
+  `python -I` (isolated mode) under a `subprocess.run(timeout=...)` wall clock.
+- A cheap substring pre-scan refuses any solution that imports `subprocess`,
+  `os.system`, `socket`, `urllib`, `multiprocessing`, or `ctypes`.
+- The bundled fixtures are stdlib-only, no I/O, no network.
+This is **deliberately not airtight**. Phase 2 adds real isolation (firejail /
+nsjail / container-per-task). Until then: only run code-generation benchmarks
+against models you trust, on machines you can afford to throw away, and never
+against the bundled fixtures replaced with untrusted input.
+## Metrics
+The envelope's `metrics` block includes:
+| Metric             | Direction       | Meaning                                   |
+| ------------------ | --------------- | ----------------------------------------- |
+| `pass_at_1`        | higher is better | mean of per-task passed booleans          |
+| `pass_at_1_p05/50/95` | higher is better | bootstrap quantiles of per-sample scores |
+| `timeout_rate`     | lower is better  | fraction of tasks that hit the wall clock |
+| `ttft_p50_ms`      | -               | model time-to-first-token, median         |
+| `total_p50_ms`     | -               | model total request time, median          |
+| `tokens_out_total` | -               | total generated tokens across the run     |
+| `ok_rate`          | -               | fraction of model calls that succeeded    |
+| `n_samples`        | -               | fixture row count                         |

inferencebench_code-0.0.2/pyproject.toml ADDED Viewed

@@ -0,0 +1,43 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "inferencebench-code"
+version = "0.0.2"
+description = "Code-generation plugin for InferenceBench Suite (HumanEval-style execution-based scoring)."
+readme = "README.md"
+requires-python = ">=3.12"
+license = { text = "Apache-2.0" }
+authors = [
+    { name = "Yobitel Communications", email = "bench@yobitel.com" },
+]
+keywords = ["benchmark", "llm", "code-generation", "humaneval", "ai", "ml"]
+classifiers = [
+    "Development Status :: 2 - Pre-Alpha",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: Apache Software License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "inferencebench-envelope",
+    "inferencebench-harness",
+    "pydantic~=2.9",
+    "pyyaml~=6.0",
+]
+[project.entry-points."inferencebench.plugins"]
+"code.generation" = "inferencebench_code.plugin:CodeGenerationPlugin"
+[project.urls]
+Homepage = "https://github.com/yobitelcomm/bench"
+[tool.hatch.build.targets.wheel]
+packages = ["src/inferencebench_code"]
+[tool.uv.sources]
+inferencebench-envelope = { workspace = true }
+inferencebench-harness = { workspace = true }

inferencebench_code-0.0.2/src/inferencebench_code/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+"""InferenceBench code-generation plugin."""
+from inferencebench_code.plugin import EXPECTED_METRICS, CodeGenerationPlugin
+from inferencebench_code.schemas import BenchmarkSpec, EngineKind, RunContext
+__all__ = [
+    "EXPECTED_METRICS",
+    "BenchmarkSpec",
+    "CodeGenerationPlugin",
+    "EngineKind",
+    "RunContext",
+]

inferencebench_code-0.0.2/src/inferencebench_code/benchmarks/humaneval-mini.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+benchmark_id: code.generation.humaneval-mini
+suite_version: 1.0.0
+description: Five stdlib-only Python tasks, pass@1 with a 5-second wall-clock timeout.
+modality: code
+kind: generation
+dataset:
+  id: builtin-humaneval-mini
+  path: humaneval-mini.jsonl
+slo_template: code.generation.standard
+warmup:
+  discard_runs: 0
+language: python
+scoring: pass_at_1
+k: 1
+timeout_s: 5.0

inferencebench_code-0.0.2/src/inferencebench_code/benchmarks/mbpp-mini.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+benchmark_id: code.generation.mbpp-mini
+suite_version: 1.0.0
+description: Mostly Basic Python Problems (MBPP)-style stdlib-only tasks, pass@1 with a 5-second wall-clock timeout.
+modality: code
+kind: generation
+dataset:
+  id: builtin-mbpp-mini
+  path: mbpp-mini.jsonl
+slo_template: code.generation.standard
+warmup:
+  discard_runs: 0
+language: python
+scoring: pass_at_1
+k: 1
+timeout_s: 5.0

inferencebench_code-0.0.2/src/inferencebench_code/datasets/humaneval-mini.jsonl ADDED Viewed

@@ -0,0 +1,5 @@
+{"task_id": "add_two_numbers", "entry_point": "add", "prompt": "def add(a, b):\n    \"\"\"Return the sum of a and b.\"\"\"\n", "canonical_solution": "def add(a, b):\n    return a + b\n", "tests": "assert add(1, 2) == 3\nassert add(0, 0) == 0\nassert add(-1, 1) == 0\nassert add(-5, -7) == -12\nassert add(100, 250) == 350\nassert add(2.5, 0.5) == 3.0\nassert add(-100, 100) == 0\n"}
+{"task_id": "reverse_string", "entry_point": "reverse_string", "prompt": "def reverse_string(s):\n    \"\"\"Return the reverse of the string s.\"\"\"\n", "canonical_solution": "def reverse_string(s):\n    return s[::-1]\n", "tests": "assert reverse_string('') == ''\nassert reverse_string('a') == 'a'\nassert reverse_string('hello') == 'olleh'\nassert reverse_string('abcde') == 'edcba'\nassert reverse_string('  spaces  ') == '  secaps  '\nassert reverse_string('racecar') == 'racecar'\nassert reverse_string('AbC') == 'CbA'\n"}
+{"task_id": "fibonacci_iter", "entry_point": "fib", "prompt": "def fib(n):\n    \"\"\"Return the n-th Fibonacci number (0-indexed: fib(0)=0, fib(1)=1).\"\"\"\n", "canonical_solution": "def fib(n):\n    a, b = 0, 1\n    for _ in range(n):\n        a, b = b, a + b\n    return a\n", "tests": "assert fib(0) == 0\nassert fib(1) == 1\nassert fib(2) == 1\nassert fib(3) == 2\nassert fib(5) == 5\nassert fib(10) == 55\nassert fib(15) == 610\n"}
+{"task_id": "count_vowels", "entry_point": "count_vowels", "prompt": "def count_vowels(s):\n    \"\"\"Count the number of vowels (a, e, i, o, u) in the lowercase string s.\"\"\"\n", "canonical_solution": "def count_vowels(s):\n    return sum(1 for c in s if c in 'aeiou')\n", "tests": "assert count_vowels('') == 0\nassert count_vowels('bcdfg') == 0\nassert count_vowels('aeiou') == 5\nassert count_vowels('hello') == 2\nassert count_vowels('python') == 1\nassert count_vowels('queue') == 4\nassert count_vowels('rhythm') == 0\n"}
+{"task_id": "is_palindrome", "entry_point": "is_palindrome", "prompt": "def is_palindrome(s):\n    \"\"\"Return True if s is a palindrome (case-insensitive). Empty string is a palindrome.\"\"\"\n", "canonical_solution": "def is_palindrome(s):\n    t = s.lower()\n    return t == t[::-1]\n", "tests": "assert is_palindrome('') is True\nassert is_palindrome('a') is True\nassert is_palindrome('racecar') is True\nassert is_palindrome('Racecar') is True\nassert is_palindrome('hello') is False\nassert is_palindrome('Level') is True\nassert is_palindrome('abba') is True\n"}

inferencebench_code-0.0.2/src/inferencebench_code/datasets/mbpp-mini.jsonl ADDED Viewed

@@ -0,0 +1,5 @@
+{"task_id": "mbpp-001", "prompt": "Write a function sum_list(items) that returns the sum of a list of integers.", "tests": "assert sum_list([1,2,3]) == 6\nassert sum_list([]) == 0\nassert sum_list([-1, -2, 3]) == 0\nassert sum_list([10, 20, 30, 40]) == 100\n", "canonical_solution": "def sum_list(items):\n    return sum(items)\n", "entry_point": "sum_list"}
+{"task_id": "mbpp-002", "prompt": "Write a function max_of_three(a, b, c) that returns the largest of three numbers.", "tests": "assert max_of_three(1, 2, 3) == 3\nassert max_of_three(5, 2, 4) == 5\nassert max_of_three(-1, -2, -3) == -1\nassert max_of_three(7, 7, 7) == 7\n", "canonical_solution": "def max_of_three(a, b, c):\n    return max(a, b, c)\n", "entry_point": "max_of_three"}
+{"task_id": "mbpp-003", "prompt": "Write a function count_evens(nums) that returns the count of even integers in the list nums.", "tests": "assert count_evens([1, 2, 3, 4]) == 2\nassert count_evens([]) == 0\nassert count_evens([2, 4, 6, 8]) == 4\nassert count_evens([1, 3, 5]) == 0\n", "canonical_solution": "def count_evens(nums):\n    return sum(1 for n in nums if n % 2 == 0)\n", "entry_point": "count_evens"}
+{"task_id": "mbpp-004", "prompt": "Write a function gcd(a, b) that returns the greatest common divisor of two non-negative integers.", "tests": "assert gcd(12, 18) == 6\nassert gcd(7, 5) == 1\nassert gcd(100, 75) == 25\nassert gcd(0, 9) == 9\n", "canonical_solution": "def gcd(a, b):\n    while b:\n        a, b = b, a % b\n    return a\n", "entry_point": "gcd"}
+{"task_id": "mbpp-005", "prompt": "Write a function unique_sorted(items) that returns a sorted list of the unique values in items.", "tests": "assert unique_sorted([3, 1, 2, 3, 1]) == [1, 2, 3]\nassert unique_sorted([]) == []\nassert unique_sorted([5]) == [5]\nassert unique_sorted([2, 2, 2, 2]) == [2]\n", "canonical_solution": "def unique_sorted(items):\n    return sorted(set(items))\n", "entry_point": "unique_sorted"}