PyPI - nl-code - Versions diffs - 0.4.1__tar.gz → 0.5.0__tar.gz - Mend

nl-code 0.4.1tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (145) hide show

nl_code-0.5.0/.env.example ADDED Viewed

@@ -0,0 +1,27 @@
+# Database
+DR_LLM_DATABASE_URL=postgresql://localhost/nl_latents
+# Optional API credentials
+# OPENROUTER_API_KEY=
+# GOOGLE_API_KEY=
+# Code execution — worker limits (inside container/subprocess)
+DR_DOCKER_WORKER_MAX_STDIN_BYTES=52428800
+DR_DOCKER_WORKER_MAX_STDOUT_BYTES=1048576
+DR_DOCKER_WORKER_CPU_SECONDS=20
+DR_DOCKER_WORKER_MEMORY_BYTES=4294967296
+DR_DOCKER_WORKER_FILE_BYTES=10485760
+DR_DOCKER_WORKER_NPROC=256
+DR_DOCKER_WORKER_SKIP_LIMITS=false
+# Code execution — runner limits (host side)
+NL_CODE_EVAL_WORKER_MAX_STDOUT_BYTES=1048576
+NL_CODE_EVAL_WORKER_MAX_STDERR_BYTES=1048576
+# Code execution — Docker container limits
+DR_DOCKER_MEMORY=4g
+DR_DOCKER_CPUS=1.0
+DR_DOCKER_PIDS_LIMIT=256
+DR_DOCKER_NOFILE=1024
+DR_DOCKER_FSIZE_BYTES=10485760
+DR_DOCKER_TMPFS_SIZE=64m

{nl_code-0.4.1 → nl_code-0.5.0}/.gitignore RENAMED Viewed

@@ -1,3 +1,4 @@
+.env
 # Python
 __pycache__/
 *.pyc

{nl_code-0.4.1 → nl_code-0.5.0}/AGENTS.md RENAMED Viewed

@@ -11,8 +11,12 @@ uv run ruff format .
 uv run ruff check .
 uv run ty check
 uv run pytest
+uv run nl-code-test docker
 ```
+`uv run pytest` runs the default non-Docker suite. Run `uv run nl-code-test docker`
+separately to execute the `@pytest.mark.docker` integration tests.
 ### Frontend (from ui/dataset-explorer/frontend/)
 ```bash

{nl_code-0.4.1 → nl_code-0.5.0}/CLAUDE.md RENAMED Viewed

@@ -11,8 +11,12 @@ uv run ruff format .
 uv run ruff check .
 uv run ty check
 uv run pytest
+uv run nl-code-test docker
 ```
+`uv run pytest` runs the default non-Docker suite. Run `uv run nl-code-test docker`
+separately to execute the `@pytest.mark.docker` integration tests.
 ### Frontend (from ui/dataset-explorer/frontend/)
 ```bash

nl_code-0.5.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,132 @@
+Metadata-Version: 2.4
+Name: nl-code
+Version: 0.5.0
+Summary: Primitives for research into LLMs and code
+Author-email: Danielle Rothermel <danielle.rothermel@gmail.com>
+Requires-Python: >=3.12
+Requires-Dist: datasets>=3.0.0
+Requires-Dist: dr-docker==0.4.5
+Requires-Dist: fastapi>=0.135.3
+Requires-Dist: pydantic>=2.12.0
+Requires-Dist: python-dotenv>=1.2.2
+Requires-Dist: typer>=0.24.1
+Requires-Dist: uvicorn[standard]>=0.44.0
+Provides-Extra: bigcodebench
+Requires-Dist: beautifulsoup4>=4.12; extra == 'bigcodebench'
+Requires-Dist: gensim>=4.3; extra == 'bigcodebench'
+Requires-Dist: holidays>=0.60; extra == 'bigcodebench'
+Requires-Dist: matplotlib>=3.9; extra == 'bigcodebench'
+Requires-Dist: nltk>=3.9; extra == 'bigcodebench'
+Requires-Dist: numpy>=1.26; extra == 'bigcodebench'
+Requires-Dist: openpyxl>=3.1; extra == 'bigcodebench'
+Requires-Dist: pandas>=2.2; extra == 'bigcodebench'
+Requires-Dist: pypdf2>=3.0; extra == 'bigcodebench'
+Requires-Dist: python-dateutil>=2.9; extra == 'bigcodebench'
+Requires-Dist: python-docx>=1.1; extra == 'bigcodebench'
+Requires-Dist: pytz>=2024.1; extra == 'bigcodebench'
+Requires-Dist: regex>=2024.4; extra == 'bigcodebench'
+Requires-Dist: reportlab>=4.2; extra == 'bigcodebench'
+Requires-Dist: scikit-learn>=1.5; extra == 'bigcodebench'
+Requires-Dist: scipy>=1.14; extra == 'bigcodebench'
+Requires-Dist: seaborn>=0.13; extra == 'bigcodebench'
+Requires-Dist: statsmodels>=0.14; extra == 'bigcodebench'
+Provides-Extra: docker
+Requires-Dist: dr-docker>=0.4.5; extra == 'docker'
+Description-Content-Type: text/markdown
+# nl-code
+Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.
+## Install
+```bash
+uv add nl-code                # core
+uv add nl-code[docker]        # + Docker execution via dr-docker
+uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval
+```
+## Code Execution
+Execute generated code in isolated Docker containers.
+Three execution modes covering all supported dataset test formats:
+- **function_call** — call a named function with inputs, compare return values (HumanEval)
+- **assertion** — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
+- **unittest** — exec code + unittest.TestCase classes (ClassEval)
+Batch variants (`batch_run_test_cases`, `batch_run_assertion_tests`, `batch_run_unittest_tests`) process many code samples in a single container with auto-chunking.
+### Build The Docker Image
+Build the execution image from the repo root:
+```bash
+docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
+```
+This is the default runtime image used by the execution pipeline. The Dockerfile
+installs both the `bigcodebench` dependency set and the pinned `dr-docker`
+runtime dependency directly from `pyproject.toml`, so the image stays aligned
+with the repo's declared execution requirements.
+### Run The Docker Test Tier
+Docker-dependent tests are marked with `@pytest.mark.docker` and are excluded
+from the default `pytest` run.
+Run them explicitly with:
+```bash
+uv run nl-code-test docker
+```
+You can pass extra pytest arguments through after `docker`, for example:
+```bash
+uv run nl-code-test docker -q tests/test_execution_runner.py
+```
+## Datasets
+Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into `Task` objects, and cached locally. `DatasetSlice` supports filtering, seeded shuffling, and limit.
+## Dataset Explorer
+A FastAPI + React app for browsing and comparing datasets. Run from `ui/dataset-explorer/`.
+## Headless validation runs
+General dataset validation/debugging commands that import `matplotlib` should run headlessly with:
+```bash
+MPLBACKEND=Agg uv run python ...
+```
+## Rebuild Dataset Caches
+Run the Docker-backed cache rebuilds with:
+```bash
+uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
+uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
+uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
+uv run python -m nl_code.datasets.cache_cli rebuild class-eval
+uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro
+```
+`cache_cli rebuild` sets `MPLBACKEND=Agg` automatically.
+Current observed results with the default execution image and env limits:
+```text
+humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
+humaneval-pro: cached 164 tasks (164 raw, 0 flawed)
+mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
+class-eval: cached 98 tasks (98 raw, 2 flawed)
+bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)
+```
+The remaining flawed samples above are dataset-level failures, not Docker
+runtime failures.

nl_code-0.5.0/README.md ADDED Viewed

@@ -0,0 +1,96 @@
+# nl-code
+Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.
+## Install
+```bash
+uv add nl-code                # core
+uv add nl-code[docker]        # + Docker execution via dr-docker
+uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval
+```
+## Code Execution
+Execute generated code in isolated Docker containers.
+Three execution modes covering all supported dataset test formats:
+- **function_call** — call a named function with inputs, compare return values (HumanEval)
+- **assertion** — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
+- **unittest** — exec code + unittest.TestCase classes (ClassEval)
+Batch variants (`batch_run_test_cases`, `batch_run_assertion_tests`, `batch_run_unittest_tests`) process many code samples in a single container with auto-chunking.
+### Build The Docker Image
+Build the execution image from the repo root:
+```bash
+docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
+```
+This is the default runtime image used by the execution pipeline. The Dockerfile
+installs both the `bigcodebench` dependency set and the pinned `dr-docker`
+runtime dependency directly from `pyproject.toml`, so the image stays aligned
+with the repo's declared execution requirements.
+### Run The Docker Test Tier
+Docker-dependent tests are marked with `@pytest.mark.docker` and are excluded
+from the default `pytest` run.
+Run them explicitly with:
+```bash
+uv run nl-code-test docker
+```
+You can pass extra pytest arguments through after `docker`, for example:
+```bash
+uv run nl-code-test docker -q tests/test_execution_runner.py
+```
+## Datasets
+Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into `Task` objects, and cached locally. `DatasetSlice` supports filtering, seeded shuffling, and limit.
+## Dataset Explorer
+A FastAPI + React app for browsing and comparing datasets. Run from `ui/dataset-explorer/`.
+## Headless validation runs
+General dataset validation/debugging commands that import `matplotlib` should run headlessly with:
+```bash
+MPLBACKEND=Agg uv run python ...
+```
+## Rebuild Dataset Caches
+Run the Docker-backed cache rebuilds with:
+```bash
+uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
+uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
+uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
+uv run python -m nl_code.datasets.cache_cli rebuild class-eval
+uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro
+```
+`cache_cli rebuild` sets `MPLBACKEND=Agg` automatically.
+Current observed results with the default execution image and env limits:
+```text
+humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
+humaneval-pro: cached 164 tasks (164 raw, 0 flawed)
+mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
+class-eval: cached 98 tasks (98 raw, 2 flawed)
+bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)
+```
+The remaining flawed samples above are dataset-level failures, not Docker
+runtime failures.

nl_code-0.5.0/docker/scientific.Dockerfile ADDED Viewed

@@ -0,0 +1,67 @@
+# Python code evaluation sandbox.
+# Uses the scientific dependency set from pyproject.toml.
+#
+# Build:
+#   docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
+FROM python:3.12-slim
+LABEL org.opencontainers.image.title="nl-code/code-eval-scientific"
+LABEL org.opencontainers.image.description="Python code evaluation sandbox (scientific)"
+LABEL org.opencontainers.image.version="v1"
+COPY pyproject.toml /tmp/nl-code/pyproject.toml
+RUN python - <<'PY' > /tmp/bigcodebench-requirements.txt
+from pathlib import Path
+import tomllib
+pyproject = tomllib.loads(Path("/tmp/nl-code/pyproject.toml").read_text())
+requirements = pyproject["project"]["optional-dependencies"]["bigcodebench"]
+print("\n".join(requirements))
+PY
+RUN python - <<'PY' > /tmp/dr-docker-requirements.txt
+from pathlib import Path
+import tomllib
+pyproject = tomllib.loads(Path("/tmp/nl-code/pyproject.toml").read_text())
+requirements = [
+    requirement
+    for requirement in pyproject["project"]["dependencies"]
+    if requirement.startswith("dr-docker")
+]
+if len(requirements) != 1:
+    raise SystemExit(
+        f"expected exactly one dr-docker requirement, found {len(requirements)}"
+    )
+print(requirements[0])
+PY
+RUN pip install --no-cache-dir \
+        -r /tmp/bigcodebench-requirements.txt \
+        -r /tmp/dr-docker-requirements.txt \
+    && rm -rf \
+        /root/.cache/pip \
+        /tmp/bigcodebench-requirements.txt \
+        /tmp/dr-docker-requirements.txt \
+        /tmp/nl-code
+# Preload NLTK resources needed by ClassEval tasks so evaluation does not
+# attempt network downloads inside the sandbox at runtime.
+ENV NLTK_DATA=/usr/local/share/nltk_data
+RUN python -m nltk.downloader -d "${NLTK_DATA}" \
+    averaged_perceptron_tagger \
+    averaged_perceptron_tagger_eng \
+    punkt \
+    punkt_tab \
+    wordnet \
+    omw-1.4
+RUN useradd -m -s /bin/bash evaluser \
+    && mkdir -p /sandbox \
+    && chown evaluser:evaluser /sandbox
+COPY --chown=evaluser:evaluser src/nl_code/code_execution/worker.py /sandbox/worker.py
+USER evaluser
+WORKDIR /tmp

{nl_code-0.4.1 → nl_code-0.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "nl-code"
-version = "0.4.1"
+version = "0.5.0"
 description = "Primitives for research into LLMs and code"
 readme = "README.md"
 authors = [
@@ -12,9 +12,15 @@ dependencies = [
     "datasets>=3.0.0",
     "fastapi>=0.135.3",
     "uvicorn[standard]>=0.44.0",
+    "typer>=0.24.1",
+    "python-dotenv>=1.2.2",
+    "dr-docker==0.4.5",
 ]
 [project.optional-dependencies]
+docker = [
+    "dr-docker>=0.4.5",
+]
 bigcodebench = [
     "numpy>=1.26",
     "pandas>=2.2",
@@ -27,8 +33,18 @@ bigcodebench = [
     "python-dateutil>=2.9",
     "holidays>=0.60",
     "regex>=2024.4",
+    "beautifulsoup4>=4.12",
+    "python-docx>=1.1",
+    "openpyxl>=3.1",
+    "gensim>=4.3",
+    "nltk>=3.9",
+    "PyPDF2>=3.0",
+    "reportlab>=4.2",
 ]
+[project.scripts]
+nl-code-test = "nl_code.test_cli:app"
 [build-system]
 requires = ["hatchling"]
 build-backend = "hatchling.build"
@@ -41,6 +57,12 @@ dev = [
     "ty>=0.0.29",
 ]
+[tool.pytest.ini_options]
+markers = [
+    "docker: tests that require a running Docker daemon",
+]
+addopts = "-m 'not docker'"
 [tool.ruff]
 include = ["src/**/*.py", "tests/**/*.py"]

{nl_code-0.4.1 → nl_code-0.5.0}/src/nl_code/__init__.py RENAMED Viewed

@@ -1 +1,5 @@
 """Primitives for research into LLMs and code."""
+from dotenv import load_dotenv
+load_dotenv()

nl_code-0.5.0/src/nl_code/code_execution/__init__.py ADDED Viewed

@@ -0,0 +1,27 @@
+"""Code execution with Docker isolation."""
+from nl_code.code_execution.models import (
+    DEFAULT_CODE_EVAL_IMAGE as DEFAULT_CODE_EVAL_IMAGE,
+    SCIENTIFIC_CODE_EVAL_IMAGE as SCIENTIFIC_CODE_EVAL_IMAGE,
+    AssertionBatchItem as AssertionBatchItem,
+    AssertionTestResult as AssertionTestResult,
+    CodeExecutionInfrastructureError as CodeExecutionInfrastructureError,
+    ExecutionResult as ExecutionResult,
+    FunctionCallBatchItem as FunctionCallBatchItem,
+    TestCase as TestCase,
+    TestCaseResult as TestCaseResult,
+    UnittestBatchItem as UnittestBatchItem,
+    UnittestResult as UnittestResult,
+    UnittestTestDetail as UnittestTestDetail,
+)
+from nl_code.code_execution.runner import (
+    EXEC_MODE_DOCKER as EXEC_MODE_DOCKER,
+    batch_run_assertion_tests as batch_run_assertion_tests,
+    batch_run_test_cases as batch_run_test_cases,
+    batch_run_unittest_tests as batch_run_unittest_tests,
+    check_compiles as check_compiles,
+    run_assertion_test as run_assertion_test,
+    run_function_batch as run_function_batch,
+    run_test_cases as run_test_cases,
+    run_unittest_test as run_unittest_test,
+)

nl_code-0.5.0/src/nl_code/code_execution/models.py ADDED Viewed

@@ -0,0 +1,152 @@
+from typing import Any
+from pydantic import BaseModel, Field
+DEFAULT_CODE_EVAL_IMAGE = "nl-code/code-eval-scientific:v1"
+SCIENTIFIC_CODE_EVAL_IMAGE = DEFAULT_CODE_EVAL_IMAGE
+class CodeExecutionInfrastructureError(RuntimeError):
+    """Raised when the execution platform itself fails.
+    This error means Docker/subprocess infrastructure could not run the code.
+    It is NEVER raised for code-level failures (syntax errors, runtime
+    exceptions, wrong answers) — those are returned in result objects.
+    Callers should ``try/except CodeExecutionInfrastructureError`` to handle
+    platform issues and inspect result objects for code-level failures.
+    """
+    def __init__(
+        self,
+        *,
+        stage: str,
+        execution_mode: str,
+        detail: str,
+    ) -> None:
+        self.stage = stage
+        self.execution_mode = execution_mode
+        self.detail = detail
+        super().__init__(
+            f"execution infrastructure failure "
+            f"(stage={stage}, mode={execution_mode}): {detail}"
+        )
+# ---------------------------------------------------------------------------
+# Function-call execution models
+# ---------------------------------------------------------------------------
+class ExecutionResult(BaseModel):
+    """Result of executing a function with a single input."""
+    input_value: Any
+    return_value: Any | None = None
+    return_type: str | None = None
+    stdout: str = ""
+    stdout_truncated: bool = False
+    error: str | None = None
+    compile_success: bool | None = None
+    compile_error: str | None = None
+class TestCase(BaseModel):
+    """A single test case: input and expected output."""
+    __test__ = False
+    input_value: Any
+    expected_output: Any
+class TestCaseResult(BaseModel):
+    """Result of comparing execution output to expected output."""
+    __test__ = False
+    input_value: Any
+    expected_output: Any
+    actual_output: Any | None = None
+    passed: bool = False
+    error: str | None = None
+    compile_success: bool | None = None
+    compile_error: str | None = None
+# ---------------------------------------------------------------------------
+# Assertion execution models (Pro datasets)
+# ---------------------------------------------------------------------------
+class AssertionTestResult(BaseModel):
+    """Result of running code against assertion-based test code."""
+    __test__ = False
+    passed: bool
+    error: str | None = None
+    stdout: str = ""
+    compile_success: bool | None = None
+    compile_error: str | None = None
+# ---------------------------------------------------------------------------
+# Unittest execution models (ClassEval)
+# ---------------------------------------------------------------------------
+class UnittestTestDetail(BaseModel):
+    """Result of running a single unittest.TestCase class."""
+    __test__ = False
+    test_class_name: str
+    tests_run: int
+    tests_passed: int
+    tests_failed: int
+    tests_errored: int
+    tests_skipped: int = 0
+    failures: list[str] = Field(default_factory=list)
+    errors: list[str] = Field(default_factory=list)
+    passed: bool
+class UnittestResult(BaseModel):
+    """Aggregated result across all unittest test classes."""
+    all_passed: bool
+    total_tests_run: int
+    total_tests_passed: int
+    total_tests_failed: int
+    total_tests_errored: int
+    per_test_class: list[UnittestTestDetail]
+    error: str | None = None
+# ---------------------------------------------------------------------------
+# Batch item models
+# ---------------------------------------------------------------------------
+class FunctionCallBatchItem(BaseModel):
+    """A single item for batch_run_test_cases."""
+    code: str
+    function_name: str
+    test_cases: list[TestCase]
+class AssertionBatchItem(BaseModel):
+    """A single item for batch_run_assertion_tests."""
+    code: str
+    test_code: str
+class UnittestBatchItem(BaseModel):
+    """A single item for batch_run_unittest_tests."""
+    code: str
+    test_code: str
+    test_class_names: list[str]

nl-code 0.4.1__tar.gz → 0.5.0__tar.gz

nl-code 0.4.1tar.gz → 0.5.0tar.gz