PyPI - distenum - Versions diffs - 0.1.0__tar.gz - Mend

distenum 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

distenum-0.1.0/LICENSE +21 -0
distenum-0.1.0/PKG-INFO +129 -0
distenum-0.1.0/README.md +104 -0
distenum-0.1.0/distenum/__init__.py +5 -0
distenum-0.1.0/distenum/parser.py +348 -0
distenum-0.1.0/distenum.egg-info/PKG-INFO +129 -0
distenum-0.1.0/distenum.egg-info/SOURCES.txt +11 -0
distenum-0.1.0/distenum.egg-info/dependency_links.txt +1 -0
distenum-0.1.0/distenum.egg-info/requires.txt +7 -0
distenum-0.1.0/distenum.egg-info/top_level.txt +1 -0
distenum-0.1.0/pyproject.toml +49 -0
distenum-0.1.0/setup.cfg +4 -0
distenum-0.1.0/tests/test_parser.py +722 -0

distenum-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Karan Taneja
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

distenum-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,129 @@
+Metadata-Version: 2.4
+Name: distenum
+Version: 0.1.0
+Summary: Parse JSON outputs with logprobs from the OpenAI API to convert enum fields into probability distributions.
+Author: Karan Taneja
+License-Expression: MIT
+Project-URL: Repository, https://github.com/ktaneja6/distenum
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Provides-Extra: openai
+Requires-Dist: openai>=1.0; extra == "openai"
+Provides-Extra: dev
+Requires-Dist: pytest; extra == "dev"
+Requires-Dist: ruff; extra == "dev"
+Dynamic: license-file
+# distenum
+**distenum** parses JSON outputs with logprobs from the OpenAI API and converts `enum`-type string fields into probability distributions instead of a single label.
+## What it does
+With structured outputs, the API returns a single enum value (e.g. `"positive"`). If you request `logprobs=True`, you also get token-level logprobs. **distenum** turns those logprobs into a probability distribution over your enum options so you can see how confident the model was in each choice.
+**Example:** For a sentiment field with `enum: ["positive", "negative", "neutral"]`:
+| From the API (content only) | With distenum (using logprobs) |
+|-----------------------------|---------------------------------|
+| `"sentiment": "positive"`   | `"sentiment": {"positive": 0.72, "negative": 0.18, "neutral": 0.10}` |
+So instead of a single label, you get a distribution you can use for uncertainty, ranking, or thresholding (e.g. only accept when `positive` probability > 0.8).
+## Install
+```bash
+pip install distenum
+```
+To run the example script (calls the OpenAI API):
+```bash
+pip install distenum[openai]
+```
+## Quick start
+Install the package and the OpenAI client: `pip install distenum[openai]`. Then call the API with `logprobs=True` and `top_logprobs=20`, and pass the response logprobs into distenum:
+```python
+from openai import OpenAI
+from distenum import parse_using_schema_and_logprobs
+schema = {
+    "type": "object",
+    "properties": {
+        "sentiment": {
+            "type": "string",
+            "enum": ["positive", "negative", "neutral"]
+        }
+    }
+}
+client = OpenAI()
+response = client.chat.completions.create(
+    model="gpt-4o-2024-08-06",
+    messages=[{"role": "user", "content": "Your prompt"}],
+    response_format={
+        "type": "json_schema",
+        "json_schema": {"name": "my_schema", "strict": True, "schema": schema}
+    },
+    logprobs=True,
+    top_logprobs=20,
+)
+logprobs_data = response.choices[0].logprobs
+parsed = parse_using_schema_and_logprobs(schema, logprobs_data)
+# parsed["sentiment"] might be: {"positive": 0.72, "negative": 0.18, "neutral": 0.10}
+```
+## Enum design tips
+- **Different prefixes:** Enum values are matched to token logprobs by **prefix**. Prefer enum labels that do not share a common prefix (e.g. `"positive"`, `"negative"`, `"neutral"` are good; `"pos"` and `"positive"` can blur probabilities).
+- **Fewer is better:** The API returns at most **20** logprobs per token (`top_logprobs=20`). With many enum values, most will get no mass; keep enums small for meaningful distributions.
+## API
+- **`parse_using_schema_and_logprobs(schema_dict, logprobs_data)`**
+  Parses the logprobs stream according to the JSON Schema. Fields of type `string` with an `enum` are returned as a dict mapping each enum label to a probability (non-negative, summing to 1). Other fields are parsed as normal JSON values.
+- **`tokenize(logprobs_data)`**
+  Low-level generator that yields tokens and their top-logprobs from the OpenAI logprobs content.
+## Performance
+The parser walks token-level logprobs and builds probability distributions for enum fields, so it is slower than parsing the same JSON with the standard library. A rough comparison (same logical structure, 100k iterations):
+| Parser        | Time (100k parses) | Throughput   | Avg per parse |
+|---------------|--------------------|--------------|---------------|
+| `json.loads`  | ~0.16 s            | ~630k/sec    | ~1.6 µs       |
+| distenum      | ~3.0 s             | ~33k/sec     | ~30 µs        |
+So distenum is typically **about 15–20× slower** than `json.loads` for the same structure. In absolute terms, **~30 µs per parse** is negligible compared to an OpenAI API call (typically hundreds of milliseconds to several seconds). Parsing a single response adds no meaningful latency.
+To run the benchmark yourself from the repo root:
+```bash
+PYTHONPATH=. python scripts/benchmark_parser.py
+```
+## Example script
+From the repo root (with `distenum[openai]` installed):
+```bash
+python example_sentiment_openai.py
+```
+## License
+MIT

distenum-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,104 @@
+# distenum
+**distenum** parses JSON outputs with logprobs from the OpenAI API and converts `enum`-type string fields into probability distributions instead of a single label.
+## What it does
+With structured outputs, the API returns a single enum value (e.g. `"positive"`). If you request `logprobs=True`, you also get token-level logprobs. **distenum** turns those logprobs into a probability distribution over your enum options so you can see how confident the model was in each choice.
+**Example:** For a sentiment field with `enum: ["positive", "negative", "neutral"]`:
+| From the API (content only) | With distenum (using logprobs) |
+|-----------------------------|---------------------------------|
+| `"sentiment": "positive"`   | `"sentiment": {"positive": 0.72, "negative": 0.18, "neutral": 0.10}` |
+So instead of a single label, you get a distribution you can use for uncertainty, ranking, or thresholding (e.g. only accept when `positive` probability > 0.8).
+## Install
+```bash
+pip install distenum
+```
+To run the example script (calls the OpenAI API):
+```bash
+pip install distenum[openai]
+```
+## Quick start
+Install the package and the OpenAI client: `pip install distenum[openai]`. Then call the API with `logprobs=True` and `top_logprobs=20`, and pass the response logprobs into distenum:
+```python
+from openai import OpenAI
+from distenum import parse_using_schema_and_logprobs
+schema = {
+    "type": "object",
+    "properties": {
+        "sentiment": {
+            "type": "string",
+            "enum": ["positive", "negative", "neutral"]
+        }
+    }
+}
+client = OpenAI()
+response = client.chat.completions.create(
+    model="gpt-4o-2024-08-06",
+    messages=[{"role": "user", "content": "Your prompt"}],
+    response_format={
+        "type": "json_schema",
+        "json_schema": {"name": "my_schema", "strict": True, "schema": schema}
+    },
+    logprobs=True,
+    top_logprobs=20,
+)
+logprobs_data = response.choices[0].logprobs
+parsed = parse_using_schema_and_logprobs(schema, logprobs_data)
+# parsed["sentiment"] might be: {"positive": 0.72, "negative": 0.18, "neutral": 0.10}
+```
+## Enum design tips
+- **Different prefixes:** Enum values are matched to token logprobs by **prefix**. Prefer enum labels that do not share a common prefix (e.g. `"positive"`, `"negative"`, `"neutral"` are good; `"pos"` and `"positive"` can blur probabilities).
+- **Fewer is better:** The API returns at most **20** logprobs per token (`top_logprobs=20`). With many enum values, most will get no mass; keep enums small for meaningful distributions.
+## API
+- **`parse_using_schema_and_logprobs(schema_dict, logprobs_data)`**
+  Parses the logprobs stream according to the JSON Schema. Fields of type `string` with an `enum` are returned as a dict mapping each enum label to a probability (non-negative, summing to 1). Other fields are parsed as normal JSON values.
+- **`tokenize(logprobs_data)`**
+  Low-level generator that yields tokens and their top-logprobs from the OpenAI logprobs content.
+## Performance
+The parser walks token-level logprobs and builds probability distributions for enum fields, so it is slower than parsing the same JSON with the standard library. A rough comparison (same logical structure, 100k iterations):
+| Parser        | Time (100k parses) | Throughput   | Avg per parse |
+|---------------|--------------------|--------------|---------------|
+| `json.loads`  | ~0.16 s            | ~630k/sec    | ~1.6 µs       |
+| distenum      | ~3.0 s             | ~33k/sec     | ~30 µs        |
+So distenum is typically **about 15–20× slower** than `json.loads` for the same structure. In absolute terms, **~30 µs per parse** is negligible compared to an OpenAI API call (typically hundreds of milliseconds to several seconds). Parsing a single response adds no meaningful latency.
+To run the benchmark yourself from the repo root:
+```bash
+PYTHONPATH=. python scripts/benchmark_parser.py
+```
+## Example script
+From the repo root (with `distenum[openai]` installed):
+```bash
+python example_sentiment_openai.py
+```
+## License
+MIT

distenum-0.1.0/distenum/__init__.py ADDED Viewed

@@ -0,0 +1,5 @@
+"""distenum: parse OpenAI JSON logprobs and convert enum fields to probability distributions."""
+from .parser import parse_using_schema_and_logprobs, tokenize
+__all__ = ["parse_using_schema_and_logprobs", "tokenize"]

distenum-0.1.0/distenum/parser.py ADDED Viewed

@@ -0,0 +1,348 @@
+"""Parse JSON logprobs from the OpenAI API and convert enum fields to probability distributions."""
+import itertools
+import math
+def tokenize(logprobs_data):
+    """Yield tokens and their top_logprobs from OpenAI logprobs content."""
+    if logprobs_data is None:
+        raise ValueError(
+            "logprobs_data is None. Pass the logprobs object from the response, e.g. response.choices[0].logprobs"
+        )
+    if not hasattr(logprobs_data, "content"):
+        raise ValueError(
+            "logprobs_data has no 'content' attribute. Expected an OpenAI logprobs object (e.g. response.choices[0].logprobs)."
+        )
+    logprobs_sequence = logprobs_data.content
+    if not logprobs_sequence:
+        raise ValueError(
+            "logprobs_data.content is empty. Ensure the API response was requested with logprobs=True and has content."
+        )
+    n = len(logprobs_sequence)
+    yield logprobs_sequence[0].top_logprobs
+    li = 0
+    ci = 0
+    def increment_ci(increase=1):
+        nonlocal li, ci
+        for _ in range(increase):
+            if ci == len(logprobs_sequence[li].token) - 1:
+                li += 1
+                ci = 0
+                yield logprobs_sequence[li].top_logprobs
+            else:
+                ci += 1
+    while li < n:
+        char = logprobs_sequence[li].token[ci]
+        # --- Whitespace ---
+        if char.isspace():
+            yield from increment_ci()
+            continue
+        # --- Structural Tokens ---
+        if char in ["{", "}", "[", "]", ":", ","]:
+            yield char
+            yield from increment_ci()
+            continue
+        # --- String Token (with JSON escape decoding: \", \\, \/, \b, \f, \n, \r, \t, \uXXXX) ---
+        if char == '"':
+            yield from increment_ci()
+            start_to_end_string = ""
+            _escape_map = {
+                '"': '"',
+                "\\": "\\",
+                "/": "/",
+                "b": "\b",
+                "f": "\f",
+                "n": "\n",
+                "r": "\r",
+                "t": "\t",
+            }
+            while li < n:
+                c = logprobs_sequence[li].token[ci]
+                if c == "\\":
+                    yield from increment_ci()
+                    if li >= n:
+                        break
+                    esc = logprobs_sequence[li].token[ci]
+                    if esc == "u":
+                        yield from increment_ci()
+                        hex_str = ""
+                        for _ in range(4):
+                            if li < n:
+                                hex_str += logprobs_sequence[li].token[ci]
+                                yield from increment_ci()
+                        if len(hex_str) == 4:
+                            start_to_end_string += chr(int(hex_str, 16))
+                    else:
+                        start_to_end_string += _escape_map.get(esc, esc)
+                        yield from increment_ci()
+                elif c == '"':
+                    break
+                else:
+                    start_to_end_string += c
+                    yield from increment_ci()
+            yield '"' + start_to_end_string + '"'
+            yield from increment_ci()
+            continue
+        # --- Number/Keyword Tokens (incl. scientific notation: 1e-5, 2.5E+10) ---
+        if char.isdigit() or char == "-":
+            start_to_end_number = ""
+            while li < n:
+                c = logprobs_sequence[li].token[ci]
+                if (
+                    c.isdigit()
+                    or c == "."
+                    or c == "-"
+                    or c in "eE+"
+                ):
+                    start_to_end_number += c
+                    yield from increment_ci()
+                else:
+                    break
+            yield start_to_end_number
+            continue
+        # --- Keywords (true, false, null) ---
+        if char in "tfn":
+            if logprobs_sequence[li].token[ci : ci + 4] == "true":
+                yield "true"
+                yield from increment_ci(4)
+            elif logprobs_sequence[li].token[ci : ci + 5] == "false":
+                yield "false"
+                yield from increment_ci(5)
+            elif logprobs_sequence[li].token[ci : ci + 4] == "null":
+                yield "null"
+                yield from increment_ci(4)
+            else:
+                raise ValueError(
+                    f"Invalid keyword in logprobs stream at content index {li}, char index {ci}: "
+                    f"got {logprobs_sequence[li].token!r}. Expected one of: true, false, null."
+                )
+            continue
+        raise ValueError(
+            f"Unexpected character in logprobs stream at content index {li}, char index {ci}: "
+            f"{char!r} in {logprobs_sequence[li].token!r}. Expected a valid JSON token (string, number, true, false, null, or {{ }} [ ] : ,)."
+        )
+def get_next_token(tokens, allow_list=False):
+    token = next(tokens)
+    while isinstance(token, list) and not allow_list:
+        token = next(tokens)
+    return token
+def parse_value(tokens, schema_dict):
+    """Parse any JSON value (object, array, string, number, bool, null) per schema."""
+    logprobs = None
+    token = get_next_token(tokens, allow_list=True)
+    if isinstance(token, list):
+        logprobs = token
+        token = get_next_token(tokens)
+    if token == "{" and schema_dict.get("type") == "object":
+        return parse_object(tokens, schema_dict)
+    elif token == "[" and schema_dict.get("type", "array") == "array":
+        return parse_array(tokens, schema_dict)
+    elif token.startswith('"') and schema_dict.get("type") == "string":
+        if "enum" in schema_dict:
+            if logprobs is None or (isinstance(logprobs, list) and len(logprobs) == 0):
+                raise ValueError(
+                    "Enum field in schema but no logprobs available for this token. "
+                    "Ensure the API was called with logprobs=True and top_logprobs (e.g. 20). "
+                    f"Schema enum: {schema_dict['enum']}."
+                )
+            output = {
+                class_label: sum(
+                    math.exp(tlp.logprob)
+                    for tlp in logprobs
+                    if class_label.lower().startswith(tlp.token.strip().lower())
+                )
+                for class_label in schema_dict["enum"]
+            }
+            sum_output = sum(output.values())
+            if sum_output == 0:
+                raise ValueError(
+                    "Enum field: no logprob mass matched any enum value. "
+                    f"Schema enum: {schema_dict['enum']}. "
+                    "Try enum labels with distinct prefixes, or increase top_logprobs."
+                )
+            output = {k: v / sum_output for k, v in output.items()}
+            # Consume the rest of the string token if we only saw the opening quote
+            if token == '"':
+                get_next_token(tokens)
+        else:
+            output = token[1:-1]
+        return output
+    elif token == "null" and schema_dict.get("type") == "null":
+        return None
+    elif token in ("true", "false", "null") and schema_dict.get("type") == "boolean":
+        return {"true": True, "false": False, "null": None}[token]
+    elif schema_dict.get("type") == "integer":
+        try:
+            return int(token)
+        except ValueError:
+            raise ValueError(
+                f"Expected an integer for schema type 'integer', got {token!r}. "
+                "Check that the logprobs content matches the schema."
+            ) from None
+    elif schema_dict.get("type") == "number":
+        try:
+            if "." in token or "e" in token.lower():
+                return float(token)
+            return int(token)
+        except ValueError:
+            raise ValueError(
+                f"Expected a number for schema type 'number', got {token!r}. "
+                "Check that the logprobs content matches the schema."
+            ) from None
+    else:
+        schema_type = schema_dict.get("type", "(missing)")
+        raise ValueError(
+            f"Unexpected token {token!r} for schema type {schema_type!r}. "
+            f"Schema expects one of: object, array, string, number, integer, boolean, null. "
+            "Ensure the logprobs_data content matches the JSON structure implied by the schema."
+        )
+def parse_object(tokens, schema_dict):
+    """Parse a JSON object into a Python dict."""
+    if "properties" not in schema_dict:
+        raise ValueError(
+            "Schema type is 'object' but schema has no 'properties'. "
+            "Add a 'properties' dict, e.g. {\"type\": \"object\", \"properties\": {\"key\": {\"type\": \"string\"}}}."
+        )
+    properties = schema_dict["properties"]
+    obj = {}
+    peek = get_next_token(tokens)
+    if peek == "}":
+        return obj
+    while True:
+        if not peek.startswith('"'):
+            raise ValueError(
+                f"Expected a quoted string key (object property name), got {peek!r}. "
+                "JSON object keys must be double-quoted strings."
+            )
+        # Tokenizer yields '"' then '"key"'; consume full key token when needed
+        if peek == '"':
+            peek = get_next_token(tokens)
+        key = peek[1:-1]
+        colon = get_next_token(tokens)
+        if colon != ":":
+            raise ValueError(
+                f"Expected ':' after object key {key!r}, got {colon!r}. "
+                "Check that the logprobs content is valid JSON."
+            )
+        if key not in properties:
+            allowed = list(properties.keys())
+            raise ValueError(
+                f"Unknown key {key!r}. Schema only allows: {allowed}. "
+                "Ensure the response matches the schema, or add this key to schema properties."
+            )
+        value = parse_value(tokens, properties[key])
+        obj[key] = value
+        separator = get_next_token(tokens)
+        if separator == "}":
+            return obj
+        if separator != ",":
+            raise ValueError(
+                f"Expected ',' or '}}' after object value, got {separator!r}. "
+                "Check that the logprobs content is valid JSON."
+            )
+        peek = get_next_token(tokens)
+def parse_array(tokens, schema_dict):
+    """Parse a JSON array into a Python list."""
+    if "items" not in schema_dict:
+        raise ValueError(
+            "Schema type is 'array' but schema has no 'items'. "
+            "Add an 'items' schema, e.g. {\"type\": \"array\", \"items\": {\"type\": \"string\"}}."
+        )
+    arr = []
+    first_token = next(tokens)
+    if isinstance(first_token, list):
+        first_token = next(tokens)
+    if first_token == "]":
+        return arr
+    while True:
+        if arr:
+            value = parse_value(tokens, schema_dict["items"])
+        else:
+            temp_tokens = itertools.chain([first_token], tokens)
+            value = parse_value(temp_tokens, schema_dict["items"])
+        arr.append(value)
+        separator = next(tokens)
+        if isinstance(separator, list):
+            separator = next(tokens)
+        if separator == "]":
+            return arr
+        if separator != ",":
+            raise ValueError(
+                f"Expected ',' or ']' after array element, got {separator!r}. "
+                "Check that the logprobs content is valid JSON."
+            )
+def _tokenizer_wrapper(logprobs_data):
+    yield from tokenize(logprobs_data)
+def parse_using_schema_and_logprobs(schema_dict, logprobs_data):
+    """Parse OpenAI logprobs content according to a JSON schema.
+    For schema fields of type string with an \"enum\", returns a probability
+    distribution over the enum values (dict mapping each enum label to a
+    probability in [0, 1]) instead of the raw string. Other fields are
+    parsed as usual JSON values.
+    Args:
+        schema_dict: JSON Schema dict (e.g. type, properties, items, enum).
+        logprobs_data: OpenAI response logprobs object (e.g. response.choices[0].logprobs).
+    Returns:
+        Parsed structure (dict/list/primitives) with enum fields as probability dicts.
+    """
+    if schema_dict is None:
+        raise ValueError("schema_dict is None. Pass a JSON Schema dict with at least 'type'.")
+    if not isinstance(schema_dict, dict):
+        raise ValueError(
+            f"schema_dict must be a dict, got {type(schema_dict).__name__}. "
+            "Pass a JSON Schema dict (e.g. {\"type\": \"object\", \"properties\": {...}})."
+        )
+    if "type" not in schema_dict:
+        raise ValueError(
+            "schema_dict must contain 'type'. Example: {\"type\": \"object\", \"properties\": {...}}."
+        )
+    tokens = _tokenizer_wrapper(logprobs_data)
+    try:
+        return parse_value(tokens, schema_dict)
+    except StopIteration:
+        raise ValueError(
+            "Logprobs stream ended unexpectedly. The logprobs_data.content may not match the expected JSON structure, "
+            "or the response may be truncated. Ensure logprobs=True and top_logprobs is set (e.g. 20)."
+        ) from None