PyPI - inference-autopsy - Versions diffs - 0.1.0__tar.gz - Mend

inference-autopsy 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (176) hide show

inference_autopsy-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,62 @@
+name: CI
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+jobs:
+  test:
+    name: Test Python ${{ matrix.python-version }}
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version:
+          - "3.11"
+          - "3.12"
+          - "3.13"
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install package
+        run: python -m pip install -e ".[dev]"
+      - name: Lint
+        run: ruff check .
+      - name: Test
+        run: pytest
+  build:
+    name: Build package
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.13"
+      - name: Install build backend
+        run: python -m pip install build
+      - name: Build distributions
+        run: python -m build
+      - name: Upload distributions
+        uses: actions/upload-artifact@v4
+        with:
+          name: python-distributions
+          path: dist/

inference_autopsy-0.1.0/.github/workflows/pages.yml ADDED Viewed

@@ -0,0 +1,43 @@
+name: Deploy GitHub Pages
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - docs/**
+      - .github/workflows/pages.yml
+  workflow_dispatch:
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+concurrency:
+  group: pages
+  cancel-in-progress: false
+jobs:
+  deploy:
+    name: Deploy docs site
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+      - name: Configure Pages
+        uses: actions/configure-pages@v5
+      - name: Upload Pages artifact
+        uses: actions/upload-pages-artifact@v4
+        with:
+          path: docs
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4

inference_autopsy-0.1.0/.github/workflows/publish-pypi.yml ADDED Viewed

@@ -0,0 +1,34 @@
+name: Publish to PyPI
+on:
+  release:
+    types:
+      - published
+  workflow_dispatch:
+jobs:
+  publish:
+    name: Build and publish package
+    runs-on: ubuntu-latest
+    environment: pypi
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v6
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.13"
+      - name: Install build backend
+        run: python -m pip install build
+      - name: Build distributions
+        run: python -m build
+      - name: Publish package distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

inference_autopsy-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,13 @@
+.env
+.venv/
+.ruff_cache/
+.pytest_cache/
+.tmp/
+__pycache__/
+*.py[cod]
+*.egg-info/
+build/
+dist/
+runs/
+reports/
+*.log

inference_autopsy-0.1.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Changelog
+## 0.1.0 - Initial Public Alpha
+- Added OpenAI-compatible single request and workload benchmarking commands.
+- Added JSONL trace recording with derived inference latency metrics.
+- Added diagnosis rules for TTFT, decode, tail latency, stream stalls, rate limits, cache effects, and concurrency pressure.
+- Added static HTML report generation.
+- Added baseline/candidate diffing with CI-friendly regression gates.
+- Added privacy-aware shape replay and exact replay refusal for hash-only traces.

inference_autopsy-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Ho Kei Ching
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

inference_autopsy-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,530 @@
+Metadata-Version: 2.4
+Name: inference-autopsy
+Version: 0.1.0
+Summary: Trace-first profiler and regression tester for AI inference systems.
+Project-URL: Homepage, https://github.com/kaseyho/Inference-Autopsy
+Project-URL: Repository, https://github.com/kaseyho/Inference-Autopsy
+Project-URL: Issues, https://github.com/kaseyho/Inference-Autopsy/issues
+Project-URL: Demo, https://kaseyho.github.io/Inference-Autopsy/
+Author: Kasey Ho
+License-Expression: MIT
+License-File: LICENSE
+Keywords: benchmarking,inference,latency,llm,openai-compatible,profiling
+Classifier: Development Status :: 3 - Alpha
+Classifier: Environment :: Console
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Software Development :: Testing
+Classifier: Topic :: System :: Benchmark
+Requires-Python: >=3.11
+Requires-Dist: httpx>=0.27
+Requires-Dist: pydantic>=2
+Requires-Dist: rich>=13
+Requires-Dist: typer>=0.12
+Provides-Extra: dev
+Requires-Dist: build>=1.2; extra == 'dev'
+Requires-Dist: pytest>=8; extra == 'dev'
+Requires-Dist: ruff>=0.5; extra == 'dev'
+Requires-Dist: twine>=5; extra == 'dev'
+Description-Content-Type: text/markdown
+# Inference Autopsy
+**Who killed my TTFT?**
+[![CI](https://github.com/kaseyho/Inference-Autopsy/actions/workflows/ci.yml/badge.svg)](https://github.com/kaseyho/Inference-Autopsy/actions/workflows/ci.yml)
+[![PyPI](https://img.shields.io/pypi/v/inference-autopsy.svg)](https://pypi.org/project/inference-autopsy/)
+[![Demo](https://img.shields.io/badge/demo-sample%20report-b6244f)](https://kaseyho.github.io/Inference-Autopsy/)
+Inference Autopsy is an open-source black-box profiler, workload replayer, and
+regression tester for OpenAI-compatible LLM inference endpoints.
+It records request-level and token-level traces, measures TTFT, ITL, tail
+latency, throughput, streaming stalls, and error rates, then turns those
+measurements into reproducible reports, baseline diffs, and CI regression
+gates.
+> Your LLM endpoint got slow. We found the body in the token stream.
+The goal is a polished local CLI plus static HTML reports, not a hosted SaaS
+dashboard.
+## Public Demo
+- Live project page: <https://kaseyho.github.io/Inference-Autopsy/>
+- Sample report: <https://kaseyho.github.io/Inference-Autopsy/sample-report.html>
+- PyPI package: <https://pypi.org/project/inference-autopsy/>
+The hosted report is generated from synthetic traces, so it is safe to share
+publicly and does not expose private prompts, API keys, or endpoints.
+## Install
+```bash
+pip install inference-autopsy
+```
+For local development:
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -e ".[dev]"
+pytest
+ruff check .
+```
+## Why This Exists
+LLM inference latency is not just one number.
+A request can be slow because of first-token delay, slow decode, long prompts,
+tail latency, rate limits, stream stalls, output bloat, or concurrency collapse.
+Aggregate benchmark numbers can tell you that something changed. Inference
+Autopsy is designed to help answer:
+1. How fast is this endpoint?
+2. Why is it slow?
+3. Can I reproduce the workload?
+4. Did my model or deployment regress?
+5. Can I explain the failure in one memorable line?
+The wedge is:
+```txt
+benchmark -> trace -> diagnose -> report -> replay -> regression gate
+```
+Existing tools such as
+[NVIDIA GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html),
+[GuideLLM](https://github.com/vllm-project/guidellm), and
+[LLMPerf](https://github.com/ray-project/llmperf) measure important LLM serving
+metrics. Inference Autopsy focuses on trace-level reproducibility, diagnosis,
+human-readable reports, and CI gates for OpenAI-compatible endpoints.
+## What It Measures
+Inference Autopsy focuses on externally visible inference behavior:
+| Metric | Meaning |
+| --- | --- |
+| TTFT | Time from request start to first generated token |
+| TTFB | Time from request start to first response byte |
+| ITL | Inter-token latency between generated output tokens |
+| Request latency | Time from request start to final token or response end |
+| Output TPS | Generated output tokens per second |
+| Stream stalls | Token gaps above a configurable threshold |
+| Error rate | Failed requests divided by total requests |
+| Timeout rate | Timed-out requests divided by total requests |
+| Tail ratio | p99 latency divided by p50 latency |
+Percentiles are first-class:
+```txt
+p50, p90, p95, p99
+```
+Because median latency is where demos look good. Tail latency is where systems
+start telling the truth.
+## Target CLI
+### Run a benchmark
+```bash
+autopsy bench \
+  --base-url http://localhost:8000/v1 \
+  --model meta-llama/Llama-3.1-8B-Instruct \
+  --profile rag-long \
+  --concurrency 1,4,8,16 \
+  --max-requests 200 \
+  --output runs/rag_long_v1.jsonl
+```
+### Generate a report
+```bash
+autopsy report \
+  runs/rag_long_v1.jsonl \
+  --html reports/rag_long_v1.html
+```
+### Diagnose a trace file
+```bash
+autopsy diagnose runs/rag_long_v1.jsonl
+```
+### Compare two runs
+```bash
+autopsy diff \
+  runs/baseline.jsonl \
+  runs/candidate.jsonl
+```
+### Fail CI on regression
+```bash
+autopsy diff \
+  runs/baseline.jsonl \
+  runs/candidate.jsonl \
+  --fail-if "ttft_p95 > +20%" \
+  --fail-if "itl_p95 > +15%" \
+  --fail-if "error_rate > 1%"
+```
+Gate examples:
+```txt
+ttft_p95 > +20%        # relative regression from baseline
+latency_p95 > +500ms   # absolute latency increase
+error_rate > 1%        # absolute candidate ceiling
+error_rate > +2pp      # percentage-point increase
+tail_ratio > 3x        # absolute ratio ceiling
+output_tps_p50 < -15%  # relative throughput drop
+```
+Exit codes:
+```txt
+0  all gates passed
+1  valid comparison, but one or more gates failed
+2  invalid gate, unreadable trace, or malformed input
+```
+### Replay a captured workload
+```bash
+autopsy replay \
+  runs/baseline.jsonl \
+  --base-url http://localhost:11434/v1 \
+  --model qwen3:8b \
+  --mode shape \
+  --output runs/replay_ollama.jsonl
+```
+Replay is privacy-aware:
+```txt
+shape replay  regenerates comparable prompts from saved workload metadata
+exact replay  requires recoverable full prompts and is refused for hash-only traces
+```
+## Example Output
+```txt
+Regression detected.
+Metric                 Baseline     Candidate    Change
+TTFT p95              840ms        1210ms       +44.0%
+ITL p95               39ms         47ms         +20.5%
+Request p99           6.2s         9.8s         +58.1%
+Error rate            0.1%         1.8%         +1.7pp
+Output TPS            41.2         35.7         -13.3%
+Failed gates:
+- ttft_p95 > +20%
+- error_rate > 1%
+Cause of death:
+Tail Wizard + Rate Limit MegaKnight
+```
+## Failure Arena
+Each bad run gets a memorable diagnosis backed by hard metrics.
+| Cause of death | Serious meaning |
+| --- | --- |
+| TTFT Pekka | First-token latency dominates request time |
+| Decode Barbarian | Inter-token latency is high |
+| Tail Wizard | p99 latency explodes while median looks fine |
+| Context Golem | Long prompts crush prefill performance |
+| Stream Wall Breaker | Streaming has large token gaps |
+| Rate Limit MegaKnight | 429s, throttling, or retries dominate |
+| Output Electro Dragon | Output length inflated unexpectedly |
+| JSON Skeleton Army | Structured output mode causes failures or latency |
+| Retry Witch | Hidden retries inflate latency |
+| Queue Queen | Endpoint collapses under parallel load |
+Example diagnosis:
+```txt
+Cause of death: Context Golem
+Severity: High
+Evidence:
+- TTFT p95 rises from 620ms at 512-token prompts to 3120ms at 8192-token prompts.
+- ITL stays mostly flat.
+- Request latency increase is concentrated before the first token.
+Likely driver:
+Long prompt prefill dominates latency.
+Suggested next tests:
+- Bucket prompts by input length.
+- Compare with prefix caching if the backend supports it.
+- Test prompt compression.
+```
+## Workload Profiles
+Inference Autopsy uses workload profiles instead of one toy prompt. Different
+workloads expose different bottlenecks.
+Planned built-in profiles:
+| Profile | Purpose |
+| --- | --- |
+| short-chat | Basic latency and endpoint overhead |
+| rag-long | Long-context RAG-style TTFT sensitivity |
+| code-completion | Long decode and output throughput |
+| agent-json | JSON reliability and structured-output latency |
+| long-context | Context-window and prefill stress |
+| mixed-realistic | Blended production-like workload |
+Example profile shape:
+```yaml
+name: rag-long
+description: Long-context RAG-style prompts with moderate outputs.
+input_tokens:
+  distribution: bucket
+  values: [2000, 4000, 8000]
+  weights: [0.4, 0.4, 0.2]
+output_tokens:
+  distribution: normal
+  mean: 256
+  std: 64
+  min: 64
+  max: 512
+sampling:
+  temperature: 0.2
+  max_tokens: 512
+messages:
+  system: "You answer questions using the provided context."
+  user_template: |
+    Context:
+    {{ generated_context }}
+    Question:
+    {{ generated_question }}
+```
+## Trace Format
+Each request is saved as one JSONL line.
+```json
+{
+  "schema_version": "0.1",
+  "run_id": "run_2026_05_23_001",
+  "request_id": "req_00042",
+  "profile": "rag-long",
+  "model": "llama-3.1-8b",
+  "base_url_hash": "endpoint_a",
+  "input_tokens_estimated": 4096,
+  "output_tokens": 261,
+  "status": "success",
+  "timings_ms": {
+    "request_start": 0,
+    "first_byte": 817,
+    "first_token": 942,
+    "request_end": 7610
+  },
+  "token_times_ms": [942, 971, 1001, 1033, 1208],
+  "metrics": {
+    "ttft_ms": 942,
+    "request_latency_ms": 7610,
+    "itl_mean_ms": 28.4,
+    "itl_p95_ms": 71.2,
+    "output_tps": 38.6,
+    "stall_count": 2
+  },
+  "error": null
+}
+```
+JSONL is append-friendly, easy to inspect, easy to upload as a CI artifact, and
+simple to process with Python, DuckDB, Polars, or shell tools.
+## HTML Reports
+The static report is the main demo artifact.
+Implemented first-pass sections:
+- Executive summary
+- Diagnosis cards with evidence
+- Summary metric cards
+- Overall percentile table
+- Static charts
+- Profile breakdown
+- Concurrency breakdown
+- Cache summary
+- Worst requests by latency and TTFT
+- Methodology notes
+The report is designed to be shareable without running a server.
+## CI Usage
+Planned GitHub Actions workflow:
+```yaml
+name: LLM Inference Regression Test
+on:
+  pull_request:
+jobs:
+  inference-autopsy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install Inference Autopsy
+        run: pip install inference-autopsy
+      - name: Run benchmark
+        run: |
+          autopsy bench \
+            --base-url ${{ secrets.LLM_BASE_URL }} \
+            --api-key ${{ secrets.LLM_API_KEY }} \
+            --model ${{ vars.LLM_MODEL }} \
+            --profile short-chat \
+            --max-requests 50 \
+            --output candidate.jsonl
+      - name: Check regression
+        run: |
+          autopsy diff baseline.jsonl candidate.jsonl \
+            --fail-if "ttft_p95 > +20%" \
+            --fail-if "itl_p95 > +15%" \
+            --fail-if "error_rate > 1%"
+```
+## Compatibility Goal
+Inference Autopsy targets OpenAI-compatible chat completion endpoints,
+especially:
+- vLLM OpenAI-compatible server
+- Ollama OpenAI-compatible API
+- LiteLLM proxy
+- hosted OpenAI-compatible inference providers
+- internal company deployments using OpenAI-style APIs
+Focus;
+- `/v1/chat/completions`
+- `stream=true`
+- `stream=false`
+- request-level JSONL traces
+- exact replay from saved prompts
+## Architecture
+```txt
+Typer CLI
+  -> async workload runner
+  -> OpenAI-compatible HTTP client
+  -> streaming parser
+  -> JSONL trace recorder
+  -> metrics engine
+  -> diagnosis engine
+  -> report generator
+  -> diff and CI gate engine
+  -> replay engine
+```
+Planned Python stack:
+- Typer for the CLI
+- httpx for async HTTP
+- Pydantic for schemas
+- orjson for fast JSON
+- Rich for terminal output
+- Polars or plain Python for metrics
+- Standard-library HTML rendering for the first static report
+- Optional Jinja2 and Plotly later when template or chart complexity justifies it
+- pytest for tests
+## Limitations
+Inference Autopsy is a black-box endpoint profiler. It identifies externally
+visible symptoms and likely bottlenecks, not definitive backend internals.
+It does not directly observe GPU kernel time, scheduler state, KV-cache pressure,
+batching internals, or prefill/decode implementation details unless a backend
+exposes those signals.
+Other known limitations:
+- Token counting may be approximate when providers do not return usage metadata.
+- OpenAI-compatible streaming formats vary across servers.
+- Hosted endpoint measurements include network and provider-side variance.
+- Replay preserves workload shape and prompts, but not perfect model determinism.
+- Static reports are not a replacement for production observability.
+## Benchmark Methodology
+Reports should include:
+- endpoint and model
+- hardware or provider
+- concurrency
+- request count
+- warmup policy
+- timeout policy
+- streaming mode
+- token counting method
+- retry policy
+- prompt generation method
+- percentile calculation method
+No trace, no reproducibility.
+## Development
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -e ".[dev]"
+pytest
+ruff check .
+mypy autopsy
+```
+## Release
+Public releases are designed to run through GitHub Actions:
+1. Push changes to `main`.
+2. Confirm the `CI` workflow passes.
+3. Confirm the `Deploy GitHub Pages` workflow publishes the docs site.
+4. Create a GitHub release such as `v0.1.0`.
+5. The `Publish to PyPI` workflow builds and publishes the package through PyPI Trusted Publishing.
+The PyPI Trusted Publisher must be configured once in PyPI with:
+```txt
+Repository owner: kaseyho
+Repository name: Inference-Autopsy
+Workflow name: publish-pypi.yml
+Environment name: pypi
+```
+## License
+MIT License. See [LICENSE](LICENSE).