inference-autopsy 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (176) hide show
  1. inference_autopsy-0.1.0/.github/workflows/ci.yml +62 -0
  2. inference_autopsy-0.1.0/.github/workflows/pages.yml +43 -0
  3. inference_autopsy-0.1.0/.github/workflows/publish-pypi.yml +34 -0
  4. inference_autopsy-0.1.0/.gitignore +13 -0
  5. inference_autopsy-0.1.0/CHANGELOG.md +10 -0
  6. inference_autopsy-0.1.0/LICENSE +21 -0
  7. inference_autopsy-0.1.0/PKG-INFO +530 -0
  8. inference_autopsy-0.1.0/README.md +496 -0
  9. inference_autopsy-0.1.0/anthropic-evals.md +28 -0
  10. inference_autopsy-0.1.0/autopsy/__init__.py +3 -0
  11. inference_autopsy-0.1.0/autopsy/cli.py +348 -0
  12. inference_autopsy-0.1.0/autopsy/client/__init__.py +1 -0
  13. inference_autopsy-0.1.0/autopsy/client/openai_compatible.py +376 -0
  14. inference_autopsy-0.1.0/autopsy/client/stream_parser.py +75 -0
  15. inference_autopsy-0.1.0/autopsy/diagnosis/__init__.py +1 -0
  16. inference_autopsy-0.1.0/autopsy/diagnosis/labels.py +38 -0
  17. inference_autopsy-0.1.0/autopsy/diagnosis/rules.py +372 -0
  18. inference_autopsy-0.1.0/autopsy/diff/__init__.py +2 -0
  19. inference_autopsy-0.1.0/autopsy/diff/compare.py +237 -0
  20. inference_autopsy-0.1.0/autopsy/diff/gates.py +207 -0
  21. inference_autopsy-0.1.0/autopsy/fake/__init__.py +0 -0
  22. inference_autopsy-0.1.0/autopsy/fake/generate.py +225 -0
  23. inference_autopsy-0.1.0/autopsy/metrics/__init__.py +0 -0
  24. inference_autopsy-0.1.0/autopsy/metrics/aggregation.py +81 -0
  25. inference_autopsy-0.1.0/autopsy/metrics/cache.py +46 -0
  26. inference_autopsy-0.1.0/autopsy/metrics/core.py +132 -0
  27. inference_autopsy-0.1.0/autopsy/metrics/saturation.py +53 -0
  28. inference_autopsy-0.1.0/autopsy/metrics/summary.py +66 -0
  29. inference_autopsy-0.1.0/autopsy/replay/__init__.py +2 -0
  30. inference_autopsy-0.1.0/autopsy/replay/replay.py +271 -0
  31. inference_autopsy-0.1.0/autopsy/runner/__init__.py +1 -0
  32. inference_autopsy-0.1.0/autopsy/runner/workload_runner.py +299 -0
  33. inference_autopsy-0.1.0/autopsy/traces/__init__.py +0 -0
  34. inference_autopsy-0.1.0/autopsy/traces/derive.py +27 -0
  35. inference_autopsy-0.1.0/autopsy/traces/jsonl.py +42 -0
  36. inference_autopsy-0.1.0/autopsy/traces/schema.py +88 -0
  37. inference_autopsy-0.1.0/autopsy/workloads/__init__.py +1 -0
  38. inference_autopsy-0.1.0/autopsy/workloads/profiles.py +96 -0
  39. inference_autopsy-0.1.0/docs/.nojekyll +1 -0
  40. inference_autopsy-0.1.0/docs/LearningOutcomes.md +349 -0
  41. inference_autopsy-0.1.0/docs/README.md +205 -0
  42. inference_autopsy-0.1.0/docs/Software_Architecture.png +0 -0
  43. inference_autopsy-0.1.0/docs/ai/agent-rules.md +54 -0
  44. inference_autopsy-0.1.0/docs/ai/anti-patterns.md +48 -0
  45. inference_autopsy-0.1.0/docs/ai/coding-contract.md +61 -0
  46. inference_autopsy-0.1.0/docs/ai/definition-of-done.md +47 -0
  47. inference_autopsy-0.1.0/docs/ai/ide-integration.md +62 -0
  48. inference_autopsy-0.1.0/docs/ai/pr-review-checklist.md +59 -0
  49. inference_autopsy-0.1.0/docs/ai/task-execution-protocol.md +64 -0
  50. inference_autopsy-0.1.0/docs/context/business-rules.md +35 -0
  51. inference_autopsy-0.1.0/docs/context/current-project-state.md +99 -0
  52. inference_autopsy-0.1.0/docs/context/domain-knowledge.md +33 -0
  53. inference_autopsy-0.1.0/docs/context/glossary.md +28 -0
  54. inference_autopsy-0.1.0/docs/context/known-issues.md +42 -0
  55. inference_autopsy-0.1.0/docs/context/technical-debt.md +38 -0
  56. inference_autopsy-0.1.0/docs/documentation/adr-guidelines.md +34 -0
  57. inference_autopsy-0.1.0/docs/documentation/changelog-guidelines.md +32 -0
  58. inference_autopsy-0.1.0/docs/documentation/documentation-quality.md +36 -0
  59. inference_autopsy-0.1.0/docs/enforcement/README.md +33 -0
  60. inference_autopsy-0.1.0/docs/enforcement/ai-validation.md +35 -0
  61. inference_autopsy-0.1.0/docs/enforcement/architecture-fitness.md +30 -0
  62. inference_autopsy-0.1.0/docs/enforcement/ci-checks.md +37 -0
  63. inference_autopsy-0.1.0/docs/enforcement/code-ownership.md +36 -0
  64. inference_autopsy-0.1.0/docs/engineering/api-design.md +43 -0
  65. inference_autopsy-0.1.0/docs/engineering/architecture.md +79 -0
  66. inference_autopsy-0.1.0/docs/engineering/code-style.md +43 -0
  67. inference_autopsy-0.1.0/docs/engineering/database-guidelines.md +39 -0
  68. inference_autopsy-0.1.0/docs/engineering/dependency-policy.md +43 -0
  69. inference_autopsy-0.1.0/docs/engineering/error-handling.md +44 -0
  70. inference_autopsy-0.1.0/docs/engineering/logging-monitoring.md +49 -0
  71. inference_autopsy-0.1.0/docs/engineering/naming-conventions.md +45 -0
  72. inference_autopsy-0.1.0/docs/engineering/performance-guidelines.md +47 -0
  73. inference_autopsy-0.1.0/docs/engineering/principles.md +45 -0
  74. inference_autopsy-0.1.0/docs/engineering/project-structure.md +68 -0
  75. inference_autopsy-0.1.0/docs/engineering/refactoring-guidelines.md +41 -0
  76. inference_autopsy-0.1.0/docs/engineering/security-guidelines.md +37 -0
  77. inference_autopsy-0.1.0/docs/engineering/state-management.md +40 -0
  78. inference_autopsy-0.1.0/docs/engineering/testing-strategy.md +76 -0
  79. inference_autopsy-0.1.0/docs/frontend/accessibility.md +40 -0
  80. inference_autopsy-0.1.0/docs/frontend/animation-guidelines.md +28 -0
  81. inference_autopsy-0.1.0/docs/frontend/component-architecture.md +42 -0
  82. inference_autopsy-0.1.0/docs/frontend/content-design.md +44 -0
  83. inference_autopsy-0.1.0/docs/frontend/design-system.md +54 -0
  84. inference_autopsy-0.1.0/docs/frontend/forms-and-validation.md +32 -0
  85. inference_autopsy-0.1.0/docs/frontend/frontend-performance.md +39 -0
  86. inference_autopsy-0.1.0/docs/frontend/interaction-guidelines.md +35 -0
  87. inference_autopsy-0.1.0/docs/frontend/responsive-design.md +36 -0
  88. inference_autopsy-0.1.0/docs/frontend/ui-principles.md +47 -0
  89. inference_autopsy-0.1.0/docs/governance/change-management.md +34 -0
  90. inference_autopsy-0.1.0/docs/governance/contribution-guidelines.md +43 -0
  91. inference_autopsy-0.1.0/docs/governance/decision-framework.md +43 -0
  92. inference_autopsy-0.1.0/docs/governance/dependency-approval.md +45 -0
  93. inference_autopsy-0.1.0/docs/governance/tech-stack-policy.md +33 -0
  94. inference_autopsy-0.1.0/docs/index.html +217 -0
  95. inference_autopsy-0.1.0/docs/product/decision-log.md +48 -0
  96. inference_autopsy-0.1.0/docs/product/feature-prioritization.md +41 -0
  97. inference_autopsy-0.1.0/docs/product/feature-spec-template.md +61 -0
  98. inference_autopsy-0.1.0/docs/product/kpi-framework.md +36 -0
  99. inference_autopsy-0.1.0/docs/product/owner-project-guide.md +2088 -0
  100. inference_autopsy-0.1.0/docs/product/product-principles.md +43 -0
  101. inference_autopsy-0.1.0/docs/product/project-spec.md +1109 -0
  102. inference_autopsy-0.1.0/docs/product/roadmap.md +38 -0
  103. inference_autopsy-0.1.0/docs/product/user-personas.md +36 -0
  104. inference_autopsy-0.1.0/docs/research/README.md +13 -0
  105. inference_autopsy-0.1.0/docs/research/concepts-to-master.md +607 -0
  106. inference_autopsy-0.1.0/docs/research/implementation-takeaways.md +480 -0
  107. inference_autopsy-0.1.0/docs/research/reading-map.md +19 -0
  108. inference_autopsy-0.1.0/docs/research/sources/aiperf.md +28 -0
  109. inference_autopsy-0.1.0/docs/research/sources/guidellm.md +33 -0
  110. inference_autopsy-0.1.0/docs/research/sources/httpx-asyncio-typer.md +58 -0
  111. inference_autopsy-0.1.0/docs/research/sources/openai-evals.md +27 -0
  112. inference_autopsy-0.1.0/docs/research/sources/opentelemetry-genai.md +27 -0
  113. inference_autopsy-0.1.0/docs/research/sources/pagedattention.md +25 -0
  114. inference_autopsy-0.1.0/docs/research/sources/source-notes-template.md +39 -0
  115. inference_autopsy-0.1.0/docs/research/sources/vllm-openai-prefix-caching.md +41 -0
  116. inference_autopsy-0.1.0/docs/research/talking-points.md +84 -0
  117. inference_autopsy-0.1.0/docs/sample-report.html +1248 -0
  118. inference_autopsy-0.1.0/docs/templates/adr-template.md +34 -0
  119. inference_autopsy-0.1.0/docs/templates/api-addition.md +38 -0
  120. inference_autopsy-0.1.0/docs/templates/architecture-rfc.md +44 -0
  121. inference_autopsy-0.1.0/docs/templates/bug-fix.md +40 -0
  122. inference_autopsy-0.1.0/docs/templates/feature-implementation.md +50 -0
  123. inference_autopsy-0.1.0/docs/templates/incident-remediation.md +38 -0
  124. inference_autopsy-0.1.0/docs/templates/migration.md +41 -0
  125. inference_autopsy-0.1.0/docs/templates/performance-optimization.md +41 -0
  126. inference_autopsy-0.1.0/docs/templates/postmortem-template.md +42 -0
  127. inference_autopsy-0.1.0/docs/templates/refactoring.md +41 -0
  128. inference_autopsy-0.1.0/docs/templates/ui-change.md +40 -0
  129. inference_autopsy-0.1.0/docs/workflow/code-review-process.md +33 -0
  130. inference_autopsy-0.1.0/docs/workflow/debugging-guide.md +34 -0
  131. inference_autopsy-0.1.0/docs/workflow/deployment-checklists.md +39 -0
  132. inference_autopsy-0.1.0/docs/workflow/development-lifecycle.md +30 -0
  133. inference_autopsy-0.1.0/docs/workflow/local-development.md +41 -0
  134. inference_autopsy-0.1.0/docs/workflow/phase-1-build-guide.md +1077 -0
  135. inference_autopsy-0.1.0/docs/workflow/phase-2-learning-notes.md +293 -0
  136. inference_autopsy-0.1.0/docs/workflow/phase-2-testing-guide.md +554 -0
  137. inference_autopsy-0.1.0/docs/workflow/phase-3-testing-guide.md +114 -0
  138. inference_autopsy-0.1.0/docs/workflow/phase-4-build-guide.md +734 -0
  139. inference_autopsy-0.1.0/docs/workflow/phase-4-learning-notes.md +206 -0
  140. inference_autopsy-0.1.0/docs/workflow/phase-4-testing-guide.md +283 -0
  141. inference_autopsy-0.1.0/docs/workflow/phase-5-build-guide.md +221 -0
  142. inference_autopsy-0.1.0/docs/workflow/phase-5-learning-notes.md +166 -0
  143. inference_autopsy-0.1.0/docs/workflow/phase-5-testing-guide.md +134 -0
  144. inference_autopsy-0.1.0/docs/workflow/phase-6-build-guide.md +184 -0
  145. inference_autopsy-0.1.0/docs/workflow/phase-6-learning-notes.md +166 -0
  146. inference_autopsy-0.1.0/docs/workflow/phase-6-testing-guide.md +133 -0
  147. inference_autopsy-0.1.0/docs/workflow/phase-7-build-guide.md +189 -0
  148. inference_autopsy-0.1.0/docs/workflow/phase-7-learning-notes.md +155 -0
  149. inference_autopsy-0.1.0/docs/workflow/phase-7-testing-guide.md +121 -0
  150. inference_autopsy-0.1.0/docs/workflow/phase-8-build-guide.md +146 -0
  151. inference_autopsy-0.1.0/docs/workflow/phase-8-learning-notes.md +112 -0
  152. inference_autopsy-0.1.0/docs/workflow/phase-8-testing-guide.md +90 -0
  153. inference_autopsy-0.1.0/docs/workflow/release-process.md +87 -0
  154. inference_autopsy-0.1.0/docs/writeups/phase-3-mid-progress-writeup.md +141 -0
  155. inference_autopsy-0.1.0/examples/traces/connection-error.jsonl +1 -0
  156. inference_autopsy-0.1.0/examples/traces/fake.jsonl +50 -0
  157. inference_autopsy-0.1.0/examples/traces/single-nonstream.jsonl +1 -0
  158. inference_autopsy-0.1.0/examples/traces/single.jsonl +1 -0
  159. inference_autopsy-0.1.0/examples/traces/timeout.jsonl +1 -0
  160. inference_autopsy-0.1.0/pyproject.toml +61 -0
  161. inference_autopsy-0.1.0/tests/test_diagnosis_rules.py +134 -0
  162. inference_autopsy-0.1.0/tests/test_diff_cli.py +59 -0
  163. inference_autopsy-0.1.0/tests/test_diff_compare.py +43 -0
  164. inference_autopsy-0.1.0/tests/test_diff_gates.py +80 -0
  165. inference_autopsy-0.1.0/tests/test_jsonl.py +19 -0
  166. inference_autopsy-0.1.0/tests/test_metrics_aggregation.py +114 -0
  167. inference_autopsy-0.1.0/tests/test_metrics_cache.py +39 -0
  168. inference_autopsy-0.1.0/tests/test_metrics_core.py +169 -0
  169. inference_autopsy-0.1.0/tests/test_openai_compatible.py +90 -0
  170. inference_autopsy-0.1.0/tests/test_profiles.py +24 -0
  171. inference_autopsy-0.1.0/tests/test_replay.py +112 -0
  172. inference_autopsy-0.1.0/tests/test_report_html.py +86 -0
  173. inference_autopsy-0.1.0/tests/test_schema.py +50 -0
  174. inference_autopsy-0.1.0/tests/test_stream_parser.py +41 -0
  175. inference_autopsy-0.1.0/tests/test_summary.py +46 -0
  176. inference_autopsy-0.1.0/tests/test_workload_runner.py +79 -0
@@ -0,0 +1,62 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+ pull_request:
8
+
9
+ jobs:
10
+ test:
11
+ name: Test Python ${{ matrix.python-version }}
12
+ runs-on: ubuntu-latest
13
+ strategy:
14
+ fail-fast: false
15
+ matrix:
16
+ python-version:
17
+ - "3.11"
18
+ - "3.12"
19
+ - "3.13"
20
+
21
+ steps:
22
+ - name: Checkout
23
+ uses: actions/checkout@v6
24
+
25
+ - name: Set up Python
26
+ uses: actions/setup-python@v5
27
+ with:
28
+ python-version: ${{ matrix.python-version }}
29
+
30
+ - name: Install package
31
+ run: python -m pip install -e ".[dev]"
32
+
33
+ - name: Lint
34
+ run: ruff check .
35
+
36
+ - name: Test
37
+ run: pytest
38
+
39
+ build:
40
+ name: Build package
41
+ runs-on: ubuntu-latest
42
+
43
+ steps:
44
+ - name: Checkout
45
+ uses: actions/checkout@v6
46
+
47
+ - name: Set up Python
48
+ uses: actions/setup-python@v5
49
+ with:
50
+ python-version: "3.13"
51
+
52
+ - name: Install build backend
53
+ run: python -m pip install build
54
+
55
+ - name: Build distributions
56
+ run: python -m build
57
+
58
+ - name: Upload distributions
59
+ uses: actions/upload-artifact@v4
60
+ with:
61
+ name: python-distributions
62
+ path: dist/
@@ -0,0 +1,43 @@
1
+ name: Deploy GitHub Pages
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+ paths:
8
+ - docs/**
9
+ - .github/workflows/pages.yml
10
+ workflow_dispatch:
11
+
12
+ permissions:
13
+ contents: read
14
+ pages: write
15
+ id-token: write
16
+
17
+ concurrency:
18
+ group: pages
19
+ cancel-in-progress: false
20
+
21
+ jobs:
22
+ deploy:
23
+ name: Deploy docs site
24
+ runs-on: ubuntu-latest
25
+ environment:
26
+ name: github-pages
27
+ url: ${{ steps.deployment.outputs.page_url }}
28
+
29
+ steps:
30
+ - name: Checkout
31
+ uses: actions/checkout@v6
32
+
33
+ - name: Configure Pages
34
+ uses: actions/configure-pages@v5
35
+
36
+ - name: Upload Pages artifact
37
+ uses: actions/upload-pages-artifact@v4
38
+ with:
39
+ path: docs
40
+
41
+ - name: Deploy to GitHub Pages
42
+ id: deployment
43
+ uses: actions/deploy-pages@v4
@@ -0,0 +1,34 @@
1
+ name: Publish to PyPI
2
+
3
+ on:
4
+ release:
5
+ types:
6
+ - published
7
+ workflow_dispatch:
8
+
9
+ jobs:
10
+ publish:
11
+ name: Build and publish package
12
+ runs-on: ubuntu-latest
13
+ environment: pypi
14
+ permissions:
15
+ contents: read
16
+ id-token: write
17
+
18
+ steps:
19
+ - name: Checkout
20
+ uses: actions/checkout@v6
21
+
22
+ - name: Set up Python
23
+ uses: actions/setup-python@v5
24
+ with:
25
+ python-version: "3.13"
26
+
27
+ - name: Install build backend
28
+ run: python -m pip install build
29
+
30
+ - name: Build distributions
31
+ run: python -m build
32
+
33
+ - name: Publish package distributions to PyPI
34
+ uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,13 @@
1
+ .env
2
+ .venv/
3
+ .ruff_cache/
4
+ .pytest_cache/
5
+ .tmp/
6
+ __pycache__/
7
+ *.py[cod]
8
+ *.egg-info/
9
+ build/
10
+ dist/
11
+ runs/
12
+ reports/
13
+ *.log
@@ -0,0 +1,10 @@
1
+ # Changelog
2
+
3
+ ## 0.1.0 - Initial Public Alpha
4
+
5
+ - Added OpenAI-compatible single request and workload benchmarking commands.
6
+ - Added JSONL trace recording with derived inference latency metrics.
7
+ - Added diagnosis rules for TTFT, decode, tail latency, stream stalls, rate limits, cache effects, and concurrency pressure.
8
+ - Added static HTML report generation.
9
+ - Added baseline/candidate diffing with CI-friendly regression gates.
10
+ - Added privacy-aware shape replay and exact replay refusal for hash-only traces.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Ho Kei Ching
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,530 @@
1
+ Metadata-Version: 2.4
2
+ Name: inference-autopsy
3
+ Version: 0.1.0
4
+ Summary: Trace-first profiler and regression tester for AI inference systems.
5
+ Project-URL: Homepage, https://github.com/kaseyho/Inference-Autopsy
6
+ Project-URL: Repository, https://github.com/kaseyho/Inference-Autopsy
7
+ Project-URL: Issues, https://github.com/kaseyho/Inference-Autopsy/issues
8
+ Project-URL: Demo, https://kaseyho.github.io/Inference-Autopsy/
9
+ Author: Kasey Ho
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: benchmarking,inference,latency,llm,openai-compatible,profiling
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Classifier: Environment :: Console
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Programming Language :: Python :: 3.13
21
+ Classifier: Topic :: Software Development :: Testing
22
+ Classifier: Topic :: System :: Benchmark
23
+ Requires-Python: >=3.11
24
+ Requires-Dist: httpx>=0.27
25
+ Requires-Dist: pydantic>=2
26
+ Requires-Dist: rich>=13
27
+ Requires-Dist: typer>=0.12
28
+ Provides-Extra: dev
29
+ Requires-Dist: build>=1.2; extra == 'dev'
30
+ Requires-Dist: pytest>=8; extra == 'dev'
31
+ Requires-Dist: ruff>=0.5; extra == 'dev'
32
+ Requires-Dist: twine>=5; extra == 'dev'
33
+ Description-Content-Type: text/markdown
34
+
35
+ # Inference Autopsy
36
+
37
+ **Who killed my TTFT?**
38
+
39
+ [![CI](https://github.com/kaseyho/Inference-Autopsy/actions/workflows/ci.yml/badge.svg)](https://github.com/kaseyho/Inference-Autopsy/actions/workflows/ci.yml)
40
+ [![PyPI](https://img.shields.io/pypi/v/inference-autopsy.svg)](https://pypi.org/project/inference-autopsy/)
41
+ [![Demo](https://img.shields.io/badge/demo-sample%20report-b6244f)](https://kaseyho.github.io/Inference-Autopsy/)
42
+
43
+ Inference Autopsy is an open-source black-box profiler, workload replayer, and
44
+ regression tester for OpenAI-compatible LLM inference endpoints.
45
+
46
+ It records request-level and token-level traces, measures TTFT, ITL, tail
47
+ latency, throughput, streaming stalls, and error rates, then turns those
48
+ measurements into reproducible reports, baseline diffs, and CI regression
49
+ gates.
50
+
51
+ > Your LLM endpoint got slow. We found the body in the token stream.
52
+
53
+ The goal is a polished local CLI plus static HTML reports, not a hosted SaaS
54
+ dashboard.
55
+
56
+ ## Public Demo
57
+
58
+ - Live project page: <https://kaseyho.github.io/Inference-Autopsy/>
59
+ - Sample report: <https://kaseyho.github.io/Inference-Autopsy/sample-report.html>
60
+ - PyPI package: <https://pypi.org/project/inference-autopsy/>
61
+
62
+ The hosted report is generated from synthetic traces, so it is safe to share
63
+ publicly and does not expose private prompts, API keys, or endpoints.
64
+
65
+ ## Install
66
+
67
+ ```bash
68
+ pip install inference-autopsy
69
+ ```
70
+
71
+ For local development:
72
+
73
+ ```bash
74
+ python -m venv .venv
75
+ source .venv/bin/activate
76
+ pip install -e ".[dev]"
77
+ pytest
78
+ ruff check .
79
+ ```
80
+
81
+ ## Why This Exists
82
+
83
+ LLM inference latency is not just one number.
84
+
85
+ A request can be slow because of first-token delay, slow decode, long prompts,
86
+ tail latency, rate limits, stream stalls, output bloat, or concurrency collapse.
87
+ Aggregate benchmark numbers can tell you that something changed. Inference
88
+ Autopsy is designed to help answer:
89
+
90
+ 1. How fast is this endpoint?
91
+ 2. Why is it slow?
92
+ 3. Can I reproduce the workload?
93
+ 4. Did my model or deployment regress?
94
+ 5. Can I explain the failure in one memorable line?
95
+
96
+ The wedge is:
97
+
98
+ ```txt
99
+ benchmark -> trace -> diagnose -> report -> replay -> regression gate
100
+ ```
101
+
102
+ Existing tools such as
103
+ [NVIDIA GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html),
104
+ [GuideLLM](https://github.com/vllm-project/guidellm), and
105
+ [LLMPerf](https://github.com/ray-project/llmperf) measure important LLM serving
106
+ metrics. Inference Autopsy focuses on trace-level reproducibility, diagnosis,
107
+ human-readable reports, and CI gates for OpenAI-compatible endpoints.
108
+
109
+ ## What It Measures
110
+
111
+ Inference Autopsy focuses on externally visible inference behavior:
112
+
113
+ | Metric | Meaning |
114
+ | --- | --- |
115
+ | TTFT | Time from request start to first generated token |
116
+ | TTFB | Time from request start to first response byte |
117
+ | ITL | Inter-token latency between generated output tokens |
118
+ | Request latency | Time from request start to final token or response end |
119
+ | Output TPS | Generated output tokens per second |
120
+ | Stream stalls | Token gaps above a configurable threshold |
121
+ | Error rate | Failed requests divided by total requests |
122
+ | Timeout rate | Timed-out requests divided by total requests |
123
+ | Tail ratio | p99 latency divided by p50 latency |
124
+
125
+ Percentiles are first-class:
126
+
127
+ ```txt
128
+ p50, p90, p95, p99
129
+ ```
130
+
131
+ Because median latency is where demos look good. Tail latency is where systems
132
+ start telling the truth.
133
+
134
+ ## Target CLI
135
+
136
+ ### Run a benchmark
137
+
138
+ ```bash
139
+ autopsy bench \
140
+ --base-url http://localhost:8000/v1 \
141
+ --model meta-llama/Llama-3.1-8B-Instruct \
142
+ --profile rag-long \
143
+ --concurrency 1,4,8,16 \
144
+ --max-requests 200 \
145
+ --output runs/rag_long_v1.jsonl
146
+ ```
147
+
148
+ ### Generate a report
149
+
150
+ ```bash
151
+ autopsy report \
152
+ runs/rag_long_v1.jsonl \
153
+ --html reports/rag_long_v1.html
154
+ ```
155
+
156
+ ### Diagnose a trace file
157
+
158
+ ```bash
159
+ autopsy diagnose runs/rag_long_v1.jsonl
160
+ ```
161
+
162
+ ### Compare two runs
163
+
164
+ ```bash
165
+ autopsy diff \
166
+ runs/baseline.jsonl \
167
+ runs/candidate.jsonl
168
+ ```
169
+
170
+ ### Fail CI on regression
171
+
172
+ ```bash
173
+ autopsy diff \
174
+ runs/baseline.jsonl \
175
+ runs/candidate.jsonl \
176
+ --fail-if "ttft_p95 > +20%" \
177
+ --fail-if "itl_p95 > +15%" \
178
+ --fail-if "error_rate > 1%"
179
+ ```
180
+
181
+ Gate examples:
182
+
183
+ ```txt
184
+ ttft_p95 > +20% # relative regression from baseline
185
+ latency_p95 > +500ms # absolute latency increase
186
+ error_rate > 1% # absolute candidate ceiling
187
+ error_rate > +2pp # percentage-point increase
188
+ tail_ratio > 3x # absolute ratio ceiling
189
+ output_tps_p50 < -15% # relative throughput drop
190
+ ```
191
+
192
+ Exit codes:
193
+
194
+ ```txt
195
+ 0 all gates passed
196
+ 1 valid comparison, but one or more gates failed
197
+ 2 invalid gate, unreadable trace, or malformed input
198
+ ```
199
+
200
+ ### Replay a captured workload
201
+
202
+ ```bash
203
+ autopsy replay \
204
+ runs/baseline.jsonl \
205
+ --base-url http://localhost:11434/v1 \
206
+ --model qwen3:8b \
207
+ --mode shape \
208
+ --output runs/replay_ollama.jsonl
209
+ ```
210
+
211
+ Replay is privacy-aware:
212
+
213
+ ```txt
214
+ shape replay regenerates comparable prompts from saved workload metadata
215
+ exact replay requires recoverable full prompts and is refused for hash-only traces
216
+ ```
217
+
218
+ ## Example Output
219
+
220
+ ```txt
221
+ Regression detected.
222
+
223
+ Metric Baseline Candidate Change
224
+ TTFT p95 840ms 1210ms +44.0%
225
+ ITL p95 39ms 47ms +20.5%
226
+ Request p99 6.2s 9.8s +58.1%
227
+ Error rate 0.1% 1.8% +1.7pp
228
+ Output TPS 41.2 35.7 -13.3%
229
+
230
+ Failed gates:
231
+ - ttft_p95 > +20%
232
+ - error_rate > 1%
233
+
234
+ Cause of death:
235
+ Tail Wizard + Rate Limit MegaKnight
236
+ ```
237
+
238
+ ## Failure Arena
239
+
240
+ Each bad run gets a memorable diagnosis backed by hard metrics.
241
+
242
+ | Cause of death | Serious meaning |
243
+ | --- | --- |
244
+ | TTFT Pekka | First-token latency dominates request time |
245
+ | Decode Barbarian | Inter-token latency is high |
246
+ | Tail Wizard | p99 latency explodes while median looks fine |
247
+ | Context Golem | Long prompts crush prefill performance |
248
+ | Stream Wall Breaker | Streaming has large token gaps |
249
+ | Rate Limit MegaKnight | 429s, throttling, or retries dominate |
250
+ | Output Electro Dragon | Output length inflated unexpectedly |
251
+ | JSON Skeleton Army | Structured output mode causes failures or latency |
252
+ | Retry Witch | Hidden retries inflate latency |
253
+ | Queue Queen | Endpoint collapses under parallel load |
254
+
255
+ Example diagnosis:
256
+
257
+ ```txt
258
+ Cause of death: Context Golem
259
+ Severity: High
260
+
261
+ Evidence:
262
+ - TTFT p95 rises from 620ms at 512-token prompts to 3120ms at 8192-token prompts.
263
+ - ITL stays mostly flat.
264
+ - Request latency increase is concentrated before the first token.
265
+
266
+ Likely driver:
267
+ Long prompt prefill dominates latency.
268
+
269
+ Suggested next tests:
270
+ - Bucket prompts by input length.
271
+ - Compare with prefix caching if the backend supports it.
272
+ - Test prompt compression.
273
+ ```
274
+
275
+ ## Workload Profiles
276
+
277
+ Inference Autopsy uses workload profiles instead of one toy prompt. Different
278
+ workloads expose different bottlenecks.
279
+
280
+ Planned built-in profiles:
281
+
282
+ | Profile | Purpose |
283
+ | --- | --- |
284
+ | short-chat | Basic latency and endpoint overhead |
285
+ | rag-long | Long-context RAG-style TTFT sensitivity |
286
+ | code-completion | Long decode and output throughput |
287
+ | agent-json | JSON reliability and structured-output latency |
288
+ | long-context | Context-window and prefill stress |
289
+ | mixed-realistic | Blended production-like workload |
290
+
291
+ Example profile shape:
292
+
293
+ ```yaml
294
+ name: rag-long
295
+ description: Long-context RAG-style prompts with moderate outputs.
296
+
297
+ input_tokens:
298
+ distribution: bucket
299
+ values: [2000, 4000, 8000]
300
+ weights: [0.4, 0.4, 0.2]
301
+
302
+ output_tokens:
303
+ distribution: normal
304
+ mean: 256
305
+ std: 64
306
+ min: 64
307
+ max: 512
308
+
309
+ sampling:
310
+ temperature: 0.2
311
+ max_tokens: 512
312
+
313
+ messages:
314
+ system: "You answer questions using the provided context."
315
+ user_template: |
316
+ Context:
317
+ {{ generated_context }}
318
+
319
+ Question:
320
+ {{ generated_question }}
321
+ ```
322
+
323
+ ## Trace Format
324
+
325
+ Each request is saved as one JSONL line.
326
+
327
+ ```json
328
+ {
329
+ "schema_version": "0.1",
330
+ "run_id": "run_2026_05_23_001",
331
+ "request_id": "req_00042",
332
+ "profile": "rag-long",
333
+ "model": "llama-3.1-8b",
334
+ "base_url_hash": "endpoint_a",
335
+ "input_tokens_estimated": 4096,
336
+ "output_tokens": 261,
337
+ "status": "success",
338
+ "timings_ms": {
339
+ "request_start": 0,
340
+ "first_byte": 817,
341
+ "first_token": 942,
342
+ "request_end": 7610
343
+ },
344
+ "token_times_ms": [942, 971, 1001, 1033, 1208],
345
+ "metrics": {
346
+ "ttft_ms": 942,
347
+ "request_latency_ms": 7610,
348
+ "itl_mean_ms": 28.4,
349
+ "itl_p95_ms": 71.2,
350
+ "output_tps": 38.6,
351
+ "stall_count": 2
352
+ },
353
+ "error": null
354
+ }
355
+ ```
356
+
357
+ JSONL is append-friendly, easy to inspect, easy to upload as a CI artifact, and
358
+ simple to process with Python, DuckDB, Polars, or shell tools.
359
+
360
+ ## HTML Reports
361
+
362
+ The static report is the main demo artifact.
363
+
364
+ Implemented first-pass sections:
365
+
366
+ - Executive summary
367
+ - Diagnosis cards with evidence
368
+ - Summary metric cards
369
+ - Overall percentile table
370
+ - Static charts
371
+ - Profile breakdown
372
+ - Concurrency breakdown
373
+ - Cache summary
374
+ - Worst requests by latency and TTFT
375
+ - Methodology notes
376
+
377
+ The report is designed to be shareable without running a server.
378
+
379
+ ## CI Usage
380
+
381
+ Planned GitHub Actions workflow:
382
+
383
+ ```yaml
384
+ name: LLM Inference Regression Test
385
+
386
+ on:
387
+ pull_request:
388
+
389
+ jobs:
390
+ inference-autopsy:
391
+ runs-on: ubuntu-latest
392
+ steps:
393
+ - uses: actions/checkout@v4
394
+
395
+ - name: Install Inference Autopsy
396
+ run: pip install inference-autopsy
397
+
398
+ - name: Run benchmark
399
+ run: |
400
+ autopsy bench \
401
+ --base-url ${{ secrets.LLM_BASE_URL }} \
402
+ --api-key ${{ secrets.LLM_API_KEY }} \
403
+ --model ${{ vars.LLM_MODEL }} \
404
+ --profile short-chat \
405
+ --max-requests 50 \
406
+ --output candidate.jsonl
407
+
408
+ - name: Check regression
409
+ run: |
410
+ autopsy diff baseline.jsonl candidate.jsonl \
411
+ --fail-if "ttft_p95 > +20%" \
412
+ --fail-if "itl_p95 > +15%" \
413
+ --fail-if "error_rate > 1%"
414
+ ```
415
+
416
+ ## Compatibility Goal
417
+
418
+ Inference Autopsy targets OpenAI-compatible chat completion endpoints,
419
+ especially:
420
+
421
+ - vLLM OpenAI-compatible server
422
+ - Ollama OpenAI-compatible API
423
+ - LiteLLM proxy
424
+ - hosted OpenAI-compatible inference providers
425
+ - internal company deployments using OpenAI-style APIs
426
+
427
+
428
+ Focus;
429
+ - `/v1/chat/completions`
430
+ - `stream=true`
431
+ - `stream=false`
432
+ - request-level JSONL traces
433
+ - exact replay from saved prompts
434
+
435
+ ## Architecture
436
+
437
+ ```txt
438
+ Typer CLI
439
+ -> async workload runner
440
+ -> OpenAI-compatible HTTP client
441
+ -> streaming parser
442
+ -> JSONL trace recorder
443
+ -> metrics engine
444
+ -> diagnosis engine
445
+ -> report generator
446
+ -> diff and CI gate engine
447
+ -> replay engine
448
+ ```
449
+
450
+ Planned Python stack:
451
+
452
+ - Typer for the CLI
453
+ - httpx for async HTTP
454
+ - Pydantic for schemas
455
+ - orjson for fast JSON
456
+ - Rich for terminal output
457
+ - Polars or plain Python for metrics
458
+ - Standard-library HTML rendering for the first static report
459
+ - Optional Jinja2 and Plotly later when template or chart complexity justifies it
460
+ - pytest for tests
461
+
462
+
463
+ ## Limitations
464
+
465
+ Inference Autopsy is a black-box endpoint profiler. It identifies externally
466
+ visible symptoms and likely bottlenecks, not definitive backend internals.
467
+
468
+ It does not directly observe GPU kernel time, scheduler state, KV-cache pressure,
469
+ batching internals, or prefill/decode implementation details unless a backend
470
+ exposes those signals.
471
+
472
+ Other known limitations:
473
+
474
+ - Token counting may be approximate when providers do not return usage metadata.
475
+ - OpenAI-compatible streaming formats vary across servers.
476
+ - Hosted endpoint measurements include network and provider-side variance.
477
+ - Replay preserves workload shape and prompts, but not perfect model determinism.
478
+ - Static reports are not a replacement for production observability.
479
+
480
+ ## Benchmark Methodology
481
+
482
+ Reports should include:
483
+
484
+ - endpoint and model
485
+ - hardware or provider
486
+ - concurrency
487
+ - request count
488
+ - warmup policy
489
+ - timeout policy
490
+ - streaming mode
491
+ - token counting method
492
+ - retry policy
493
+ - prompt generation method
494
+ - percentile calculation method
495
+
496
+ No trace, no reproducibility.
497
+
498
+ ## Development
499
+
500
+ ```bash
501
+ python -m venv .venv
502
+ source .venv/bin/activate
503
+ pip install -e ".[dev]"
504
+ pytest
505
+ ruff check .
506
+ mypy autopsy
507
+ ```
508
+
509
+ ## Release
510
+
511
+ Public releases are designed to run through GitHub Actions:
512
+
513
+ 1. Push changes to `main`.
514
+ 2. Confirm the `CI` workflow passes.
515
+ 3. Confirm the `Deploy GitHub Pages` workflow publishes the docs site.
516
+ 4. Create a GitHub release such as `v0.1.0`.
517
+ 5. The `Publish to PyPI` workflow builds and publishes the package through PyPI Trusted Publishing.
518
+
519
+ The PyPI Trusted Publisher must be configured once in PyPI with:
520
+
521
+ ```txt
522
+ Repository owner: kaseyho
523
+ Repository name: Inference-Autopsy
524
+ Workflow name: publish-pypi.yml
525
+ Environment name: pypi
526
+ ```
527
+
528
+ ## License
529
+
530
+ MIT License. See [LICENSE](LICENSE).