stata-code 0.4.0__tar.gz → 0.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. {stata_code-0.4.0 → stata_code-0.6.0}/.gitignore +13 -0
  2. {stata_code-0.4.0 → stata_code-0.6.0}/CHANGELOG.md +146 -0
  3. {stata_code-0.4.0 → stata_code-0.6.0}/LICENSE-POLICY.md +4 -2
  4. stata_code-0.6.0/PKG-INFO +443 -0
  5. stata_code-0.6.0/README.md +405 -0
  6. {stata_code-0.4.0 → stata_code-0.6.0}/SCHEMA.md +44 -4
  7. {stata_code-0.4.0 → stata_code-0.6.0}/pyproject.toml +2 -2
  8. {stata_code-0.4.0 → stata_code-0.6.0}/schema/run_result.schema.json +67 -0
  9. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/__init__.py +1 -1
  10. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/_pool.py +56 -7
  11. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/_refs.py +12 -0
  12. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/_runtime.py +117 -2
  13. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/log_artifacts.py +7 -0
  14. stata_code-0.6.0/stata_code/core/notebook.py +962 -0
  15. stata_code-0.6.0/stata_code/core/run_index.py +319 -0
  16. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/runner.py +23 -1
  17. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/schema.py +17 -0
  18. stata_code-0.6.0/stata_code/kernel/assets/logo-32x32.png +0 -0
  19. stata_code-0.6.0/stata_code/kernel/assets/logo-64x64.png +0 -0
  20. stata_code-0.6.0/stata_code/kernel/assets/logo-svg.svg +41 -0
  21. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/kernel/kernel.py +54 -10
  22. stata_code-0.6.0/stata_code/mcp/server.py +2023 -0
  23. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_cancel.py +5 -9
  24. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_kernel.py +72 -1
  25. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_log_artifacts.py +30 -0
  26. stata_code-0.6.0/tests/test_mcp.py +561 -0
  27. stata_code-0.6.0/tests/test_notebook.py +301 -0
  28. stata_code-0.6.0/tests/test_notebook_phase2.py +479 -0
  29. stata_code-0.6.0/tests/test_run_index.py +351 -0
  30. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_runner.py +1 -0
  31. stata_code-0.6.0/tests/test_runtime_discovery.py +81 -0
  32. stata_code-0.4.0/PKG-INFO +0 -719
  33. stata_code-0.4.0/README.md +0 -681
  34. stata_code-0.4.0/stata_code/mcp/server.py +0 -416
  35. stata_code-0.4.0/tests/test_mcp.py +0 -254
  36. {stata_code-0.4.0 → stata_code-0.6.0}/LICENSE +0 -0
  37. {stata_code-0.4.0 → stata_code-0.6.0}/PUBLISHING.md +0 -0
  38. {stata_code-0.4.0 → stata_code-0.6.0}/docs/design/hard_timeout.md +0 -0
  39. {stata_code-0.4.0 → stata_code-0.6.0}/examples/01-basic-regression.md +0 -0
  40. {stata_code-0.4.0 → stata_code-0.6.0}/examples/02-did-card-krueger.md +0 -0
  41. {stata_code-0.4.0 → stata_code-0.6.0}/examples/03-graphs.md +0 -0
  42. {stata_code-0.4.0 → stata_code-0.6.0}/examples/04-multi-session.md +0 -0
  43. {stata_code-0.4.0 → stata_code-0.6.0}/examples/05-large-matrix.md +0 -0
  44. {stata_code-0.4.0 → stata_code-0.6.0}/examples/README.md +0 -0
  45. {stata_code-0.4.0 → stata_code-0.6.0}/scripts/export_schema.py +0 -0
  46. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/__init__.py +0 -0
  47. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/core/errors.py +0 -0
  48. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/kernel/__init__.py +0 -0
  49. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/kernel/__main__.py +0 -0
  50. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/mcp/__init__.py +0 -0
  51. {stata_code-0.4.0 → stata_code-0.6.0}/stata_code/mcp/__main__.py +0 -0
  52. {stata_code-0.4.0 → stata_code-0.6.0}/tests/__init__.py +0 -0
  53. {stata_code-0.4.0 → stata_code-0.6.0}/tests/fixtures/.gitkeep +0 -0
  54. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_errors.py +0 -0
  55. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_pool.py +0 -0
  56. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_schema.py +0 -0
  57. {stata_code-0.4.0 → stata_code-0.6.0}/tests/test_schema_artifact.py +0 -0
@@ -223,3 +223,16 @@ log-files/
223
223
  *.smcl
224
224
  *.dta
225
225
  !tests/fixtures/*.dta
226
+
227
+ # macOS
228
+ .DS_Store
229
+ **/.DS_Store
230
+
231
+ # VS Code workspace settings (contain user-machine absolute paths)
232
+ .vscode/
233
+
234
+ # Claude Code scratch (worktrees, transcripts)
235
+ .claude/
236
+
237
+ # Demo / scratch notebooks (real tests live under tests/)
238
+ demo-tests/*.ipynb
@@ -4,6 +4,152 @@ All notable changes to `stata-code` are documented here. The format follows
4
4
  [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); the project adheres
5
5
  to semver-major.minor for the result schema (see `SCHEMA.md` §6).
6
6
 
7
+ ## Unreleased
8
+
9
+ ## [0.6.0] — 2026-05-08
10
+
11
+ ### Added
12
+
13
+ - **Notebook navigation tools.** New MCP tools `notebook_outline` and
14
+ `notebook_get_cell` let agents inspect a `.ipynb` without pulling the whole
15
+ file into context. `notebook_outline` returns a per-cell index (cell id,
16
+ type, source preview, line/char counts, execution_count, has-error flag);
17
+ `notebook_get_cell` returns one cell's full source plus a token-economic
18
+ outputs summary (head/tail of stream/text outputs, error ename/evalue,
19
+ truncated traceback, image presence flag). Cell identity follows nbformat
20
+ 4.5+; pre-4.5 cells get a synthesised `synth-<index>-<hash>` id flagged via
21
+ `id_synthesized`. Read-only.
22
+ - **Notebook search.** `notebook_locate` finds cells by literal `snippet`
23
+ (whitespace-tolerant fallback), `regex` (Python regex, multiline), or
24
+ pasted `error_text` (longest code-like line is used as a fingerprint).
25
+ Returns ranked candidates with `cell_id`, `line_in_cell`, and a small
26
+ preview. Read-only.
27
+ - **Atomic notebook edits.** New `notebook_edit_cell`,
28
+ `notebook_insert_cell`, and `notebook_delete_cell` mutate cells via a
29
+ temp-file + `os.replace` write so the on-disk `.ipynb` is never partially
30
+ written. Edits preserve `cell.id` and metadata; for code cells they clear
31
+ outputs and `execution_count`. Optional `expected_source` is an
32
+ optimistic-concurrency guard that fails with `edit_source_drift` /
33
+ `delete_source_drift` if the user changed the cell between the agent's
34
+ read and write. Insertion against a pre-4.5 synth id auto-upgrades the
35
+ anchor to a real UUID so its id stays valid after the index shift.
36
+ - **Origin echo on `RunResult`.** New optional `origin_cell_id` input
37
+ joins the existing `origin_path` / `origin_kind` / `origin_label` and
38
+ is round-tripped on `result.origin` plus the run-bundle manifest. The
39
+ execution path stays cell-agnostic: the runner does not interpret these
40
+ fields, only forwards them, so agents can correlate `stata_run` calls
41
+ with notebook cells without the protocol becoming notebook-aware.
42
+ - **Run-bundle index.** New MCP tool `list_runs` queries the on-disk
43
+ `manifest.json` files written by `persist_log_files=true` runs. Filters
44
+ compose: `cell_id`, `origin_path`, `session_id`, `ok`, `since` (ISO 8601
45
+ UTC, lexicographic compare, inclusive). Returns newest-first compact
46
+ summaries with `directory`, `manifest_path`, and `log_path` so callers
47
+ read the full manifest from disk if needed. Read-only.
48
+ - **Notebook MCP prompts.** New `run_notebook_cell_and_report` and
49
+ `fix_and_rerun_notebook_cell` prompts wire the per-cell repair recipe
50
+ (read → run with `origin_cell_id` → on failure, edit with
51
+ `expected_source` guard → rerun → recommend kernel restart after the
52
+ retry budget is exhausted) so users can `/mcp prompts` it directly.
53
+ - **Capability advertising.** `stata_info.capabilities` now lists
54
+ `notebook_navigation`, `notebook_search`, `notebook_edit`, `run_index`,
55
+ and `origin_echo` so clients can feature-detect the Phase 1-3 surface.
56
+ - **Schema-level mutex constraints.** `notebook_locate` and
57
+ `notebook_insert_cell` inputSchemas now use `oneOf` to express the
58
+ "exactly one of snippet / regex / error_text" and "exactly one anchor"
59
+ rules so strict MCP clients catch them before dispatch.
60
+ - **VSCode language layer.** The extension now ships Stata TextMate syntax
61
+ highlighting and language configuration, plus an Outline provider for
62
+ `**#` hierarchical sections and `program define` blocks.
63
+ - **VSCode section ergonomics.** `Stata: Run Current Section`,
64
+ `Cmd/Ctrl+Shift+Enter`, and `▶ Run Section` code lenses run from the
65
+ current heading to the next equal/higher heading. Existing `Cmd/Ctrl+Enter`
66
+ also runs a section when the cursor is on a section heading.
67
+ - **VSCode editing aids.** Added Stata command/function/variable completion,
68
+ configured custom command completions, conservative F2 variable rename
69
+ (skips line/block comments, string literals, and ``` `macro' ``` references),
70
+ `Stata: Open Help for Selection`, and `Stata: Insert Line Continuation`
71
+ for `///` blocks.
72
+ - **Runtime discovery.** `pystata` discovery now honors
73
+ `STATA_CODE_PYSTATA_PATH`, `PYSTATA_PATH`, `STATA_HOME`, `STATA_PATH`,
74
+ and `STATA_CLI`, and includes Stata 19 / StataNow default locations.
75
+ - **Capability advertising.** `stata_info` now lists `subprocess_timeout`
76
+ alongside the existing capability strings so clients can detect that the
77
+ server isolates Stata in a worker process and enforces hard wall-clock
78
+ timeouts.
79
+ - **Agent-native MCP surface.** The MCP server now advertises tool
80
+ `outputSchema` metadata, returns `structuredContent` alongside JSON text for
81
+ compatibility, exposes RunResult/log/graph/matrix/session resources, and
82
+ ships workflow prompts for validation, debugging, repair loops, replication
83
+ audits, and estimation summaries.
84
+
85
+ ### Changed
86
+
87
+ - **`stata_info` payload is richer.** It now returns a nested `stata`
88
+ object with version/edition/backend plus the supported capabilities list,
89
+ while retaining the older flat aliases for compatibility. Operational
90
+ failures (worker timeout / crash) now report `available: false` together
91
+ with an `error` field so callers can tell them apart from genuine
92
+ "Stata not installed".
93
+ - **Jupyter completions inspect live context.** The kernel now completes
94
+ variables from the last result's dataset and `do_inspect` reports variable
95
+ type/label metadata when available.
96
+ - **`_summarise_outputs` is now streaming.** A cell with many large stream
97
+ outputs no longer materialises the full concatenation in memory before
98
+ truncating to 4 KB; we accumulate `text_chars_total` as we go but stop
99
+ appending to `text_preview` once the budget is hit.
100
+
101
+ ### Fixed
102
+
103
+ - **`_pool._utc_iso_ms` race across the second boundary.** The fallback
104
+ pool helper that builds `started_at` for synthetic timeout / crash
105
+ results called `datetime.now()` twice; if the two calls straddled a
106
+ second boundary it could produce timestamps like `T23:59:59.000Z`
107
+ (correct seconds, wrong milliseconds) and silently break lexicographic
108
+ compare in `list_runs`'s `since` filter. Captured `now` once.
109
+ - **`limit=True` accepted as `limit=1` in `list_runs`.** Python booleans
110
+ are a subclass of `int`; the `isinstance(limit, int)` check was passing
111
+ through `True` / `False`. Both `list_runs` and the MCP dispatcher now
112
+ reject `bool` explicitly with `limit_invalid`.
113
+
114
+ ## [0.5.0] — 2026-05-08
115
+
116
+ ### Added
117
+
118
+ - **Bundled Jupyter kernel logos.** `stata-code-kernel install --user`
119
+ now copies `stata_code/kernel/assets/{logo-32x32.png,logo-64x64.png,
120
+ logo-svg.svg}` into the kernelspec source dir before
121
+ `KernelSpecManager().install_kernel_spec` runs. VS Code's Jupyter
122
+ extension filters out kernelspecs that lack logo files, so prior
123
+ releases were invisible in its kernel picker; v0.5 fixes that without
124
+ affecting JupyterLab or classic Jupyter (which both already worked).
125
+ - **TestPyPI publishing step in `release.yml`.** Tag `v*` now publishes
126
+ to TestPyPI (via OIDC trusted publishing, environment `testpypi`)
127
+ before publishing to PyPI proper. `continue-on-error: true` keeps
128
+ PyPI + GitHub Release on the happy path even when TestPyPI is
129
+ misconfigured. Setup mirrors the PyPI trusted publisher and is
130
+ documented in [CLAUDE.md](CLAUDE.md).
131
+
132
+ ### Changed
133
+
134
+ - **`stata_run` tool description and README** clarify the boundary
135
+ between non-mutating execution and the optional agent "fix and
136
+ rerun" repair loop. The tool itself never rewrites your `.do` file
137
+ — but the submitted Stata code can still produce logs, graphs, and
138
+ output files as usual. Repair loops require explicit user opt-in;
139
+ failed runs are diagnostics first, not automatic rewrite permission.
140
+ - **VSCode MCP-client handshake version aligned to 0.5.0** (was a
141
+ stale 0.3.2 since the v0.3.2 release).
142
+
143
+ ### Fixed
144
+
145
+ - **`install_kernel` no longer `.resolve()`s `sys.executable`.** On
146
+ macOS Homebrew venvs (and other layouts that use a `python` symlink
147
+ outside the venv's `bin/` to a Cellar-style real interpreter),
148
+ resolving the symlink pointed Jupyter at an interpreter that
149
+ couldn't import `stata_code`. The kernelspec now keeps the
150
+ unresolved `sys.executable`, so the venv's `python` (with
151
+ `stata_code` on its `sys.path`) launches the kernel.
152
+
7
153
  ## [0.4.0] — 2026-05-07
8
154
 
9
155
  ### Added
@@ -35,10 +35,11 @@ MIT, BSD, Apache 2.0, ISC. Reading source is allowed; copying must follow the li
35
35
 
36
36
  - `kylebarron/stata-enhanced` — MIT (TextMate grammar; we do not reuse it).
37
37
  - `kylebarron/stata-exec` — MIT (Atom; not reused).
38
- - `kylebarron/language-stata` — MIT (Atom grammar; not reused).
38
+ - `kylebarron/language-stata` / Stata Enhanced grammar lineage — MIT (bundled in the VS Code extension with notice).
39
39
  - `hanlulong/stata-mcp` — MIT (we do not consult its source; see §4).
40
- - `lbraglia/RStata` — design reference only.
41
40
  - `euglevi/stata-language-server` — MIT.
41
+ - `ZihaoVistonWang/stata-outline` — MIT.
42
+ - `ZihaoVistonWang/stata-all-in-one` — MIT; incorporates Stata Enhanced grammar under MIT.
42
43
 
43
44
  ### 2.3 Copyleft projects (source code forbidden)
44
45
 
@@ -49,6 +50,7 @@ MIT, BSD, Apache 2.0, ISC. Reading source is allowed; copying must follow the li
49
50
  - `tmonk/stata-workbench` — AGPL-3.0
50
51
  - `kylebarron/stata_kernel` — GPL-3.0
51
52
  - `hugetim/nbstata` — GPL-3.0
53
+ - `lbraglia/RStata` — GPL-3.0
52
54
 
53
55
  If new copyleft Stata projects appear, add them here in the same PR that first references them.
54
56
 
@@ -0,0 +1,443 @@
1
+ Metadata-Version: 2.4
2
+ Name: stata-code
3
+ Version: 0.6.0
4
+ Summary: Agent-native Stata bridge — one core, multiple frontends (MCP, Jupyter, VSCode)
5
+ Project-URL: Homepage, https://github.com/brycewang-stanford/stata-code
6
+ Project-URL: Repository, https://github.com/brycewang-stanford/stata-code
7
+ Project-URL: Issues, https://github.com/brycewang-stanford/stata-code/issues
8
+ Project-URL: Changelog, https://github.com/brycewang-stanford/stata-code/blob/main/CHANGELOG.md
9
+ Author-email: Bryce Wang <brycewang@stanford.edu>
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ License-File: LICENSE-POLICY.md
13
+ Keywords: causal-inference,jupyter,mcp,pystata,stata,vscode
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: >=3.10
24
+ Requires-Dist: pydantic>=2.0
25
+ Provides-Extra: all
26
+ Requires-Dist: ipykernel>=6.0; extra == 'all'
27
+ Requires-Dist: mcp>=1.27; extra == 'all'
28
+ Provides-Extra: dev
29
+ Requires-Dist: mypy>=1.8; extra == 'dev'
30
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
31
+ Requires-Dist: pytest>=8.0; extra == 'dev'
32
+ Requires-Dist: ruff>=0.4.0; extra == 'dev'
33
+ Provides-Extra: kernel
34
+ Requires-Dist: ipykernel>=6.0; extra == 'kernel'
35
+ Provides-Extra: mcp
36
+ Requires-Dist: mcp>=1.27; extra == 'mcp'
37
+ Description-Content-Type: text/markdown
38
+
39
+ <p align="center">
40
+ <img src="branding/logo/horizontal@1024.png" alt="stata-code logo" width="520" />
41
+ </p>
42
+
43
+ <p align="center">
44
+ <a href="README.md"><strong>English</strong></a> | <a href="README.zh.md">中文</a>
45
+ </p>
46
+
47
+ # stata-code
48
+
49
+ [![PyPI](https://img.shields.io/pypi/v/stata-code.svg)](https://pypi.org/project/stata-code/)
50
+ [![Python](https://img.shields.io/pypi/pyversions/stata-code.svg)](https://pypi.org/project/stata-code/)
51
+ [![License](https://img.shields.io/pypi/l/stata-code.svg)](https://github.com/brycewang-stanford/stata-code/blob/main/LICENSE)
52
+ [![CI](https://img.shields.io/github/actions/workflow/status/brycewang-stanford/stata-code/test.yml?branch=main&label=tests)](https://github.com/brycewang-stanford/stata-code/actions/workflows/test.yml)
53
+ [![Downloads](https://static.pepy.tech/badge/stata-code/month)](https://pepy.tech/project/stata-code)
54
+ [![VS Code](https://img.shields.io/visual-studio-marketplace/v/brycewang-stanford.stata-code-vscode.svg?label=vscode)](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode)
55
+ [![VS Code Installs](https://img.shields.io/visual-studio-marketplace/i/brycewang-stanford.stata-code-vscode.svg)](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode)
56
+ [![VS Code Downloads](https://img.shields.io/visual-studio-marketplace/d/brycewang-stanford.stata-code-vscode.svg?label=vscode%20downloads)](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode)
57
+ [![Rating](https://img.shields.io/visual-studio-marketplace/r/brycewang-stanford.stata-code-vscode.svg)](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode)
58
+ [![GitHub release](https://img.shields.io/github/v/release/brycewang-stanford/stata-code)](https://github.com/brycewang-stanford/stata-code/releases)
59
+ [![GitHub stars](https://img.shields.io/github/stars/brycewang-stanford/stata-code?style=social)](https://github.com/brycewang-stanford/stata-code)
60
+
61
+ <p align="center">
62
+ <img src="branding/github-instructions.png" alt="stata-code: agent-native Stata bridge — one Python core, multiple frontends (Jupyter kernel, MCP server, VS Code extension)" width="720" />
63
+ </p>
64
+
65
+ > Agent-native Stata bridge — **one Python core, multiple frontends**.
66
+
67
+ `stata-code` lets you drive Stata from modern environments: an LLM agent (Claude Code, Cursor, Claude Desktop), a Jupyter notebook, or a VS Code editor session. All frontends share one Python core and return a stable, structured, **agent-friendly** result schema.
68
+
69
+ ```text
70
+ ┌────────────────────────────────────────┐
71
+ │ stata-code core (Python) │
72
+ │ │
73
+ │ • pystata adapter (Stata 17+) │
74
+ │ • v1.0 unified result schema │
75
+ │ • token-economy defaults │
76
+ │ • multi-session via Stata frames │
77
+ │ • typed errors + suggestions │
78
+ └────────────────────────────────────────┘
79
+ ↑ ↑ ↑
80
+ ┌────────┴────┐ ┌──────┴─────┐ ┌────┴────────────┐
81
+ │ Jupyter │ │ MCP │ │ VS Code │
82
+ │ kernel │ │ server │ │ extension │
83
+ └─────────────┘ └────────────┘ └─────────────────┘
84
+ ```
85
+
86
+ **Status: v0.6 (May 2026)** — the core, MCP server, Jupyter kernel, and VS Code extension work end-to-end against Stata 18 MP. Current test suite: 310 passing tests across schema, runner, MCP, kernel, notebook, and run-index modules. License: **MIT**.
87
+
88
+ Two workflows v0.6 explicitly supports for end users:
89
+
90
+ - **Run Stata code from a Jupyter notebook.** `pip install "stata-code[kernel]"` + `stata-code-kernel install --user` registers a **Stata** kernel that the Jupyter Notebook UI, JupyterLab, and the VS Code Jupyter extension all pick up by name. Cells render Stata logs, graphs, and warnings inline (the kernel logo bundled since v0.5 makes it appear in VS Code's kernel picker too). See [As a Jupyter Kernel](#as-a-jupyter-kernel).
91
+ - **Optional agent "fix and rerun" loop.** `stata_run` returns typed `error.kind/line/context` plus `suggestions` on every failure. By default Claude Code only reports diagnostics — but if you explicitly say "fix this and rerun until it passes", the agent uses the same fields to edit your `.do` file and re-call `stata_run` until the run is green. The repair loop is **opt-in**: failed runs are diagnostics first, not automatic rewrite permission. See [Error Recovery in Agent Workflows](#error-recovery-in-agent-workflows).
92
+
93
+ ---
94
+
95
+ ## Why this exists
96
+
97
+ The Stata AI / agent tooling landscape is fragmented; see [References-tools.md](References-tools.md):
98
+
99
+ - Existing MCP servers ([SepineTam/stata-mcp](https://github.com/sepinetam/stata-mcp), [tmonk/mcp-stata](https://github.com/tmonk/mcp-stata)) are **AGPL-3.0**, which is not a fit for closed-source or commercial integration.
100
+ - The popular VS Code AI extension ([hanlulong/stata-mcp](https://github.com/hanlulong/stata-mcp)) is MIT, but it bundles the MCP server inside the extension, making standalone reuse awkward.
101
+ - Each tool wraps `pystata` with its own result shape, so agents have to special-case each integration.
102
+ - Many existing tools were designed for humans first and then bolted onto MCP; they often dump long logs and base64 graph blobs into every reply, burning tokens by default.
103
+
104
+ `stata-code` is designed to fill that gap:
105
+
106
+ 1. **MIT-licensed**, with no copyleft contagion.
107
+ 2. One shared result schema for every frontend: [SCHEMA.md](SCHEMA.md).
108
+ 3. Agent-native by default: typed errors, structured `r()` / `e()`, log refs, graph refs, and suggestion seeds.
109
+ 4. One core, multiple frontends: Jupyter kernel, MCP server, and VS Code extension.
110
+
111
+ For the project's clean-room policy around AGPL/GPL Stata projects, see [LICENSE-POLICY.md](LICENSE-POLICY.md).
112
+
113
+ ---
114
+
115
+ ## Install
116
+
117
+ Requirements: **Stata 17+** (with `pystata` shipped by Stata) and **Python 3.10+**.
118
+
119
+ ```bash
120
+ # from PyPI
121
+ pip install stata-code
122
+
123
+ # with the MCP server and Jupyter kernel extras
124
+ pip install "stata-code[mcp,kernel]"
125
+
126
+ # or from source (editable install for development)
127
+ git clone https://github.com/brycewang-stanford/stata-code.git
128
+ cd stata-code
129
+ pip install -e ".[mcp,kernel]"
130
+ ```
131
+
132
+ > **Naming note.** The PyPI distribution is `stata-code` (hyphen), but
133
+ > the Python import is `stata_code` (underscore — Python identifiers
134
+ > can't contain hyphens). Same convention as `scikit-learn` →
135
+ > `import sklearn`. So: `pip install stata-code`,
136
+ > `from stata_code import run`.
137
+
138
+ Note: `pystata` is **not** on PyPI; it ships with Stata. `stata-code` auto-discovers it on macOS at `/Applications/Stata/utilities/pystata` and at equivalent Linux / Windows paths. If your install is elsewhere, add it to `PYTHONPATH` before importing.
139
+
140
+ ---
141
+
142
+ ## Quick Start
143
+
144
+ See [`examples/`](examples/) for end-to-end cookbook entries: basic regression, DiD, graphs, multi-session, and large matrices.
145
+
146
+ ### As a Python Library
147
+
148
+ ```python
149
+ from stata_code import run
150
+
151
+ r = run("sysuse auto, clear")
152
+ r = run("regress mpg weight")
153
+
154
+ if r.ok:
155
+ print(r.results.e.scalars["r2"]) # 0.6515 (native float)
156
+ print(r.results.e.macros["cmd"]) # "regress"
157
+ b = r.results.e.matrices["b"]
158
+ print(dict(zip(b.cols, b.values[0]))) # {"weight": -0.006, "_cons": 39.44}
159
+ else:
160
+ print(r.error.kind, r.error.message) # ErrorKind.VARNAME_NOT_FOUND, "..."
161
+ for s in r.error.suggestions:
162
+ print("hint:", s.action) # "Did you mean `mpg`?"
163
+ ```
164
+
165
+ ### As an MCP Server
166
+
167
+ After `pip install "stata-code[mcp]"`, the `stata-code-mcp` binary is on your `PATH`. You can wire it into Claude Code, Cursor, Claude Desktop, or any other MCP-compatible client.
168
+
169
+ #### Claude Code via `claude mcp add` (recommended)
170
+
171
+ If you have not installed Claude Code yet, see [anthropics/claude-code](https://github.com/anthropics/claude-code).
172
+
173
+ The fastest way is the `claude mcp add` CLI. Pick a scope based on how widely you want `stata-code` available:
174
+
175
+ ```bash
176
+ # user scope — install once, available in every Claude Code workspace on this machine
177
+ claude mcp add stata-code --scope user -- stata-code-mcp
178
+
179
+ # local scope — only for the current workspace (your local Claude config, not committed)
180
+ claude mcp add stata-code --scope local -- stata-code-mcp
181
+
182
+ # project scope — written into ./.mcp.json so collaborators on this repo share it
183
+ claude mcp add stata-code --scope project -- stata-code-mcp
184
+ ```
185
+
186
+ Then launch `claude` and type `/mcp` to confirm `stata-code` shows up with its 15 tools (`stata_run`, `stata_info`, `get_log`, `get_graph`, `get_matrix`, `list_sessions`, `cancel_session`, `reset_session`, `notebook_outline`, `notebook_get_cell`, `notebook_locate`, `notebook_edit_cell`, `notebook_insert_cell`, `notebook_delete_cell`, `list_runs`).
187
+
188
+ #### Error Recovery in Agent Workflows
189
+
190
+ `stata_run` does not rewrite the source `.do` file or change code on its own. It executes the submitted Stata code, so that code may still create logs, graphs, tables, or other outputs as usual. When Stata fails, `stata_run` returns typed diagnostics (`error.kind`, `error.message`, `error.line`, `error.context`) plus best-effort `suggestions`. That supports two distinct Claude Code workflows:
191
+
192
+ - For "run this do-file" or "verify this code", Claude can report the failure and suggested next steps without changing source files.
193
+ - For "fix this and rerun until it passes", Claude can use the same structured error fields to edit the `.do` file, call `stata_run` again, and iterate.
194
+
195
+ If you want the repair loop, say so explicitly. Otherwise, treat failed runs as diagnostics first, not as automatic permission to rewrite code.
196
+
197
+ #### `uvx` (no global pip install)
198
+
199
+ If you prefer not to `pip install stata-code` globally, run it ephemerally through [`uv`](https://github.com/astral-sh/uv):
200
+
201
+ ```bash
202
+ claude mcp add stata-code --scope user -- uvx --from stata-code stata-code-mcp
203
+ ```
204
+
205
+ `uvx` will resolve and cache `stata-code` on first launch. Note: `pystata` is **not** on PyPI, so it still has to be locatable on the host. The runner adds the standard Stata install path (e.g. `/Applications/Stata/utilities/pystata` on macOS) to `sys.path` automatically; if your Stata lives elsewhere, set `PYTHONPATH` in the env block.
206
+
207
+ #### Manual JSON config (Cursor / Claude Desktop / fallback)
208
+
209
+ For clients without a `mcp add` CLI, edit the config file directly (`~/.claude/mcp.json`, Cursor settings, Claude Desktop `claude_desktop_config.json`, etc.):
210
+
211
+ ```json
212
+ {
213
+ "mcpServers": {
214
+ "stata-code": {
215
+ "command": "stata-code-mcp"
216
+ }
217
+ }
218
+ }
219
+ ```
220
+
221
+ Or run it as a module if the binary is not on `PATH`:
222
+
223
+ ```bash
224
+ python -m stata_code.mcp
225
+ ```
226
+
227
+ The MCP server registers 15 tools:
228
+
229
+ | Tool | Purpose |
230
+ | --- | --- |
231
+ | `stata_run` | Execute Stata code and return a v1.0 RunResult JSON |
232
+ | `stata_info` | Report Stata edition, version, and capabilities |
233
+ | `get_log` | Fetch the full log behind a `log://` ref |
234
+ | `get_graph` | Fetch graph bytes behind a `graph://` ref (`ImageContent`) |
235
+ | `get_matrix` | Fetch matrix payloads behind a `matrix://` ref |
236
+ | `list_sessions` | Enumerate live sessions |
237
+ | `cancel_session` | Cooperatively cancel the next `stata_run` for a session |
238
+ | `reset_session` | Drop a session's data |
239
+ | `notebook_outline` | Compact per-cell index of a `.ipynb` (cell_id, type, preview) |
240
+ | `notebook_get_cell` | One cell's full source plus a token-economic outputs summary |
241
+ | `notebook_locate` | Find cells by snippet / regex / pasted error text |
242
+ | `notebook_edit_cell` | Atomically replace one cell's source (preserves id, clears outputs) |
243
+ | `notebook_insert_cell` | Insert a new cell with a fresh nbformat 4.5+ UUID |
244
+ | `notebook_delete_cell` | Remove a cell by id |
245
+ | `list_runs` | Query run-bundle manifests (filter by notebook / cell_id / session / since / ok) |
246
+
247
+ For modern MCP clients, these tools now return structured results through
248
+ `structuredContent` with `outputSchema` metadata, while still keeping the
249
+ serialized JSON text block for older clients. The server also exposes MCP
250
+ resources:
251
+
252
+ | Resource | Purpose |
253
+ | --- | --- |
254
+ | `stata://schema/run-result` | JSON Schema for `stata_run` structured output |
255
+ | `stata://server/capabilities` | Server instructions, tools, and resource templates |
256
+ | `stata://sessions` | Current subprocess-backed Stata sessions |
257
+ | `log://...` | Full log text from a truncated `stata_run` result |
258
+ | `graph://...` | Captured graph image bytes |
259
+ | `matrix://...` | Deferred large matrix payloads |
260
+
261
+ MCP prompts are available for common agent workflows:
262
+ `run_do_file_and_report`, `debug_stata_error`,
263
+ `fix_and_rerun_until_passes`, `replication_audit`, and
264
+ `summarize_estimation_results`.
265
+
266
+ ### As a Jupyter Kernel
267
+
268
+ `stata-code` ships a Jupyter kernel as part of the Python package — there is no separate "Jupyter plugin" in the JupyterLab extension marketplace. Installation is two steps: `pip install` the package with the `kernel` extra, then register the kernelspec with Jupyter.
269
+
270
+ **Prerequisites**: Stata 17+ installed locally with a valid license (the kernel calls Stata via `pystata`), and Python 3.10+ with `jupyter`/`jupyterlab` already on the same environment.
271
+
272
+ ```bash
273
+ # 1. Install stata-code with the kernel extra (pulls in ipykernel)
274
+ pip install "stata-code[kernel]"
275
+
276
+ # 2. Register the kernelspec into Jupyter's user data dir
277
+ stata-code-kernel install --user
278
+ # Or, equivalently:
279
+ # python -m stata_code.kernel install --user
280
+ ```
281
+
282
+ Verify the kernel is registered:
283
+
284
+ ```bash
285
+ jupyter kernelspec list
286
+ # should include an entry named `stata`
287
+ ```
288
+
289
+ Then open Jupyter Notebook / JupyterLab (or a `.ipynb` in VS Code), pick **Stata** in the kernel selector, and run Stata commands in cells. Logs, graphs, and warnings render inline.
290
+
291
+ > JupyterLab's Extension Manager only installs front-end JS extensions, so it cannot install a kernel — `pip install` plus the `install --user` step above is the only supported path.
292
+
293
+ ### As a VS Code Extension
294
+
295
+ The companion extension is on the Marketplace as [`brycewang-stanford.stata-code-vscode`](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode). It spawns `stata-code-mcp` as a child process and adds syntax highlighting, an Outline view for `**#` sections and `program define` blocks, code-lens "Run cell" and "Run section" actions on `.do` files, a sidebar (sessions / last result / run history / logs / graphs), status-bar indicators, completions, help lookup, conservative variable rename, and inline diagnostics from the v1.0 typed errors.
296
+
297
+ ```bash
298
+ # from the VS Code CLI
299
+ code --install-extension brycewang-stanford.stata-code-vscode
300
+ ```
301
+
302
+ Or open the **Extensions** sidebar in VS Code and search `stata-code`.
303
+
304
+ The extension still requires the MCP extra on your system Python (`pip install "stata-code[mcp]"`), so that `stata-code-mcp` resolves on `PATH` and can import the MCP SDK. Stata 17+ and a valid Stata license are required as for any other frontend.
305
+
306
+ ---
307
+
308
+ ## Token-Economy Defaults
309
+
310
+ A typical `stata_run` response is about **10x smaller** than servers that dump logs and images directly. Three design choices drive this:
311
+
312
+ 1. **Logs return `head` + `tail` + `ref`** by default. Full logs are fetched on demand via `get_log(ref)`. A Stata regression log can be about 6,000 tokens; `stata-code` returns about 600 by default.
313
+ 2. **Graphs return refs, not inline base64**. A 30 KB PNG can become about 50,000 base64 tokens; returning a ref avoids that unless the agent actually needs the bytes.
314
+ 3. **Errors are typed**. Agents can check `err.kind == "varname_not_found"` instead of regex-parsing English logs.
315
+
316
+ For example, a misspelled variable returns a structured error:
317
+
318
+ ```json
319
+ {
320
+ "ok": false,
321
+ "rc": 111,
322
+ "error": {
323
+ "kind": "varname_not_found",
324
+ "varname": "mpgg",
325
+ "line": 3,
326
+ "context": {
327
+ "before": ["use auto"],
328
+ "failing": "summarize mpgg",
329
+ "after": []
330
+ },
331
+ "suggestions": [
332
+ {"action": "Did you mean `mpg`?", "command": "describe"}
333
+ ]
334
+ }
335
+ }
336
+ ```
337
+
338
+ The full schema is in [SCHEMA.md](SCHEMA.md).
339
+
340
+ ---
341
+
342
+ ## Architecture
343
+
344
+ ```text
345
+ stata_code/
346
+ ├── core/
347
+ │ ├── _runtime.py # process-singleton pystata wrapper
348
+ │ ├── _refs.py # LRU ref store for log/graph/matrix payloads
349
+ │ ├── schema.py # Pydantic v2 models for the v1.0 result schema
350
+ │ ├── errors.py # rc → ErrorKind mapping + suggestion seeds
351
+ │ └── runner.py # the one execute(); collects everything via sfi
352
+ ├── mcp/
353
+ │ └── server.py # MCP server (15 tools)
354
+ └── kernel/
355
+ └── kernel.py # Jupyter kernel
356
+ ```
357
+
358
+ `runner.py` is the only place that touches Stata. The Jupyter kernel and MCP server both import from it and only translate results into their own transports.
359
+
360
+ ---
361
+
362
+ ## Comparison
363
+
364
+ | | stata-code | SepineTam/stata-mcp | hanlulong/stata-mcp | nbstata |
365
+ | --- | --- | --- | --- | --- |
366
+ | License | **MIT** | AGPL-3.0 | MIT | GPL-3.0 |
367
+ | Standalone MCP | ✓ | ✓ | bundled with VS Code | — |
368
+ | Jupyter kernel | ✓ | — | — | ✓ |
369
+ | Unified result schema | ✓ ([SCHEMA.md](SCHEMA.md)) | per-tool | per-tool | per-tool |
370
+ | Token-economy defaults | ✓ (log refs, graph refs) | — | — | — |
371
+ | Typed errors + suggestions | ✓ (32 kinds) | — | — | — |
372
+ | Multi-session | ✓ (Stata frames) | partial | — | — |
373
+ | Mature ecosystem | early | ✓ (statamcp.com, cookbook) | ✓ (11k installs) | ✓ |
374
+
375
+ `stata-code` is the younger, MIT-licensed, agent-native alternative in this problem space. Among the AGPL options, SepineTam's `stata-mcp` is currently more mature; `stata-code` is aimed at cases where copyleft contagion is unacceptable and agents need structured results.
376
+
377
+ ---
378
+
379
+ ## Roadmap
380
+
381
+ ### Done (through v0.6 — May 2026)
382
+
383
+ - v1.0 result schema ([SCHEMA.md](SCHEMA.md))
384
+ - `pystata`-based runner with native-typed `r()`, `e()`, and matrices
385
+ - Multi-session via Stata frames
386
+ - Per-line error attribution: line number, context, commands_executed
387
+ - Graph capture: `png` / `svg` / `pdf` with ref store
388
+ - Log truncation with ref store
389
+ - Warning extraction: 5 categories + generic notes
390
+ - 32-kind error taxonomy with canonical suggestions
391
+ - MCP server: 15 tools, including notebook navigation / search / atomic edits and the run-bundle index (`list_runs`)
392
+ - Jupyter kernel: rewired to the v1.0 pipeline, kernel logos bundled
393
+ - Matrix size cap + `get_matrix(ref)` for large matrices (>10k cells)
394
+ - Cooperative cancellation: `cancel(session_id)` / MCP `cancel_session`
395
+ - Per-cell repair loop on `.ipynb` via `notebook_outline` / `notebook_get_cell` / `notebook_edit_cell` with optimistic-concurrency `expected_source` guards and `origin_cell_id` echo on `RunResult`
396
+ - Persistent run bundles + `list_runs` query over `manifest.json` files (filter by cell / origin / session / since / ok)
397
+ - JSON Schema artifact auto-generated from `schema.py`: [`schema/run_result.schema.json`](schema/run_result.schema.json)
398
+ - VS Code extension published to the Marketplace as [`brycewang-stanford.stata-code-vscode`](https://marketplace.visualstudio.com/items?itemName=brycewang-stanford.stata-code-vscode): syntax highlighting, section outline/navigation, code-lens cell and section runners, sidebar (sessions / last result / run history / logs / graphs), status bar, completions, conservative variable rename, diagnostics, MCP child-process spawn
399
+ - Clean-room license policy ([LICENSE-POLICY.md](LICENSE-POLICY.md))
400
+
401
+ ### Next Up
402
+
403
+ - Console fallback for Stata 11–16, re-implemented against the v1.0 schema
404
+ - Hard timeout / mid-Stata interrupt; design and tradeoffs in [`docs/design/hard_timeout.md`](docs/design/hard_timeout.md)
405
+ - Extra VS Code polish (esbuild bundle, lighter VSIX, command palette UX)
406
+ - **v1.0** — Stable schema, broader Stata edition coverage
407
+
408
+ See [SCHEMA.md §7](SCHEMA.md) for explicitly out-of-scope items.
409
+
410
+ ---
411
+
412
+ ## Testing
413
+
414
+ ```bash
415
+ pip install -e ".[dev,mcp,kernel]"
416
+ pytest # full suite (310 tests)
417
+ pytest -m "not stata_required" # CI subset; no Stata needed
418
+ pytest -m "stata_required" -v # Stata-only integration tests
419
+ ```
420
+
421
+ The `stata_required` marker tags the real-Stata integration tests. CI uses `pytest -m "not stata_required"` so it does not collect them. Locally without Stata, those tests skip cleanly with the `"pystata / Stata 17+ not available"` message.
422
+
423
+ ---
424
+
425
+ ## Contributing
426
+
427
+ - Read [LICENSE-POLICY.md](LICENSE-POLICY.md) before opening a PR.
428
+ - Add a one-line acknowledgement to your first PR description; the template is in the policy file.
429
+ - Tests are required for any new schema field or runner behavior.
430
+
431
+ ---
432
+
433
+ ## License
434
+
435
+ The code is licensed under [MIT](./LICENSE). [LICENSE-POLICY.md](LICENSE-POLICY.md) explains how this project relates to other Stata projects.
436
+
437
+ ## Trademark Notice
438
+
439
+ Stata is a registered trademark of StataCorp LLC. This project is independent and not affiliated with or endorsed by StataCorp.
440
+
441
+ ## Acknowledgements
442
+
443
+ The Stata tooling landscape that this project builds on and learns from is surveyed in [References-tools.md](References-tools.md). All listed projects retain their own licenses and authorship; please consult each repository before reuse.