chunkhound-index-compactor 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1 @@
1
+ tests/fixtures/shopware-cli-chunks.duckdb filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,53 @@
1
+ # Python bytecode
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # Distribution / packaging
7
+ *.egg-info/
8
+ *.egg
9
+ build/
10
+ dist/
11
+ .eggs/
12
+
13
+ # Virtual environments
14
+ .venv/
15
+ venv/
16
+
17
+ # Node (prettier toolchain)
18
+ node_modules/
19
+
20
+ # Tooling caches
21
+ .pytest_cache/
22
+ .ruff_cache/
23
+ .mypy_cache/
24
+
25
+ # Coverage
26
+ .coverage
27
+ .coverage.*
28
+ htmlcov/
29
+ coverage.xml
30
+
31
+ # IDE
32
+ .idea/
33
+ .vscode/
34
+ *.swp
35
+ *.swo
36
+
37
+ # OS
38
+ .DS_Store
39
+ Thumbs.db
40
+
41
+ # Compactor-produced artifacts (any location)
42
+ *.compacted.duckdb
43
+ *.bak
44
+
45
+ # Superpowers plans and specs stay untracked
46
+ docs/superpowers/plans/
47
+ docs/superpowers/specs/
48
+
49
+ # .claude is excluded except for the shared skills and project settings
50
+ .claude/*
51
+ !.claude/hook-contexts/
52
+ !.claude/skills/
53
+ !.claude/settings.json
@@ -0,0 +1,38 @@
1
+ # Mirrors the local checks documented in CONTRIBUTING.md §Local checks.
2
+ # Install once per clone: `pre-commit install`.
3
+ # Run against the whole tree on demand: `pre-commit run --all-files`.
4
+
5
+ repos:
6
+ - repo: https://github.com/astral-sh/ruff-pre-commit
7
+ rev: v0.15.13
8
+ hooks:
9
+ - id: ruff
10
+ # Default invocation is `ruff check` without --fix.
11
+ # Auto-fix is off by project policy; run `uv run ruff check --fix` manually.
12
+
13
+ - repo: https://github.com/crate-ci/typos
14
+ rev: v1.46.2
15
+ hooks:
16
+ - id: typos
17
+
18
+ - repo: https://github.com/rbubley/mirrors-prettier
19
+ rev: v3.8.3
20
+ hooks:
21
+ - id: prettier
22
+ args: [--check]
23
+
24
+ - repo: local
25
+ hooks:
26
+ - id: mypy
27
+ name: mypy
28
+ entry: uv run mypy src/
29
+ language: system
30
+ files: ^(src/.*\.py|pyproject\.toml)$
31
+ pass_filenames: false
32
+
33
+ - id: pytest
34
+ name: pytest
35
+ entry: uv run pytest
36
+ language: system
37
+ always_run: true
38
+ pass_filenames: false
@@ -0,0 +1,17 @@
1
+ node_modules/
2
+ dist/
3
+ .venv/
4
+ .pytest_cache/
5
+ .ruff_cache/
6
+ .mypy_cache/
7
+
8
+ src/
9
+ tests/
10
+
11
+ # Markdown is hand-formatted. Prettier mangles identifiers containing
12
+ # underscores (treats them as emphasis) and pads tables past printWidth.
13
+ **/*.md
14
+
15
+ uv.lock
16
+ package-lock.json
17
+ LICENSE
@@ -0,0 +1,7 @@
1
+ {
2
+ "printWidth": 100,
3
+ "proseWrap": "preserve",
4
+ "tabWidth": 2,
5
+ "trailingComma": "all",
6
+ "endOfLine": "lf"
7
+ }
@@ -0,0 +1,13 @@
1
+ [files]
2
+ extend-exclude = [
3
+ "tests/fixtures/",
4
+ "dist/",
5
+ "node_modules/",
6
+ "uv.lock",
7
+ "package-lock.json",
8
+ "CHANGELOG.md",
9
+ ]
10
+
11
+ [default.extend-words]
12
+ # "unparseable" is a valid English variant; we use it in test names.
13
+ unparseable = "unparseable"
@@ -0,0 +1,80 @@
1
+ # AGENTS.md
2
+
3
+ ## Layout
4
+
5
+ ```
6
+ chunkhound-index-compactor/
7
+ ├── pyproject.toml
8
+ ├── package.json # prettier dev dep (Node)
9
+ ├── .prettierrc.json
10
+ ├── .prettierignore
11
+ ├── .typos.toml
12
+ ├── .github/workflows/ # ci.yml, rolling.yml, release.yml
13
+ ├── README.md
14
+ ├── AGENTS.md
15
+ ├── CLAUDE.md # @AGENTS.md
16
+ ├── CONTRIBUTING.md # dev tooling, CI, release process
17
+ ├── CHANGELOG.md
18
+ ├── LICENSE
19
+ ├── docs/
20
+ │ ├── architecture.md # pipeline, RAM asymmetry, recipe table, vss bundling, ChunkHound compat, refused inputs
21
+ │ ├── benchmarks.md # empirical baseline (1.25 TB ChunkHound index + fixture cross-check)
22
+ │ └── out-of-scope.md # refused shapes, dropped metadata, latent edges, rejected approaches + fix shapes
23
+ ├── src/chunkhound_index_compactor/
24
+ │ ├── __init__.py # public API re-exports
25
+ │ ├── __main__.py # python -m entry
26
+ │ ├── cli.py # Typer app
27
+ │ └── core.py # compaction logic
28
+ └── tests/
29
+ ├── conftest.py # fixtures: populated_db, bloated_db, hnsw_db, cosine_hnsw_db, shopware_cli_index
30
+ ├── fixtures/ # committed real-world DB artifacts (provenance in conftest.py)
31
+ ├── test_core.py
32
+ ├── test_cli.py
33
+ ├── test_extensions.py
34
+ ├── test_rebuild.py
35
+ └── test_human_size.py
36
+ ```
37
+
38
+ ## Module → symbols
39
+
40
+ | Module | Public | Private |
41
+ |---|---|---|
42
+ | `core.py` | `compact_database`, `restore_indexes`, `replace_with_compacted`, `human_size`, `CompactionResult`, `RestoreResult` | `_topological_order`, `_referenced_tables`, `_reject_unsupported_objects`, `_capture_hnsw_recipes`, `_write_recipe_table`, `_load_bundled_extension`, `_bundled_extension_path`, `_escape_sql_literal`, `_quote_identifier`, `RECIPE_TABLE` constant, regexes `_HNSW_RE`, `_HNSW_COLUMN_RE`, `_FK_REFERENCES_RE`, `_GENERATED_COLUMN_RE`, `_BARE_IDENTIFIER_RE` |
43
+ | `cli.py` | `app` (Typer), `compact`, `restore` commands; `DefaultCommandGroup` routes bare args to `compact` | (none) |
44
+ | `__main__.py` | `app()` invocation | (none) |
45
+ | `__init__.py` | re-exports from `core` | (none) |
46
+
47
+ ## When to modify
48
+
49
+ | Task | File / symbol |
50
+ |---|---|
51
+ | Rebuild SQL sequence | `core.py` → `compact_database()` |
52
+ | FK ordering | `core.py` → `_topological_order()` / `_referenced_tables()` |
53
+ | Front-gate refusal of unsupported source shapes | `core.py` → `_reject_unsupported_objects()` (schemas, views, user-defined types, generated columns, self-ref FKs) and `_capture_hnsw_recipes()` (expression HNSW columns) |
54
+ | Cross-filesystem replace fallback | `core.py` → `replace_with_compacted()` (`shutil.move` on EXDEV) |
55
+ | DuckDB spill location | `core.py` → `compact_database()` (architecture.md §Compaction pipeline) |
56
+ | HNSW metric recovery / recipe table schema | `core.py` → `_capture_hnsw_recipes()` / `_write_recipe_table()` / `RECIPE_TABLE` |
57
+ | Index restore | `core.py` → `restore_indexes()` |
58
+ | Atomic replace / backup suffix | `core.py` → `replace_with_compacted()` |
59
+ | CLI args / commands / output strings | `cli.py` (`DefaultCommandGroup`, `compact`, `restore`) |
60
+ | Byte formatting | `core.py` → `human_size()` |
61
+ | New public export | `core.py` + `__init__.py` `__all__` |
62
+ | Pipeline narrative, design rationale, refused-input reasoning | `docs/architecture.md` (not here) |
63
+ | Empirical baseline / scale numbers | `docs/benchmarks.md` (not here) |
64
+ | Refused shapes, dropped metadata, latent edges, rejected approaches, and the fix shape per item | `docs/out-of-scope.md` (not here) |
65
+
66
+ ## Invariants enforced by code
67
+
68
+ - HNSW metric must survive rebuild. Catalog DDL strips `WITH (...)`, so the metric is read from `pragma_hnsw_index_info()` in `_capture_hnsw_recipes`. (architecture.md §ChunkHound compatibility)
69
+ - SQL DDL is built by string interpolation (no parameter binding); escape literals via `_escape_sql_literal`, wrap table and index names via `_quote_identifier`. (architecture.md §Compaction pipeline)
70
+ - Public-API exceptions (`ValueError`, `FileNotFoundError`, `FileExistsError`, `RuntimeError`, `OSError`) enumerated at README §Library Usage; refused inputs reasoned at architecture.md §Not supported (and why).
71
+ - Front-gate refusals run before `ATTACH dst`; on any failure after `ATTACH dst`, the partial target and its `.wal` are unlinked. (architecture.md §Compaction pipeline)
72
+ - Reading the source never loads its HNSW into RAM; building the destination HNSW dominates peak RAM. `--skip-hnsw` is the small-RAM unlock; `restore` is a separate-machine step. (architecture.md §RAM cost asymmetry)
73
+
74
+ ## Build / verify
75
+
76
+ - Setup, local-check commands, tooling configs, CI workflow details, and release process at `CONTRIBUTING.md` §Setup, §Local checks, §CI workflows, §Release process.
77
+
78
+ ## Runtime deps
79
+
80
+ - Authoritative constraints at `pyproject.toml`. Load-bearing context: `duckdb` range matches `chunkhound` to stay file-format-compatible; `duckdb-extension-vss>=1.5.2` pins `duckdb==1.5.2` transitively. Python `>=3.10,<3.14`.
@@ -0,0 +1,54 @@
1
+ # Changelog
2
+
3
+ ## [0.2.0] - 2026-05-21
4
+
5
+ ### Fail-hard
6
+
7
+ - `compact_database` now refuses sources with user-defined types, generated columns, self-referential foreign keys, or HNSW indexes on non-bare-column expressions, raising `ValueError`. See architecture.md §Not supported (and why).
8
+ - `replace_with_compacted` documents `OSError` as a raised exception (the move from `compacted` to `source` failing even via the `shutil.move` fallback).
9
+ - `compact_database` and `restore_indexes` document `RuntimeError` for the case where the bundled `vss` extension binary cannot be located.
10
+
11
+ ### Fixed
12
+
13
+ - The CLI surfaces `RuntimeError` from missing bundled `vss` and `OSError` from `replace_with_compacted` as clean `error:` lines instead of unhandled stack traces.
14
+ - The `--skip-hnsw` note now prints after the `--replace` step and points at the path the artifact lives at after the whole CLI run (`result.source` with `--replace`, `result.target` without). Previously it pointed at the `.compacted` path that `--replace` renamed away.
15
+ - `replace_with_compacted` falls back to `shutil.move` when the second rename fails with a cross-device error (EXDEV), so a cross-filesystem `--replace` (user-supplied target on another mount) now completes instead of crashing.
16
+ - The failure-cleanup block in `compact_database` now also unlinks `<target>.wal` if a CHECKPOINT step left one behind.
17
+ - The compact CLI surfaces spill-directory creation failures (`OSError`) as a clean `error:` line, and `compact_database` removes the spill directory when DuckDB never spills into it.
18
+ - DDL identifier interpolation routes through a new `_quote_identifier()` helper that doubles any embedded `"`. The previous sites double-quoted identifiers without escaping.
19
+
20
+ ### Added
21
+
22
+ - `--replace` with `--skip-hnsw` now prints a `warning:` line. The in-place file has no vector index until `restore` runs against it; the regression needs to be loud.
23
+
24
+ ### Changed
25
+
26
+ - `compact_database` co-locates DuckDB spill with the target's filesystem (`temp_directory` beside the target). See architecture.md §Compaction pipeline.
27
+ - README narrowed the "fully generic / works on any single-schema DuckDB file" claim to ChunkHound-shaped inputs; other shapes are refused at the front gate rather than rebuilt with silent loss. The published package description (`pyproject.toml`) dropped its "(ChunkHound index or otherwise)" parenthetical to match.
28
+ - `docs/architecture.md` corrects the imprecise "drops HNSW on each write batch" claim to ChunkHound's actual `insert_embeddings_batch` 50-row threshold, removes the wrong COMMENT ON claim (see out-of-scope.md §Table and column comments for the actual drop behavior), and cites [duckdb/duckdb#16785](https://github.com/duckdb/duckdb/issues/16785) for the `COPY FROM DATABASE` FK race.
29
+ - `docs/architecture.md` §Not supported collapsed to one-line pointers at `out-of-scope.md`, so per-case refusal reasoning lives on a single surface.
30
+ - `docs/out-of-scope.md` promoted to the single per-topic catalog covering refused source shapes, silently-dropped metadata (HNSW tuning beyond `metric`, table and column comments), latent code edges (quoted referenced tables in `_FK_REFERENCES_RE`), and rejected alternative approaches. Each section owns both the why-not and the fix shape.
31
+
32
+ ## [0.1.1] - 2026-05-20
33
+
34
+ ### Packaging
35
+ - README, `[project.urls]` (Homepage / Repository / Issues / Changelog), `authors`, `keywords`, and trove classifiers now ship in the published package metadata. The PyPI project page renders the README and links back to the GitHub repository; 0.1.0 shipped without any of these.
36
+
37
+ ## [0.1.0] - 2026-05-20
38
+
39
+ ### Added
40
+ - Initial release of `chunkhound-index-compactor`.
41
+ - `chunkhound-index-compactor` CLI (Typer-based) with a `compact` default command and a `restore` subcommand, routed via `DefaultCommandGroup` so `chunkhound-index-compactor SOURCE [TARGET]` still works.
42
+ - `compact_database(source, target, *, skip_hnsw=False)`: rebuild a DuckDB database into a fresh file via a foreign-key-ordered streaming rebuild. Captures the source schema, recreates sequences/tables/indexes in a freshly-allocated file, computes a foreign-key-topological table order, and inserts one table at a time parent-before-child. This sidesteps the FK race that breaks `ATTACH` + `COPY FROM DATABASE` on large FK-bearing databases (e.g. ChunkHound indexes at scale) while still dropping orphaned blocks.
43
+ - HNSW indexes are recreated with the metric recovered from `pragma_hnsw_index_info()`. The catalog DDL strips the `WITH (...)` clause, so a verbatim rebuild would silently reset a `cosine` index to the `l2sq` default and leave it dead (queries fall back to brute force).
44
+ - `--skip-hnsw` flag / `skip_hnsw=True` parameter: rebuild without vector indexes (RAM-flat, smallest output) and record what was stripped in a `_compactor_hnsw_recipe` table inside the output.
45
+ - `restore` CLI command / `restore_indexes()` function: rebuild the stripped HNSW indexes in place from the recipe table, idempotently, on a RAM-capable machine.
46
+ - `replace_with_compacted()`: atomic swap with `.bak` backup.
47
+ - `human_size()`: binary-prefix byte formatting.
48
+ - `CompactionResult` and `RestoreResult` dataclasses.
49
+ - `--replace` flag for in-place compaction with backup.
50
+ - Bundled `vss.duckdb_extension` binary from `duckdb-extension-vss` is `LOAD`ed directly from disk when an HNSW index is present, so compaction of ChunkHound and other vector-search DuckDBs works offline out of the box.
51
+
52
+ ### Fail-hard
53
+ - Sources with non-`main` schemas, views, or foreign-key cycles raise `ValueError` rather than silently dropping objects.
54
+ - On any failure after the target file is created, the partial target is unlinked. A half-written multi-GB file is worse than nothing.
@@ -0,0 +1 @@
1
+ @AGENTS.md
@@ -0,0 +1,85 @@
1
+ # Contributing
2
+
3
+ ## Setup
4
+
5
+ ```bash
6
+ uv sync --extra dev # Python toolchain (mypy, pytest, ruff, typos, pre-commit)
7
+ npm install # Node toolchain (prettier; one-time per clone)
8
+ uv run pre-commit install # wires the git pre-commit hook (one-time per clone)
9
+ ```
10
+
11
+ Python 3.10 through 3.13 supported. macOS and Linux are tested.
12
+
13
+ The pre-commit hook runs ruff, typos, prettier, mypy, and pytest on every commit. Bypass with `git commit --no-verify` when you need to ship a WIP commit; CI still runs the same checks.
14
+
15
+ ## Local checks
16
+
17
+ ```bash
18
+ uv run pytest
19
+ uv run ruff check src/ tests/
20
+ uv run ruff format --check src/ tests/
21
+ uv run mypy src/
22
+ uv run typos
23
+ npx prettier --check .
24
+ ```
25
+
26
+ Apply prettier fixes with `npm run format:fix`. CI runs the same commands.
27
+
28
+ ### Tooling notes
29
+
30
+ - **Ruff**: `E W F I B C4 UP ARG SIM PTH`, line length 100, `E501` ignored.
31
+ - **MyPy**: strict; `tests.*` relaxed; `duckdb`, `duckdb_extension_vss` missing-imports ignored.
32
+ - **Pytest**: discovers `tests/`. Fixtures bundle real ChunkHound DB samples (see `tests/conftest.py`).
33
+ - **Typos**: config at `.typos.toml`.
34
+ - **Prettier**: config at `.prettierrc.json` and `.prettierignore`. Markdown is excluded because prettier mangles identifiers containing underscores and bloats markdown tables past the 100-column line limit.
35
+
36
+ ## CI workflows
37
+
38
+ Three workflows under `.github/workflows/`. All third-party actions are SHA-pinned with `# vX.Y.Z` comments so Renovate can update them later.
39
+
40
+ ### ci.yml
41
+
42
+ Runs on every push to `main` and every PR targeting `main`. Six parallel jobs:
43
+
44
+ | Job | What |
45
+ |-------------|---------------------------------------------------------------------------------------|
46
+ | `lint` | `ruff check` and `ruff format --check` |
47
+ | `typecheck` | `mypy src/` |
48
+ | `test` | `pytest` with coverage on a Python 3.10 / 3.11 / 3.12 / 3.13 matrix (`ubuntu-latest`) |
49
+ | `typos` | `typos` against the repo |
50
+ | `prettier` | `prettier --check .` |
51
+ | `build` | `uv build`; uploads wheel + sdist as a `dist` artifact (14-day retention) |
52
+
53
+ Coverage is reported but not gated. The `build` job runs independently of the others, so PR reviewers can download a wheel even when other jobs fail.
54
+
55
+ ### rolling.yml
56
+
57
+ After `ci.yml` succeeds on `main`, deletes the existing `rolling` tag plus release and recreates them at the new HEAD SHA. Marked prerelease and not-latest so it never shadows a tagged release.
58
+
59
+ Install the current main-branch build from the rolling asset:
60
+
61
+ ```bash
62
+ uv tool install https://github.com/it-bens/chunkhound-index-compactor/releases/download/rolling/chunkhound_index_compactor-<version>-py3-none-any.whl
63
+ ```
64
+
65
+ ### release.yml
66
+
67
+ Triggers on a `v*.*.*` tag push or manual `workflow_dispatch` against a tag ref.
68
+
69
+ 1. Verifies the ref is a tag.
70
+ 2. Verifies the tag (minus the `v` prefix) matches `pyproject.toml`'s `version`.
71
+ 3. Builds wheel + sdist.
72
+ 4. Publishes to PyPI via Trusted Publisher (OIDC; no PyPI token in the repo).
73
+ 5. Creates a GitHub Release with the wheel + sdist attached. Release notes are not auto-generated; edit them on GitHub afterward.
74
+
75
+ ## Release process
76
+
77
+ To cut a release:
78
+
79
+ 1. Bump `version` in `pyproject.toml` (for example, `0.1.0` to `0.2.0`) and update `CHANGELOG.md`.
80
+ 2. Commit, push, wait for CI to pass on `main`.
81
+ 3. Tag and push: `git tag v0.2.0 && git push origin v0.2.0`.
82
+ 4. The release workflow publishes to PyPI and creates a GitHub Release.
83
+ 5. Open the Release on GitHub and write the release notes.
84
+
85
+ To re-trigger from a tag manually (for example, after a transient PyPI failure): GitHub Actions, then Release, then Run workflow, then pick the tag from the ref dropdown.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Martin Bens
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,147 @@
1
+ Metadata-Version: 2.4
2
+ Name: chunkhound-index-compactor
3
+ Version: 0.2.0
4
+ Summary: Compact a bloated ChunkHound DuckDB index by rebuilding it into a fresh file
5
+ Project-URL: Homepage, https://github.com/it-bens/chunkhound-index-compactor
6
+ Project-URL: Repository, https://github.com/it-bens/chunkhound-index-compactor
7
+ Project-URL: Issues, https://github.com/it-bens/chunkhound-index-compactor/issues
8
+ Project-URL: Changelog, https://github.com/it-bens/chunkhound-index-compactor/blob/main/CHANGELOG.md
9
+ Author-email: Martin Bens <martin.bens@it-bens.de>
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: chunkhound,compact,database,duckdb,hnsw,vector-index
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Programming Language :: Python :: 3.13
21
+ Classifier: Topic :: Database
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: <3.14,>=3.10
24
+ Requires-Dist: duckdb-extension-vss>=1.5.2
25
+ Requires-Dist: duckdb<1.5.3.dev0,>=1.4.0
26
+ Requires-Dist: typer>=0.25
27
+ Provides-Extra: dev
28
+ Requires-Dist: mypy>=2.1; extra == 'dev'
29
+ Requires-Dist: pre-commit>=4.5; extra == 'dev'
30
+ Requires-Dist: pytest-cov>=7.0; extra == 'dev'
31
+ Requires-Dist: pytest>=9.0; extra == 'dev'
32
+ Requires-Dist: ruff>=0.15; extra == 'dev'
33
+ Requires-Dist: typos>=1.46; extra == 'dev'
34
+ Description-Content-Type: text/markdown
35
+
36
+ # ChunkHound Index Compactor
37
+
38
+ Compact a [DuckDB](https://duckdb.org) database by rebuilding it into a fresh file. The motivating and supported use case is shrinking a bloated [ChunkHound](https://github.com/chunkhound/chunkhound) index, whose drop-and-recreate HNSW churn (above its 50-row write-batch threshold) leaves large amounts of orphaned-but-counted blocks. The rebuild pipeline is structurally generic and works on other single-schema DuckDB files, but only ChunkHound-shaped inputs are promised: any shape outside that scope is refused at the front gate (see the Not Supported section below) rather than silently dropped or rebuilt with loss.
39
+
40
+ That bloat comes back as ChunkHound keeps indexing, so compaction is periodic maintenance rather than a one-time cleanup.
41
+
42
+ ## ⚡ Quick Start
43
+
44
+ ```bash
45
+ uvx chunkhound-index-compactor path/to/db.duckdb
46
+ # writes path/to/db.duckdb.compacted
47
+
48
+ uvx chunkhound-index-compactor path/to/db.duckdb --replace
49
+ # swaps in the compacted copy and keeps the original at path/to/db.duckdb.bak
50
+
51
+ uvx chunkhound-index-compactor path/to/db.duckdb --skip-hnsw
52
+ # skips rebuilding vector indexes (RAM-flat, smallest output); restore them later
53
+ uvx chunkhound-index-compactor restore path/to/db.duckdb.compacted
54
+ ```
55
+
56
+ The source is opened read-only, but an active writer holds the file lock. Close any process writing to the database before running.
57
+
58
+ ## 🖥️ CLI Usage
59
+
60
+ ```
61
+ $ chunkhound-index-compactor --help
62
+ Usage: chunkhound-index-compactor [OPTIONS] COMMAND [ARGS]...
63
+
64
+ Commands:
65
+ compact Compact a DuckDB database by rebuilding it into a fresh file. (default)
66
+ restore Rebuild HNSW vector indexes in a --skip-hnsw artifact, in place.
67
+ ```
68
+
69
+ A bare invocation routes to `compact`, so `chunkhound-index-compactor SOURCE` still works:
70
+
71
+ ```
72
+ chunkhound-index-compactor SOURCE [TARGET] [--replace] [--skip-hnsw]
73
+ chunkhound-index-compactor restore DATABASE
74
+ ```
75
+
76
+ | Argument / Option | Meaning |
77
+ |-------------------|-----------------------------------------------------------------------------------|
78
+ | `SOURCE` | Path to the existing DuckDB file (required) |
79
+ | `TARGET` | Path for the compacted output [default: `<source>.compacted`] |
80
+ | `--replace` | After success, replace source with the compacted file (original → `<source>.bak`) |
81
+ | `--skip-hnsw` | Do not rebuild vector indexes; write a recipe table for later `restore` |
82
+
83
+ With `--skip-hnsw`, the output has no vector index and falls back to a brute-force scan (correct, just unaccelerated) until you run `restore`. Rebuilding the HNSW is the memory-dominant step, so `--skip-hnsw` lets you compact on a small machine and `restore` on a RAM-capable one. See [docs/benchmarks.md](docs/benchmarks.md) for peak-RAM numbers and [docs/architecture.md §RAM cost asymmetry](docs/architecture.md#ram-cost-asymmetry) for why.
84
+
85
+ ## 🐍 Library Usage
86
+
87
+ ```python
88
+ from pathlib import Path
89
+ from chunkhound_index_compactor import compact_database, restore_indexes, replace_with_compacted
90
+
91
+ result = compact_database(Path("big.duckdb"), Path("small.duckdb"))
92
+ print(f"{result.source_size} -> {result.target_size} ({result.delta_pct:+.1f}%)")
93
+
94
+ # Small-RAM path: skip the vector index, restore it later on a bigger machine.
95
+ compact_database(Path("big.duckdb"), Path("small.duckdb"), skip_hnsw=True)
96
+ restored = restore_indexes(Path("small.duckdb"))
97
+ print(f"restored: {restored.restored}")
98
+
99
+ # Optional: swap in place with .bak backup
100
+ backup = replace_with_compacted(result.source, result.target)
101
+ ```
102
+
103
+ `compact_database()` raises:
104
+ - `ValueError`: `target` resolves to the same path as `source`, the FK graph has a cycle, or the source has a shape refused at the front gate (see the Not Supported section below).
105
+ - `FileNotFoundError`: `source` does not exist.
106
+ - `FileExistsError`: `target` already exists.
107
+ - `RuntimeError`: the bundled `vss` extension binary cannot be located (only reachable if the source contains an HNSW index).
108
+
109
+ `restore_indexes()` raises:
110
+ - `FileNotFoundError`: `database` does not exist.
111
+ - `ValueError`: `database` has no `_compactor_hnsw_recipe` table (not a `--skip-hnsw` artifact).
112
+ - `RuntimeError`: the bundled `vss` extension binary cannot be located.
113
+
114
+ `replace_with_compacted()` raises:
115
+ - `FileNotFoundError`: `source` or `compacted` is missing.
116
+ - `FileExistsError`: `<source>.bak` already exists (it refuses to overwrite an existing backup).
117
+ - `OSError`: the move from `compacted` to `source` fails even via the cross-filesystem fallback (`shutil.move`).
118
+
119
+ ## 🚫 Not Supported
120
+
121
+ The tool fails hard rather than silently dropping anything it cannot reproduce.
122
+
123
+ - Non-`main` schemas and views (raise `ValueError`).
124
+ - User-defined types, generated columns, self-referential foreign keys, and HNSW indexes on non-bare-column expressions (raise `ValueError`).
125
+ - Foreign-key cycles among tables (raise `ValueError`).
126
+ - HNSW tuning parameters other than `metric` (`M`, `M0`, `ef_construction`, `ef_search`); they are not recoverable from a built index and are rebuilt at the `vss` defaults.
127
+ - Table and column comments are not carried across the rebuild.
128
+
129
+ See [docs/architecture.md](docs/architecture.md#not-supported-and-why) for the reasoning, and [docs/out-of-scope.md](docs/out-of-scope.md) for approaches considered and not pursued.
130
+
131
+ ## 🏗️ Development
132
+
133
+ Setup, local checks, CI, and release process: [CONTRIBUTING.md](CONTRIBUTING.md).
134
+
135
+ ## ⚖️ License
136
+
137
+ MIT
138
+
139
+ ---
140
+
141
+ > [!NOTE]
142
+ > Yes, an AI wrote this README. And the code, the docs, the tests, and
143
+ > the `.claude/skills` it now uses to write the next round. Yes, a human
144
+ > told it to keep the emojis. The human has ADHD, which, as it turns
145
+ > out, means his brain was already doing attention re-routing and
146
+ > context-window thrashing before LLMs made it cool. They call him ...
147
+ > LLMartin. The emojis are a feature.
@@ -0,0 +1,112 @@
1
+ # ChunkHound Index Compactor
2
+
3
+ Compact a [DuckDB](https://duckdb.org) database by rebuilding it into a fresh file. The motivating and supported use case is shrinking a bloated [ChunkHound](https://github.com/chunkhound/chunkhound) index, whose drop-and-recreate HNSW churn (above its 50-row write-batch threshold) leaves large amounts of orphaned-but-counted blocks. The rebuild pipeline is structurally generic and works on other single-schema DuckDB files, but only ChunkHound-shaped inputs are promised: any shape outside that scope is refused at the front gate (see the Not Supported section below) rather than silently dropped or rebuilt with loss.
4
+
5
+ That bloat comes back as ChunkHound keeps indexing, so compaction is periodic maintenance rather than a one-time cleanup.
6
+
7
+ ## ⚡ Quick Start
8
+
9
+ ```bash
10
+ uvx chunkhound-index-compactor path/to/db.duckdb
11
+ # writes path/to/db.duckdb.compacted
12
+
13
+ uvx chunkhound-index-compactor path/to/db.duckdb --replace
14
+ # swaps in the compacted copy and keeps the original at path/to/db.duckdb.bak
15
+
16
+ uvx chunkhound-index-compactor path/to/db.duckdb --skip-hnsw
17
+ # skips rebuilding vector indexes (RAM-flat, smallest output); restore them later
18
+ uvx chunkhound-index-compactor restore path/to/db.duckdb.compacted
19
+ ```
20
+
21
+ The source is opened read-only, but an active writer holds the file lock. Close any process writing to the database before running.
22
+
23
+ ## 🖥️ CLI Usage
24
+
25
+ ```
26
+ $ chunkhound-index-compactor --help
27
+ Usage: chunkhound-index-compactor [OPTIONS] COMMAND [ARGS]...
28
+
29
+ Commands:
30
+ compact Compact a DuckDB database by rebuilding it into a fresh file. (default)
31
+ restore Rebuild HNSW vector indexes in a --skip-hnsw artifact, in place.
32
+ ```
33
+
34
+ A bare invocation routes to `compact`, so `chunkhound-index-compactor SOURCE` still works:
35
+
36
+ ```
37
+ chunkhound-index-compactor SOURCE [TARGET] [--replace] [--skip-hnsw]
38
+ chunkhound-index-compactor restore DATABASE
39
+ ```
40
+
41
+ | Argument / Option | Meaning |
42
+ |-------------------|-----------------------------------------------------------------------------------|
43
+ | `SOURCE` | Path to the existing DuckDB file (required) |
44
+ | `TARGET` | Path for the compacted output [default: `<source>.compacted`] |
45
+ | `--replace` | After success, replace source with the compacted file (original → `<source>.bak`) |
46
+ | `--skip-hnsw` | Do not rebuild vector indexes; write a recipe table for later `restore` |
47
+
48
+ With `--skip-hnsw`, the output has no vector index and falls back to a brute-force scan (correct, just unaccelerated) until you run `restore`. Rebuilding the HNSW is the memory-dominant step, so `--skip-hnsw` lets you compact on a small machine and `restore` on a RAM-capable one. See [docs/benchmarks.md](docs/benchmarks.md) for peak-RAM numbers and [docs/architecture.md §RAM cost asymmetry](docs/architecture.md#ram-cost-asymmetry) for why.
49
+
50
+ ## 🐍 Library Usage
51
+
52
+ ```python
53
+ from pathlib import Path
54
+ from chunkhound_index_compactor import compact_database, restore_indexes, replace_with_compacted
55
+
56
+ result = compact_database(Path("big.duckdb"), Path("small.duckdb"))
57
+ print(f"{result.source_size} -> {result.target_size} ({result.delta_pct:+.1f}%)")
58
+
59
+ # Small-RAM path: skip the vector index, restore it later on a bigger machine.
60
+ compact_database(Path("big.duckdb"), Path("small.duckdb"), skip_hnsw=True)
61
+ restored = restore_indexes(Path("small.duckdb"))
62
+ print(f"restored: {restored.restored}")
63
+
64
+ # Optional: swap in place with .bak backup
65
+ backup = replace_with_compacted(result.source, result.target)
66
+ ```
67
+
68
+ `compact_database()` raises:
69
+ - `ValueError`: `target` resolves to the same path as `source`, the FK graph has a cycle, or the source has a shape refused at the front gate (see the Not Supported section below).
70
+ - `FileNotFoundError`: `source` does not exist.
71
+ - `FileExistsError`: `target` already exists.
72
+ - `RuntimeError`: the bundled `vss` extension binary cannot be located (only reachable if the source contains an HNSW index).
73
+
74
+ `restore_indexes()` raises:
75
+ - `FileNotFoundError`: `database` does not exist.
76
+ - `ValueError`: `database` has no `_compactor_hnsw_recipe` table (not a `--skip-hnsw` artifact).
77
+ - `RuntimeError`: the bundled `vss` extension binary cannot be located.
78
+
79
+ `replace_with_compacted()` raises:
80
+ - `FileNotFoundError`: `source` or `compacted` is missing.
81
+ - `FileExistsError`: `<source>.bak` already exists (it refuses to overwrite an existing backup).
82
+ - `OSError`: the move from `compacted` to `source` fails even via the cross-filesystem fallback (`shutil.move`).
83
+
84
+ ## 🚫 Not Supported
85
+
86
+ The tool fails hard rather than silently dropping anything it cannot reproduce.
87
+
88
+ - Non-`main` schemas and views (raise `ValueError`).
89
+ - User-defined types, generated columns, self-referential foreign keys, and HNSW indexes on non-bare-column expressions (raise `ValueError`).
90
+ - Foreign-key cycles among tables (raise `ValueError`).
91
+ - HNSW tuning parameters other than `metric` (`M`, `M0`, `ef_construction`, `ef_search`); they are not recoverable from a built index and are rebuilt at the `vss` defaults.
92
+ - Table and column comments are not carried across the rebuild.
93
+
94
+ See [docs/architecture.md](docs/architecture.md#not-supported-and-why) for the reasoning, and [docs/out-of-scope.md](docs/out-of-scope.md) for approaches considered and not pursued.
95
+
96
+ ## 🏗️ Development
97
+
98
+ Setup, local checks, CI, and release process: [CONTRIBUTING.md](CONTRIBUTING.md).
99
+
100
+ ## ⚖️ License
101
+
102
+ MIT
103
+
104
+ ---
105
+
106
+ > [!NOTE]
107
+ > Yes, an AI wrote this README. And the code, the docs, the tests, and
108
+ > the `.claude/skills` it now uses to write the next round. Yes, a human
109
+ > told it to keep the emojis. The human has ADHD, which, as it turns
110
+ > out, means his brain was already doing attention re-routing and
111
+ > context-window thrashing before LLMs made it cool. They call him ...
112
+ > LLMartin. The emojis are a feature.