chunkhound-index-compactor 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- chunkhound_index_compactor-0.2.0/.gitattributes +1 -0
- chunkhound_index_compactor-0.2.0/.gitignore +53 -0
- chunkhound_index_compactor-0.2.0/.pre-commit-config.yaml +38 -0
- chunkhound_index_compactor-0.2.0/.prettierignore +17 -0
- chunkhound_index_compactor-0.2.0/.prettierrc.json +7 -0
- chunkhound_index_compactor-0.2.0/.typos.toml +13 -0
- chunkhound_index_compactor-0.2.0/AGENTS.md +80 -0
- chunkhound_index_compactor-0.2.0/CHANGELOG.md +54 -0
- chunkhound_index_compactor-0.2.0/CLAUDE.md +1 -0
- chunkhound_index_compactor-0.2.0/CONTRIBUTING.md +85 -0
- chunkhound_index_compactor-0.2.0/LICENSE +21 -0
- chunkhound_index_compactor-0.2.0/PKG-INFO +147 -0
- chunkhound_index_compactor-0.2.0/README.md +112 -0
- chunkhound_index_compactor-0.2.0/package-lock.json +29 -0
- chunkhound_index_compactor-0.2.0/package.json +12 -0
- chunkhound_index_compactor-0.2.0/pyproject.toml +89 -0
- chunkhound_index_compactor-0.2.0/src/chunkhound_index_compactor/__init__.py +19 -0
- chunkhound_index_compactor-0.2.0/src/chunkhound_index_compactor/__main__.py +5 -0
- chunkhound_index_compactor-0.2.0/src/chunkhound_index_compactor/cli.py +131 -0
- chunkhound_index_compactor-0.2.0/src/chunkhound_index_compactor/core.py +470 -0
- chunkhound_index_compactor-0.2.0/uv.lock +720 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
tests/fixtures/shopware-cli-chunks.duckdb filter=lfs diff=lfs merge=lfs -text
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Python bytecode
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*$py.class
|
|
5
|
+
|
|
6
|
+
# Distribution / packaging
|
|
7
|
+
*.egg-info/
|
|
8
|
+
*.egg
|
|
9
|
+
build/
|
|
10
|
+
dist/
|
|
11
|
+
.eggs/
|
|
12
|
+
|
|
13
|
+
# Virtual environments
|
|
14
|
+
.venv/
|
|
15
|
+
venv/
|
|
16
|
+
|
|
17
|
+
# Node (prettier toolchain)
|
|
18
|
+
node_modules/
|
|
19
|
+
|
|
20
|
+
# Tooling caches
|
|
21
|
+
.pytest_cache/
|
|
22
|
+
.ruff_cache/
|
|
23
|
+
.mypy_cache/
|
|
24
|
+
|
|
25
|
+
# Coverage
|
|
26
|
+
.coverage
|
|
27
|
+
.coverage.*
|
|
28
|
+
htmlcov/
|
|
29
|
+
coverage.xml
|
|
30
|
+
|
|
31
|
+
# IDE
|
|
32
|
+
.idea/
|
|
33
|
+
.vscode/
|
|
34
|
+
*.swp
|
|
35
|
+
*.swo
|
|
36
|
+
|
|
37
|
+
# OS
|
|
38
|
+
.DS_Store
|
|
39
|
+
Thumbs.db
|
|
40
|
+
|
|
41
|
+
# Compactor-produced artifacts (any location)
|
|
42
|
+
*.compacted.duckdb
|
|
43
|
+
*.bak
|
|
44
|
+
|
|
45
|
+
# Superpowers plans and specs stay untracked
|
|
46
|
+
docs/superpowers/plans/
|
|
47
|
+
docs/superpowers/specs/
|
|
48
|
+
|
|
49
|
+
# .claude is excluded except for the shared skills and project settings
|
|
50
|
+
.claude/*
|
|
51
|
+
!.claude/hook-contexts/
|
|
52
|
+
!.claude/skills/
|
|
53
|
+
!.claude/settings.json
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# Mirrors the local checks documented in CONTRIBUTING.md §Local checks.
|
|
2
|
+
# Install once per clone: `pre-commit install`.
|
|
3
|
+
# Run against the whole tree on demand: `pre-commit run --all-files`.
|
|
4
|
+
|
|
5
|
+
repos:
|
|
6
|
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
7
|
+
rev: v0.15.13
|
|
8
|
+
hooks:
|
|
9
|
+
- id: ruff
|
|
10
|
+
# Default invocation is `ruff check` without --fix.
|
|
11
|
+
# Auto-fix is off by project policy; run `uv run ruff check --fix` manually.
|
|
12
|
+
|
|
13
|
+
- repo: https://github.com/crate-ci/typos
|
|
14
|
+
rev: v1.46.2
|
|
15
|
+
hooks:
|
|
16
|
+
- id: typos
|
|
17
|
+
|
|
18
|
+
- repo: https://github.com/rbubley/mirrors-prettier
|
|
19
|
+
rev: v3.8.3
|
|
20
|
+
hooks:
|
|
21
|
+
- id: prettier
|
|
22
|
+
args: [--check]
|
|
23
|
+
|
|
24
|
+
- repo: local
|
|
25
|
+
hooks:
|
|
26
|
+
- id: mypy
|
|
27
|
+
name: mypy
|
|
28
|
+
entry: uv run mypy src/
|
|
29
|
+
language: system
|
|
30
|
+
files: ^(src/.*\.py|pyproject\.toml)$
|
|
31
|
+
pass_filenames: false
|
|
32
|
+
|
|
33
|
+
- id: pytest
|
|
34
|
+
name: pytest
|
|
35
|
+
entry: uv run pytest
|
|
36
|
+
language: system
|
|
37
|
+
always_run: true
|
|
38
|
+
pass_filenames: false
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
node_modules/
|
|
2
|
+
dist/
|
|
3
|
+
.venv/
|
|
4
|
+
.pytest_cache/
|
|
5
|
+
.ruff_cache/
|
|
6
|
+
.mypy_cache/
|
|
7
|
+
|
|
8
|
+
src/
|
|
9
|
+
tests/
|
|
10
|
+
|
|
11
|
+
# Markdown is hand-formatted. Prettier mangles identifiers containing
|
|
12
|
+
# underscores (treats them as emphasis) and pads tables past printWidth.
|
|
13
|
+
**/*.md
|
|
14
|
+
|
|
15
|
+
uv.lock
|
|
16
|
+
package-lock.json
|
|
17
|
+
LICENSE
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
[files]
|
|
2
|
+
extend-exclude = [
|
|
3
|
+
"tests/fixtures/",
|
|
4
|
+
"dist/",
|
|
5
|
+
"node_modules/",
|
|
6
|
+
"uv.lock",
|
|
7
|
+
"package-lock.json",
|
|
8
|
+
"CHANGELOG.md",
|
|
9
|
+
]
|
|
10
|
+
|
|
11
|
+
[default.extend-words]
|
|
12
|
+
# "unparseable" is a valid English variant; we use it in test names.
|
|
13
|
+
unparseable = "unparseable"
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
# AGENTS.md
|
|
2
|
+
|
|
3
|
+
## Layout
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
chunkhound-index-compactor/
|
|
7
|
+
├── pyproject.toml
|
|
8
|
+
├── package.json # prettier dev dep (Node)
|
|
9
|
+
├── .prettierrc.json
|
|
10
|
+
├── .prettierignore
|
|
11
|
+
├── .typos.toml
|
|
12
|
+
├── .github/workflows/ # ci.yml, rolling.yml, release.yml
|
|
13
|
+
├── README.md
|
|
14
|
+
├── AGENTS.md
|
|
15
|
+
├── CLAUDE.md # @AGENTS.md
|
|
16
|
+
├── CONTRIBUTING.md # dev tooling, CI, release process
|
|
17
|
+
├── CHANGELOG.md
|
|
18
|
+
├── LICENSE
|
|
19
|
+
├── docs/
|
|
20
|
+
│ ├── architecture.md # pipeline, RAM asymmetry, recipe table, vss bundling, ChunkHound compat, refused inputs
|
|
21
|
+
│ ├── benchmarks.md # empirical baseline (1.25 TB ChunkHound index + fixture cross-check)
|
|
22
|
+
│ └── out-of-scope.md # refused shapes, dropped metadata, latent edges, rejected approaches + fix shapes
|
|
23
|
+
├── src/chunkhound_index_compactor/
|
|
24
|
+
│ ├── __init__.py # public API re-exports
|
|
25
|
+
│ ├── __main__.py # python -m entry
|
|
26
|
+
│ ├── cli.py # Typer app
|
|
27
|
+
│ └── core.py # compaction logic
|
|
28
|
+
└── tests/
|
|
29
|
+
├── conftest.py # fixtures: populated_db, bloated_db, hnsw_db, cosine_hnsw_db, shopware_cli_index
|
|
30
|
+
├── fixtures/ # committed real-world DB artifacts (provenance in conftest.py)
|
|
31
|
+
├── test_core.py
|
|
32
|
+
├── test_cli.py
|
|
33
|
+
├── test_extensions.py
|
|
34
|
+
├── test_rebuild.py
|
|
35
|
+
└── test_human_size.py
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Module → symbols
|
|
39
|
+
|
|
40
|
+
| Module | Public | Private |
|
|
41
|
+
|---|---|---|
|
|
42
|
+
| `core.py` | `compact_database`, `restore_indexes`, `replace_with_compacted`, `human_size`, `CompactionResult`, `RestoreResult` | `_topological_order`, `_referenced_tables`, `_reject_unsupported_objects`, `_capture_hnsw_recipes`, `_write_recipe_table`, `_load_bundled_extension`, `_bundled_extension_path`, `_escape_sql_literal`, `_quote_identifier`, `RECIPE_TABLE` constant, regexes `_HNSW_RE`, `_HNSW_COLUMN_RE`, `_FK_REFERENCES_RE`, `_GENERATED_COLUMN_RE`, `_BARE_IDENTIFIER_RE` |
|
|
43
|
+
| `cli.py` | `app` (Typer), `compact`, `restore` commands; `DefaultCommandGroup` routes bare args to `compact` | (none) |
|
|
44
|
+
| `__main__.py` | `app()` invocation | (none) |
|
|
45
|
+
| `__init__.py` | re-exports from `core` | (none) |
|
|
46
|
+
|
|
47
|
+
## When to modify
|
|
48
|
+
|
|
49
|
+
| Task | File / symbol |
|
|
50
|
+
|---|---|
|
|
51
|
+
| Rebuild SQL sequence | `core.py` → `compact_database()` |
|
|
52
|
+
| FK ordering | `core.py` → `_topological_order()` / `_referenced_tables()` |
|
|
53
|
+
| Front-gate refusal of unsupported source shapes | `core.py` → `_reject_unsupported_objects()` (schemas, views, user-defined types, generated columns, self-ref FKs) and `_capture_hnsw_recipes()` (expression HNSW columns) |
|
|
54
|
+
| Cross-filesystem replace fallback | `core.py` → `replace_with_compacted()` (`shutil.move` on EXDEV) |
|
|
55
|
+
| DuckDB spill location | `core.py` → `compact_database()` (architecture.md §Compaction pipeline) |
|
|
56
|
+
| HNSW metric recovery / recipe table schema | `core.py` → `_capture_hnsw_recipes()` / `_write_recipe_table()` / `RECIPE_TABLE` |
|
|
57
|
+
| Index restore | `core.py` → `restore_indexes()` |
|
|
58
|
+
| Atomic replace / backup suffix | `core.py` → `replace_with_compacted()` |
|
|
59
|
+
| CLI args / commands / output strings | `cli.py` (`DefaultCommandGroup`, `compact`, `restore`) |
|
|
60
|
+
| Byte formatting | `core.py` → `human_size()` |
|
|
61
|
+
| New public export | `core.py` + `__init__.py` `__all__` |
|
|
62
|
+
| Pipeline narrative, design rationale, refused-input reasoning | `docs/architecture.md` (not here) |
|
|
63
|
+
| Empirical baseline / scale numbers | `docs/benchmarks.md` (not here) |
|
|
64
|
+
| Refused shapes, dropped metadata, latent edges, rejected approaches, and the fix shape per item | `docs/out-of-scope.md` (not here) |
|
|
65
|
+
|
|
66
|
+
## Invariants enforced by code
|
|
67
|
+
|
|
68
|
+
- HNSW metric must survive rebuild. Catalog DDL strips `WITH (...)`, so the metric is read from `pragma_hnsw_index_info()` in `_capture_hnsw_recipes`. (architecture.md §ChunkHound compatibility)
|
|
69
|
+
- SQL DDL is built by string interpolation (no parameter binding); escape literals via `_escape_sql_literal`, wrap table and index names via `_quote_identifier`. (architecture.md §Compaction pipeline)
|
|
70
|
+
- Public-API exceptions (`ValueError`, `FileNotFoundError`, `FileExistsError`, `RuntimeError`, `OSError`) enumerated at README §Library Usage; refused inputs reasoned at architecture.md §Not supported (and why).
|
|
71
|
+
- Front-gate refusals run before `ATTACH dst`; on any failure after `ATTACH dst`, the partial target and its `.wal` are unlinked. (architecture.md §Compaction pipeline)
|
|
72
|
+
- Reading the source never loads its HNSW into RAM; building the destination HNSW dominates peak RAM. `--skip-hnsw` is the small-RAM unlock; `restore` is a separate-machine step. (architecture.md §RAM cost asymmetry)
|
|
73
|
+
|
|
74
|
+
## Build / verify
|
|
75
|
+
|
|
76
|
+
- Setup, local-check commands, tooling configs, CI workflow details, and release process at `CONTRIBUTING.md` §Setup, §Local checks, §CI workflows, §Release process.
|
|
77
|
+
|
|
78
|
+
## Runtime deps
|
|
79
|
+
|
|
80
|
+
- Authoritative constraints at `pyproject.toml`. Load-bearing context: `duckdb` range matches `chunkhound` to stay file-format-compatible; `duckdb-extension-vss>=1.5.2` pins `duckdb==1.5.2` transitively. Python `>=3.10,<3.14`.
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## [0.2.0] - 2026-05-21
|
|
4
|
+
|
|
5
|
+
### Fail-hard
|
|
6
|
+
|
|
7
|
+
- `compact_database` now refuses sources with user-defined types, generated columns, self-referential foreign keys, or HNSW indexes on non-bare-column expressions, raising `ValueError`. See architecture.md §Not supported (and why).
|
|
8
|
+
- `replace_with_compacted` documents `OSError` as a raised exception (the move from `compacted` to `source` failing even via the `shutil.move` fallback).
|
|
9
|
+
- `compact_database` and `restore_indexes` document `RuntimeError` for the case where the bundled `vss` extension binary cannot be located.
|
|
10
|
+
|
|
11
|
+
### Fixed
|
|
12
|
+
|
|
13
|
+
- The CLI surfaces `RuntimeError` from missing bundled `vss` and `OSError` from `replace_with_compacted` as clean `error:` lines instead of unhandled stack traces.
|
|
14
|
+
- The `--skip-hnsw` note now prints after the `--replace` step and points at the path the artifact lives at after the whole CLI run (`result.source` with `--replace`, `result.target` without). Previously it pointed at the `.compacted` path that `--replace` renamed away.
|
|
15
|
+
- `replace_with_compacted` falls back to `shutil.move` when the second rename fails with a cross-device error (EXDEV), so a cross-filesystem `--replace` (user-supplied target on another mount) now completes instead of crashing.
|
|
16
|
+
- The failure-cleanup block in `compact_database` now also unlinks `<target>.wal` if a CHECKPOINT step left one behind.
|
|
17
|
+
- The compact CLI surfaces spill-directory creation failures (`OSError`) as a clean `error:` line, and `compact_database` removes the spill directory when DuckDB never spills into it.
|
|
18
|
+
- DDL identifier interpolation routes through a new `_quote_identifier()` helper that doubles any embedded `"`. The previous sites double-quoted identifiers without escaping.
|
|
19
|
+
|
|
20
|
+
### Added
|
|
21
|
+
|
|
22
|
+
- `--replace` with `--skip-hnsw` now prints a `warning:` line. The in-place file has no vector index until `restore` runs against it; the regression needs to be loud.
|
|
23
|
+
|
|
24
|
+
### Changed
|
|
25
|
+
|
|
26
|
+
- `compact_database` co-locates DuckDB spill with the target's filesystem (`temp_directory` beside the target). See architecture.md §Compaction pipeline.
|
|
27
|
+
- README narrowed the "fully generic / works on any single-schema DuckDB file" claim to ChunkHound-shaped inputs; other shapes are refused at the front gate rather than rebuilt with silent loss. The published package description (`pyproject.toml`) dropped its "(ChunkHound index or otherwise)" parenthetical to match.
|
|
28
|
+
- `docs/architecture.md` corrects the imprecise "drops HNSW on each write batch" claim to ChunkHound's actual `insert_embeddings_batch` 50-row threshold, removes the wrong COMMENT ON claim (see out-of-scope.md §Table and column comments for the actual drop behavior), and cites [duckdb/duckdb#16785](https://github.com/duckdb/duckdb/issues/16785) for the `COPY FROM DATABASE` FK race.
|
|
29
|
+
- `docs/architecture.md` §Not supported collapsed to one-line pointers at `out-of-scope.md`, so per-case refusal reasoning lives on a single surface.
|
|
30
|
+
- `docs/out-of-scope.md` promoted to the single per-topic catalog covering refused source shapes, silently-dropped metadata (HNSW tuning beyond `metric`, table and column comments), latent code edges (quoted referenced tables in `_FK_REFERENCES_RE`), and rejected alternative approaches. Each section owns both the why-not and the fix shape.
|
|
31
|
+
|
|
32
|
+
## [0.1.1] - 2026-05-20
|
|
33
|
+
|
|
34
|
+
### Packaging
|
|
35
|
+
- README, `[project.urls]` (Homepage / Repository / Issues / Changelog), `authors`, `keywords`, and trove classifiers now ship in the published package metadata. The PyPI project page renders the README and links back to the GitHub repository; 0.1.0 shipped without any of these.
|
|
36
|
+
|
|
37
|
+
## [0.1.0] - 2026-05-20
|
|
38
|
+
|
|
39
|
+
### Added
|
|
40
|
+
- Initial release of `chunkhound-index-compactor`.
|
|
41
|
+
- `chunkhound-index-compactor` CLI (Typer-based) with a `compact` default command and a `restore` subcommand, routed via `DefaultCommandGroup` so `chunkhound-index-compactor SOURCE [TARGET]` still works.
|
|
42
|
+
- `compact_database(source, target, *, skip_hnsw=False)`: rebuild a DuckDB database into a fresh file via a foreign-key-ordered streaming rebuild. Captures the source schema, recreates sequences/tables/indexes in a freshly-allocated file, computes a foreign-key-topological table order, and inserts one table at a time parent-before-child. This sidesteps the FK race that breaks `ATTACH` + `COPY FROM DATABASE` on large FK-bearing databases (e.g. ChunkHound indexes at scale) while still dropping orphaned blocks.
|
|
43
|
+
- HNSW indexes are recreated with the metric recovered from `pragma_hnsw_index_info()`. The catalog DDL strips the `WITH (...)` clause, so a verbatim rebuild would silently reset a `cosine` index to the `l2sq` default and leave it dead (queries fall back to brute force).
|
|
44
|
+
- `--skip-hnsw` flag / `skip_hnsw=True` parameter: rebuild without vector indexes (RAM-flat, smallest output) and record what was stripped in a `_compactor_hnsw_recipe` table inside the output.
|
|
45
|
+
- `restore` CLI command / `restore_indexes()` function: rebuild the stripped HNSW indexes in place from the recipe table, idempotently, on a RAM-capable machine.
|
|
46
|
+
- `replace_with_compacted()`: atomic swap with `.bak` backup.
|
|
47
|
+
- `human_size()`: binary-prefix byte formatting.
|
|
48
|
+
- `CompactionResult` and `RestoreResult` dataclasses.
|
|
49
|
+
- `--replace` flag for in-place compaction with backup.
|
|
50
|
+
- Bundled `vss.duckdb_extension` binary from `duckdb-extension-vss` is `LOAD`ed directly from disk when an HNSW index is present, so compaction of ChunkHound and other vector-search DuckDBs works offline out of the box.
|
|
51
|
+
|
|
52
|
+
### Fail-hard
|
|
53
|
+
- Sources with non-`main` schemas, views, or foreign-key cycles raise `ValueError` rather than silently dropping objects.
|
|
54
|
+
- On any failure after the target file is created, the partial target is unlinked. A half-written multi-GB file is worse than nothing.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
@AGENTS.md
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
# Contributing
|
|
2
|
+
|
|
3
|
+
## Setup
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
uv sync --extra dev # Python toolchain (mypy, pytest, ruff, typos, pre-commit)
|
|
7
|
+
npm install # Node toolchain (prettier; one-time per clone)
|
|
8
|
+
uv run pre-commit install # wires the git pre-commit hook (one-time per clone)
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
Python 3.10 through 3.13 supported. macOS and Linux are tested.
|
|
12
|
+
|
|
13
|
+
The pre-commit hook runs ruff, typos, prettier, mypy, and pytest on every commit. Bypass with `git commit --no-verify` when you need to ship a WIP commit; CI still runs the same checks.
|
|
14
|
+
|
|
15
|
+
## Local checks
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
uv run pytest
|
|
19
|
+
uv run ruff check src/ tests/
|
|
20
|
+
uv run ruff format --check src/ tests/
|
|
21
|
+
uv run mypy src/
|
|
22
|
+
uv run typos
|
|
23
|
+
npx prettier --check .
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
Apply prettier fixes with `npm run format:fix`. CI runs the same commands.
|
|
27
|
+
|
|
28
|
+
### Tooling notes
|
|
29
|
+
|
|
30
|
+
- **Ruff**: `E W F I B C4 UP ARG SIM PTH`, line length 100, `E501` ignored.
|
|
31
|
+
- **MyPy**: strict; `tests.*` relaxed; `duckdb`, `duckdb_extension_vss` missing-imports ignored.
|
|
32
|
+
- **Pytest**: discovers `tests/`. Fixtures bundle real ChunkHound DB samples (see `tests/conftest.py`).
|
|
33
|
+
- **Typos**: config at `.typos.toml`.
|
|
34
|
+
- **Prettier**: config at `.prettierrc.json` and `.prettierignore`. Markdown is excluded because prettier mangles identifiers containing underscores and bloats markdown tables past the 100-column line limit.
|
|
35
|
+
|
|
36
|
+
## CI workflows
|
|
37
|
+
|
|
38
|
+
Three workflows under `.github/workflows/`. All third-party actions are SHA-pinned with `# vX.Y.Z` comments so Renovate can update them later.
|
|
39
|
+
|
|
40
|
+
### ci.yml
|
|
41
|
+
|
|
42
|
+
Runs on every push to `main` and every PR targeting `main`. Six parallel jobs:
|
|
43
|
+
|
|
44
|
+
| Job | What |
|
|
45
|
+
|-------------|---------------------------------------------------------------------------------------|
|
|
46
|
+
| `lint` | `ruff check` and `ruff format --check` |
|
|
47
|
+
| `typecheck` | `mypy src/` |
|
|
48
|
+
| `test` | `pytest` with coverage on a Python 3.10 / 3.11 / 3.12 / 3.13 matrix (`ubuntu-latest`) |
|
|
49
|
+
| `typos` | `typos` against the repo |
|
|
50
|
+
| `prettier` | `prettier --check .` |
|
|
51
|
+
| `build` | `uv build`; uploads wheel + sdist as a `dist` artifact (14-day retention) |
|
|
52
|
+
|
|
53
|
+
Coverage is reported but not gated. The `build` job runs independently of the others, so PR reviewers can download a wheel even when other jobs fail.
|
|
54
|
+
|
|
55
|
+
### rolling.yml
|
|
56
|
+
|
|
57
|
+
After `ci.yml` succeeds on `main`, deletes the existing `rolling` tag plus release and recreates them at the new HEAD SHA. Marked prerelease and not-latest so it never shadows a tagged release.
|
|
58
|
+
|
|
59
|
+
Install the current main-branch build from the rolling asset:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
uv tool install https://github.com/it-bens/chunkhound-index-compactor/releases/download/rolling/chunkhound_index_compactor-<version>-py3-none-any.whl
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### release.yml
|
|
66
|
+
|
|
67
|
+
Triggers on a `v*.*.*` tag push or manual `workflow_dispatch` against a tag ref.
|
|
68
|
+
|
|
69
|
+
1. Verifies the ref is a tag.
|
|
70
|
+
2. Verifies the tag (minus the `v` prefix) matches `pyproject.toml`'s `version`.
|
|
71
|
+
3. Builds wheel + sdist.
|
|
72
|
+
4. Publishes to PyPI via Trusted Publisher (OIDC; no PyPI token in the repo).
|
|
73
|
+
5. Creates a GitHub Release with the wheel + sdist attached. Release notes are not auto-generated; edit them on GitHub afterward.
|
|
74
|
+
|
|
75
|
+
## Release process
|
|
76
|
+
|
|
77
|
+
To cut a release:
|
|
78
|
+
|
|
79
|
+
1. Bump `version` in `pyproject.toml` (for example, `0.1.0` to `0.2.0`) and update `CHANGELOG.md`.
|
|
80
|
+
2. Commit, push, wait for CI to pass on `main`.
|
|
81
|
+
3. Tag and push: `git tag v0.2.0 && git push origin v0.2.0`.
|
|
82
|
+
4. The release workflow publishes to PyPI and creates a GitHub Release.
|
|
83
|
+
5. Open the Release on GitHub and write the release notes.
|
|
84
|
+
|
|
85
|
+
To re-trigger from a tag manually (for example, after a transient PyPI failure): GitHub Actions, then Release, then Run workflow, then pick the tag from the ref dropdown.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Martin Bens
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: chunkhound-index-compactor
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Compact a bloated ChunkHound DuckDB index by rebuilding it into a fresh file
|
|
5
|
+
Project-URL: Homepage, https://github.com/it-bens/chunkhound-index-compactor
|
|
6
|
+
Project-URL: Repository, https://github.com/it-bens/chunkhound-index-compactor
|
|
7
|
+
Project-URL: Issues, https://github.com/it-bens/chunkhound-index-compactor/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/it-bens/chunkhound-index-compactor/blob/main/CHANGELOG.md
|
|
9
|
+
Author-email: Martin Bens <martin.bens@it-bens.de>
|
|
10
|
+
License-Expression: MIT
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: chunkhound,compact,database,duckdb,hnsw,vector-index
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: Operating System :: OS Independent
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
21
|
+
Classifier: Topic :: Database
|
|
22
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
23
|
+
Requires-Python: <3.14,>=3.10
|
|
24
|
+
Requires-Dist: duckdb-extension-vss>=1.5.2
|
|
25
|
+
Requires-Dist: duckdb<1.5.3.dev0,>=1.4.0
|
|
26
|
+
Requires-Dist: typer>=0.25
|
|
27
|
+
Provides-Extra: dev
|
|
28
|
+
Requires-Dist: mypy>=2.1; extra == 'dev'
|
|
29
|
+
Requires-Dist: pre-commit>=4.5; extra == 'dev'
|
|
30
|
+
Requires-Dist: pytest-cov>=7.0; extra == 'dev'
|
|
31
|
+
Requires-Dist: pytest>=9.0; extra == 'dev'
|
|
32
|
+
Requires-Dist: ruff>=0.15; extra == 'dev'
|
|
33
|
+
Requires-Dist: typos>=1.46; extra == 'dev'
|
|
34
|
+
Description-Content-Type: text/markdown
|
|
35
|
+
|
|
36
|
+
# ChunkHound Index Compactor
|
|
37
|
+
|
|
38
|
+
Compact a [DuckDB](https://duckdb.org) database by rebuilding it into a fresh file. The motivating and supported use case is shrinking a bloated [ChunkHound](https://github.com/chunkhound/chunkhound) index, whose drop-and-recreate HNSW churn (above its 50-row write-batch threshold) leaves large amounts of orphaned-but-counted blocks. The rebuild pipeline is structurally generic and works on other single-schema DuckDB files, but only ChunkHound-shaped inputs are promised: any shape outside that scope is refused at the front gate (see the Not Supported section below) rather than silently dropped or rebuilt with loss.
|
|
39
|
+
|
|
40
|
+
That bloat comes back as ChunkHound keeps indexing, so compaction is periodic maintenance rather than a one-time cleanup.
|
|
41
|
+
|
|
42
|
+
## ⚡ Quick Start
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
uvx chunkhound-index-compactor path/to/db.duckdb
|
|
46
|
+
# writes path/to/db.duckdb.compacted
|
|
47
|
+
|
|
48
|
+
uvx chunkhound-index-compactor path/to/db.duckdb --replace
|
|
49
|
+
# swaps in the compacted copy and keeps the original at path/to/db.duckdb.bak
|
|
50
|
+
|
|
51
|
+
uvx chunkhound-index-compactor path/to/db.duckdb --skip-hnsw
|
|
52
|
+
# skips rebuilding vector indexes (RAM-flat, smallest output); restore them later
|
|
53
|
+
uvx chunkhound-index-compactor restore path/to/db.duckdb.compacted
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
The source is opened read-only, but an active writer holds the file lock. Close any process writing to the database before running.
|
|
57
|
+
|
|
58
|
+
## 🖥️ CLI Usage
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
$ chunkhound-index-compactor --help
|
|
62
|
+
Usage: chunkhound-index-compactor [OPTIONS] COMMAND [ARGS]...
|
|
63
|
+
|
|
64
|
+
Commands:
|
|
65
|
+
compact Compact a DuckDB database by rebuilding it into a fresh file. (default)
|
|
66
|
+
restore Rebuild HNSW vector indexes in a --skip-hnsw artifact, in place.
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
A bare invocation routes to `compact`, so `chunkhound-index-compactor SOURCE` still works:
|
|
70
|
+
|
|
71
|
+
```
|
|
72
|
+
chunkhound-index-compactor SOURCE [TARGET] [--replace] [--skip-hnsw]
|
|
73
|
+
chunkhound-index-compactor restore DATABASE
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
| Argument / Option | Meaning |
|
|
77
|
+
|-------------------|-----------------------------------------------------------------------------------|
|
|
78
|
+
| `SOURCE` | Path to the existing DuckDB file (required) |
|
|
79
|
+
| `TARGET` | Path for the compacted output [default: `<source>.compacted`] |
|
|
80
|
+
| `--replace` | After success, replace source with the compacted file (original → `<source>.bak`) |
|
|
81
|
+
| `--skip-hnsw` | Do not rebuild vector indexes; write a recipe table for later `restore` |
|
|
82
|
+
|
|
83
|
+
With `--skip-hnsw`, the output has no vector index and falls back to a brute-force scan (correct, just unaccelerated) until you run `restore`. Rebuilding the HNSW is the memory-dominant step, so `--skip-hnsw` lets you compact on a small machine and `restore` on a RAM-capable one. See [docs/benchmarks.md](docs/benchmarks.md) for peak-RAM numbers and [docs/architecture.md §RAM cost asymmetry](docs/architecture.md#ram-cost-asymmetry) for why.
|
|
84
|
+
|
|
85
|
+
## 🐍 Library Usage
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
from pathlib import Path
|
|
89
|
+
from chunkhound_index_compactor import compact_database, restore_indexes, replace_with_compacted
|
|
90
|
+
|
|
91
|
+
result = compact_database(Path("big.duckdb"), Path("small.duckdb"))
|
|
92
|
+
print(f"{result.source_size} -> {result.target_size} ({result.delta_pct:+.1f}%)")
|
|
93
|
+
|
|
94
|
+
# Small-RAM path: skip the vector index, restore it later on a bigger machine.
|
|
95
|
+
compact_database(Path("big.duckdb"), Path("small.duckdb"), skip_hnsw=True)
|
|
96
|
+
restored = restore_indexes(Path("small.duckdb"))
|
|
97
|
+
print(f"restored: {restored.restored}")
|
|
98
|
+
|
|
99
|
+
# Optional: swap in place with .bak backup
|
|
100
|
+
backup = replace_with_compacted(result.source, result.target)
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
`compact_database()` raises:
|
|
104
|
+
- `ValueError`: `target` resolves to the same path as `source`, the FK graph has a cycle, or the source has a shape refused at the front gate (see the Not Supported section below).
|
|
105
|
+
- `FileNotFoundError`: `source` does not exist.
|
|
106
|
+
- `FileExistsError`: `target` already exists.
|
|
107
|
+
- `RuntimeError`: the bundled `vss` extension binary cannot be located (only reachable if the source contains an HNSW index).
|
|
108
|
+
|
|
109
|
+
`restore_indexes()` raises:
|
|
110
|
+
- `FileNotFoundError`: `database` does not exist.
|
|
111
|
+
- `ValueError`: `database` has no `_compactor_hnsw_recipe` table (not a `--skip-hnsw` artifact).
|
|
112
|
+
- `RuntimeError`: the bundled `vss` extension binary cannot be located.
|
|
113
|
+
|
|
114
|
+
`replace_with_compacted()` raises:
|
|
115
|
+
- `FileNotFoundError`: `source` or `compacted` is missing.
|
|
116
|
+
- `FileExistsError`: `<source>.bak` already exists (it refuses to overwrite an existing backup).
|
|
117
|
+
- `OSError`: the move from `compacted` to `source` fails even via the cross-filesystem fallback (`shutil.move`).
|
|
118
|
+
|
|
119
|
+
## 🚫 Not Supported
|
|
120
|
+
|
|
121
|
+
The tool fails hard rather than silently dropping anything it cannot reproduce.
|
|
122
|
+
|
|
123
|
+
- Non-`main` schemas and views (raise `ValueError`).
|
|
124
|
+
- User-defined types, generated columns, self-referential foreign keys, and HNSW indexes on non-bare-column expressions (raise `ValueError`).
|
|
125
|
+
- Foreign-key cycles among tables (raise `ValueError`).
|
|
126
|
+
- HNSW tuning parameters other than `metric` (`M`, `M0`, `ef_construction`, `ef_search`); they are not recoverable from a built index and are rebuilt at the `vss` defaults.
|
|
127
|
+
- Table and column comments are not carried across the rebuild.
|
|
128
|
+
|
|
129
|
+
See [docs/architecture.md](docs/architecture.md#not-supported-and-why) for the reasoning, and [docs/out-of-scope.md](docs/out-of-scope.md) for approaches considered and not pursued.
|
|
130
|
+
|
|
131
|
+
## 🏗️ Development
|
|
132
|
+
|
|
133
|
+
Setup, local checks, CI, and release process: [CONTRIBUTING.md](CONTRIBUTING.md).
|
|
134
|
+
|
|
135
|
+
## ⚖️ License
|
|
136
|
+
|
|
137
|
+
MIT
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
> [!NOTE]
|
|
142
|
+
> Yes, an AI wrote this README. And the code, the docs, the tests, and
|
|
143
|
+
> the `.claude/skills` it now uses to write the next round. Yes, a human
|
|
144
|
+
> told it to keep the emojis. The human has ADHD, which, as it turns
|
|
145
|
+
> out, means his brain was already doing attention re-routing and
|
|
146
|
+
> context-window thrashing before LLMs made it cool. They call him ...
|
|
147
|
+
> LLMartin. The emojis are a feature.
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
# ChunkHound Index Compactor
|
|
2
|
+
|
|
3
|
+
Compact a [DuckDB](https://duckdb.org) database by rebuilding it into a fresh file. The motivating and supported use case is shrinking a bloated [ChunkHound](https://github.com/chunkhound/chunkhound) index, whose drop-and-recreate HNSW churn (above its 50-row write-batch threshold) leaves large amounts of orphaned-but-counted blocks. The rebuild pipeline is structurally generic and works on other single-schema DuckDB files, but only ChunkHound-shaped inputs are promised: any shape outside that scope is refused at the front gate (see the Not Supported section below) rather than silently dropped or rebuilt with loss.
|
|
4
|
+
|
|
5
|
+
That bloat comes back as ChunkHound keeps indexing, so compaction is periodic maintenance rather than a one-time cleanup.
|
|
6
|
+
|
|
7
|
+
## ⚡ Quick Start
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
uvx chunkhound-index-compactor path/to/db.duckdb
|
|
11
|
+
# writes path/to/db.duckdb.compacted
|
|
12
|
+
|
|
13
|
+
uvx chunkhound-index-compactor path/to/db.duckdb --replace
|
|
14
|
+
# swaps in the compacted copy and keeps the original at path/to/db.duckdb.bak
|
|
15
|
+
|
|
16
|
+
uvx chunkhound-index-compactor path/to/db.duckdb --skip-hnsw
|
|
17
|
+
# skips rebuilding vector indexes (RAM-flat, smallest output); restore them later
|
|
18
|
+
uvx chunkhound-index-compactor restore path/to/db.duckdb.compacted
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
The source is opened read-only, but an active writer holds the file lock. Close any process writing to the database before running.
|
|
22
|
+
|
|
23
|
+
## 🖥️ CLI Usage
|
|
24
|
+
|
|
25
|
+
```
|
|
26
|
+
$ chunkhound-index-compactor --help
|
|
27
|
+
Usage: chunkhound-index-compactor [OPTIONS] COMMAND [ARGS]...
|
|
28
|
+
|
|
29
|
+
Commands:
|
|
30
|
+
compact Compact a DuckDB database by rebuilding it into a fresh file. (default)
|
|
31
|
+
restore Rebuild HNSW vector indexes in a --skip-hnsw artifact, in place.
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
A bare invocation routes to `compact`, so `chunkhound-index-compactor SOURCE` still works:
|
|
35
|
+
|
|
36
|
+
```
|
|
37
|
+
chunkhound-index-compactor SOURCE [TARGET] [--replace] [--skip-hnsw]
|
|
38
|
+
chunkhound-index-compactor restore DATABASE
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
| Argument / Option | Meaning |
|
|
42
|
+
|-------------------|-----------------------------------------------------------------------------------|
|
|
43
|
+
| `SOURCE` | Path to the existing DuckDB file (required) |
|
|
44
|
+
| `TARGET` | Path for the compacted output [default: `<source>.compacted`] |
|
|
45
|
+
| `--replace` | After success, replace source with the compacted file (original → `<source>.bak`) |
|
|
46
|
+
| `--skip-hnsw` | Do not rebuild vector indexes; write a recipe table for later `restore` |
|
|
47
|
+
|
|
48
|
+
With `--skip-hnsw`, the output has no vector index and falls back to a brute-force scan (correct, just unaccelerated) until you run `restore`. Rebuilding the HNSW is the memory-dominant step, so `--skip-hnsw` lets you compact on a small machine and `restore` on a RAM-capable one. See [docs/benchmarks.md](docs/benchmarks.md) for peak-RAM numbers and [docs/architecture.md §RAM cost asymmetry](docs/architecture.md#ram-cost-asymmetry) for why.
|
|
49
|
+
|
|
50
|
+
## 🐍 Library Usage
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
from pathlib import Path
|
|
54
|
+
from chunkhound_index_compactor import compact_database, restore_indexes, replace_with_compacted
|
|
55
|
+
|
|
56
|
+
result = compact_database(Path("big.duckdb"), Path("small.duckdb"))
|
|
57
|
+
print(f"{result.source_size} -> {result.target_size} ({result.delta_pct:+.1f}%)")
|
|
58
|
+
|
|
59
|
+
# Small-RAM path: skip the vector index, restore it later on a bigger machine.
|
|
60
|
+
compact_database(Path("big.duckdb"), Path("small.duckdb"), skip_hnsw=True)
|
|
61
|
+
restored = restore_indexes(Path("small.duckdb"))
|
|
62
|
+
print(f"restored: {restored.restored}")
|
|
63
|
+
|
|
64
|
+
# Optional: swap in place with .bak backup
|
|
65
|
+
backup = replace_with_compacted(result.source, result.target)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
`compact_database()` raises:
|
|
69
|
+
- `ValueError`: `target` resolves to the same path as `source`, the FK graph has a cycle, or the source has a shape refused at the front gate (see the Not Supported section below).
|
|
70
|
+
- `FileNotFoundError`: `source` does not exist.
|
|
71
|
+
- `FileExistsError`: `target` already exists.
|
|
72
|
+
- `RuntimeError`: the bundled `vss` extension binary cannot be located (only reachable if the source contains an HNSW index).
|
|
73
|
+
|
|
74
|
+
`restore_indexes()` raises:
|
|
75
|
+
- `FileNotFoundError`: `database` does not exist.
|
|
76
|
+
- `ValueError`: `database` has no `_compactor_hnsw_recipe` table (not a `--skip-hnsw` artifact).
|
|
77
|
+
- `RuntimeError`: the bundled `vss` extension binary cannot be located.
|
|
78
|
+
|
|
79
|
+
`replace_with_compacted()` raises:
|
|
80
|
+
- `FileNotFoundError`: `source` or `compacted` is missing.
|
|
81
|
+
- `FileExistsError`: `<source>.bak` already exists (it refuses to overwrite an existing backup).
|
|
82
|
+
- `OSError`: the move from `compacted` to `source` fails even via the cross-filesystem fallback (`shutil.move`).
|
|
83
|
+
|
|
84
|
+
## 🚫 Not Supported
|
|
85
|
+
|
|
86
|
+
The tool fails hard rather than silently dropping anything it cannot reproduce.
|
|
87
|
+
|
|
88
|
+
- Non-`main` schemas and views (raise `ValueError`).
|
|
89
|
+
- User-defined types, generated columns, self-referential foreign keys, and HNSW indexes on non-bare-column expressions (raise `ValueError`).
|
|
90
|
+
- Foreign-key cycles among tables (raise `ValueError`).
|
|
91
|
+
- HNSW tuning parameters other than `metric` (`M`, `M0`, `ef_construction`, `ef_search`); they are not recoverable from a built index and are rebuilt at the `vss` defaults.
|
|
92
|
+
- Table and column comments are not carried across the rebuild.
|
|
93
|
+
|
|
94
|
+
See [docs/architecture.md](docs/architecture.md#not-supported-and-why) for the reasoning, and [docs/out-of-scope.md](docs/out-of-scope.md) for approaches considered and not pursued.
|
|
95
|
+
|
|
96
|
+
## 🏗️ Development
|
|
97
|
+
|
|
98
|
+
Setup, local checks, CI, and release process: [CONTRIBUTING.md](CONTRIBUTING.md).
|
|
99
|
+
|
|
100
|
+
## ⚖️ License
|
|
101
|
+
|
|
102
|
+
MIT
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
> [!NOTE]
|
|
107
|
+
> Yes, an AI wrote this README. And the code, the docs, the tests, and
|
|
108
|
+
> the `.claude/skills` it now uses to write the next round. Yes, a human
|
|
109
|
+
> told it to keep the emojis. The human has ADHD, which, as it turns
|
|
110
|
+
> out, means his brain was already doing attention re-routing and
|
|
111
|
+
> context-window thrashing before LLMs made it cool. They call him ...
|
|
112
|
+
> LLMartin. The emojis are a feature.
|