donotreadagain 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (56) hide show
  1. donotreadagain-0.1.0/.github/ISSUE_TEMPLATE/bug_report.md +27 -0
  2. donotreadagain-0.1.0/.github/ISSUE_TEMPLATE/feature_request.md +19 -0
  3. donotreadagain-0.1.0/.github/PULL_REQUEST_TEMPLATE.md +16 -0
  4. donotreadagain-0.1.0/.github/workflows/ci.yml +30 -0
  5. donotreadagain-0.1.0/.gitignore +32 -0
  6. donotreadagain-0.1.0/CHANGELOG.md +34 -0
  7. donotreadagain-0.1.0/CODE_OF_CONDUCT.md +33 -0
  8. donotreadagain-0.1.0/CONTRIBUTING.md +71 -0
  9. donotreadagain-0.1.0/LICENSE +21 -0
  10. donotreadagain-0.1.0/MILESTONES.md +153 -0
  11. donotreadagain-0.1.0/PKG-INFO +151 -0
  12. donotreadagain-0.1.0/README.md +126 -0
  13. donotreadagain-0.1.0/SECURITY.md +54 -0
  14. donotreadagain-0.1.0/SKILL.md +70 -0
  15. donotreadagain-0.1.0/pyproject.toml +40 -0
  16. donotreadagain-0.1.0/qna.md +35 -0
  17. donotreadagain-0.1.0/spec/dnr-0.1.md +124 -0
  18. donotreadagain-0.1.0/spec/dnr.schema.json +140 -0
  19. donotreadagain-0.1.0/spec/vectors/sample.txt +3 -0
  20. donotreadagain-0.1.0/spec/vectors/sample.wav +0 -0
  21. donotreadagain-0.1.0/spec/vectors/vectors.json +10 -0
  22. donotreadagain-0.1.0/src/dnr/__init__.py +11 -0
  23. donotreadagain-0.1.0/src/dnr/bootstrap.py +22 -0
  24. donotreadagain-0.1.0/src/dnr/cli.py +469 -0
  25. donotreadagain-0.1.0/src/dnr/embed.py +397 -0
  26. donotreadagain-0.1.0/src/dnr/formats.py +53 -0
  27. donotreadagain-0.1.0/src/dnr/guide.py +44 -0
  28. donotreadagain-0.1.0/src/dnr/hashing.py +156 -0
  29. donotreadagain-0.1.0/src/dnr/index.py +494 -0
  30. donotreadagain-0.1.0/src/dnr/ingest.py +216 -0
  31. donotreadagain-0.1.0/src/dnr/keyring.py +45 -0
  32. donotreadagain-0.1.0/src/dnr/record.py +42 -0
  33. donotreadagain-0.1.0/src/dnr/schema.py +79 -0
  34. donotreadagain-0.1.0/src/dnr/signing.py +71 -0
  35. donotreadagain-0.1.0/src/dnr/skill.py +94 -0
  36. donotreadagain-0.1.0/src/dnr/transcribe.py +133 -0
  37. donotreadagain-0.1.0/tests/conftest.py +76 -0
  38. donotreadagain-0.1.0/tests/test_coverage.py +81 -0
  39. donotreadagain-0.1.0/tests/test_embed.py +59 -0
  40. donotreadagain-0.1.0/tests/test_fixes.py +103 -0
  41. donotreadagain-0.1.0/tests/test_guide.py +32 -0
  42. donotreadagain-0.1.0/tests/test_hashing.py +38 -0
  43. donotreadagain-0.1.0/tests/test_index.py +168 -0
  44. donotreadagain-0.1.0/tests/test_ingest.py +66 -0
  45. donotreadagain-0.1.0/tests/test_init.py +49 -0
  46. donotreadagain-0.1.0/tests/test_query_features.py +81 -0
  47. donotreadagain-0.1.0/tests/test_query_filters.py +116 -0
  48. donotreadagain-0.1.0/tests/test_query_memory.py +71 -0
  49. donotreadagain-0.1.0/tests/test_record.py +27 -0
  50. donotreadagain-0.1.0/tests/test_robustness.py +58 -0
  51. donotreadagain-0.1.0/tests/test_schema.py +36 -0
  52. donotreadagain-0.1.0/tests/test_signing.py +31 -0
  53. donotreadagain-0.1.0/tests/test_strip.py +31 -0
  54. donotreadagain-0.1.0/tests/test_text.py +47 -0
  55. donotreadagain-0.1.0/tests/test_vectors.py +16 -0
  56. donotreadagain-0.1.0/vision.md +383 -0
@@ -0,0 +1,27 @@
1
+ ---
2
+ name: Bug report
3
+ about: Something didn't work as expected
4
+ labels: bug
5
+ ---
6
+
7
+ **What happened**
8
+ A clear description of the bug.
9
+
10
+ **To reproduce**
11
+ Exact `dnr` command(s) and, if possible, a minimal file/folder that triggers it.
12
+
13
+ ```
14
+ $ dnr ...
15
+ ```
16
+
17
+ **Expected**
18
+ What you expected instead.
19
+
20
+ **Environment**
21
+ - dnr version (`dnr --version`):
22
+ - Python version (`python --version`):
23
+ - OS:
24
+ - File type involved (PDF / PNG / JPEG / MP3 / docx / …):
25
+
26
+ **Notes**
27
+ Anything else — e.g. is the transcript flagged low-quality by `dnr status`? Is it in-file or db-only?
@@ -0,0 +1,19 @@
1
+ ---
2
+ name: Feature request
3
+ about: Suggest an idea or improvement
4
+ labels: enhancement
5
+ ---
6
+
7
+ **The problem / use case**
8
+ What are you trying to do that's hard or impossible today?
9
+
10
+ **Proposed solution**
11
+ What you'd like to see (a flag, a command, a carrier, a behavior).
12
+
13
+ **Fit with dnr's principles**
14
+ dnr is a *deterministic substrate* — it doesn't infer metadata or do semantic/fuzzy work (that's the
15
+ agent's job), owns no model, and keeps the file as the source of truth. Does your idea fit, or is it a
16
+ conscious exception? (See [qna.md](../qna.md) for decisions already settled.)
17
+
18
+ **Alternatives considered**
19
+ Anything you tried or ruled out.
@@ -0,0 +1,16 @@
1
+ ## What & why
2
+
3
+ Briefly: what does this change, and why?
4
+
5
+ ## Checklist
6
+
7
+ - [ ] Tests added/updated and `pytest` is green locally
8
+ - [ ] No new hard model dependency in the core (dnr owns no model)
9
+ - [ ] If a new carrier: `content_hash` is invariant under embed + re-embed is byte-stable (round-trip test added)
10
+ - [ ] If agent-facing behavior changed: regenerated `SKILL.md` (`dnr skill > SKILL.md`)
11
+ - [ ] Doesn't make dnr *infer* metadata (dates/parties/topics) or do fuzzy/semantic search — that stays the agent's job
12
+ - [ ] Docs updated if needed (README / spec / qna)
13
+
14
+ ## Notes
15
+
16
+ Link any relevant `qna.md` decision or spec section. Call out anything that rubs against a design principle.
@@ -0,0 +1,30 @@
1
+ name: ci
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ pull_request:
7
+
8
+ jobs:
9
+ test:
10
+ runs-on: ubuntu-latest
11
+ strategy:
12
+ fail-fast: false
13
+ matrix:
14
+ python-version: ["3.10", "3.11", "3.12", "3.13"]
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+ - uses: actions/setup-python@v5
18
+ with:
19
+ python-version: ${{ matrix.python-version }}
20
+ - name: Install
21
+ run: |
22
+ python -m pip install --upgrade pip
23
+ pip install -e ".[dev]"
24
+ - name: Test
25
+ run: pytest -q
26
+ - name: Spec ⇄ schema in sync
27
+ run: |
28
+ python -c "import json,sys; from dnr import schema; \
29
+ cur=json.load(open('spec/dnr.schema.json')); \
30
+ sys.exit(0 if cur==schema.SCHEMA else 'spec/dnr.schema.json is stale — regenerate from dnr.schema.SCHEMA')"
@@ -0,0 +1,32 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+ dist/
7
+ build/
8
+ .venv/
9
+ venv/
10
+ .pytest_cache/
11
+ .ruff_cache/
12
+ .mypy_cache/
13
+
14
+ # uv
15
+ uv.lock.*
16
+ .uv/
17
+
18
+ # dnr index — regenerable cache (safe to ignore; can travel as a warm cache if desired)
19
+ .dnr.db
20
+ .dnr.db-wal
21
+ .dnr.db-shm
22
+ *.dnr.json
23
+
24
+ # editor / OS
25
+ .DS_Store
26
+ .idea/
27
+ .vscode/
28
+
29
+ # scratch / experiments (local only, never committed)
30
+ *.tmp
31
+ /scratch/
32
+ /experiments/
@@ -0,0 +1,34 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here. Format loosely follows
4
+ [Keep a Changelog](https://keepachangelog.com/); this project uses [SemVer](https://semver.org/).
5
+
6
+ ## [Unreleased] — 0.1.0
7
+
8
+ First public cut. Pre-release.
9
+
10
+ ### Added
11
+ - **Read-once cache.** `dnr ingest` / `dnr record` transcribe a file once, sign the record (Ed25519),
12
+ and embed it in the file's own metadata; `dnr read` returns the verified transcript instead of re-parsing.
13
+ - **Per-format `content_hash`** over *decoded* content (PDF streams, audio frames, decoded pixels, OOXML
14
+ manifest, NFC text), invariant under embedding — the identity + re-transcribe trigger.
15
+ - **In-file carriers:** PDF (XMP), MP3 (ID3), PNG (iTXt), JPEG (APP segment). Pixels/content untouched.
16
+ - **db-only records** in a per-folder `.dnr.db` for slotless formats (docx, …) and `--no-embed`
17
+ (evidentiary originals, byte-identical). Already-readable text gets no record. **No sidecar files.**
18
+ - **Per-folder index** (SQLite + FTS5 trigram, Korean/CJK ok): `dnr index` + `dnr query` with composed
19
+ filters (`--match` ∩ `--tag` ∩ `--since/--until` ∩ `--where`), `--any` OR sweeps, `--dedup`,
20
+ `--min-chars`, `--context` KWIC, `--format json`.
21
+ - **Query memory:** saved queries (`--save`/`dnr queries`/`--use`, re-run live), `dnr tag` and `dnr date`
22
+ for explicit (never-inferred) metadata.
23
+ - **Self-describing distribution:** each record carries an `_about` pointer, the `.dnr.db` self-describes,
24
+ and agents fetch the skill once (`dnr skill` / `SKILL.md`). `dnr init` just ensures a key.
25
+ - **Trust + quality:** signature + content_hash gate on `read`/`index`/`verify`; a low-quality
26
+ (empty/mojibake) transcript heuristic flagged by `dnr status` (`trusted ≠ faithful`).
27
+ - CLI: `keygen, ingest, record, read, verify, guide, types, status, date, index, query, queries, tag,
28
+ init, skill, strip, validate, schema`.
29
+ - Spec (`spec/dnr-0.1.md`) + JSON Schema + golden vectors; threat model (`SECURITY.md`).
30
+
31
+ ### Known limits
32
+ - Not yet on PyPI; a standalone binary (Python-less environments) is future work.
33
+ - Adoption (agents *knowing* dnr) is the real lever, not the tool alone.
34
+ - More in-file carriers (OOXML, audio containers, video) and pre-query auto-scan are planned.
@@ -0,0 +1,33 @@
1
+ # Code of Conduct
2
+
3
+ ## Our pledge
4
+
5
+ We want dnr to be a welcoming, harassment-free project for everyone, regardless of
6
+ experience, identity, or background.
7
+
8
+ ## Our standards
9
+
10
+ Examples of behavior that helps:
11
+
12
+ - Being respectful and constructive in issues, PRs, and reviews.
13
+ - Welcoming newcomers and questions.
14
+ - Accepting feedback gracefully and assuming good faith.
15
+
16
+ Examples of unacceptable behavior:
17
+
18
+ - Harassment, insults, or derogatory comments.
19
+ - Personal or political attacks.
20
+ - Publishing others' private information without permission.
21
+
22
+ ## Scope
23
+
24
+ This applies in all project spaces (issues, pull requests, discussions) and when an
25
+ individual is representing the project in public spaces.
26
+
27
+ ## Enforcement
28
+
29
+ Report unacceptable behavior to the maintainer at **melodysdreamj@gmail.com**. Reports
30
+ will be reviewed and handled confidentially. Maintainers may remove, edit, or reject
31
+ contributions and comments that violate this Code of Conduct.
32
+
33
+ This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), v2.1.
@@ -0,0 +1,71 @@
1
+ # Contributing to dnr
2
+
3
+ Thanks for your interest! dnr is a small, principled codebase — a quick read of the principles below will save you (and a reviewer) time.
4
+
5
+ ## Getting started
6
+
7
+ ```bash
8
+ git clone https://github.com/melodysdreamj/donotreadagain
9
+ cd donotreadagain
10
+ python -m venv .venv && . .venv/bin/activate
11
+ pip install -e ".[dev]"
12
+ pytest # should be green in ~1s
13
+ ```
14
+
15
+ Requires Python 3.10+. No API keys, no network, no services — the whole suite runs locally.
16
+
17
+ ## Project layout
18
+
19
+ ```
20
+ src/dnr/
21
+ hashing.py content_hash (per-format, over DECODED content) + whole_hash
22
+ record.py build the record + RFC 8785 (JCS) canonicalization
23
+ signing.py Ed25519 sign / verify; keyring.py manages the local key
24
+ embed.py carriers — read/write the record into a file's slot (PDF/MP3/PNG/JPEG)
25
+ ingest.py transcribe → record → sign → store; read_cached (the skip-reparse gate)
26
+ transcribe.py local providers (pypdf, Whisper, python-docx) + quality/lang heuristics
27
+ index.py per-folder .dnr.db (SQLite + FTS5): scan/harvest, query, query memory
28
+ guide.py the verbatim transcription contract; skill.py / bootstrap.py: distribution
29
+ cli.py the `dnr` command-line surface
30
+ spec/ the normative spec + JSON Schema + golden vectors
31
+ tests/ pytest; fast, hermetic
32
+ SKILL.md the agent skill (generated from skill.py via `dnr skill`)
33
+ ```
34
+
35
+ ## Design principles (please don't break these)
36
+
37
+ 1. **dnr is a deterministic substrate; the agent is the intelligence.** dnr does verifiable, repeatable primitives (hash, sign, full-text/structured query). It must **never *infer* metadata** (dates, case numbers, parties, topics) or do fuzzy/semantic search — that's the calling agent's job. Metadata is set *explicitly* (`dnr tag`, `dnr date`).
38
+ 2. **dnr owns no model.** The transcript is an *input*, produced by the agent (vision), a local model (Whisper, text-extract), or an API. Don't add a hard model dependency to the core.
39
+ 3. **File = canonical truth; the index is a regenerable cache.** Anything in `.dnr.db` (except authoritative db-only records) must be reconstructable from the files.
40
+ 4. **Determinism is load-bearing.** `content_hash` must stay invariant when the record is embedded, and re-embedding identical content must be byte-stable. New carriers must preserve this (see below). No timestamps in records/embeds.
41
+ 5. **`trusted ≠ faithful`.** Signing proves provenance + file-match, not transcription accuracy. Don't conflate them; surface quality, don't fake it.
42
+ 6. **No sidecar files, no per-folder notes.** Records live in-file or db-only; discovery rides on the artifacts' self-description.
43
+
44
+ If a change rubs against one of these, say so in the PR — sometimes the principle should evolve, but it should be a conscious decision (see [qna.md](qna.md) for ones already settled).
45
+
46
+ ## Adding a format carrier (common contribution)
47
+
48
+ To make a new format embed *in-file* (instead of db-only):
49
+
50
+ 1. Add `embed_<fmt>` / `extract_<fmt>` / `strip_<fmt>` in `embed.py`, register them in `_EMBED`/`_EXTRACT`/`_STRIP`.
51
+ 2. **Critical:** embedding must NOT change the file's *decoded content* — `hashing.content_hash` must be invariant before/after embed (e.g. for JPEG, insert a metadata segment without re-encoding the pixels). Re-embedding identical input must be byte-stable.
52
+ 3. Update `formats.py` (`SUPPORTED`) and add a test asserting: round-trip, `content_hash` invariance, idempotent re-embed, and `strip`.
53
+
54
+ ## Tests
55
+
56
+ - Every change needs a test. The suite is the contract — keep it green and fast (no network, no large fixtures).
57
+ - Use `tmp_path` and set `DNR_HOME` to an isolated dir (see the fixtures in `tests/`).
58
+ - Run `pytest` before opening a PR.
59
+
60
+ ## Pull requests
61
+
62
+ - Branch from `main`; keep PRs focused.
63
+ - Conventional-ish commit subjects: `feat(dnr): …`, `fix(dnr): …`, `test: …`, `docs: …`.
64
+ - If you changed agent-facing behavior, regenerate `SKILL.md` (`dnr skill > SKILL.md`) and update it.
65
+ - Describe *what* and *why*; link any relevant `qna.md` / spec section.
66
+
67
+ ## Reporting bugs / ideas
68
+
69
+ Open an issue (templates provided). For anything security-sensitive, see [SECURITY.md](SECURITY.md) — don't file a public issue for vulnerabilities.
70
+
71
+ By contributing you agree your work is licensed under the project's [MIT License](LICENSE).
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 june lee
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,153 @@
1
+ # dnr — Milestones
2
+
3
+ Build roadmap. Full design → [vision.md](vision.md). &nbsp; Status: ✅ done · 🔜 in progress / next · ⬜ todo
4
+
5
+ **v0.1 goal —** a working `dnr` that ingests PDF + audio (transcribe → canonical-hash → deterministic embed → sign), builds a per-folder queryable index (Korean/CJK search included), and lets an agent read/query with **no install**. Fundamentals-first: the `content_hash` and signing primitives are *proven* before the rest is layered on.
6
+
7
+ **Critical path:** M1 → M2 → (M3 ∥ M4) → M5 → M6 → M7 → M8 → M9. &nbsp; **v0.1 cut** = M1–M8 (build) + **M9 (dogfood — the real release-readiness gate)**. &nbsp; **M10–M14** = operability, security, the standard, scale, release.
8
+
9
+ **Progress (2026-06-20):** working `dnr` package + CLI — `hashing`/`record`(JCS)/`embed`(PDF·mp3·sidecar; gates 1·2·4)/`signing`(Ed25519+keyring); `transcribe` (transcriber-agnostic: local text-extract + agent path + Whisper provider) + `guide` (verbatim contract `dnr-verbatim-1`); `ingest`/`read_cached` (skip-reparse, idempotent); `index` (`.dnr.db` fixed table + FTS5 **trigram for CJK** + incremental scan + move resilience + tombstone). CLI: **keygen·ingest·record·read·verify·guide·types·index·query**. End-to-end (ingest→index→query→read) works with **zero API keys**; `dnr init` installs the agent skill (one-phrase bootstrap); **57 tests green.** **M1–M12 landed.** Then a **broad multi-user dogfood** (11 personas, each in an isolated folder) found 2 ship-blockers the targeted run missed — **both now fixed**: (1) the **index/query trusted UNSIGNED records** (security bypass — `read`/`verify` refused forged records but `query` surfaced them); `scan` now verifies signature + content_hash before harvesting. (2) **duplicate-content PRIMARY-KEY collision** silently dropped a file; rows are now keyed by **path**. Also added `query --list`, "no results"/CJK-short-term hints, and stripped-record removal from the index. **Then the 3 high dogfood items were fixed too:** non-PDF `ingest` (text → `method:none` sidecar, searchable; images/unknown → clean "use `dnr record`" errors, no pypdf crash); **CJK <3-char search** (LIKE substring fallback → 2-char Korean terms 계약/이혼/특허 now match); **spec `content_hash`** now documents the `<CS>`/`<IM>` framing + ships **golden vectors** (`spec/vectors/`, text + audio). **Then format coverage expanded:** **docx** (local python-docx text-extract), **images** (JPEG/PNG/TIFF/WEBP — pixel content_hash + agent `dnr record` + sidecar), and OOXML content_hash; cross-format search verified (docx + image in one folder). **Then a real-corpus dogfood** (the founder's `law-example` — 12 real Korean legal docs) found + fixed 4 more bugs: **macOS NFD paths** (now NFC-normalized in the index), **`start_date` as a real column** (`--where` now consistent with `--sort`), **language auto-detect** (`lang='ko'` now works for local ingest), and **filename-searchable FTS** (terms in the filename now match). Query surface also gained `--tag`, `--sort/--desc`, `--match --context N` (KWIC snippets), `--list`, and `record --tags`. (`#5` Korean-PDF word-spacing is inherent to CJK PDF text layers — not fixable from text; use the vision/`dnr record` path.) **76 tests green**, all 4 fixes verified on the real corpus. Then **sidecars were removed entirely** (`.dnr.json` gone): **images now embed in-file** (PNG iTXt / JPEG APP segment — pixels untouched, content_hash invariant, multi-segment chunking for >64KB), text/docx/etc. store a **db-only** record in `.dnr.db` (authoritative; preserved across re-scans), and `--no-embed` forces db-only for evidentiary originals (file byte-identical). Distribution also moved from a per-folder note to a **fetch-once `SKILL.md`** + each record's `_about` self-pointer. Then a **query-memory** layer landed: **composed queries** (`--match` ∩ `--tag a,b` ∩ `--since/--until` ∩ `--where`, one shot), **saved queries** (`--save`/`dnr queries`/`--use` — stores the *query*, re-runs live so it never goes stale), and **`dnr tag <file> <tag>…`** so an agent accumulates tags as it works (carrier files re-indexed immediately). Remaining debt: golden vectors / cross-tool, proper `dnr:` XMP namespace, more in-file carriers (OOXML for docx, audio containers, video), pre-query auto-scan, ingest lock. **Distribution decided: assume Python 3.10+ (pip/pipx/`uvx --from donotreadagain dnr`) — Python's stdlib `sqlite3` covers the read path, so one dependency does both; a standalone per-platform binary for Python-less environments is deferred post-1.0.** Verified: a clean venv `pip install .` yields a working standalone `dnr` (no source tree), sqlite included.
10
+
11
+ ---
12
+
13
+ ## ✅ M0 — Foundation validated
14
+ > Prove the load-bearing assumption before building on it.
15
+ - [x] Design doc (`vision.md`) — architecture, schema, hashing, trust, distribution
16
+ - [x] Repo init (git, MIT, README, .gitignore)
17
+ - [x] **make-or-break experiment** — `content_hash` invariant under embed + every re-save mode (PDF/WAV)
18
+ - [x] Deterministic embed recipe found → conformance **gate 4**
19
+
20
+ ## 🔜 M1 — Canonicalization core + conformance harness
21
+ > The single primitive everything rests on — *and* the test infra that makes "any tool agrees" real.
22
+ - [x] `content_hash(pdf)` — decompressed content streams + image XObjects, page order
23
+ - [~] `content_hash(audio)` done (mp3 frames + wav data chunk); remaining: `content_hash(image)` (decoded pixels), `content_hash(ooxml)` (member manifest)
24
+ - [~] Canonical record serialization — SHA-256 + RFC 8785 JCS done; NFC text normalization remaining
25
+ - [ ] **Conformance harness** — golden test vectors per format + a runnable suite, wired into **CI** (gates run every commit)
26
+ - [ ] **Cross-tool / cross-version determinism** — same `content_hash` across pikepdf/qpdf versions (and a 2nd library), not just self-consistency
27
+ - [ ] Follow-up validations: real **scanned** PDF (image-only), multi-MB payload, real **mp3**
28
+ - **Done when:** two independent tools/versions agree on `content_hash` for a real corpus, with published vectors.
29
+
30
+ ## 🔜 M2 — Embed / extract engine (carriers)
31
+ > Write & read the record in each format's native slot, safely.
32
+ - [~] Write: **PDF (XMP) · mp3 (ID3 TXXX) · sidecar `.dnr.json`** done; remaining: proper `dnr:` namespace, JPEG/PNG/TIFF/MP4, Vorbis, OOXML
33
+ - [x] **Deterministic embed** (`deterministic_id`, no auto-timestamps) — gate 4
34
+ - [x] **Atomic write** (temp + fsync + rename) — never mutate the original in place
35
+ - [x] Preserve native tags (gate 2) · read-back + verify `content_hash` (gate 1)
36
+ - [ ] Sidecar fallback rules: no slot / over size limit / read-only / sensitive
37
+ - **Done when:** all 4 conformance gates pass per carrier in CI.
38
+
39
+ ## 🔜 M3 — Signing & trust
40
+ > Make a record trustworthy enough to justify skipping a re-read.
41
+ - [x] `record_hash = sha256(JCS(record − sig))`, Ed25519 sign / verify
42
+ - [~] Keygen + trust list done; persistent local keyring remaining
43
+ - [ ] Verify → trust tiers: signed + trusted + hash-match → **skip-reparse**; else **search-only + fallback**
44
+ - [ ] `transcript` always handled as untrusted data, never as instructions
45
+ - **Done when:** forged / altered / untrusted-key records are correctly refused for skip-reparse.
46
+
47
+ ## 🔜 M4 — Transcription (ingest)
48
+ > Turn a raw file into a faithful record, once. **dnr owns no model** — the transcript is supplied by the agent or a local provider.
49
+ - [x] **Transcriber-agnostic ingest pipeline** — content_hash → transcribe → record → sign → embed
50
+ - [x] **Local `text-extract`** (pypdf, born-digital PDF, NFC) · **agent-supplied** path (`dnr record`) · **text files** (.txt/.md/.json/… → `method:none` sidecar) · clean errors for images/unknown
51
+ - [ ] Local models: **Whisper** (audio) · local OCR/vision (scans); optional hosted API
52
+ - [ ] Method hierarchy enforced: `text-extract` → `vision` → (`ocr` demoted)
53
+ - [ ] **Verbatim** transcription contract (prompt) shipped in the skill: complete, no summary, mark uncertainty
54
+ - [ ] provenance: version, instruction_id, prompt_hash, confidence; per-segment language tagging (feeds M6)
55
+ - [ ] Cost control: query-driven lazy ingest · ask-the-user · `dnr ingest --glob --budget`
56
+ - **Done when:** a PDF / audio ingests into a verbatim, signed, embedded record with provenance — agent-supplied or local, no API key required.
57
+
58
+ ## 🔜 M5 — Index (query layer)
59
+ > A folder becomes a queryable table — cheaply, incrementally.
60
+ - [x] `.dnr.db`: fixed base table + `dnr_fts` (FTS5) + `_dnr_readme`
61
+ - [x] Incremental scan: stat → record → harvest; tombstone deletes
62
+ - [~] **index ≠ ingest** done; cold-folder media → currently *skipped* (pending-rows TODO)
63
+ - [ ] Pre-query incremental scan (`--no-scan` to skip)
64
+ - [x] Move resilience: `content_hash` match → update `path` only
65
+ - [~] Concurrency: SQLite **WAL** on; `content_hash` ingest lock TODO
66
+ - **Done when:** second scan is fast (stat-skips ✅), queries are fresh, moves don't re-transcribe ✅
67
+
68
+ ## 🔜 M6 — i18n & search quality
69
+ > Make non-English — especially Korean/CJK — actually searchable (the founder's own corpus).
70
+ - [x] **trigram FTS5** + **LIKE substring fallback for <3-char terms** → 2-char Korean (계약/이혼/특허) matches; **filename also searchable** (FTS over name)
71
+ - [x] **NFC normalization end-to-end** — index stores NFC paths/names (fixes macOS NFD); text NFC-normalized
72
+ - [x] **language auto-detect** (script heuristic) → `lang` set on local ingest; `--where lang='ko'` works
73
+ - [~] Multilingual `fields` consistency / RTL / bidi remaining
74
+ - **Done when:** ✅ Korean legal-doc search (incl. 2-char terms + filenames) returns correct hits — verified on the real `law-example` corpus
75
+
76
+ ## 🔜 M7 — CLI & distribution
77
+ > One tool that ties it together, runnable anywhere.
78
+ - [~] `dnr init·ingest·record·read·verify·keygen·guide·types·index·query` done; `seal·strip` TODO
79
+ - [x] Protocol enforced in code (`dnr read/index/query` are real commands, not prose)
80
+ - [ ] `uvx` package **+ single static binary** (per-platform releases) — dependency-free drop-in
81
+ - **Done when:** `uvx dnr index <folder>` and `dnr query` work on a fresh machine, offline (minus transcription API).
82
+
83
+ ## 🔜 M8 — Agent integration (consumer)
84
+ > Zero-install consumption by AI agents.
85
+ - [x] **agent skill (`SKILL.md`)**: fixed schema + example queries + consumer contract + verbatim guide; a fetchable skill (frontmatter name/description), **not** a per-folder note
86
+ - [x] **skill encodes the full decision flow** (A: one file → self-validating `dnr read` / B: folder → status→transcribe→index→query) and was **adversarially tested** — fresh agents given only the skill text + 6 scenarios; judged vs the canonical flow + a doc critic over 4 rounds (3→4→5 correct, **0 wrong throughout**), fixing real gaps (read=self-validating; `--sidecar` mutation; transcribe-as-a-step; bulk-only ask-gate) so the wording matches actual CLI behavior
87
+ - [~] Consumer path documented (`dnr read`/`query`; raw `sqlite3` via `_dnr_readme`)
88
+ - [x] **transcribe-first ask-flow** — `dnr status <folder>` reports coverage by cost (model = image/audio/video / parse = PDF·Office / cheap = text); the skill tells the agent to run it on the first folder-wide question and **offer to transcribe-first** when expensive files are un-transcribed (one-time pass → every later view is a cache hit). Verified on the real corpus: `status 자료/` → "0/441 transcribed, 92 model + 202 parse pending → transcribe-first recommended".
89
+ - [x] **No per-folder note — self-describing + fetch-once skill** — every record carries an `_about` pointer (and the `.dnr.db` readme points to the skill), so an agent that meets a dnr artifact fetches `SKILL.md` **once** (committed at the repo root / `dnr skill`) and then knows dnr in every folder; nothing is written into the user's folders. `dnr init` now only ensures a signing key. Run with no install via `uvx --from donotreadagain dnr`.
90
+ - **Done when:** an agent given only the skill queries a dnr folder and skips re-parsing correctly; `dnr init` bootstraps from a single user phrase.
91
+
92
+ ## 🔜 M9 — Agent scenario testing & dogfooding
93
+ > Drive the whole thing with real agents across many scenarios — the bugs that specs & unit tests miss surface here, and feed M10–M12. This is the real release-readiness gate.
94
+ - [x] **Scenario matrix**, run by agents: cache-hit, cross-file query, cold folder, move, freshness, incremental, CJK, + adversarial
95
+ - [~] **Multi-harness**: exercised via the real `dnr` CLI by agents; actual Claude Code / Codex / Cursor runs are a broader TODO
96
+ - [x] **Adversarial / edge**: forged-unsigned (refused), tampered-signed (verify fails), freshness (no stale leak), corrupt/garbage file
97
+ - [x] **Measure**: 8/10 pass, security held, failure taxonomy + prioritized backlog produced
98
+ - [x] Run as a **multi-agent workflow** (10 scenarios in parallel + synthesis)
99
+ - **Done when:** ✅ matrix complete with a failure list; fixes fed back (corrupt-file robustness → done). Remaining low-pri: CJK <3-char FTS, file-embedded-cache note.
100
+
101
+ ## 🔜 M10 — Reversibility & corpus operability
102
+ > Make it safe to undo, and runnable at corpus scale.
103
+ - [x] **Robustness** (from M9 dogfooding): corrupt/missing files no longer crash — clean errors, one bad file never aborts a scan; `dnr read` falls back gracefully
104
+ - [x] `dnr strip` (un-embed, in-file + sidecar; content unchanged) · **bulk rollback** TODO
105
+ - [ ] **Resumable / idempotent** ingest after crash · `--dry-run`
106
+ - [ ] Rebuild a corrupted/lost `.dnr.db` without re-incurring transcription
107
+ - [ ] **Model-upgrade policy**: re-transcribe only lossy methods (asr/ocr/vision), skip `text-extract`; partial/lazy migration; mixed-version coherence
108
+ - [ ] Backup/dedup awareness (embedding changes whole_hash → re-backup churn)
109
+ - **Done when:** a bad bulk ingest is fully revertible and a crashed run resumes cleanly.
110
+
111
+ ## 🔜 M11 — Security & privacy
112
+ > Treat every embedded record as untrusted input; don't leak on share.
113
+ - [x] **Threat-model document** — `SECURITY.md` (injection, forgery, TOFU, exfiltration, custody) + dogfood evidence
114
+ - [~] `transcript` as untrusted data (skill + contract); injection covered by forged/tampered dogfood; dedicated injection corpus TODO
115
+ - [~] `dnr strip` before sharing done; sensitivity-flag refuse-embed TODO
116
+ - [~] Poisoning surface documented; sanitization helpers TODO
117
+ - **Done when:** ✅ dogfooding showed a malicious/forged/tampered file cannot pass as trusted or steer the agent.
118
+
119
+ ## 🔜 M12 — Spec formalization (the standard) ← the goal
120
+ > Make it implementable by others, and able to evolve.
121
+ - [x] `spec/dnr-0.1.md` (normative) + `spec/dnr.schema.json` (JSON Schema) + `dnr validate` / `dnr schema`
122
+ - [x] Carrier mapping table · per-format canonicalization (incl. documented PDF `<CS>`/`<IM>` framing) · conformance gates (in spec)
123
+ - [x] **Golden conformance vectors** — `spec/vectors/` (text + audio) + a test the impl must reproduce; PDF/image/OOXML vectors pending
124
+ - [~] Versioning / forward-compat in spec; profile registry, governance = TODO
125
+ - **Done when:** a second, independent implementation passes the published vectors (text/audio shipped; PDF vector + a real 2nd impl remain).
126
+
127
+ ## 🔜 M13 — Format expansion & scale hardening
128
+ - [~] **docx** (local extract) + **images** JPEG/PNG/TIFF/WEBP (pixel content_hash + sidecar + agent record) done; remaining: FLAC/OGG/M4A/MP4·MOV, xlsx/pptx, in-file image XMP/PNG-iTXt, local Whisper wired
129
+ - [ ] Large-corpus performance · multi-agent stress · recovery primitives at scale
130
+
131
+ ## ⬜ M14 — Release, governance & adoption
132
+ > Ship, then earn adoption with **proof** — not cold asks. (See "Adoption strategy" below.)
133
+ - [ ] v0.1 public release on GitHub
134
+ - [ ] **Benchmark** (the key adoption asset): measured token / latency savings on re-reads + agent protocol-compliance rate (built on M9's numbers)
135
+ - [ ] 2-minute demo · one-command try (`uvx dnr …`)
136
+ - [ ] **Launch posts** (GeekNews / Show HN) — lead with the demo + benchmark; CTA = the one-phrase bootstrap (`uvx dnr init`). A spike, not a strategy — only after v0.1 + the try-path are frictionless.
137
+ - [ ] **Opt-in surfaces first**: MCP server + skill/`AGENTS.md` snippet (users adopt without any maintainer PR)
138
+ - [ ] **OKF-compatible sidecar emit** (ride existing rails, don't fight them)
139
+ - [ ] **Targeted integrations**: PRs only to projects with a clean plugin/tool extension point — benefit-first, after the benchmark exists
140
+ - [ ] Governance: contribution process + spec change control
141
+ - **Done when:** ≥1 external project/user adopts via the opt-in surface, with the benchmark as evidence.
142
+
143
+ ---
144
+
145
+ ### Adoption strategy (M14) — why "proof-then-pitch", not cold PRs
146
+ 1. **Prove first.** Maintainers adopt things that already work + have a number, not specs. Ship → benchmark (token/latency savings) → a few real users → *then* integrate.
147
+ 2. **Opt-in beats PR.** Consumption is ambient `sqlite3`, so the integration is tiny — and an **MCP server / skill snippet** lets users turn it on with zero maintainer change, sidestepping PR rejection. Reserve real PRs for projects with a plugin/tool registry.
148
+ 3. **Benefit-first messaging.** Not "adopt my standard" — "your agent re-parses PDFs every turn; drop this in for an N% saving." Show, don't tell.
149
+ 4. **Target narrowly.** Document-heavy RAG / coding / research agents where re-parsing is a felt pain. 50 drive-by PRs read as spam and cost reputation; a few working, wanted integrations win.
150
+
151
+ ---
152
+
153
+ *All docs in English (public repo).*
@@ -0,0 +1,151 @@
1
+ Metadata-Version: 2.4
2
+ Name: donotreadagain
3
+ Version: 0.1.0
4
+ Summary: Read once, never again — self-describing files so AI agents stop re-parsing.
5
+ Project-URL: Homepage, https://github.com/melodysdreamj/donotreadagain
6
+ Project-URL: Repository, https://github.com/melodysdreamj/donotreadagain
7
+ Project-URL: Issues, https://github.com/melodysdreamj/donotreadagain/issues
8
+ Author: june lee
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: agents,ai,cache,metadata,pdf,self-describing,transcript,xmp
12
+ Requires-Python: >=3.10
13
+ Requires-Dist: cryptography
14
+ Requires-Dist: jsonschema
15
+ Requires-Dist: mutagen
16
+ Requires-Dist: pikepdf>=8
17
+ Requires-Dist: pillow
18
+ Requires-Dist: pypdf
19
+ Requires-Dist: python-docx
20
+ Requires-Dist: rfc8785
21
+ Provides-Extra: dev
22
+ Requires-Dist: fpdf2; extra == 'dev'
23
+ Requires-Dist: pytest; extra == 'dev'
24
+ Description-Content-Type: text/markdown
25
+
26
+ # donotreadagain (`dnr`)
27
+
28
+ > **Read once, never again.** Embed a faithful, signed AI transcript into each expensive-to-parse file's own metadata, so AI agents stop re-OCR/re-parsing the same PDF, image, scan, or audio every time.
29
+
30
+ [![tests](https://img.shields.io/badge/tests-passing-brightgreen)](#development) [![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE) [![python](https://img.shields.io/badge/python-3.10%2B-blue)](pyproject.toml) · **status:** v0.1, pre-release
31
+
32
+ ---
33
+
34
+ ## The problem
35
+
36
+ AI agents re-parse the same file *every time they touch it* — re-OCR a scan, re-run vision on a screenshot, re-transcribe an audio clip, re-extract a PDF. It's slow, it burns tokens and model calls, and it's non-deterministic. In **repeat-access corpora** (legal, research, compliance) the same documents get read dozens of times; with **multi-agent** setups every agent re-parses independently. The waste compounds exactly where it hurts.
37
+
38
+ ## The idea
39
+
40
+ dnr reads a file **once**, then writes a verbatim transcript + structured metadata **into the file's own native metadata slot** as a *signed* JSON record — the file becomes **self-describing**. Any agent that opens it later reads the cached transcript instead of re-parsing. A per-folder SQLite + FTS5 index makes a whole folder searchable without opening anything.
41
+
42
+ The second view is the win:
43
+
44
+ | | first view (re-parse) | second view (cached) |
45
+ |---|---|---|
46
+ | born-digital PDF | ~1.4 s (pypdf) | ~60 ms — **~22× faster** |
47
+ | image / scan / audio | a vision / Whisper model call | a few ms of text — **no model at all** |
48
+
49
+ …and the cache is **trustworthy**: a record is used only if it's signed by a trusted key *and* its `content_hash` still matches the file, so "fast" never means "stale or forged."
50
+
51
+ ## Demo
52
+
53
+ ```console
54
+ $ dnr ingest contract.pdf # transcribe once → sign → embed in the file
55
+ ingested contract.pdf [in-file]
56
+ method=text-extract transcriber=pypdf
57
+ signed key_id=ce6d170a497238f7
58
+
59
+ $ dnr read contract.pdf # later (or from any agent): verified cache hit — no re-parsing
60
+ LOAN AGREEMENT
61
+ Lender: Acme Capital LLC
62
+ Borrower: Jordan Smith
63
+ Principal: USD 1,200,000
64
+ ...
65
+
66
+ $ dnr index ./contracts
67
+ $ dnr query ./contracts --match damages --context 40 # search a whole folder, no files opened
68
+ contract.pdf
69
+ … Principal: USD 1,200,000 Maturity: 2026-12-31 Damages clause: section 7.
70
+ ```
71
+
72
+ The transcript lives *inside* `contract.pdf` — move it, email it, hand it to another agent, and the cached transcript travels with it.
73
+
74
+ ## Quickstart
75
+
76
+ Requires **Python 3.10+** (its stdlib includes the `sqlite3` used to read the index — one dependency covers both).
77
+
78
+ ```bash
79
+ # run with no persistent install:
80
+ uvx --from donotreadagain dnr <cmd>
81
+ # or install:
82
+ pipx install donotreadagain # or: pip install donotreadagain
83
+ ```
84
+
85
+ ```bash
86
+ dnr ingest report.pdf # transcribe once (local) → sign → embed in the file
87
+ dnr read report.pdf # print the cached transcript (verified), or fall back
88
+ dnr index ./case-folder # build .dnr.db
89
+ dnr query ./case-folder --match "손해배상" --tag 가압류 --since 2025-01-01
90
+ ```
91
+
92
+ For a scan / image / anything you must *look* at, the agent transcribes it and records the result:
93
+
94
+ ```bash
95
+ dnr record scan.png --transcript-file t.md --method vision --transcriber <your-model>
96
+ ```
97
+
98
+ ## How it fits together
99
+
100
+ ```
101
+ File = canonical truth Index .dnr.db = derived, regenerable
102
+ ┌────────────────────────────┐ harvest ┌────────────────────────────┐
103
+ │ signed dnr record │ ───────▶ │ fixed table + FTS5 search │
104
+ │ content_hash · transcript │ │ path · tags · transcript … │
105
+ │ provenance · fields · sig │ └────────────────────────────┘
106
+ └────────────────────────────┘ ▲ query via sqlite3 — no dnr install needed
107
+ ▲ transcribe · sign · embed once (expensive)
108
+ ```
109
+
110
+ **Where the record lives (no sidecar files):**
111
+ - **In-file** for formats with a metadata slot — PDF→XMP, MP3→ID3, PNG→iTXt, JPEG→APP segment. Pixels/bytes-of-content untouched (`content_hash` invariant), so the transcript **travels with the file** (move it, email it — it's still there).
112
+ - **db-only** in the folder's `.dnr.db` for formats with no slot yet (docx, …), or via `--no-embed` for evidentiary originals you must not modify (file left byte-identical).
113
+ - **Nothing** for already-readable text (`.txt`/`.md`/`.csv`) — an agent just reads it.
114
+
115
+ ## Using it
116
+
117
+ - **Read (consumer):** `dnr read <file>` returns the cached transcript only if it's present, trusted, and still matches (self-validating — a changed file silently misses). No dnr tool? An agent can read `.dnr.db` directly with ambient `sqlite3` (the db's `_dnr_readme` table self-describes).
118
+ - **Transcribe (producer):** `dnr ingest` (local: pypdf / Whisper / python-docx) or `dnr record` (agent supplies a vision transcript). dnr **owns no model** — the transcript is an input from whoever's best placed.
119
+ - **Query a folder:** `dnr query <folder>` combines `--match` (FTS, Korean/CJK ok) ∩ `--tag a,b` ∩ `--since/--until` ∩ `--where`; plus `--any` (OR sweep), `--dedup`, `--context` (KWIC), `--format json`. Save composed queries with `--save`/`--use`; accumulate labels with `dnr tag`.
120
+ - **Agents onboard once:** point an agent at a dnr folder and it fetches **[SKILL.md](SKILL.md)** once — then it knows dnr everywhere. `dnr init` just ensures a signing key; nothing is written into your folders.
121
+
122
+ ## Design principles
123
+
124
+ - **dnr is the deterministic substrate; the agent is the intelligence.** dnr does verifiable primitives (hash, sign, full-text/structured query); it never *infers* metadata (dates, parties, topics) or does fuzzy semantic search — that's the agent's job. Set metadata explicitly with `dnr tag` / `dnr date`.
125
+ - **File = truth, index = regenerable cache.** Delete `.dnr.db` and rebuild it from the files anytime.
126
+ - **Transcriber-agnostic.** dnr ships a *contract* (the verbatim guide) + a *trust layer*, not a model. Fidelity is the transcriber's; provenance is recorded so a consumer can apply its own quality policy (`trusted ≠ faithful`).
127
+
128
+ ## Status & honest limits
129
+
130
+ v0.1, pre-release. Works today for repeat-access corpora; validated by real-corpus dogfooding. Known limits we're explicit about:
131
+ - **Adoption is the real lever.** The value compounds when agents *know* dnr (a skill, eventually native support) — not from the tool alone.
132
+ - **`trusted ≠ faithful`.** A signature proves *who made it + that it matches the file*, not that the transcription is accurate. Low-quality/garbled transcripts are flagged (`dnr status`), not silently trusted.
133
+ - **Not yet published** to PyPI; a standalone binary for Python-less environments is future work.
134
+
135
+ See **[vision.md](vision.md)** (design) · **[spec/dnr-0.1.md](spec/dnr-0.1.md)** (spec) · **[SECURITY.md](SECURITY.md)** (threat model) · **[qna.md](qna.md)** (settled design decisions) · **[MILESTONES.md](MILESTONES.md)** (roadmap).
136
+
137
+ ## Development
138
+
139
+ ```bash
140
+ git clone https://github.com/melodysdreamj/donotreadagain
141
+ cd donotreadagain
142
+ python -m venv .venv && . .venv/bin/activate
143
+ pip install -e ".[dev]"
144
+ pytest # the suite is green and fast
145
+ ```
146
+
147
+ Contributions welcome — see **[CONTRIBUTING.md](CONTRIBUTING.md)**.
148
+
149
+ ## License
150
+
151
+ [MIT](LICENSE) © 2026 june lee