donotreadagain 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- donotreadagain-0.1.0/.github/ISSUE_TEMPLATE/bug_report.md +27 -0
- donotreadagain-0.1.0/.github/ISSUE_TEMPLATE/feature_request.md +19 -0
- donotreadagain-0.1.0/.github/PULL_REQUEST_TEMPLATE.md +16 -0
- donotreadagain-0.1.0/.github/workflows/ci.yml +30 -0
- donotreadagain-0.1.0/.gitignore +32 -0
- donotreadagain-0.1.0/CHANGELOG.md +34 -0
- donotreadagain-0.1.0/CODE_OF_CONDUCT.md +33 -0
- donotreadagain-0.1.0/CONTRIBUTING.md +71 -0
- donotreadagain-0.1.0/LICENSE +21 -0
- donotreadagain-0.1.0/MILESTONES.md +153 -0
- donotreadagain-0.1.0/PKG-INFO +151 -0
- donotreadagain-0.1.0/README.md +126 -0
- donotreadagain-0.1.0/SECURITY.md +54 -0
- donotreadagain-0.1.0/SKILL.md +70 -0
- donotreadagain-0.1.0/pyproject.toml +40 -0
- donotreadagain-0.1.0/qna.md +35 -0
- donotreadagain-0.1.0/spec/dnr-0.1.md +124 -0
- donotreadagain-0.1.0/spec/dnr.schema.json +140 -0
- donotreadagain-0.1.0/spec/vectors/sample.txt +3 -0
- donotreadagain-0.1.0/spec/vectors/sample.wav +0 -0
- donotreadagain-0.1.0/spec/vectors/vectors.json +10 -0
- donotreadagain-0.1.0/src/dnr/__init__.py +11 -0
- donotreadagain-0.1.0/src/dnr/bootstrap.py +22 -0
- donotreadagain-0.1.0/src/dnr/cli.py +469 -0
- donotreadagain-0.1.0/src/dnr/embed.py +397 -0
- donotreadagain-0.1.0/src/dnr/formats.py +53 -0
- donotreadagain-0.1.0/src/dnr/guide.py +44 -0
- donotreadagain-0.1.0/src/dnr/hashing.py +156 -0
- donotreadagain-0.1.0/src/dnr/index.py +494 -0
- donotreadagain-0.1.0/src/dnr/ingest.py +216 -0
- donotreadagain-0.1.0/src/dnr/keyring.py +45 -0
- donotreadagain-0.1.0/src/dnr/record.py +42 -0
- donotreadagain-0.1.0/src/dnr/schema.py +79 -0
- donotreadagain-0.1.0/src/dnr/signing.py +71 -0
- donotreadagain-0.1.0/src/dnr/skill.py +94 -0
- donotreadagain-0.1.0/src/dnr/transcribe.py +133 -0
- donotreadagain-0.1.0/tests/conftest.py +76 -0
- donotreadagain-0.1.0/tests/test_coverage.py +81 -0
- donotreadagain-0.1.0/tests/test_embed.py +59 -0
- donotreadagain-0.1.0/tests/test_fixes.py +103 -0
- donotreadagain-0.1.0/tests/test_guide.py +32 -0
- donotreadagain-0.1.0/tests/test_hashing.py +38 -0
- donotreadagain-0.1.0/tests/test_index.py +168 -0
- donotreadagain-0.1.0/tests/test_ingest.py +66 -0
- donotreadagain-0.1.0/tests/test_init.py +49 -0
- donotreadagain-0.1.0/tests/test_query_features.py +81 -0
- donotreadagain-0.1.0/tests/test_query_filters.py +116 -0
- donotreadagain-0.1.0/tests/test_query_memory.py +71 -0
- donotreadagain-0.1.0/tests/test_record.py +27 -0
- donotreadagain-0.1.0/tests/test_robustness.py +58 -0
- donotreadagain-0.1.0/tests/test_schema.py +36 -0
- donotreadagain-0.1.0/tests/test_signing.py +31 -0
- donotreadagain-0.1.0/tests/test_strip.py +31 -0
- donotreadagain-0.1.0/tests/test_text.py +47 -0
- donotreadagain-0.1.0/tests/test_vectors.py +16 -0
- donotreadagain-0.1.0/vision.md +383 -0
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Bug report
|
|
3
|
+
about: Something didn't work as expected
|
|
4
|
+
labels: bug
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**What happened**
|
|
8
|
+
A clear description of the bug.
|
|
9
|
+
|
|
10
|
+
**To reproduce**
|
|
11
|
+
Exact `dnr` command(s) and, if possible, a minimal file/folder that triggers it.
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
$ dnr ...
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
**Expected**
|
|
18
|
+
What you expected instead.
|
|
19
|
+
|
|
20
|
+
**Environment**
|
|
21
|
+
- dnr version (`dnr --version`):
|
|
22
|
+
- Python version (`python --version`):
|
|
23
|
+
- OS:
|
|
24
|
+
- File type involved (PDF / PNG / JPEG / MP3 / docx / …):
|
|
25
|
+
|
|
26
|
+
**Notes**
|
|
27
|
+
Anything else — e.g. is the transcript flagged low-quality by `dnr status`? Is it in-file or db-only?
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Feature request
|
|
3
|
+
about: Suggest an idea or improvement
|
|
4
|
+
labels: enhancement
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**The problem / use case**
|
|
8
|
+
What are you trying to do that's hard or impossible today?
|
|
9
|
+
|
|
10
|
+
**Proposed solution**
|
|
11
|
+
What you'd like to see (a flag, a command, a carrier, a behavior).
|
|
12
|
+
|
|
13
|
+
**Fit with dnr's principles**
|
|
14
|
+
dnr is a *deterministic substrate* — it doesn't infer metadata or do semantic/fuzzy work (that's the
|
|
15
|
+
agent's job), owns no model, and keeps the file as the source of truth. Does your idea fit, or is it a
|
|
16
|
+
conscious exception? (See [qna.md](../qna.md) for decisions already settled.)
|
|
17
|
+
|
|
18
|
+
**Alternatives considered**
|
|
19
|
+
Anything you tried or ruled out.
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
## What & why
|
|
2
|
+
|
|
3
|
+
Briefly: what does this change, and why?
|
|
4
|
+
|
|
5
|
+
## Checklist
|
|
6
|
+
|
|
7
|
+
- [ ] Tests added/updated and `pytest` is green locally
|
|
8
|
+
- [ ] No new hard model dependency in the core (dnr owns no model)
|
|
9
|
+
- [ ] If a new carrier: `content_hash` is invariant under embed + re-embed is byte-stable (round-trip test added)
|
|
10
|
+
- [ ] If agent-facing behavior changed: regenerated `SKILL.md` (`dnr skill > SKILL.md`)
|
|
11
|
+
- [ ] Doesn't make dnr *infer* metadata (dates/parties/topics) or do fuzzy/semantic search — that stays the agent's job
|
|
12
|
+
- [ ] Docs updated if needed (README / spec / qna)
|
|
13
|
+
|
|
14
|
+
## Notes
|
|
15
|
+
|
|
16
|
+
Link any relevant `qna.md` decision or spec section. Call out anything that rubs against a design principle.
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
name: ci
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
pull_request:
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
test:
|
|
10
|
+
runs-on: ubuntu-latest
|
|
11
|
+
strategy:
|
|
12
|
+
fail-fast: false
|
|
13
|
+
matrix:
|
|
14
|
+
python-version: ["3.10", "3.11", "3.12", "3.13"]
|
|
15
|
+
steps:
|
|
16
|
+
- uses: actions/checkout@v4
|
|
17
|
+
- uses: actions/setup-python@v5
|
|
18
|
+
with:
|
|
19
|
+
python-version: ${{ matrix.python-version }}
|
|
20
|
+
- name: Install
|
|
21
|
+
run: |
|
|
22
|
+
python -m pip install --upgrade pip
|
|
23
|
+
pip install -e ".[dev]"
|
|
24
|
+
- name: Test
|
|
25
|
+
run: pytest -q
|
|
26
|
+
- name: Spec ⇄ schema in sync
|
|
27
|
+
run: |
|
|
28
|
+
python -c "import json,sys; from dnr import schema; \
|
|
29
|
+
cur=json.load(open('spec/dnr.schema.json')); \
|
|
30
|
+
sys.exit(0 if cur==schema.SCHEMA else 'spec/dnr.schema.json is stale — regenerate from dnr.schema.SCHEMA')"
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.egg-info/
|
|
5
|
+
.eggs/
|
|
6
|
+
dist/
|
|
7
|
+
build/
|
|
8
|
+
.venv/
|
|
9
|
+
venv/
|
|
10
|
+
.pytest_cache/
|
|
11
|
+
.ruff_cache/
|
|
12
|
+
.mypy_cache/
|
|
13
|
+
|
|
14
|
+
# uv
|
|
15
|
+
uv.lock.*
|
|
16
|
+
.uv/
|
|
17
|
+
|
|
18
|
+
# dnr index — regenerable cache (safe to ignore; can travel as a warm cache if desired)
|
|
19
|
+
.dnr.db
|
|
20
|
+
.dnr.db-wal
|
|
21
|
+
.dnr.db-shm
|
|
22
|
+
*.dnr.json
|
|
23
|
+
|
|
24
|
+
# editor / OS
|
|
25
|
+
.DS_Store
|
|
26
|
+
.idea/
|
|
27
|
+
.vscode/
|
|
28
|
+
|
|
29
|
+
# scratch / experiments (local only, never committed)
|
|
30
|
+
*.tmp
|
|
31
|
+
/scratch/
|
|
32
|
+
/experiments/
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here. Format loosely follows
|
|
4
|
+
[Keep a Changelog](https://keepachangelog.com/); this project uses [SemVer](https://semver.org/).
|
|
5
|
+
|
|
6
|
+
## [Unreleased] — 0.1.0
|
|
7
|
+
|
|
8
|
+
First public cut. Pre-release.
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
- **Read-once cache.** `dnr ingest` / `dnr record` transcribe a file once, sign the record (Ed25519),
|
|
12
|
+
and embed it in the file's own metadata; `dnr read` returns the verified transcript instead of re-parsing.
|
|
13
|
+
- **Per-format `content_hash`** over *decoded* content (PDF streams, audio frames, decoded pixels, OOXML
|
|
14
|
+
manifest, NFC text), invariant under embedding — the identity + re-transcribe trigger.
|
|
15
|
+
- **In-file carriers:** PDF (XMP), MP3 (ID3), PNG (iTXt), JPEG (APP segment). Pixels/content untouched.
|
|
16
|
+
- **db-only records** in a per-folder `.dnr.db` for slotless formats (docx, …) and `--no-embed`
|
|
17
|
+
(evidentiary originals, byte-identical). Already-readable text gets no record. **No sidecar files.**
|
|
18
|
+
- **Per-folder index** (SQLite + FTS5 trigram, Korean/CJK ok): `dnr index` + `dnr query` with composed
|
|
19
|
+
filters (`--match` ∩ `--tag` ∩ `--since/--until` ∩ `--where`), `--any` OR sweeps, `--dedup`,
|
|
20
|
+
`--min-chars`, `--context` KWIC, `--format json`.
|
|
21
|
+
- **Query memory:** saved queries (`--save`/`dnr queries`/`--use`, re-run live), `dnr tag` and `dnr date`
|
|
22
|
+
for explicit (never-inferred) metadata.
|
|
23
|
+
- **Self-describing distribution:** each record carries an `_about` pointer, the `.dnr.db` self-describes,
|
|
24
|
+
and agents fetch the skill once (`dnr skill` / `SKILL.md`). `dnr init` just ensures a key.
|
|
25
|
+
- **Trust + quality:** signature + content_hash gate on `read`/`index`/`verify`; a low-quality
|
|
26
|
+
(empty/mojibake) transcript heuristic flagged by `dnr status` (`trusted ≠ faithful`).
|
|
27
|
+
- CLI: `keygen, ingest, record, read, verify, guide, types, status, date, index, query, queries, tag,
|
|
28
|
+
init, skill, strip, validate, schema`.
|
|
29
|
+
- Spec (`spec/dnr-0.1.md`) + JSON Schema + golden vectors; threat model (`SECURITY.md`).
|
|
30
|
+
|
|
31
|
+
### Known limits
|
|
32
|
+
- Not yet on PyPI; a standalone binary (Python-less environments) is future work.
|
|
33
|
+
- Adoption (agents *knowing* dnr) is the real lever, not the tool alone.
|
|
34
|
+
- More in-file carriers (OOXML, audio containers, video) and pre-query auto-scan are planned.
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
# Code of Conduct
|
|
2
|
+
|
|
3
|
+
## Our pledge
|
|
4
|
+
|
|
5
|
+
We want dnr to be a welcoming, harassment-free project for everyone, regardless of
|
|
6
|
+
experience, identity, or background.
|
|
7
|
+
|
|
8
|
+
## Our standards
|
|
9
|
+
|
|
10
|
+
Examples of behavior that helps:
|
|
11
|
+
|
|
12
|
+
- Being respectful and constructive in issues, PRs, and reviews.
|
|
13
|
+
- Welcoming newcomers and questions.
|
|
14
|
+
- Accepting feedback gracefully and assuming good faith.
|
|
15
|
+
|
|
16
|
+
Examples of unacceptable behavior:
|
|
17
|
+
|
|
18
|
+
- Harassment, insults, or derogatory comments.
|
|
19
|
+
- Personal or political attacks.
|
|
20
|
+
- Publishing others' private information without permission.
|
|
21
|
+
|
|
22
|
+
## Scope
|
|
23
|
+
|
|
24
|
+
This applies in all project spaces (issues, pull requests, discussions) and when an
|
|
25
|
+
individual is representing the project in public spaces.
|
|
26
|
+
|
|
27
|
+
## Enforcement
|
|
28
|
+
|
|
29
|
+
Report unacceptable behavior to the maintainer at **melodysdreamj@gmail.com**. Reports
|
|
30
|
+
will be reviewed and handled confidentially. Maintainers may remove, edit, or reject
|
|
31
|
+
contributions and comments that violate this Code of Conduct.
|
|
32
|
+
|
|
33
|
+
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), v2.1.
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# Contributing to dnr
|
|
2
|
+
|
|
3
|
+
Thanks for your interest! dnr is a small, principled codebase — a quick read of the principles below will save you (and a reviewer) time.
|
|
4
|
+
|
|
5
|
+
## Getting started
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
git clone https://github.com/melodysdreamj/donotreadagain
|
|
9
|
+
cd donotreadagain
|
|
10
|
+
python -m venv .venv && . .venv/bin/activate
|
|
11
|
+
pip install -e ".[dev]"
|
|
12
|
+
pytest # should be green in ~1s
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Requires Python 3.10+. No API keys, no network, no services — the whole suite runs locally.
|
|
16
|
+
|
|
17
|
+
## Project layout
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
src/dnr/
|
|
21
|
+
hashing.py content_hash (per-format, over DECODED content) + whole_hash
|
|
22
|
+
record.py build the record + RFC 8785 (JCS) canonicalization
|
|
23
|
+
signing.py Ed25519 sign / verify; keyring.py manages the local key
|
|
24
|
+
embed.py carriers — read/write the record into a file's slot (PDF/MP3/PNG/JPEG)
|
|
25
|
+
ingest.py transcribe → record → sign → store; read_cached (the skip-reparse gate)
|
|
26
|
+
transcribe.py local providers (pypdf, Whisper, python-docx) + quality/lang heuristics
|
|
27
|
+
index.py per-folder .dnr.db (SQLite + FTS5): scan/harvest, query, query memory
|
|
28
|
+
guide.py the verbatim transcription contract; skill.py / bootstrap.py: distribution
|
|
29
|
+
cli.py the `dnr` command-line surface
|
|
30
|
+
spec/ the normative spec + JSON Schema + golden vectors
|
|
31
|
+
tests/ pytest; fast, hermetic
|
|
32
|
+
SKILL.md the agent skill (generated from skill.py via `dnr skill`)
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
## Design principles (please don't break these)
|
|
36
|
+
|
|
37
|
+
1. **dnr is a deterministic substrate; the agent is the intelligence.** dnr does verifiable, repeatable primitives (hash, sign, full-text/structured query). It must **never *infer* metadata** (dates, case numbers, parties, topics) or do fuzzy/semantic search — that's the calling agent's job. Metadata is set *explicitly* (`dnr tag`, `dnr date`).
|
|
38
|
+
2. **dnr owns no model.** The transcript is an *input*, produced by the agent (vision), a local model (Whisper, text-extract), or an API. Don't add a hard model dependency to the core.
|
|
39
|
+
3. **File = canonical truth; the index is a regenerable cache.** Anything in `.dnr.db` (except authoritative db-only records) must be reconstructable from the files.
|
|
40
|
+
4. **Determinism is load-bearing.** `content_hash` must stay invariant when the record is embedded, and re-embedding identical content must be byte-stable. New carriers must preserve this (see below). No timestamps in records/embeds.
|
|
41
|
+
5. **`trusted ≠ faithful`.** Signing proves provenance + file-match, not transcription accuracy. Don't conflate them; surface quality, don't fake it.
|
|
42
|
+
6. **No sidecar files, no per-folder notes.** Records live in-file or db-only; discovery rides on the artifacts' self-description.
|
|
43
|
+
|
|
44
|
+
If a change rubs against one of these, say so in the PR — sometimes the principle should evolve, but it should be a conscious decision (see [qna.md](qna.md) for ones already settled).
|
|
45
|
+
|
|
46
|
+
## Adding a format carrier (common contribution)
|
|
47
|
+
|
|
48
|
+
To make a new format embed *in-file* (instead of db-only):
|
|
49
|
+
|
|
50
|
+
1. Add `embed_<fmt>` / `extract_<fmt>` / `strip_<fmt>` in `embed.py`, register them in `_EMBED`/`_EXTRACT`/`_STRIP`.
|
|
51
|
+
2. **Critical:** embedding must NOT change the file's *decoded content* — `hashing.content_hash` must be invariant before/after embed (e.g. for JPEG, insert a metadata segment without re-encoding the pixels). Re-embedding identical input must be byte-stable.
|
|
52
|
+
3. Update `formats.py` (`SUPPORTED`) and add a test asserting: round-trip, `content_hash` invariance, idempotent re-embed, and `strip`.
|
|
53
|
+
|
|
54
|
+
## Tests
|
|
55
|
+
|
|
56
|
+
- Every change needs a test. The suite is the contract — keep it green and fast (no network, no large fixtures).
|
|
57
|
+
- Use `tmp_path` and set `DNR_HOME` to an isolated dir (see the fixtures in `tests/`).
|
|
58
|
+
- Run `pytest` before opening a PR.
|
|
59
|
+
|
|
60
|
+
## Pull requests
|
|
61
|
+
|
|
62
|
+
- Branch from `main`; keep PRs focused.
|
|
63
|
+
- Conventional-ish commit subjects: `feat(dnr): …`, `fix(dnr): …`, `test: …`, `docs: …`.
|
|
64
|
+
- If you changed agent-facing behavior, regenerate `SKILL.md` (`dnr skill > SKILL.md`) and update it.
|
|
65
|
+
- Describe *what* and *why*; link any relevant `qna.md` / spec section.
|
|
66
|
+
|
|
67
|
+
## Reporting bugs / ideas
|
|
68
|
+
|
|
69
|
+
Open an issue (templates provided). For anything security-sensitive, see [SECURITY.md](SECURITY.md) — don't file a public issue for vulnerabilities.
|
|
70
|
+
|
|
71
|
+
By contributing you agree your work is licensed under the project's [MIT License](LICENSE).
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 june lee
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,153 @@
|
|
|
1
|
+
# dnr — Milestones
|
|
2
|
+
|
|
3
|
+
Build roadmap. Full design → [vision.md](vision.md). Status: ✅ done · 🔜 in progress / next · ⬜ todo
|
|
4
|
+
|
|
5
|
+
**v0.1 goal —** a working `dnr` that ingests PDF + audio (transcribe → canonical-hash → deterministic embed → sign), builds a per-folder queryable index (Korean/CJK search included), and lets an agent read/query with **no install**. Fundamentals-first: the `content_hash` and signing primitives are *proven* before the rest is layered on.
|
|
6
|
+
|
|
7
|
+
**Critical path:** M1 → M2 → (M3 ∥ M4) → M5 → M6 → M7 → M8 → M9. **v0.1 cut** = M1–M8 (build) + **M9 (dogfood — the real release-readiness gate)**. **M10–M14** = operability, security, the standard, scale, release.
|
|
8
|
+
|
|
9
|
+
**Progress (2026-06-20):** working `dnr` package + CLI — `hashing`/`record`(JCS)/`embed`(PDF·mp3·sidecar; gates 1·2·4)/`signing`(Ed25519+keyring); `transcribe` (transcriber-agnostic: local text-extract + agent path + Whisper provider) + `guide` (verbatim contract `dnr-verbatim-1`); `ingest`/`read_cached` (skip-reparse, idempotent); `index` (`.dnr.db` fixed table + FTS5 **trigram for CJK** + incremental scan + move resilience + tombstone). CLI: **keygen·ingest·record·read·verify·guide·types·index·query**. End-to-end (ingest→index→query→read) works with **zero API keys**; `dnr init` installs the agent skill (one-phrase bootstrap); **57 tests green.** **M1–M12 landed.** Then a **broad multi-user dogfood** (11 personas, each in an isolated folder) found 2 ship-blockers the targeted run missed — **both now fixed**: (1) the **index/query trusted UNSIGNED records** (security bypass — `read`/`verify` refused forged records but `query` surfaced them); `scan` now verifies signature + content_hash before harvesting. (2) **duplicate-content PRIMARY-KEY collision** silently dropped a file; rows are now keyed by **path**. Also added `query --list`, "no results"/CJK-short-term hints, and stripped-record removal from the index. **Then the 3 high dogfood items were fixed too:** non-PDF `ingest` (text → `method:none` sidecar, searchable; images/unknown → clean "use `dnr record`" errors, no pypdf crash); **CJK <3-char search** (LIKE substring fallback → 2-char Korean terms 계약/이혼/특허 now match); **spec `content_hash`** now documents the `<CS>`/`<IM>` framing + ships **golden vectors** (`spec/vectors/`, text + audio). **Then format coverage expanded:** **docx** (local python-docx text-extract), **images** (JPEG/PNG/TIFF/WEBP — pixel content_hash + agent `dnr record` + sidecar), and OOXML content_hash; cross-format search verified (docx + image in one folder). **Then a real-corpus dogfood** (the founder's `law-example` — 12 real Korean legal docs) found + fixed 4 more bugs: **macOS NFD paths** (now NFC-normalized in the index), **`start_date` as a real column** (`--where` now consistent with `--sort`), **language auto-detect** (`lang='ko'` now works for local ingest), and **filename-searchable FTS** (terms in the filename now match). Query surface also gained `--tag`, `--sort/--desc`, `--match --context N` (KWIC snippets), `--list`, and `record --tags`. (`#5` Korean-PDF word-spacing is inherent to CJK PDF text layers — not fixable from text; use the vision/`dnr record` path.) **76 tests green**, all 4 fixes verified on the real corpus. Then **sidecars were removed entirely** (`.dnr.json` gone): **images now embed in-file** (PNG iTXt / JPEG APP segment — pixels untouched, content_hash invariant, multi-segment chunking for >64KB), text/docx/etc. store a **db-only** record in `.dnr.db` (authoritative; preserved across re-scans), and `--no-embed` forces db-only for evidentiary originals (file byte-identical). Distribution also moved from a per-folder note to a **fetch-once `SKILL.md`** + each record's `_about` self-pointer. Then a **query-memory** layer landed: **composed queries** (`--match` ∩ `--tag a,b` ∩ `--since/--until` ∩ `--where`, one shot), **saved queries** (`--save`/`dnr queries`/`--use` — stores the *query*, re-runs live so it never goes stale), and **`dnr tag <file> <tag>…`** so an agent accumulates tags as it works (carrier files re-indexed immediately). Remaining debt: golden vectors / cross-tool, proper `dnr:` XMP namespace, more in-file carriers (OOXML for docx, audio containers, video), pre-query auto-scan, ingest lock. **Distribution decided: assume Python 3.10+ (pip/pipx/`uvx --from donotreadagain dnr`) — Python's stdlib `sqlite3` covers the read path, so one dependency does both; a standalone per-platform binary for Python-less environments is deferred post-1.0.** Verified: a clean venv `pip install .` yields a working standalone `dnr` (no source tree), sqlite included.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## ✅ M0 — Foundation validated
|
|
14
|
+
> Prove the load-bearing assumption before building on it.
|
|
15
|
+
- [x] Design doc (`vision.md`) — architecture, schema, hashing, trust, distribution
|
|
16
|
+
- [x] Repo init (git, MIT, README, .gitignore)
|
|
17
|
+
- [x] **make-or-break experiment** — `content_hash` invariant under embed + every re-save mode (PDF/WAV)
|
|
18
|
+
- [x] Deterministic embed recipe found → conformance **gate 4**
|
|
19
|
+
|
|
20
|
+
## 🔜 M1 — Canonicalization core + conformance harness
|
|
21
|
+
> The single primitive everything rests on — *and* the test infra that makes "any tool agrees" real.
|
|
22
|
+
- [x] `content_hash(pdf)` — decompressed content streams + image XObjects, page order
|
|
23
|
+
- [~] `content_hash(audio)` done (mp3 frames + wav data chunk); remaining: `content_hash(image)` (decoded pixels), `content_hash(ooxml)` (member manifest)
|
|
24
|
+
- [~] Canonical record serialization — SHA-256 + RFC 8785 JCS done; NFC text normalization remaining
|
|
25
|
+
- [ ] **Conformance harness** — golden test vectors per format + a runnable suite, wired into **CI** (gates run every commit)
|
|
26
|
+
- [ ] **Cross-tool / cross-version determinism** — same `content_hash` across pikepdf/qpdf versions (and a 2nd library), not just self-consistency
|
|
27
|
+
- [ ] Follow-up validations: real **scanned** PDF (image-only), multi-MB payload, real **mp3**
|
|
28
|
+
- **Done when:** two independent tools/versions agree on `content_hash` for a real corpus, with published vectors.
|
|
29
|
+
|
|
30
|
+
## 🔜 M2 — Embed / extract engine (carriers)
|
|
31
|
+
> Write & read the record in each format's native slot, safely.
|
|
32
|
+
- [~] Write: **PDF (XMP) · mp3 (ID3 TXXX) · sidecar `.dnr.json`** done; remaining: proper `dnr:` namespace, JPEG/PNG/TIFF/MP4, Vorbis, OOXML
|
|
33
|
+
- [x] **Deterministic embed** (`deterministic_id`, no auto-timestamps) — gate 4
|
|
34
|
+
- [x] **Atomic write** (temp + fsync + rename) — never mutate the original in place
|
|
35
|
+
- [x] Preserve native tags (gate 2) · read-back + verify `content_hash` (gate 1)
|
|
36
|
+
- [ ] Sidecar fallback rules: no slot / over size limit / read-only / sensitive
|
|
37
|
+
- **Done when:** all 4 conformance gates pass per carrier in CI.
|
|
38
|
+
|
|
39
|
+
## 🔜 M3 — Signing & trust
|
|
40
|
+
> Make a record trustworthy enough to justify skipping a re-read.
|
|
41
|
+
- [x] `record_hash = sha256(JCS(record − sig))`, Ed25519 sign / verify
|
|
42
|
+
- [~] Keygen + trust list done; persistent local keyring remaining
|
|
43
|
+
- [ ] Verify → trust tiers: signed + trusted + hash-match → **skip-reparse**; else **search-only + fallback**
|
|
44
|
+
- [ ] `transcript` always handled as untrusted data, never as instructions
|
|
45
|
+
- **Done when:** forged / altered / untrusted-key records are correctly refused for skip-reparse.
|
|
46
|
+
|
|
47
|
+
## 🔜 M4 — Transcription (ingest)
|
|
48
|
+
> Turn a raw file into a faithful record, once. **dnr owns no model** — the transcript is supplied by the agent or a local provider.
|
|
49
|
+
- [x] **Transcriber-agnostic ingest pipeline** — content_hash → transcribe → record → sign → embed
|
|
50
|
+
- [x] **Local `text-extract`** (pypdf, born-digital PDF, NFC) · **agent-supplied** path (`dnr record`) · **text files** (.txt/.md/.json/… → `method:none` sidecar) · clean errors for images/unknown
|
|
51
|
+
- [ ] Local models: **Whisper** (audio) · local OCR/vision (scans); optional hosted API
|
|
52
|
+
- [ ] Method hierarchy enforced: `text-extract` → `vision` → (`ocr` demoted)
|
|
53
|
+
- [ ] **Verbatim** transcription contract (prompt) shipped in the skill: complete, no summary, mark uncertainty
|
|
54
|
+
- [ ] provenance: version, instruction_id, prompt_hash, confidence; per-segment language tagging (feeds M6)
|
|
55
|
+
- [ ] Cost control: query-driven lazy ingest · ask-the-user · `dnr ingest --glob --budget`
|
|
56
|
+
- **Done when:** a PDF / audio ingests into a verbatim, signed, embedded record with provenance — agent-supplied or local, no API key required.
|
|
57
|
+
|
|
58
|
+
## 🔜 M5 — Index (query layer)
|
|
59
|
+
> A folder becomes a queryable table — cheaply, incrementally.
|
|
60
|
+
- [x] `.dnr.db`: fixed base table + `dnr_fts` (FTS5) + `_dnr_readme`
|
|
61
|
+
- [x] Incremental scan: stat → record → harvest; tombstone deletes
|
|
62
|
+
- [~] **index ≠ ingest** done; cold-folder media → currently *skipped* (pending-rows TODO)
|
|
63
|
+
- [ ] Pre-query incremental scan (`--no-scan` to skip)
|
|
64
|
+
- [x] Move resilience: `content_hash` match → update `path` only
|
|
65
|
+
- [~] Concurrency: SQLite **WAL** on; `content_hash` ingest lock TODO
|
|
66
|
+
- **Done when:** second scan is fast (stat-skips ✅), queries are fresh, moves don't re-transcribe ✅
|
|
67
|
+
|
|
68
|
+
## 🔜 M6 — i18n & search quality
|
|
69
|
+
> Make non-English — especially Korean/CJK — actually searchable (the founder's own corpus).
|
|
70
|
+
- [x] **trigram FTS5** + **LIKE substring fallback for <3-char terms** → 2-char Korean (계약/이혼/특허) matches; **filename also searchable** (FTS over name)
|
|
71
|
+
- [x] **NFC normalization end-to-end** — index stores NFC paths/names (fixes macOS NFD); text NFC-normalized
|
|
72
|
+
- [x] **language auto-detect** (script heuristic) → `lang` set on local ingest; `--where lang='ko'` works
|
|
73
|
+
- [~] Multilingual `fields` consistency / RTL / bidi remaining
|
|
74
|
+
- **Done when:** ✅ Korean legal-doc search (incl. 2-char terms + filenames) returns correct hits — verified on the real `law-example` corpus
|
|
75
|
+
|
|
76
|
+
## 🔜 M7 — CLI & distribution
|
|
77
|
+
> One tool that ties it together, runnable anywhere.
|
|
78
|
+
- [~] `dnr init·ingest·record·read·verify·keygen·guide·types·index·query` done; `seal·strip` TODO
|
|
79
|
+
- [x] Protocol enforced in code (`dnr read/index/query` are real commands, not prose)
|
|
80
|
+
- [ ] `uvx` package **+ single static binary** (per-platform releases) — dependency-free drop-in
|
|
81
|
+
- **Done when:** `uvx dnr index <folder>` and `dnr query` work on a fresh machine, offline (minus transcription API).
|
|
82
|
+
|
|
83
|
+
## 🔜 M8 — Agent integration (consumer)
|
|
84
|
+
> Zero-install consumption by AI agents.
|
|
85
|
+
- [x] **agent skill (`SKILL.md`)**: fixed schema + example queries + consumer contract + verbatim guide; a fetchable skill (frontmatter name/description), **not** a per-folder note
|
|
86
|
+
- [x] **skill encodes the full decision flow** (A: one file → self-validating `dnr read` / B: folder → status→transcribe→index→query) and was **adversarially tested** — fresh agents given only the skill text + 6 scenarios; judged vs the canonical flow + a doc critic over 4 rounds (3→4→5 correct, **0 wrong throughout**), fixing real gaps (read=self-validating; `--sidecar` mutation; transcribe-as-a-step; bulk-only ask-gate) so the wording matches actual CLI behavior
|
|
87
|
+
- [~] Consumer path documented (`dnr read`/`query`; raw `sqlite3` via `_dnr_readme`)
|
|
88
|
+
- [x] **transcribe-first ask-flow** — `dnr status <folder>` reports coverage by cost (model = image/audio/video / parse = PDF·Office / cheap = text); the skill tells the agent to run it on the first folder-wide question and **offer to transcribe-first** when expensive files are un-transcribed (one-time pass → every later view is a cache hit). Verified on the real corpus: `status 자료/` → "0/441 transcribed, 92 model + 202 parse pending → transcribe-first recommended".
|
|
89
|
+
- [x] **No per-folder note — self-describing + fetch-once skill** — every record carries an `_about` pointer (and the `.dnr.db` readme points to the skill), so an agent that meets a dnr artifact fetches `SKILL.md` **once** (committed at the repo root / `dnr skill`) and then knows dnr in every folder; nothing is written into the user's folders. `dnr init` now only ensures a signing key. Run with no install via `uvx --from donotreadagain dnr`.
|
|
90
|
+
- **Done when:** an agent given only the skill queries a dnr folder and skips re-parsing correctly; `dnr init` bootstraps from a single user phrase.
|
|
91
|
+
|
|
92
|
+
## 🔜 M9 — Agent scenario testing & dogfooding
|
|
93
|
+
> Drive the whole thing with real agents across many scenarios — the bugs that specs & unit tests miss surface here, and feed M10–M12. This is the real release-readiness gate.
|
|
94
|
+
- [x] **Scenario matrix**, run by agents: cache-hit, cross-file query, cold folder, move, freshness, incremental, CJK, + adversarial
|
|
95
|
+
- [~] **Multi-harness**: exercised via the real `dnr` CLI by agents; actual Claude Code / Codex / Cursor runs are a broader TODO
|
|
96
|
+
- [x] **Adversarial / edge**: forged-unsigned (refused), tampered-signed (verify fails), freshness (no stale leak), corrupt/garbage file
|
|
97
|
+
- [x] **Measure**: 8/10 pass, security held, failure taxonomy + prioritized backlog produced
|
|
98
|
+
- [x] Run as a **multi-agent workflow** (10 scenarios in parallel + synthesis)
|
|
99
|
+
- **Done when:** ✅ matrix complete with a failure list; fixes fed back (corrupt-file robustness → done). Remaining low-pri: CJK <3-char FTS, file-embedded-cache note.
|
|
100
|
+
|
|
101
|
+
## 🔜 M10 — Reversibility & corpus operability
|
|
102
|
+
> Make it safe to undo, and runnable at corpus scale.
|
|
103
|
+
- [x] **Robustness** (from M9 dogfooding): corrupt/missing files no longer crash — clean errors, one bad file never aborts a scan; `dnr read` falls back gracefully
|
|
104
|
+
- [x] `dnr strip` (un-embed, in-file + sidecar; content unchanged) · **bulk rollback** TODO
|
|
105
|
+
- [ ] **Resumable / idempotent** ingest after crash · `--dry-run`
|
|
106
|
+
- [ ] Rebuild a corrupted/lost `.dnr.db` without re-incurring transcription
|
|
107
|
+
- [ ] **Model-upgrade policy**: re-transcribe only lossy methods (asr/ocr/vision), skip `text-extract`; partial/lazy migration; mixed-version coherence
|
|
108
|
+
- [ ] Backup/dedup awareness (embedding changes whole_hash → re-backup churn)
|
|
109
|
+
- **Done when:** a bad bulk ingest is fully revertible and a crashed run resumes cleanly.
|
|
110
|
+
|
|
111
|
+
## 🔜 M11 — Security & privacy
|
|
112
|
+
> Treat every embedded record as untrusted input; don't leak on share.
|
|
113
|
+
- [x] **Threat-model document** — `SECURITY.md` (injection, forgery, TOFU, exfiltration, custody) + dogfood evidence
|
|
114
|
+
- [~] `transcript` as untrusted data (skill + contract); injection covered by forged/tampered dogfood; dedicated injection corpus TODO
|
|
115
|
+
- [~] `dnr strip` before sharing done; sensitivity-flag refuse-embed TODO
|
|
116
|
+
- [~] Poisoning surface documented; sanitization helpers TODO
|
|
117
|
+
- **Done when:** ✅ dogfooding showed a malicious/forged/tampered file cannot pass as trusted or steer the agent.
|
|
118
|
+
|
|
119
|
+
## 🔜 M12 — Spec formalization (the standard) ← the goal
|
|
120
|
+
> Make it implementable by others, and able to evolve.
|
|
121
|
+
- [x] `spec/dnr-0.1.md` (normative) + `spec/dnr.schema.json` (JSON Schema) + `dnr validate` / `dnr schema`
|
|
122
|
+
- [x] Carrier mapping table · per-format canonicalization (incl. documented PDF `<CS>`/`<IM>` framing) · conformance gates (in spec)
|
|
123
|
+
- [x] **Golden conformance vectors** — `spec/vectors/` (text + audio) + a test the impl must reproduce; PDF/image/OOXML vectors pending
|
|
124
|
+
- [~] Versioning / forward-compat in spec; profile registry, governance = TODO
|
|
125
|
+
- **Done when:** a second, independent implementation passes the published vectors (text/audio shipped; PDF vector + a real 2nd impl remain).
|
|
126
|
+
|
|
127
|
+
## 🔜 M13 — Format expansion & scale hardening
|
|
128
|
+
- [~] **docx** (local extract) + **images** JPEG/PNG/TIFF/WEBP (pixel content_hash + sidecar + agent record) done; remaining: FLAC/OGG/M4A/MP4·MOV, xlsx/pptx, in-file image XMP/PNG-iTXt, local Whisper wired
|
|
129
|
+
- [ ] Large-corpus performance · multi-agent stress · recovery primitives at scale
|
|
130
|
+
|
|
131
|
+
## ⬜ M14 — Release, governance & adoption
|
|
132
|
+
> Ship, then earn adoption with **proof** — not cold asks. (See "Adoption strategy" below.)
|
|
133
|
+
- [ ] v0.1 public release on GitHub
|
|
134
|
+
- [ ] **Benchmark** (the key adoption asset): measured token / latency savings on re-reads + agent protocol-compliance rate (built on M9's numbers)
|
|
135
|
+
- [ ] 2-minute demo · one-command try (`uvx dnr …`)
|
|
136
|
+
- [ ] **Launch posts** (GeekNews / Show HN) — lead with the demo + benchmark; CTA = the one-phrase bootstrap (`uvx dnr init`). A spike, not a strategy — only after v0.1 + the try-path are frictionless.
|
|
137
|
+
- [ ] **Opt-in surfaces first**: MCP server + skill/`AGENTS.md` snippet (users adopt without any maintainer PR)
|
|
138
|
+
- [ ] **OKF-compatible sidecar emit** (ride existing rails, don't fight them)
|
|
139
|
+
- [ ] **Targeted integrations**: PRs only to projects with a clean plugin/tool extension point — benefit-first, after the benchmark exists
|
|
140
|
+
- [ ] Governance: contribution process + spec change control
|
|
141
|
+
- **Done when:** ≥1 external project/user adopts via the opt-in surface, with the benchmark as evidence.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
### Adoption strategy (M14) — why "proof-then-pitch", not cold PRs
|
|
146
|
+
1. **Prove first.** Maintainers adopt things that already work + have a number, not specs. Ship → benchmark (token/latency savings) → a few real users → *then* integrate.
|
|
147
|
+
2. **Opt-in beats PR.** Consumption is ambient `sqlite3`, so the integration is tiny — and an **MCP server / skill snippet** lets users turn it on with zero maintainer change, sidestepping PR rejection. Reserve real PRs for projects with a plugin/tool registry.
|
|
148
|
+
3. **Benefit-first messaging.** Not "adopt my standard" — "your agent re-parses PDFs every turn; drop this in for an N% saving." Show, don't tell.
|
|
149
|
+
4. **Target narrowly.** Document-heavy RAG / coding / research agents where re-parsing is a felt pain. 50 drive-by PRs read as spam and cost reputation; a few working, wanted integrations win.
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
*All docs in English (public repo).*
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: donotreadagain
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Read once, never again — self-describing files so AI agents stop re-parsing.
|
|
5
|
+
Project-URL: Homepage, https://github.com/melodysdreamj/donotreadagain
|
|
6
|
+
Project-URL: Repository, https://github.com/melodysdreamj/donotreadagain
|
|
7
|
+
Project-URL: Issues, https://github.com/melodysdreamj/donotreadagain/issues
|
|
8
|
+
Author: june lee
|
|
9
|
+
License: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: agents,ai,cache,metadata,pdf,self-describing,transcript,xmp
|
|
12
|
+
Requires-Python: >=3.10
|
|
13
|
+
Requires-Dist: cryptography
|
|
14
|
+
Requires-Dist: jsonschema
|
|
15
|
+
Requires-Dist: mutagen
|
|
16
|
+
Requires-Dist: pikepdf>=8
|
|
17
|
+
Requires-Dist: pillow
|
|
18
|
+
Requires-Dist: pypdf
|
|
19
|
+
Requires-Dist: python-docx
|
|
20
|
+
Requires-Dist: rfc8785
|
|
21
|
+
Provides-Extra: dev
|
|
22
|
+
Requires-Dist: fpdf2; extra == 'dev'
|
|
23
|
+
Requires-Dist: pytest; extra == 'dev'
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
|
|
26
|
+
# donotreadagain (`dnr`)
|
|
27
|
+
|
|
28
|
+
> **Read once, never again.** Embed a faithful, signed AI transcript into each expensive-to-parse file's own metadata, so AI agents stop re-OCR/re-parsing the same PDF, image, scan, or audio every time.
|
|
29
|
+
|
|
30
|
+
[](#development) [](LICENSE) [](pyproject.toml) · **status:** v0.1, pre-release
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## The problem
|
|
35
|
+
|
|
36
|
+
AI agents re-parse the same file *every time they touch it* — re-OCR a scan, re-run vision on a screenshot, re-transcribe an audio clip, re-extract a PDF. It's slow, it burns tokens and model calls, and it's non-deterministic. In **repeat-access corpora** (legal, research, compliance) the same documents get read dozens of times; with **multi-agent** setups every agent re-parses independently. The waste compounds exactly where it hurts.
|
|
37
|
+
|
|
38
|
+
## The idea
|
|
39
|
+
|
|
40
|
+
dnr reads a file **once**, then writes a verbatim transcript + structured metadata **into the file's own native metadata slot** as a *signed* JSON record — the file becomes **self-describing**. Any agent that opens it later reads the cached transcript instead of re-parsing. A per-folder SQLite + FTS5 index makes a whole folder searchable without opening anything.
|
|
41
|
+
|
|
42
|
+
The second view is the win:
|
|
43
|
+
|
|
44
|
+
| | first view (re-parse) | second view (cached) |
|
|
45
|
+
|---|---|---|
|
|
46
|
+
| born-digital PDF | ~1.4 s (pypdf) | ~60 ms — **~22× faster** |
|
|
47
|
+
| image / scan / audio | a vision / Whisper model call | a few ms of text — **no model at all** |
|
|
48
|
+
|
|
49
|
+
…and the cache is **trustworthy**: a record is used only if it's signed by a trusted key *and* its `content_hash` still matches the file, so "fast" never means "stale or forged."
|
|
50
|
+
|
|
51
|
+
## Demo
|
|
52
|
+
|
|
53
|
+
```console
|
|
54
|
+
$ dnr ingest contract.pdf # transcribe once → sign → embed in the file
|
|
55
|
+
ingested contract.pdf [in-file]
|
|
56
|
+
method=text-extract transcriber=pypdf
|
|
57
|
+
signed key_id=ce6d170a497238f7
|
|
58
|
+
|
|
59
|
+
$ dnr read contract.pdf # later (or from any agent): verified cache hit — no re-parsing
|
|
60
|
+
LOAN AGREEMENT
|
|
61
|
+
Lender: Acme Capital LLC
|
|
62
|
+
Borrower: Jordan Smith
|
|
63
|
+
Principal: USD 1,200,000
|
|
64
|
+
...
|
|
65
|
+
|
|
66
|
+
$ dnr index ./contracts
|
|
67
|
+
$ dnr query ./contracts --match damages --context 40 # search a whole folder, no files opened
|
|
68
|
+
contract.pdf
|
|
69
|
+
… Principal: USD 1,200,000 Maturity: 2026-12-31 Damages clause: section 7.
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
The transcript lives *inside* `contract.pdf` — move it, email it, hand it to another agent, and the cached transcript travels with it.
|
|
73
|
+
|
|
74
|
+
## Quickstart
|
|
75
|
+
|
|
76
|
+
Requires **Python 3.10+** (its stdlib includes the `sqlite3` used to read the index — one dependency covers both).
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
# run with no persistent install:
|
|
80
|
+
uvx --from donotreadagain dnr <cmd>
|
|
81
|
+
# or install:
|
|
82
|
+
pipx install donotreadagain # or: pip install donotreadagain
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
dnr ingest report.pdf # transcribe once (local) → sign → embed in the file
|
|
87
|
+
dnr read report.pdf # print the cached transcript (verified), or fall back
|
|
88
|
+
dnr index ./case-folder # build .dnr.db
|
|
89
|
+
dnr query ./case-folder --match "손해배상" --tag 가압류 --since 2025-01-01
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
For a scan / image / anything you must *look* at, the agent transcribes it and records the result:
|
|
93
|
+
|
|
94
|
+
```bash
|
|
95
|
+
dnr record scan.png --transcript-file t.md --method vision --transcriber <your-model>
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## How it fits together
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
File = canonical truth Index .dnr.db = derived, regenerable
|
|
102
|
+
┌────────────────────────────┐ harvest ┌────────────────────────────┐
|
|
103
|
+
│ signed dnr record │ ───────▶ │ fixed table + FTS5 search │
|
|
104
|
+
│ content_hash · transcript │ │ path · tags · transcript … │
|
|
105
|
+
│ provenance · fields · sig │ └────────────────────────────┘
|
|
106
|
+
└────────────────────────────┘ ▲ query via sqlite3 — no dnr install needed
|
|
107
|
+
▲ transcribe · sign · embed once (expensive)
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**Where the record lives (no sidecar files):**
|
|
111
|
+
- **In-file** for formats with a metadata slot — PDF→XMP, MP3→ID3, PNG→iTXt, JPEG→APP segment. Pixels/bytes-of-content untouched (`content_hash` invariant), so the transcript **travels with the file** (move it, email it — it's still there).
|
|
112
|
+
- **db-only** in the folder's `.dnr.db` for formats with no slot yet (docx, …), or via `--no-embed` for evidentiary originals you must not modify (file left byte-identical).
|
|
113
|
+
- **Nothing** for already-readable text (`.txt`/`.md`/`.csv`) — an agent just reads it.
|
|
114
|
+
|
|
115
|
+
## Using it
|
|
116
|
+
|
|
117
|
+
- **Read (consumer):** `dnr read <file>` returns the cached transcript only if it's present, trusted, and still matches (self-validating — a changed file silently misses). No dnr tool? An agent can read `.dnr.db` directly with ambient `sqlite3` (the db's `_dnr_readme` table self-describes).
|
|
118
|
+
- **Transcribe (producer):** `dnr ingest` (local: pypdf / Whisper / python-docx) or `dnr record` (agent supplies a vision transcript). dnr **owns no model** — the transcript is an input from whoever's best placed.
|
|
119
|
+
- **Query a folder:** `dnr query <folder>` combines `--match` (FTS, Korean/CJK ok) ∩ `--tag a,b` ∩ `--since/--until` ∩ `--where`; plus `--any` (OR sweep), `--dedup`, `--context` (KWIC), `--format json`. Save composed queries with `--save`/`--use`; accumulate labels with `dnr tag`.
|
|
120
|
+
- **Agents onboard once:** point an agent at a dnr folder and it fetches **[SKILL.md](SKILL.md)** once — then it knows dnr everywhere. `dnr init` just ensures a signing key; nothing is written into your folders.
|
|
121
|
+
|
|
122
|
+
## Design principles
|
|
123
|
+
|
|
124
|
+
- **dnr is the deterministic substrate; the agent is the intelligence.** dnr does verifiable primitives (hash, sign, full-text/structured query); it never *infers* metadata (dates, parties, topics) or does fuzzy semantic search — that's the agent's job. Set metadata explicitly with `dnr tag` / `dnr date`.
|
|
125
|
+
- **File = truth, index = regenerable cache.** Delete `.dnr.db` and rebuild it from the files anytime.
|
|
126
|
+
- **Transcriber-agnostic.** dnr ships a *contract* (the verbatim guide) + a *trust layer*, not a model. Fidelity is the transcriber's; provenance is recorded so a consumer can apply its own quality policy (`trusted ≠ faithful`).
|
|
127
|
+
|
|
128
|
+
## Status & honest limits
|
|
129
|
+
|
|
130
|
+
v0.1, pre-release. Works today for repeat-access corpora; validated by real-corpus dogfooding. Known limits we're explicit about:
|
|
131
|
+
- **Adoption is the real lever.** The value compounds when agents *know* dnr (a skill, eventually native support) — not from the tool alone.
|
|
132
|
+
- **`trusted ≠ faithful`.** A signature proves *who made it + that it matches the file*, not that the transcription is accurate. Low-quality/garbled transcripts are flagged (`dnr status`), not silently trusted.
|
|
133
|
+
- **Not yet published** to PyPI; a standalone binary for Python-less environments is future work.
|
|
134
|
+
|
|
135
|
+
See **[vision.md](vision.md)** (design) · **[spec/dnr-0.1.md](spec/dnr-0.1.md)** (spec) · **[SECURITY.md](SECURITY.md)** (threat model) · **[qna.md](qna.md)** (settled design decisions) · **[MILESTONES.md](MILESTONES.md)** (roadmap).
|
|
136
|
+
|
|
137
|
+
## Development
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
git clone https://github.com/melodysdreamj/donotreadagain
|
|
141
|
+
cd donotreadagain
|
|
142
|
+
python -m venv .venv && . .venv/bin/activate
|
|
143
|
+
pip install -e ".[dev]"
|
|
144
|
+
pytest # the suite is green and fast
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
Contributions welcome — see **[CONTRIBUTING.md](CONTRIBUTING.md)**.
|
|
148
|
+
|
|
149
|
+
## License
|
|
150
|
+
|
|
151
|
+
[MIT](LICENSE) © 2026 june lee
|