diary-kg 0.92.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,445 @@
1
+ Metadata-Version: 2.4
2
+ Name: diary-kg
3
+ Version: 0.92.5
4
+ Summary: A knowledge graph builder and semantic search engine for diaries and journals
5
+ License-Expression: Elastic-2.0
6
+ Keywords: knowledge-graph,diary,journal,lancedb,sqlite,semantic-search,nlp
7
+ Author: Eric G. Suchanek, PhD
8
+ Author-email: suchanek@mac.com
9
+ Requires-Python: >=3.12,<3.14
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Classifier: Topic :: Text Processing :: Linguistic
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Provides-Extra: all
20
+ Provides-Extra: dev
21
+ Provides-Extra: kgdeps
22
+ Provides-Extra: viz
23
+ Provides-Extra: viz3d
24
+ Requires-Dist: PyQt5 (>=5.15.0) ; extra == "all"
25
+ Requires-Dist: PyQt5 (>=5.15.0) ; extra == "viz3d"
26
+ Requires-Dist: click (>=8.1.0,<9)
27
+ Requires-Dist: detect-secrets (>=1.5.0) ; extra == "all"
28
+ Requires-Dist: detect-secrets (>=1.5.0) ; extra == "dev"
29
+ Requires-Dist: doc-kg (>=0.15.0) ; extra == "all"
30
+ Requires-Dist: doc-kg (>=0.15.0) ; extra == "kgdeps"
31
+ Requires-Dist: kgmodule-utils (>=0.2.3)
32
+ Requires-Dist: lancedb (>=0.29.0)
33
+ Requires-Dist: markdown (>=3.6) ; extra == "all"
34
+ Requires-Dist: markdown (>=3.6) ; extra == "viz3d"
35
+ Requires-Dist: mcp (>=1.0.0)
36
+ Requires-Dist: mypy (>=1.10.0) ; extra == "all"
37
+ Requires-Dist: mypy (>=1.10.0) ; extra == "dev"
38
+ Requires-Dist: numpy (>=1.24.0)
39
+ Requires-Dist: pandas (>=2.0.0)
40
+ Requires-Dist: param (>=2.0.0) ; extra == "all"
41
+ Requires-Dist: param (>=2.0.0) ; extra == "viz3d"
42
+ Requires-Dist: pdoc (>=14.0.0) ; extra == "all"
43
+ Requires-Dist: pdoc (>=14.0.0) ; extra == "dev"
44
+ Requires-Dist: plotly (>=5.14.0) ; extra == "all"
45
+ Requires-Dist: plotly (>=5.14.0) ; extra == "viz"
46
+ Requires-Dist: pre-commit (>=4.5.1) ; extra == "all"
47
+ Requires-Dist: pre-commit (>=4.5.1) ; extra == "dev"
48
+ Requires-Dist: pycode-kg (>=0.19.0) ; extra == "all"
49
+ Requires-Dist: pycode-kg (>=0.19.0) ; extra == "kgdeps"
50
+ Requires-Dist: pylint (>=4.0.5) ; extra == "all"
51
+ Requires-Dist: pylint (>=4.0.5) ; extra == "dev"
52
+ Requires-Dist: pytest (>=8.0.0) ; extra == "all"
53
+ Requires-Dist: pytest (>=8.0.0) ; extra == "dev"
54
+ Requires-Dist: pytest-cov (>=5.0.0) ; extra == "all"
55
+ Requires-Dist: pytest-cov (>=5.0.0) ; extra == "dev"
56
+ Requires-Dist: pyvis (>=0.3.2) ; extra == "all"
57
+ Requires-Dist: pyvis (>=0.3.2) ; extra == "viz"
58
+ Requires-Dist: pyvista[jupyter] (>=0.44.0) ; extra == "all"
59
+ Requires-Dist: pyvista[jupyter] (>=0.44.0) ; extra == "viz3d"
60
+ Requires-Dist: pyvistaqt (>=0.11.0) ; extra == "all"
61
+ Requires-Dist: pyvistaqt (>=0.11.0) ; extra == "viz3d"
62
+ Requires-Dist: rich (>=14.3.3,<15)
63
+ Requires-Dist: ruff (>=0.4.0) ; extra == "all"
64
+ Requires-Dist: ruff (>=0.4.0) ; extra == "dev"
65
+ Requires-Dist: sentence-transformers (>=5.4.1)
66
+ Requires-Dist: spacy (>=3.8.11,<4)
67
+ Requires-Dist: streamlit (>=1.35.0) ; extra == "all"
68
+ Requires-Dist: streamlit (>=1.35.0) ; extra == "viz"
69
+ Requires-Dist: trame-vtk (>=2.0.0) ; extra == "all"
70
+ Requires-Dist: trame-vtk (>=2.0.0) ; extra == "viz3d"
71
+ Project-URL: Homepage, https://github.com/Flux-Frontiers/diary_kg
72
+ Project-URL: Repository, https://github.com/Flux-Frontiers/diary_kg
73
+ Description-Content-Type: text/markdown
74
+
75
+ [![Python](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg)](https://www.python.org/)
76
+ [![License: Elastic-2.0](https://img.shields.io/badge/License-Elastic%202.0-blue.svg)](https://www.elastic.co/licensing/elastic-license)
77
+ [![Version](https://img.shields.io/badge/version-0.92.5-blue.svg)](https://github.com/Flux-Frontiers/diary_kg/releases)
78
+ [![CI](https://github.com/Flux-Frontiers/diary_kg/actions/workflows/ci.yml/badge.svg)](https://github.com/Flux-Frontiers/diary_kg/actions/workflows/ci.yml)
79
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
80
+ [![DOI](https://zenodo.org/badge/1183242132.svg)](https://zenodo.org/badge/latestdoi/1183242132)
81
+
82
+ **DiaryKG** — A deterministic knowledge graph for diaries and journals with semantic indexing and source-grounded snippet packing.
83
+
84
+ *Author: Eric G. Suchanek, PhD — Flux-Frontiers, Liberty TWP, OH*
85
+
86
+ ---
87
+
88
+ ## Overview
89
+
90
+ DiaryKG ingests plain-text diary or journal files and produces a hybrid SQLite + LanceDB knowledge graph that supports natural-language querying, source-grounded snippet packs for LLM context, temporal analysis, and topic/context classification.
91
+
92
+ It was built around the **Samuel Pepys diary** (1660–1669, 7,282 entries) but is general-purpose — any structured plain-text diary or journal file is supported.
93
+
94
+ The system is organized as two cooperating Python packages:
95
+
96
+ - **`diary_transformer`** — spaCy NLP enrichment, topic classification, sentence-group chunking, diversity sampling. Turns a raw diary text file into one Markdown chunk-file per entry, with full provenance metadata.
97
+ - **`diary_kg`** — orchestrates the chunking pipeline, builds the DocKG-backed SQLite graph + LanceDB vector index over the chunked corpus, and exposes the query / pack / analyze / snapshot APIs and an MCP server.
98
+
99
+ ### Architecture
100
+
101
+ ```
102
+ Plain-text diary
103
+
104
+
105
+ DiaryTransformer spaCy NLP enrichment, topic classification,
106
+ (diary_transformer) sentence-group chunking, diversity sampling
107
+
108
+
109
+ Corpus (.md files) one file per chunk, full provenance metadata
110
+ .diarykg/corpus/
111
+
112
+ ├──▶ DocKG build SQLite graph + LanceDB vector index
113
+ │ (doc-kg) BAAI/bge-small-en-v1.5 (384-d, normalized)
114
+
115
+ └──▶ DiaryKG APIs query(), pack(), analyze(), snapshot_save()
116
+ ```
117
+
118
+ ### Storage layout
119
+
120
+ ```
121
+ .diarykg/
122
+ config.json build parameters
123
+ corpus/ one .md chunk file per diary entry
124
+ graph.sqlite SQLite knowledge graph (DocKG)
125
+ lancedb/ LanceDB vector index (384-d HNSW)
126
+ snapshots/ point-in-time metrics snapshots
127
+ ```
128
+
129
+ ---
130
+
131
+ ## Quick Start
132
+
133
+ ```bash
134
+ # Install
135
+ pip install diary-kg
136
+
137
+ # Build from a plain-text diary file (creates .diarykg/ in the current dir)
138
+ diarykg build --source path/to/diary.txt
139
+
140
+ # Query the corpus
141
+ diarykg query "office work and the navy board"
142
+
143
+ # Pack snippets for an LLM context window
144
+ diarykg pack "Pepys at the theatre" --output context.md
145
+
146
+ # Start the MCP server (stdio transport for Claude Code / Cline / etc.)
147
+ diarykg-mcp
148
+ ```
149
+
150
+ ---
151
+
152
+ ## Installation
153
+
154
+ ### From PyPI (recommended)
155
+
156
+ ```bash
157
+ # Core runtime (CLI + MCP server + graph engine)
158
+ pip install diary-kg
159
+
160
+ # With Streamlit / Plotly visualizer extras
161
+ pip install "diary-kg[viz]"
162
+
163
+ # With 3D visualization extras (PyVista, PyQt5, etc. — heavy dependencies)
164
+ pip install "diary-kg[viz3d]"
165
+
166
+ # With KG integration deps (pycode-kg, doc-kg)
167
+ pip install "diary-kg[kgdeps]"
168
+
169
+ # Everything
170
+ pip install "diary-kg[all]"
171
+ ```
172
+
173
+ ### Poetry project
174
+
175
+ ```bash
176
+ poetry add diary-kg
177
+ poetry add "diary-kg[viz]"
178
+ poetry add "diary-kg[kgdeps]"
179
+ ```
180
+
181
+ ### Local development
182
+
183
+ ```bash
184
+ git clone https://github.com/Flux-Frontiers/diary_kg.git
185
+ cd diary_kg
186
+ python -m venv .venv && source .venv/bin/activate
187
+ pip install -e ".[dev]"
188
+ poetry run pytest
189
+ ```
190
+
191
+ ---
192
+
193
+ ## CLI Reference
194
+
195
+ The `diarykg` console script is the primary entry point. The MCP server ships as a separate `diarykg-mcp` script.
196
+
197
+ | Command | Purpose |
198
+ |---|---|
199
+ | `diarykg build` | Full pipeline: ingest diary → chunk → index into SQLite + LanceDB |
200
+ | `diarykg reindex` | Rebuild the LanceDB + SQLite index from the existing corpus (skips ingest) |
201
+ | `diarykg query <QUERY>` | Hybrid semantic + graph search; returns ranked hits |
202
+ | `diarykg pack <QUERY>` | Source-grounded Markdown snippet pack for LLM context |
203
+ | `diarykg analyze` | Generate a Markdown analysis report for the corpus |
204
+ | `diarykg status` | KG health check and build metadata, without loading the full DB |
205
+ | `diarykg snapshot save` | Capture point-in-time corpus metrics |
206
+ | `diarykg snapshot list / show / diff / prune` | Inspect and prune snapshots |
207
+ | `diarykg install-hooks` | Install the DiaryKG pre-commit git hook |
208
+ | `diarykg-mcp` | Run the MCP server (stdio / SSE transport) |
209
+
210
+ Every command accepts a `ROOT` positional argument (default: current directory) pointing at the project that contains `.diarykg/`. Run `diarykg <command> --help` for the full option list.
211
+
212
+ ### Build
213
+
214
+ ```bash
215
+ # First build — --source is required
216
+ diarykg build --source pepys/pepys_enriched_full.txt
217
+
218
+ # Incremental update (preserve existing corpus + DBs)
219
+ diarykg build --source pepys/pepys_enriched_full.txt --update
220
+
221
+ # Configure chunking
222
+ diarykg build --source diary.txt --chunking semantic --chunk-size 800 --max-chunks 5
223
+
224
+ # Capture a snapshot immediately after the build
225
+ diarykg build --source diary.txt --snapshot
226
+ ```
227
+
228
+ Chunking strategies: `sentence_group` (default), `semantic`, `hybrid`. Custom topic catalogs can be supplied with `--topics-file path/to/topics.yaml`.
229
+
230
+ ### Query and pack
231
+
232
+ ```bash
233
+ # Top-k semantic hits as a rich-formatted table
234
+ diarykg query "Navy affairs" -k 12
235
+
236
+ # Same query as JSON for downstream tooling
237
+ diarykg query "Navy affairs" --json
238
+
239
+ # Markdown snippet pack ready to paste into an LLM
240
+ diarykg pack "Pepys wife Elizabeth" --output context.md
241
+ ```
242
+
243
+ ### Snapshots
244
+
245
+ Version is an option (`-v` / `--version`), not a positional argument; bare positionals are treated as `ROOT`.
246
+
247
+ ```bash
248
+ # Capture a snapshot at the current corpus state
249
+ diarykg snapshot save -v 0.92.2
250
+
251
+ # With a label
252
+ diarykg snapshot save -v 0.92.2 -l "after backfilling 1667 entries"
253
+
254
+ # List, inspect, compare
255
+ diarykg snapshot list
256
+ diarykg snapshot show <key>
257
+ diarykg snapshot diff <key_a> <key_b>
258
+
259
+ # Prune snapshots that carry no new metric information
260
+ diarykg snapshot prune --dry-run
261
+ ```
262
+
263
+ Snapshots are keyed by git tree hash and capture chunk/entry/node/edge counts, temporal span, topic/context distributions, and deltas vs. the previous and baseline snapshots.
264
+
265
+ ### Reindex
266
+
267
+ Use after changing the embedding model or fixing an index bug, when the corpus `.md` chunk files are already up-to-date.
268
+
269
+ ```bash
270
+ diarykg reindex
271
+ ```
272
+
273
+ ---
274
+
275
+ ## MCP Server
276
+
277
+ DiaryKG ships an MCP server that exposes three tools to AI agents.
278
+
279
+ | Tool | Returns | Description |
280
+ |---|---|---|
281
+ | `query_diary(q, k)` | JSON | Semantic search over the diary corpus; ranked hit list with `node_id`, `score`, `summary`, `source_file`, `timestamp`, `category`, `context`. |
282
+ | `pack_diary(q, k)` | Markdown | Top-k diary snippets formatted as Markdown sections, ready to paste into an LLM context window. |
283
+ | `diary_stats()` | JSON | Combined corpus metadata (`info()`) and KG stats (`stats()`): chunk/entry counts, temporal span, topic/context distributions, node/edge counts. |
284
+
285
+ ### Run the server
286
+
287
+ ```bash
288
+ # Stdio transport (default — for Claude Code / Cline / Claude Desktop / Kilo Code)
289
+ diarykg-mcp --repo /path/to/diary_project
290
+
291
+ # SSE transport
292
+ diarykg-mcp --repo /path/to/diary_project --transport sse
293
+ ```
294
+
295
+ ### Wire it up in an MCP client
296
+
297
+ Most MCP clients use a JSON config file. Example `.mcp.json` for Claude Code or Kilo Code:
298
+
299
+ ```json
300
+ {
301
+ "mcpServers": {
302
+ "diarykg": {
303
+ "command": "diarykg-mcp",
304
+ "args": ["--repo", "/absolute/path/to/diary_project"]
305
+ }
306
+ }
307
+ }
308
+ ```
309
+
310
+ For per-agent setup steps, run `/setup-diarykg-mcp` in Claude Code (the slash command at [.claude/commands/setup-diarykg-mcp.md](.claude/commands/setup-diarykg-mcp.md) walks through the Claude Code, Cline, Claude Desktop, GitHub Copilot, and Kilo Code variants).
311
+
312
+ ---
313
+
314
+ ## Python API
315
+
316
+ ```python
317
+ from diary_kg import DiaryKG
318
+
319
+ # First build
320
+ kg = DiaryKG("/path/to/project", source_file="pepys_diary.txt")
321
+ kg.build()
322
+
323
+ # Subsequent runs only need the project root
324
+ kg = DiaryKG("/path/to/project")
325
+
326
+ # Hybrid semantic + graph search
327
+ hits = kg.query("what did Pepys think of the theatre?", k=12)
328
+
329
+ # Source-grounded snippet pack (list of dicts with content, metadata)
330
+ snippets = kg.pack("Navy corruption", k=8)
331
+
332
+ # Corpus metadata + KG stats
333
+ info = kg.info() # chunk_count, entry_count, temporal_span, topic/context distributions
334
+ stats = kg.stats() # node_count, edge_count
335
+
336
+ # Markdown analysis report
337
+ report = kg.analyze()
338
+
339
+ # Snapshots
340
+ kg.snapshot_save(version="0.92.2", label="release")
341
+ kg.snapshot_list()
342
+ kg.snapshot_show(key)
343
+ kg.snapshot_diff(key_a, key_b)
344
+ ```
345
+
346
+ The package re-exports the primary types:
347
+
348
+ ```python
349
+ from diary_kg import DiaryKG, DEFAULT_MODEL, CrossHit, CrossSnippet, KGEntry, KGKind
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Embedding Model
355
+
356
+ | Use | Model | Dims | Notes |
357
+ |---|---|---|---|
358
+ | Knowledge graph build | `BAAI/bge-small-en-v1.5` | 384 | Fast, general-text, L2-normalized |
359
+ | Multipass pipeline | `BAAI/bge-small-en-v1.5` | 384 | Same model stack-wide; loaded via `kg_utils.embedder.load_sentence_transformer()` |
360
+
361
+ Model loading is handled by `kg_utils.embedder.load_sentence_transformer()`, which enforces `local_files_only=True` when a cached copy exists — preventing spurious HuggingFace HEAD requests in offline or air-gapped environments.
362
+
363
+ ---
364
+
365
+ ## Project Structure
366
+
367
+ ```
368
+ diary_kg/
369
+ ├── src/
370
+ │ ├── diary_kg/ DiaryKG package
371
+ │ │ ├── kg.py DiaryKG class (build, query, pack, analyze, snapshots)
372
+ │ │ ├── cli.py Click CLI — `diarykg` console script
373
+ │ │ ├── mcp_server.py MCP server — `diarykg-mcp` console script
374
+ │ │ ├── primitives.py CrossHit, CrossSnippet, KGEntry, KGKind
375
+ │ │ ├── snapshots.py DiarySnapshotManager
376
+ │ │ └── module/ Pluggable KGModule interface
377
+ │ └── diary_transformer/ Chunking + NLP pipeline
378
+ │ ├── transformer.py DiaryTransformer orchestrator
379
+ │ ├── chunker.py sentence_group / semantic / hybrid chunkers
380
+ │ ├── classifier.py Topic + context classification
381
+ │ ├── parser.py Diary file parser
382
+ │ ├── topic_classifier.py Hybrid keyword / K-means classifier
383
+ │ └── topics.yaml Default topic catalog
384
+ ├── pepys/ Sample Pepys diary corpus
385
+ ├── docs/ Technical articles and disclosures
386
+ ├── benchmarks/ Embedding model benchmarks
387
+ ├── analysis/ Versioned analysis reports
388
+ ├── tests/ Pytest suite
389
+ └── scripts/ Wiki generator, embedder benchmarks
390
+ ```
391
+
392
+ ---
393
+
394
+ ## Dependencies
395
+
396
+ - `doc-kg ≥ 0.12.0` — hybrid semantic + structural document knowledge graph
397
+ - `kgmodule-utils ≥ 0.2.3` — shared embedding, model cache, and snapshot utilities
398
+ - `spacy ≥ 3.8` with `en_core_web_sm` model
399
+ - `sentence-transformers ≥ 5.4`
400
+ - `lancedb ≥ 0.29`
401
+ - `transformers ≥ 4.57`
402
+ - `mcp ≥ 1.0` — Model Context Protocol SDK
403
+ - `rich ≥ 14.3` — terminal output and progress bars
404
+
405
+ Optional extras (`viz`, `viz3d`, `kgdeps`, `dev`) are documented in [pyproject.toml](pyproject.toml).
406
+
407
+ ---
408
+
409
+ ## Development
410
+
411
+ ```bash
412
+ # Install with dev tools
413
+ pip install -e ".[dev]"
414
+
415
+ # Run the test suite
416
+ pytest # uses pytest.ini (testpaths = tests/)
417
+ pytest -m "not slow" # skip slow tests
418
+ pytest --cov=diary_kg # with coverage
419
+
420
+ # Lint and format
421
+ ruff check src tests
422
+ ruff format src tests
423
+ mypy src/
424
+
425
+ # Pre-commit (runs ruff, mypy, pytest, detect-secrets, pylint)
426
+ pre-commit run --all-files
427
+ ```
428
+
429
+ The repo ships an optional pre-commit git hook that rebuilds PyCodeKG and DocKG indices from staged content, captures metrics snapshots keyed by git tree hash, and stages `.pycodekg/snapshots/` and `.dockg/snapshots/` atomically before the standard pre-commit framework checks run. Install it with:
430
+
431
+ ```bash
432
+ diarykg install-hooks --repo .
433
+ # Skip per-commit with: DIARYKG_SKIP_SNAPSHOT=1 git commit ...
434
+ ```
435
+
436
+ ---
437
+
438
+ ## License
439
+
440
+ Elastic License 2.0 — see [LICENSE](LICENSE) and the [Elastic License page](https://www.elastic.co/licensing/elastic-license).
441
+
442
+ ## Citation
443
+
444
+ If you use DiaryKG in academic work, please cite via the metadata in [CITATION.cff](CITATION.cff).
445
+