doc-kg 0.11.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. doc_kg-0.11.0/LICENSE +94 -0
  2. doc_kg-0.11.0/PKG-INFO +608 -0
  3. doc_kg-0.11.0/README.md +573 -0
  4. doc_kg-0.11.0/pyproject.toml +127 -0
  5. doc_kg-0.11.0/src/doc_kg/__init__.py +7 -0
  6. doc_kg-0.11.0/src/doc_kg/__main__.py +6 -0
  7. doc_kg-0.11.0/src/doc_kg/app.py +412 -0
  8. doc_kg-0.11.0/src/doc_kg/chunker.py +540 -0
  9. doc_kg-0.11.0/src/doc_kg/cli/__init__.py +0 -0
  10. doc_kg-0.11.0/src/doc_kg/cli/cmd_analyze.py +69 -0
  11. doc_kg-0.11.0/src/doc_kg/cli/cmd_build.py +468 -0
  12. doc_kg-0.11.0/src/doc_kg/cli/cmd_hooks.py +120 -0
  13. doc_kg-0.11.0/src/doc_kg/cli/cmd_mcp.py +69 -0
  14. doc_kg-0.11.0/src/doc_kg/cli/cmd_model.py +54 -0
  15. doc_kg-0.11.0/src/doc_kg/cli/cmd_pipeline.py +342 -0
  16. doc_kg-0.11.0/src/doc_kg/cli/cmd_query.py +182 -0
  17. doc_kg-0.11.0/src/doc_kg/cli/cmd_semantic_analyze.py +69 -0
  18. doc_kg-0.11.0/src/doc_kg/cli/cmd_snapshot.py +392 -0
  19. doc_kg-0.11.0/src/doc_kg/cli/cmd_status.py +102 -0
  20. doc_kg-0.11.0/src/doc_kg/cli/cmd_viz.py +76 -0
  21. doc_kg-0.11.0/src/doc_kg/cli/group.py +22 -0
  22. doc_kg-0.11.0/src/doc_kg/cli/main.py +39 -0
  23. doc_kg-0.11.0/src/doc_kg/cli/options.py +50 -0
  24. doc_kg-0.11.0/src/doc_kg/config.py +44 -0
  25. doc_kg-0.11.0/src/doc_kg/doc_kg.code-workspace +8 -0
  26. doc_kg-0.11.0/src/doc_kg/dockg.py +615 -0
  27. doc_kg-0.11.0/src/doc_kg/dockg_semantic_analysis.py +717 -0
  28. doc_kg-0.11.0/src/doc_kg/dockg_thorough_analysis.py +405 -0
  29. doc_kg-0.11.0/src/doc_kg/embedder_worker.py +356 -0
  30. doc_kg-0.11.0/src/doc_kg/entry_chunk.py +117 -0
  31. doc_kg-0.11.0/src/doc_kg/graph.py +171 -0
  32. doc_kg-0.11.0/src/doc_kg/index.py +880 -0
  33. doc_kg-0.11.0/src/doc_kg/kg.py +835 -0
  34. doc_kg-0.11.0/src/doc_kg/manifold.py +282 -0
  35. doc_kg-0.11.0/src/doc_kg/mcp_server.py +170 -0
  36. doc_kg-0.11.0/src/doc_kg/pipeline.py +469 -0
  37. doc_kg-0.11.0/src/doc_kg/relations.py +166 -0
  38. doc_kg-0.11.0/src/doc_kg/sampler.py +335 -0
  39. doc_kg-0.11.0/src/doc_kg/snapshots.py +478 -0
  40. doc_kg-0.11.0/src/doc_kg/store.py +616 -0
  41. doc_kg-0.11.0/src/doc_kg/topics.py +328 -0
doc_kg-0.11.0/LICENSE ADDED
@@ -0,0 +1,94 @@
1
+ Elastic License 2.0
2
+
3
+ URL: https://www.elastic.co/licensing/elastic-license
4
+
5
+ ## Acceptance
6
+
7
+ By using the software, you agree to all of the terms and conditions below.
8
+
9
+ ## Copyright License
10
+
11
+ The licensor grants you a non-exclusive, royalty-free, worldwide,
12
+ non-sublicensable, non-transferable license to use, copy, distribute, make
13
+ available, and prepare derivative works of the software, in each case subject to
14
+ the limitations and conditions below.
15
+
16
+ ## Limitations
17
+
18
+ **You may not provide the software to third parties as a hosted or managed
19
+ service, where the service provides users with access to any substantial set of
20
+ the features or functionality of the software.**
21
+
22
+ You may not move, change, disable, or circumvent the license key functionality
23
+ in the software, and you may not remove or obscure any functionality in the
24
+ software that is protected by the license key.
25
+
26
+ You may not alter, remove, or obscure any licensing, copyright, or other notices
27
+ of the licensor in the software. Any use of the licensor's trademarks is subject
28
+ to applicable law.
29
+
30
+ ## Patents
31
+
32
+ The licensor grants you a license, under any patent claims the licensor can
33
+ license, or becomes able to license, to make, have made, use, sell, offer for
34
+ sale, import and have imported the software, in each case subject to the
35
+ limitations and conditions in this license. This license does not cover any
36
+ patent claims that you cause to be infringed by modifications or additions to the
37
+ software. If you or your company make any written claim that the software
38
+ infringes or contributes to infringement of any patent, your patent license for
39
+ the software granted under these terms ends immediately. If your company makes
40
+ such a claim, your patent license ends immediately for work on behalf of your
41
+ company.
42
+
43
+ ## Notices
44
+
45
+ You must ensure that anyone who gets a copy of any part of the software from you
46
+ also gets a copy of these terms or the URL for them above, as well as copies of
47
+ any plain-text lines beginning with "Required Notice:" that the licensor provided
48
+ with the software. For example:
49
+
50
+ Required Notice: Copyright (c) 2026 Eric G. Suchanek, PhD
51
+
52
+ ## No Other Rights
53
+
54
+ These terms do not imply any other licenses not expressly granted in this
55
+ license.
56
+
57
+ ## Termination
58
+
59
+ If you use the software in violation of these terms, such use is not licensed,
60
+ and your licenses will automatically terminate. If the licensor provides you with
61
+ a notice of your violation, and you cease all violation of this license no later
62
+ than 30 days after you receive that notice, your licenses will be reinstated
63
+ retroactively. However, if you violate these terms after such reinstatement, any
64
+ additional violation of these terms will cause your licenses to terminate
65
+ automatically and permanently.
66
+
67
+ ## No Liability
68
+
69
+ *As far as the law allows, the software comes as is, without any warranty or
70
+ condition, and the licensor will not be liable to you for any damages arising out
71
+ of these terms or the use or nature of the software, under any kind of legal
72
+ claim.*
73
+
74
+ ## Definitions
75
+
76
+ The **licensor** is the entity offering these terms, and the **software** is the
77
+ software the licensor makes available under these terms, including any portion of
78
+ it.
79
+
80
+ **You** refers to the individual or entity agreeing to these terms.
81
+
82
+ **Your company** is any legal entity, sole proprietorship, or other kind of
83
+ organization that you work for, plus all organizations that have control over,
84
+ are under the control of, or are under common control with that organization.
85
+ Control means ownership of substantially all the assets of an entity, or the
86
+ power to direct its management and policies by vote, contract, or otherwise.
87
+ Control can be direct or indirect.
88
+
89
+ **Your licenses** are all the licenses granted to you for the software under
90
+ these terms.
91
+
92
+ **Use** means anything you do with the software requiring one of your licenses.
93
+
94
+ **Trademark** means trademarks, service marks, and similar rights.
doc_kg-0.11.0/PKG-INFO ADDED
@@ -0,0 +1,608 @@
1
+ Metadata-Version: 2.4
2
+ Name: doc-kg
3
+ Version: 0.11.0
4
+ Summary: A tool to build a semantically searchable knowledge graph from markdown and text documents
5
+ License: Elastic-2.0
6
+ License-File: LICENSE
7
+ Keywords: knowledge-graph,document-analysis,markdown,lancedb,sqlite,semantic-search
8
+ Author: Eric G. Suchanek, PhD
9
+ Author-email: suchanek@flux-frontiers.com
10
+ Requires-Python: >=3.12,<3.14
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: Other/Proprietary License
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
18
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
19
+ Provides-Extra: viz
20
+ Requires-Dist: click (>=8.1.0)
21
+ Requires-Dist: einops (>=0.8.2,<0.9.0)
22
+ Requires-Dist: kg-snapshot (>=0.3.0)
23
+ Requires-Dist: lancedb (>=0.29.0)
24
+ Requires-Dist: markdown-it-py (>=3.0.0)
25
+ Requires-Dist: mcp (>=1.0.0)
26
+ Requires-Dist: numpy (>=1.24.0)
27
+ Requires-Dist: pandas (>=2.0.0)
28
+ Requires-Dist: pyyaml (>=6.0.0)
29
+ Requires-Dist: rich (>=13.0.0)
30
+ Requires-Dist: sentence-transformers (>=5.4.1,<6.0.0)
31
+ Project-URL: Homepage, https://github.com/Flux-Frontiers/doc_kg
32
+ Project-URL: Repository, https://github.com/Flux-Frontiers/doc_kg
33
+ Description-Content-Type: text/markdown
34
+
35
+ [![CI](https://github.com/Flux-Frontiers/doc_kg/actions/workflows/publish.yml/badge.svg)](https://github.com/Flux-Frontiers/doc_kg/actions/workflows/publish.yml)
36
+ [![Python](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg)](https://www.python.org/)
37
+ [![License: Elastic-2.0](https://img.shields.io/badge/License-Elastic%202.0-blue.svg)](https://www.elastic.co/licensing/elastic-license)
38
+ [![Version](https://img.shields.io/badge/version-0.11.0-blue.svg)](https://github.com/Flux-Frontiers/doc_kg/releases)
39
+ [![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
40
+ [![DOI](https://zenodo.org/badge/1176162360.svg)](https://zenodo.org/badge/latestdoi/1176162360)
41
+
42
+ **DocKG** — A Hybrid Knowledge Graph for Document Corpora
43
+ with Semantic Indexing and Source-Grounded Passage Packing
44
+
45
+ *Author: Eric G. Suchanek, PhD*
46
+ *Flux-Frontiers, Liberty TWP, OH*
47
+
48
+ ---
49
+
50
+ ## Overview
51
+
52
+ DocKG constructs a **deterministic, explainable knowledge graph** from a corpus of Markdown and plain-text documents. It semantically chunks text, discovers structural and semantic relationships between sections and chunks, stores them in SQLite, and augments retrieval with vector embeddings via LanceDB.
53
+
54
+ Structure is treated as **ground truth**; semantic search is strictly an acceleration layer. The result is a searchable, auditable representation of a document corpus that supports precise navigation, contextual passage extraction, and downstream reasoning — making it an ideal retrieval engine for LLMs and a practical foundation for **Knowledge-Graph RAG (KRAG)**, in contrast to embedding-only approaches.
55
+
56
+ DocKG uses the same architecture as [CodeKG](https://github.com/Flux-Frontiers/code_kg) but targets natural-language documents rather than Python source code.
57
+
58
+ ---
59
+
60
+ ## Features
61
+
62
+ - **Semantic chunking** — Splits `.md` and `.txt` files into semantically coherent chunks by heading and paragraph structure
63
+ - **Deterministic knowledge graph** — SQLite-backed canonical store with typed nodes and provenance-tracked edges
64
+ - **Relation extraction** — Topics, named entities, and keywords extracted from each chunk; co-occurrence and similarity edges built automatically
65
+ - **Hybrid query model** — Semantic seeding (LanceDB embeddings) + structural expansion (graph traversal)
66
+ - **Passage packing** — Extract context-rich text passages grounded to source documents with headings
67
+ - **Semantic coverage analysis** — Per-document metrics, hot chunks, orphan detection, and overall corpus health report
68
+ - **Temporal snapshots** — Save and diff graph metrics over time; compare coverage across corpus versions
69
+ - **MCP server** — Four tools for AI agent integration (`graph_stats`, `query_docs`, `pack_docs`, `get_node`)
70
+ - **Streamlit web app** — Interactive graph browser, hybrid query UI, and passage pack explorer
71
+ - **Configurable extraction** — Toggle topic/entity/keyword extraction per build
72
+
73
+ ---
74
+
75
+ ## Quick Start
76
+
77
+ ```bash
78
+ # Index a document corpus (SQLite + LanceDB in one step)
79
+ dockg build docs/
80
+
81
+ # Natural-language query — returns ranked document chunks
82
+ dockg query "authentication flow"
83
+
84
+ # Source-grounded passage pack — paste straight into an LLM prompt
85
+ dockg pack "configuration reference" --format md --out context.md
86
+ ```
87
+
88
+ ---
89
+
90
+ ## Usage Examples
91
+
92
+ ### Build the knowledge graph
93
+
94
+ ```bash
95
+ # Full pipeline: parse documents → SQLite graph → LanceDB semantic index
96
+ dockg build docs/
97
+
98
+ # Build only the SQLite graph (no embeddings)
99
+ dockg build-graph docs/
100
+
101
+ # Build only the LanceDB index from an existing graph
102
+ dockg build-index
103
+
104
+ # Rebuild from scratch (wipe is the default)
105
+ dockg build docs/
106
+
107
+ # Incremental update — keep existing data
108
+ dockg build docs/ --update
109
+
110
+ # Exclude specific directories
111
+ dockg build docs/ --exclude-dir dir1 --exclude-dir dir2
112
+ ```
113
+
114
+ ### Query and pack passages
115
+
116
+ ```bash
117
+ # Hybrid query — semantic seed + graph expansion
118
+ dockg query "deployment configuration"
119
+
120
+ # Increase top-K and expansion hops
121
+ dockg query "API authentication" --k 12 --hop 2
122
+
123
+ # Pack passages as Markdown for LLM context injection
124
+ dockg pack "error handling strategies" --format md --out context.md
125
+
126
+ # Pack as JSON
127
+ dockg pack "database schema" --format json
128
+ ```
129
+
130
+ ### Analyze corpus health
131
+
132
+ ```bash
133
+ # Full analysis report (Markdown + JSON snapshot)
134
+ dockg analyze docs/
135
+
136
+ # Output to a specific file
137
+ dockg analyze docs/ --output analysis/report.md
138
+
139
+ # Quiet mode for CI — exits non-zero on issues
140
+ dockg analyze docs/ --quiet
141
+ ```
142
+
143
+ ### Snapshot the knowledge graph over time
144
+
145
+ ```bash
146
+ # Save a snapshot tagged with a version
147
+ dockg snapshot save 0.1.0
148
+
149
+ # List all saved snapshots
150
+ dockg snapshot list
151
+
152
+ # Show detail for a specific snapshot
153
+ dockg snapshot show 0.1.0
154
+
155
+ # Diff two snapshots
156
+ dockg snapshot diff 0.1.0 0.2.0
157
+ ```
158
+
159
+ ### Launch the Streamlit visualizer
160
+
161
+ ```bash
162
+ # Requires [viz] extra: pip install 'doc-kg[viz]'
163
+ dockg viz
164
+
165
+ # Custom port, suppress browser launch
166
+ dockg viz --port 8510 --no-browser
167
+ ```
168
+
169
+ ### Start the MCP server
170
+
171
+ ```bash
172
+ # Serve via stdio (default — for Claude Code, Cline, Copilot)
173
+ dockg mcp --repo docs/
174
+
175
+ # Serve via SSE (for web clients)
176
+ dockg mcp --repo docs/ --transport sse
177
+ ```
178
+
179
+ ### Use via MCP in Claude Code / GitHub Copilot
180
+
181
+ Once the MCP server is running, your AI agent has four tools:
182
+
183
+ ```
184
+ graph_stats() # node/edge counts by kind
185
+ query_docs("authentication flow") # hybrid semantic + structural search
186
+ pack_docs("configuration reference") # source-grounded passages as Markdown
187
+ get_node("chunk:intro:overview") # fetch a single node by ID
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Installation
193
+
194
+ **Requirements:** Python ≥ 3.12, < 3.14
195
+
196
+ ### pip
197
+
198
+ ```bash
199
+ # From PyPI (recommended)
200
+ pip install doc-kg
201
+
202
+ # With Streamlit web visualizer
203
+ pip install 'doc-kg[viz]'
204
+
205
+ # Latest from GitHub
206
+ pip install 'doc-kg @ git+https://github.com/Flux-Frontiers/doc_kg.git'
207
+ ```
208
+
209
+ ### Poetry (existing project)
210
+
211
+ ```bash
212
+ # From PyPI
213
+ poetry add doc-kg
214
+
215
+ # With Streamlit visualizer
216
+ poetry add 'doc-kg[viz]'
217
+
218
+ # From GitHub source
219
+ poetry add 'doc-kg @ git+https://github.com/Flux-Frontiers/doc_kg.git'
220
+ ```
221
+
222
+ Or declare in `pyproject.toml`:
223
+
224
+ ```toml
225
+ [tool.poetry.dependencies]
226
+ doc-kg = "^0.11.0"
227
+ # or with visualizer:
228
+ doc-kg = {version = "^0.11.0", extras = ["viz"]}
229
+ ```
230
+
231
+ > **Note for DocKG developers:** Clone the repo and use `poetry install -E viz` for a full local development environment including the Streamlit visualizer.
232
+
233
+ ### Verify the installation
234
+
235
+ ```bash
236
+ dockg --help
237
+ dockg status --repo .
238
+ ```
239
+
240
+ `dockg status` shows the knowledge graph builder metadata, node/edge counts, and DB size. It exits non-zero if no graph has been built yet — useful for CI health checks.
241
+
242
+ ### First build
243
+
244
+ ```bash
245
+ # Build a knowledge graph from a directory of .md and .txt files
246
+ dockg build --repo /path/to/docs/
247
+
248
+ # Verify the result
249
+ dockg status --repo /path/to/docs/
250
+
251
+ # Run a query
252
+ dockg query --repo /path/to/docs/ "your search topic"
253
+ ```
254
+
255
+ ### Git hooks (optional)
256
+
257
+ Install a pre-commit hook that automatically captures a graph metrics snapshot before each commit:
258
+
259
+ ```bash
260
+ # Via the CLI (recommended — uses the full quality-check pipeline)
261
+ dockg install-hooks
262
+
263
+ # Via the standalone script
264
+ bash scripts/install-hooks.sh
265
+
266
+ # Skip the hook for a specific commit
267
+ DOCKG_SKIP_SNAPSHOT=1 git commit -m "message"
268
+ ```
269
+
270
+ ### Download embedding model for offline use
271
+
272
+ The default model (`BAAI/bge-small-en-v1.5`) is fetched from HuggingFace on first use. To pre-download it for air-gapped or offline environments:
273
+
274
+ ```bash
275
+ dockg download-model
276
+ # or a specific model:
277
+ dockg download-model --model BAAI/bge-small-en-v1.5
278
+ ```
279
+
280
+ ### AI agent integration (MCP)
281
+
282
+ After installing, wire DocKG into your AI agent by adding it as an MCP server. See [docs/MCP.md](docs/MCP.md) for the full setup guide, or run the installer script to configure all providers automatically:
283
+
284
+ ```bash
285
+ # Configure Claude Code, GitHub Copilot, and Cline in one step
286
+ bash scripts/install-skill.sh
287
+
288
+ # Claude Code only
289
+ bash scripts/install-skill.sh --providers claude
290
+
291
+ # Dry-run to see what would be changed
292
+ bash scripts/install-skill.sh --dry-run
293
+ ```
294
+
295
+ ---
296
+
297
+ ## CLI Reference
298
+
299
+ All commands are available via the unified `dockg` CLI:
300
+
301
+ ```bash
302
+ dockg --help
303
+ ```
304
+
305
+ Every subcommand also ships as a dedicated `dockg-<name>` script — useful for shell scripts, `Makefile` targets, and CI pipelines with no `poetry run` required.
306
+
307
+ | Script alias | Equivalent subcommand | Description |
308
+ |---|---|---|
309
+ | `dockg-build` | `dockg build` | Full pipeline: parse → SQLite → LanceDB |
310
+ | `dockg-build-graph` | `dockg build-graph` | SQLite graph only |
311
+ | `dockg-build-index` | `dockg build-index` | LanceDB index only |
312
+ | `dockg-query` | `dockg query` | Hybrid semantic + structural query |
313
+ | `dockg-pack` | `dockg pack` | Source-grounded passage extraction |
314
+ | `dockg-analyze` | `dockg analyze` | Corpus health analysis + report |
315
+ | `dockg-snapshot` | `dockg snapshot` | Save / list / show / diff snapshots |
316
+ | `dockg-viz` | `dockg viz` | Launch Streamlit visualizer |
317
+ | `dockg-mcp` | `dockg mcp` | Start MCP server |
318
+
319
+ ### `dockg build` — Full pipeline
320
+
321
+ ```bash
322
+ dockg build CORPUS_ROOT [--db PATH] [--lancedb PATH] [--model NAME]
323
+ [--update] [--no-similar] [--exclude-dir DIR]...
324
+ ```
325
+
326
+ | Option | Default | Description |
327
+ |---|---|---|
328
+ | `CORPUS_ROOT` | required | Root directory of documents to index |
329
+ | `--db` | `.dockg/graph.sqlite` | SQLite database path |
330
+ | `--lancedb` | `.dockg/lancedb` | LanceDB index directory |
331
+ | `--model` | `BAAI/bge-small-en-v1.5` | Sentence-transformer embedding model |
332
+ | `--update` | off | Incremental update — keep existing data instead of wiping |
333
+ | `--no-similar` | off | Skip computing `SIMILAR_TO` edges |
334
+ | `--exclude-dir` | — | Exclude a directory at every depth (repeatable); merged with `[tool.dockg].exclude` |
335
+
336
+ ### `dockg build-graph` — SQLite only
337
+
338
+ ```bash
339
+ dockg build-graph CORPUS_ROOT [--db PATH] [--update] [--exclude-dir DIR]...
340
+ ```
341
+
342
+ Parses documents, extracts nodes (documents, sections, chunks, topics, entities, keywords), and writes the SQLite graph. No embedding model required.
343
+
344
+ | Option | Default | Description |
345
+ |---|---|---|
346
+ | `--exclude-dir` | — | Exclude a directory at every depth (repeatable); merged with `[tool.dockg].exclude` |
347
+
348
+ ### `dockg build-index` — LanceDB only
349
+
350
+ ```bash
351
+ dockg build-index [--db PATH] [--lancedb PATH] [--model NAME] [--no-similar]
352
+ ```
353
+
354
+ Reads an existing SQLite graph and builds (or rebuilds) the LanceDB vector index. Use after `build-graph` or when reindexing with a different model.
355
+
356
+ ### `dockg query` — Hybrid search
357
+
358
+ ```bash
359
+ dockg query QUERY [--db PATH] [--lancedb PATH] [--k N] [--hop N] [--rels TYPES]
360
+ ```
361
+
362
+ | Option | Default | Description |
363
+ |---|---|---|
364
+ | `QUERY` | required | Natural-language search string |
365
+ | `--k` | `8` | Top-K semantic seed hits |
366
+ | `--hop` | `1` | Graph expansion hops |
367
+ | `--rels` | `CONTAINS,NEXT,REFERENCES,SIMILAR_TO` | Edge types to traverse |
368
+
369
+ ### `dockg pack` — Passage extraction
370
+
371
+ ```bash
372
+ dockg pack QUERY [--db PATH] [--lancedb PATH] [--k N] [--hop N]
373
+ [--format md|json] [--out PATH] [--max-chars N] [--max-nodes N]
374
+ ```
375
+
376
+ | Option | Default | Description |
377
+ |---|---|---|
378
+ | `--k` | `8` | Top-K semantic seed hits |
379
+ | `--hop` | `1` | Graph expansion hops |
380
+ | `--format` | `md` | Output format: `md` or `json` |
381
+ | `--out` | stdout | Output file path |
382
+ | `--max-chars` | `12000` | Max total characters in pack |
383
+ | `--max-nodes` | `50` | Max nodes included |
384
+
385
+ ### `dockg analyze` — Corpus health report
386
+
387
+ ```bash
388
+ dockg analyze [CORPUS_ROOT] [--db PATH] [--lancedb PATH]
389
+ [--output PATH] [--json] [--quiet]
390
+ ```
391
+
392
+ Runs the full `DocKGAnalyzer` pipeline:
393
+
394
+ 1. Baseline graph statistics (node/edge counts by kind)
395
+ 2. Per-document structure metrics (sections, chunks, depth)
396
+ 3. Semantic coverage (% of chunks with topic/entity/keyword annotations)
397
+ 4. Orphan detection (isolated nodes with no edges)
398
+ 5. Hot chunks (highest connectivity / most referenced)
399
+ 6. Actionable insights and improvement suggestions
400
+
401
+ Writes a Markdown report and optionally a JSON snapshot.
402
+
403
+ ### `dockg snapshot` — Temporal snapshots
404
+
405
+ ```bash
406
+ dockg snapshot save VERSION # capture current metrics
407
+ dockg snapshot list # list all saved snapshots
408
+ dockg snapshot show COMMIT # full detail + delta vs previous
409
+ dockg snapshot diff A B # side-by-side comparison
410
+ ```
411
+
412
+ Snapshots are stored in `.dockg/snapshots/`. Use them to track documentation coverage trends across iterations.
413
+
414
+ ```bash
415
+ # Save snapshots at key milestones
416
+ dockg snapshot save 0.1.0
417
+ # ... add more docs, rebuild ...
418
+ dockg snapshot save 0.2.0
419
+
420
+ # See what changed
421
+ dockg snapshot diff 0.1.0 0.2.0
422
+ ```
423
+
424
+ ### `dockg viz` — Streamlit visualizer
425
+
426
+ ```bash
427
+ dockg viz [--db PATH] [--port PORT] [--no-browser]
428
+ ```
429
+
430
+ Launches a Streamlit web app with three tabs:
431
+
432
+ - **Graph** — Interactive pyvis graph browser with node kind / edge type filters
433
+ - **Query** — Hybrid search UI with result ranking and provenance
434
+ - **Pack** — Passage pack explorer for LLM context injection
435
+
436
+ Requires the `[viz]` extra: `pip install 'doc-kg[viz]'`.
437
+
438
+ ### `dockg mcp` — MCP server
439
+
440
+ ```bash
441
+ dockg mcp [--repo PATH] [--db PATH] [--lancedb PATH] [--model NAME]
442
+ [--transport stdio|sse]
443
+ ```
444
+
445
+ Starts the FastMCP server. Default transport is `stdio` for AI agent integration; use `sse` for web clients.
446
+
447
+ ---
448
+
449
+ ## Knowledge Graph Schema
450
+
451
+ ### Node kinds
452
+
453
+ | Kind | Description |
454
+ |---|---|
455
+ | `document` | A source `.md` or `.txt` file |
456
+ | `section` | A heading-delimited section within a document |
457
+ | `chunk` | A semantically coherent text passage within a section |
458
+ | `topic` | A topic extracted from chunk text |
459
+ | `entity` | A named entity (person, place, organization, concept) |
460
+ | `keyword` | A keyword or key phrase from a chunk |
461
+
462
+ ### Edge types
463
+
464
+ | Type | Description |
465
+ |---|---|
466
+ | `CONTAINS` | Parent → child (document→section, section→chunk) |
467
+ | `NEXT` | Sequential ordering between same-level nodes |
468
+ | `REFERENCES` | A chunk references another document or section |
469
+ | `SIMILAR_TO` | Semantic similarity between chunks (LanceDB-derived) |
470
+ | `HAS_TOPIC` | Chunk → topic association |
471
+ | `MENTIONS_ENTITY` | Chunk → named entity association |
472
+ | `HAS_KEYWORD` | Chunk → keyword association |
473
+ | `CO_OCCURS_WITH` | Co-occurrence between topics/entities within a chunk |
474
+
475
+ ---
476
+
477
+ ## MCP Integration
478
+
479
+ See [docs/MCP.md](docs/MCP.md) for the full setup guide covering Claude Code, GitHub Copilot, Claude Desktop, and Cline.
480
+
481
+ ### Quick MCP setup
482
+
483
+ **Claude Code / Kilo Code** — add to `.mcp.json` in your repo root:
484
+
485
+ ```json
486
+ {
487
+ "mcpServers": {
488
+ "dockg": {
489
+ "command": "dockg-mcp",
490
+ "args": ["--repo", "."]
491
+ }
492
+ }
493
+ }
494
+ ```
495
+
496
+ **GitHub Copilot** — add to `.vscode/mcp.json`:
497
+
498
+ ```json
499
+ {
500
+ "servers": {
501
+ "dockg": {
502
+ "type": "stdio",
503
+ "command": "dockg-mcp",
504
+ "args": ["--repo", "."]
505
+ }
506
+ }
507
+ }
508
+ ```
509
+
510
+ ### MCP tools reference
511
+
512
+ | Tool | Description |
513
+ |---|---|
514
+ | `graph_stats()` | Node and edge counts by kind |
515
+ | `query_docs(q, k, hop, rels, max_nodes)` | Hybrid semantic + structural search |
516
+ | `pack_docs(q, k, hop, rels, max_chars, max_nodes)` | Source-grounded passages as Markdown |
517
+ | `get_node(node_id)` | Fetch a single node by ID |
518
+
519
+ ---
520
+
521
+ ## Python API
522
+
523
+ ```python
524
+ from doc_kg import DocKG
525
+
526
+ kg = DocKG(corpus_root="docs/")
527
+ kg.build(wipe=True)
528
+
529
+ # Hybrid query
530
+ result = kg.query("deployment configuration", k=8, hop=1)
531
+ for node in result.nodes:
532
+ print(node["id"], node["name"])
533
+
534
+ # Passage pack for LLM context
535
+ pack = kg.pack("authentication flow")
536
+ pack.save("context.md")
537
+ ```
538
+
539
+ ---
540
+
541
+ ## Configuration
542
+
543
+ Add to your project's `pyproject.toml` to persist common settings:
544
+
545
+ ```toml
546
+ [tool.dockg]
547
+ exclude = ["archive", "vendor", "generated"]
548
+ ```
549
+
550
+ ### Exclude priority order
551
+
552
+ Exclusions are **additive** across three levels:
553
+
554
+ 1. **Built-in** — hardcoded in `dockg.py`: `.git`, `.venv`, `__pycache__`, `.dockg`, `.codekg`, etc.
555
+ 2. **Config** — `[tool.dockg].exclude` from `pyproject.toml` (auto-loaded from corpus root)
556
+ 3. **CLI** — `--exclude-dir` flags (merged at call time)
557
+
558
+ All three are unioned—there is no override, only additive exclusion. Example:
559
+
560
+ ```bash
561
+ # pyproject.toml has: exclude = ["archive", "vendor"]
562
+ # This adds to those:
563
+ dockg build docs/ --exclude-dir node_modules --exclude-dir dist
564
+ # Result: archive + vendor + node_modules + dist are all excluded (plus built-ins)
565
+ ```
566
+
567
+ ---
568
+
569
+ ## Storage Layout
570
+
571
+ After running `dockg build`, the following files are created:
572
+
573
+ ```
574
+ .dockg/
575
+ graph.sqlite # SQLite knowledge graph (nodes + edges)
576
+ lancedb/ # LanceDB vector index
577
+ snapshots/ # Temporal snapshots (JSON)
578
+ manifest.json
579
+ <version>.json
580
+ ```
581
+
582
+ ---
583
+
584
+ ## Contributing
585
+
586
+ 1. Fork the repository and create a feature branch
587
+ 2. Install dev dependencies: `poetry install`
588
+ 3. Run the test suite: `pytest`
589
+ 4. Submit a pull request
590
+
591
+ ```bash
592
+ # Install with viz extras for full local development
593
+ poetry install -E viz
594
+
595
+ # Run all tests
596
+ pytest
597
+
598
+ # Lint and format
599
+ ruff check src/ tests/
600
+ ruff format src/ tests/
601
+ ```
602
+
603
+ ---
604
+
605
+ ## License
606
+
607
+ [Elastic License 2.0](LICENSE) — free for non-commercial and internal use; commercial redistribution requires a license from Flux-Frontiers.
608
+