okb 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
okb-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,397 @@
1
+ Metadata-Version: 2.3
2
+ Name: okb
3
+ Version: 1.0.0
4
+ Summary: Personal knowledge base with semantic search for LLMs
5
+ Requires-Python: >=3.11
6
+ Classifier: Programming Language :: Python :: 3
7
+ Classifier: Programming Language :: Python :: 3.11
8
+ Classifier: Programming Language :: Python :: 3.12
9
+ Classifier: Programming Language :: Python :: 3.13
10
+ Provides-Extra: all
11
+ Provides-Extra: dev
12
+ Provides-Extra: docx
13
+ Provides-Extra: llm
14
+ Provides-Extra: llm-bedrock
15
+ Provides-Extra: pdf
16
+ Provides-Extra: web
17
+ Requires-Dist: PyGithub (>=2.0.0)
18
+ Requires-Dist: anthropic (>=0.40.0) ; extra == "all"
19
+ Requires-Dist: anthropic (>=0.40.0) ; extra == "llm"
20
+ Requires-Dist: anthropic (>=0.40.0) ; extra == "llm-bedrock"
21
+ Requires-Dist: boto3 (>=1.28.0) ; extra == "llm-bedrock"
22
+ Requires-Dist: botocore (>=1.31.0) ; extra == "llm-bedrock"
23
+ Requires-Dist: click (>=8.0.0)
24
+ Requires-Dist: dropbox (>=12.0.0)
25
+ Requires-Dist: einops (>=0.7.0)
26
+ Requires-Dist: mcp (>=1.0.0)
27
+ Requires-Dist: modal (>=1.0.0)
28
+ Requires-Dist: pgvector (>=0.2.0)
29
+ Requires-Dist: psycopg[binary] (>=3.1.0)
30
+ Requires-Dist: pymupdf (>=1.23.0) ; extra == "all"
31
+ Requires-Dist: pymupdf (>=1.23.0) ; extra == "pdf"
32
+ Requires-Dist: pytest (>=7.0.0) ; extra == "dev"
33
+ Requires-Dist: python-docx (>=1.1.0) ; extra == "all"
34
+ Requires-Dist: python-docx (>=1.1.0) ; extra == "docx"
35
+ Requires-Dist: pyyaml (>=6.0)
36
+ Requires-Dist: ruff (>=0.1.0) ; extra == "dev"
37
+ Requires-Dist: sentence-transformers (>=2.2.0)
38
+ Requires-Dist: trafilatura (>=1.6.0) ; extra == "all"
39
+ Requires-Dist: trafilatura (>=1.6.0) ; extra == "web"
40
+ Requires-Dist: watchdog (>=3.0.0)
41
+ Requires-Dist: yoyo-migrations (>=8.0.0)
42
+ Description-Content-Type: text/markdown
43
+
44
+ # Owned Knowledge Base (OKB)
45
+
46
+ A local-first semantic search system for personal documents with Claude Code integration via MCP.
47
+
48
+ ## Installation
49
+
50
+ ```bash
51
+ pip install okb
52
+ ```
53
+
54
+ Or from source:
55
+ ```bash
56
+ git clone https://github.com/yourusername/okb
57
+ cd okb
58
+ pip install -e .
59
+ ```
60
+
61
+ ## Quick Start
62
+
63
+ ```bash
64
+ # 1. Start the database
65
+ okb db start
66
+
67
+ # 2. (Optional) Deploy Modal embedder for faster batch ingestion
68
+ okb modal deploy
69
+
70
+ # 3. Ingest your documents
71
+ okb ingest ~/notes ~/docs
72
+
73
+ # 4. Configure Claude Code MCP (see below)
74
+ ```
75
+
76
+ ## CLI Commands
77
+
78
+ | Command | Description |
79
+ |---------|-------------|
80
+ | `okb db start` | Start pgvector database container |
81
+ | `okb db stop` | Stop database container |
82
+ | `okb db status` | Show database status |
83
+ | `okb db destroy` | Remove container and volume (destructive) |
84
+ | `okb ingest <paths>` | Ingest documents into knowledge base |
85
+ | `okb ingest <paths> --local` | Ingest using CPU embedding (no Modal) |
86
+ | `okb serve` | Start MCP server (stdio, for Claude Code) |
87
+ | `okb serve --http` | Start HTTP MCP server with token auth |
88
+ | `okb watch <paths>` | Watch directories for changes |
89
+ | `okb config init` | Create default config file |
90
+ | `okb config show` | Show current configuration |
91
+ | `okb modal deploy` | Deploy GPU embedder to Modal |
92
+ | `okb token create` | Create API token for HTTP server |
93
+ | `okb token list` | List tokens for a database |
94
+ | `okb token revoke` | Revoke an API token |
95
+ | `okb sync list` | List available API sources (plugins) |
96
+ | `okb sync run <sources>` | Sync data from external APIs |
97
+ | `okb sync status` | Show last sync times |
98
+ | `okb rescan` | Check indexed files for changes, re-ingest stale |
99
+ | `okb rescan --dry-run` | Show what would change without executing |
100
+ | `okb rescan --delete` | Also remove documents for missing files |
101
+ | `okb llm status` | Show LLM config and connectivity |
102
+ | `okb llm deploy` | Deploy Modal LLM for open model inference |
103
+ | `okb llm clear-cache` | Clear LLM response cache |
104
+
105
+ ## Architecture
106
+
107
+ ```
108
+ ┌─────────────────────────────────────────────────────────────────────┐
109
+ │ INGESTION (Burst GPU) │
110
+ │ │
111
+ │ Local Files → Contextual Chunking → Modal (GPU T4) → pgvector │
112
+ │ │
113
+ │ ~/notes/project-x/api-design.md │
114
+ │ ↓ │
115
+ │ "Document: API Design Notes │
116
+ │ Project: project-x │
117
+ │ Section: Authentication │
118
+ │ Content: Use JWT tokens with..." │
119
+ │ ↓ │
120
+ │ [0.23, -0.41, 0.87, ...] → pgvector │
121
+ └─────────────────────────────────────────────────────────────────────┘
122
+
123
+ ┌─────────────────────────────────────────────────────────────────────┐
124
+ │ RETRIEVAL (Always-on, Local) │
125
+ │ │
126
+ │ Claude Code → MCP Server → CPU Embedding → pgvector → Results │
127
+ │ │
128
+ │ "How do I handle auth?" │
129
+ │ ↓ │
130
+ │ [0.19, -0.38, 0.91, ...] (local CPU, ~300ms) │
131
+ │ ↓ │
132
+ │ Cosine similarity search → Top 5 chunks with context │
133
+ └─────────────────────────────────────────────────────────────────────┘
134
+ ```
135
+
136
+ ## Configuration
137
+
138
+ Configuration is loaded from `~/.config/okb/config.yaml` (or `$XDG_CONFIG_HOME/okb/config.yaml`).
139
+
140
+ Create default config:
141
+ ```bash
142
+ okb config init
143
+ ```
144
+
145
+ Example config:
146
+ ```yaml
147
+ databases:
148
+ personal:
149
+ url: postgresql://knowledge:localdev@localhost:5433/personal_kb
150
+ default: true # Used when --db not specified (only one can be default)
151
+ managed: true # okb manages via Docker
152
+ work:
153
+ url: postgresql://knowledge:localdev@localhost:5433/work_kb
154
+ managed: true
155
+
156
+ docker:
157
+ port: 5433
158
+ container_name: okb-pgvector
159
+
160
+ chunking:
161
+ chunk_size: 512
162
+ chunk_overlap: 64
163
+ ```
164
+
165
+ Use `--db <name>` to target a specific database with any command.
166
+
167
+ Environment variables override config file settings:
168
+ - `KB_DATABASE_URL` - Database connection string
169
+ - `OKB_DOCKER_PORT` - Docker port mapping
170
+ - `OKB_CONTAINER_NAME` - Docker container name
171
+
172
+ ### Project-Local Config
173
+
174
+ Override global config per-project with `.okbconf.yaml` (searched from CWD upward):
175
+
176
+ ```yaml
177
+ # .okbconf.yaml
178
+ default_database: work # Use 'work' db in this project
179
+
180
+ extensions:
181
+ skip_directories: # Extends global list
182
+ - test_fixtures
183
+ ```
184
+
185
+ Merge: scalars replace, lists extend, dicts deep-merge.
186
+
187
+ ### LLM Integration (Optional)
188
+
189
+ Enable LLM-based document classification and filtering:
190
+
191
+ ```yaml
192
+ llm:
193
+ provider: claude # "claude", "modal", or null (disabled)
194
+ model: claude-haiku-4-5-20251001
195
+ timeout: 30
196
+ cache_responses: true
197
+ ```
198
+
199
+ **Providers:**
200
+ | Provider | Setup | Cost |
201
+ |----------|-------|------|
202
+ | `claude` | `export ANTHROPIC_API_KEY=...` | ~$0.25/1M tokens |
203
+ | `modal` | `okb llm deploy` | ~$0.02/min GPU |
204
+
205
+ For Modal (no API key needed):
206
+ ```yaml
207
+ llm:
208
+ provider: modal
209
+ model: meta-llama/Llama-3.2-3B-Instruct
210
+ ```
211
+
212
+ **Pre-ingest filtering** - skip low-value content during sync:
213
+ ```yaml
214
+ plugins:
215
+ sources:
216
+ dropbox-paper:
217
+ llm_filter:
218
+ enabled: true
219
+ prompt: "Skip meeting notes and drafts"
220
+ action_on_skip: discard # or "archive"
221
+ ```
222
+
223
+ CLI commands:
224
+ ```bash
225
+ okb llm status # Show config and connectivity
226
+ okb llm deploy # Deploy Modal LLM (for provider: modal)
227
+ okb llm clear-cache # Clear response cache
228
+ ```
229
+
230
+ ## Claude Code MCP Config
231
+
232
+ ### stdio mode (default)
233
+
234
+ Add to your Claude Code MCP configuration:
235
+
236
+ ```json
237
+ {
238
+ "mcpServers": {
239
+ "knowledge-base": {
240
+ "command": "okb",
241
+ "args": ["serve"]
242
+ }
243
+ }
244
+ }
245
+ ```
246
+
247
+ ### HTTP mode (for remote/shared servers)
248
+
249
+ First, start the HTTP server and create a token:
250
+
251
+ ```bash
252
+ # Create a token
253
+ okb token create --db default -d "Claude Code"
254
+ # Output: okb_default_rw_a1b2c3d4e5f6g7h8
255
+
256
+ # Start HTTP server
257
+ okb serve --http --host 0.0.0.0 --port 8080
258
+ ```
259
+
260
+ Then configure Claude Code to connect via SSE:
261
+
262
+ ```json
263
+ {
264
+ "mcpServers": {
265
+ "knowledge-base": {
266
+ "type": "sse",
267
+ "url": "http://localhost:8080/sse",
268
+ "headers": {
269
+ "Authorization": "Bearer okb_default_rw_a1b2c3d4e5f6g7h8"
270
+ }
271
+ }
272
+ }
273
+ }
274
+ ```
275
+
276
+ ## MCP Tools (Available in Claude Code)
277
+
278
+ | Tool | Purpose |
279
+ |------|---------|
280
+ | `search_knowledge` | Semantic search with natural language queries |
281
+ | `keyword_search` | Exact keyword/symbol matching |
282
+ | `hybrid_search` | Combined semantic + keyword (RRF fusion) |
283
+ | `get_document` | Retrieve full document by path |
284
+ | `list_sources` | Show indexed document stats |
285
+ | `list_projects` | List known projects |
286
+ | `recent_documents` | Show recently indexed files |
287
+ | `save_knowledge` | Save knowledge from Claude for future reference |
288
+ | `delete_knowledge` | Delete a Claude-saved knowledge entry |
289
+ | `get_actionable_items` | Query tasks/events with structured filters |
290
+
291
+ ## Contextual Chunking
292
+
293
+ Documents are chunked with context for better retrieval:
294
+
295
+ ```
296
+ Document: Django Performance Notes
297
+ Project: student-app ← inferred from path or frontmatter
298
+ Section: Query Optimization ← extracted from markdown headers
299
+ Topics: django, performance ← from frontmatter tags
300
+ Content: Use `select_related()` to avoid N+1 queries...
301
+ ```
302
+
303
+ ### Frontmatter Example
304
+
305
+ ```markdown
306
+ ---
307
+ tags: [django, postgresql, performance]
308
+ project: student-app
309
+ category: backend
310
+ ---
311
+
312
+ # Query Optimization
313
+
314
+ Use `select_related()` for foreign keys...
315
+ ```
316
+
317
+ ## Cost Estimate
318
+
319
+ | Component | Local | Cloud Alternative |
320
+ |-----------|-------|-------------------|
321
+ | pgvector | $0 | ~$15-30/mo (CloudSQL) |
322
+ | MCP Server | $0 | ~$5/mo (small VM) |
323
+ | Modal embedding | ~$0.50-2/mo | N/A |
324
+ | **Total** | **~$1-2/mo** | **~$20-35/mo** |
325
+
326
+ ## Development
327
+
328
+ ```bash
329
+ # Install dev dependencies
330
+ pip install -e ".[dev]"
331
+
332
+ # Run tests
333
+ pytest
334
+
335
+ # Lint and format
336
+ ruff check . && ruff format .
337
+ ```
338
+
339
+ ## Plugin System
340
+
341
+ OKB supports plugins for custom file parsers and API data sources (GitHub, Todoist, etc).
342
+
343
+ ### Creating a Plugin
344
+
345
+ ```python
346
+ # File parser plugin
347
+ from okb.plugins import FileParser, Document
348
+
349
+ class EpubParser:
350
+ extensions = ['.epub']
351
+ source_type = 'epub'
352
+
353
+ def can_parse(self, path): return path.suffix.lower() == '.epub'
354
+ def parse(self, path, extra_metadata=None) -> Document: ...
355
+
356
+ # API source plugin
357
+ from okb.plugins import APISource, SyncState, Document
358
+
359
+ class GitHubSource:
360
+ name = 'github'
361
+ source_type = 'github-issue'
362
+
363
+ def configure(self, config): ...
364
+ def fetch(self, state: SyncState | None) -> tuple[list[Document], SyncState]: ...
365
+ ```
366
+
367
+ ### Registering Plugins
368
+
369
+ In your plugin's `pyproject.toml`:
370
+ ```toml
371
+ [project.entry-points."okb.parsers"]
372
+ epub = "okb_epub:EpubParser"
373
+
374
+ [project.entry-points."okb.sources"]
375
+ github = "okb_github:GitHubSource"
376
+ ```
377
+
378
+ ### Configuring API Sources
379
+
380
+ ```yaml
381
+ # ~/.config/okb/config.yaml
382
+ plugins:
383
+ sources:
384
+ github:
385
+ enabled: true
386
+ token: ${GITHUB_TOKEN} # Resolved from environment
387
+ repos: [owner/repo1, owner/repo2]
388
+ dropbox-paper:
389
+ enabled: true
390
+ token: ${DROPBOX_TOKEN}
391
+ folders: [/] # Optional: filter to specific folders
392
+ ```
393
+
394
+ ## License
395
+
396
+ MIT
397
+