ocr-provenance-mcp 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of ocr-provenance-mcp might be problematic. Click here for more details.
- package/.env.example +55 -0
- package/LICENSE +78 -0
- package/README.md +1154 -0
- package/dist/bin-http.d.ts +24 -0
- package/dist/bin-http.d.ts.map +1 -0
- package/dist/bin-http.js +275 -0
- package/dist/bin-http.js.map +1 -0
- package/dist/bin-setup.d.ts +11 -0
- package/dist/bin-setup.d.ts.map +1 -0
- package/dist/bin-setup.js +610 -0
- package/dist/bin-setup.js.map +1 -0
- package/dist/bin.d.ts +16 -0
- package/dist/bin.d.ts.map +1 -0
- package/dist/bin.js +16 -0
- package/dist/bin.js.map +1 -0
- package/dist/index.d.ts +13 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +90 -0
- package/dist/index.js.map +1 -0
- package/dist/models/chunk.d.ts +136 -0
- package/dist/models/chunk.d.ts.map +1 -0
- package/dist/models/chunk.js +27 -0
- package/dist/models/chunk.js.map +1 -0
- package/dist/models/cluster.d.ts +79 -0
- package/dist/models/cluster.d.ts.map +1 -0
- package/dist/models/cluster.js +10 -0
- package/dist/models/cluster.js.map +1 -0
- package/dist/models/comparison.d.ts +62 -0
- package/dist/models/comparison.d.ts.map +1 -0
- package/dist/models/comparison.js +8 -0
- package/dist/models/comparison.js.map +1 -0
- package/dist/models/document.d.ts +104 -0
- package/dist/models/document.d.ts.map +1 -0
- package/dist/models/document.js +15 -0
- package/dist/models/document.js.map +1 -0
- package/dist/models/embedding.d.ts +87 -0
- package/dist/models/embedding.d.ts.map +1 -0
- package/dist/models/embedding.js +23 -0
- package/dist/models/embedding.js.map +1 -0
- package/dist/models/extraction.d.ts +15 -0
- package/dist/models/extraction.d.ts.map +1 -0
- package/dist/models/extraction.js +2 -0
- package/dist/models/extraction.js.map +1 -0
- package/dist/models/form-fill.d.ts +23 -0
- package/dist/models/form-fill.d.ts.map +1 -0
- package/dist/models/form-fill.js +2 -0
- package/dist/models/form-fill.js.map +1 -0
- package/dist/models/image.d.ts +177 -0
- package/dist/models/image.d.ts.map +1 -0
- package/dist/models/image.js +8 -0
- package/dist/models/image.js.map +1 -0
- package/dist/models/index.d.ts +14 -0
- package/dist/models/index.d.ts.map +1 -0
- package/dist/models/index.js +22 -0
- package/dist/models/index.js.map +1 -0
- package/dist/models/provenance.d.ts +174 -0
- package/dist/models/provenance.d.ts.map +1 -0
- package/dist/models/provenance.js +53 -0
- package/dist/models/provenance.js.map +1 -0
- package/dist/models/uploaded-file.d.ts +20 -0
- package/dist/models/uploaded-file.d.ts.map +1 -0
- package/dist/models/uploaded-file.js +2 -0
- package/dist/models/uploaded-file.js.map +1 -0
- package/dist/server/errors.d.ts +93 -0
- package/dist/server/errors.d.ts.map +1 -0
- package/dist/server/errors.js +256 -0
- package/dist/server/errors.js.map +1 -0
- package/dist/server/events.d.ts +36 -0
- package/dist/server/events.d.ts.map +1 -0
- package/dist/server/events.js +48 -0
- package/dist/server/events.js.map +1 -0
- package/dist/server/permissions.d.ts +26 -0
- package/dist/server/permissions.d.ts.map +1 -0
- package/dist/server/permissions.js +194 -0
- package/dist/server/permissions.js.map +1 -0
- package/dist/server/register-tools.d.ts +25 -0
- package/dist/server/register-tools.d.ts.map +1 -0
- package/dist/server/register-tools.js +102 -0
- package/dist/server/register-tools.js.map +1 -0
- package/dist/server/startup.d.ts +16 -0
- package/dist/server/startup.d.ts.map +1 -0
- package/dist/server/startup.js +37 -0
- package/dist/server/startup.js.map +1 -0
- package/dist/server/state.d.ts +166 -0
- package/dist/server/state.d.ts.map +1 -0
- package/dist/server/state.js +424 -0
- package/dist/server/state.js.map +1 -0
- package/dist/server/transports/http-transport.d.ts +37 -0
- package/dist/server/transports/http-transport.d.ts.map +1 -0
- package/dist/server/transports/http-transport.js +204 -0
- package/dist/server/transports/http-transport.js.map +1 -0
- package/dist/server/transports/index.d.ts +9 -0
- package/dist/server/transports/index.d.ts.map +1 -0
- package/dist/server/transports/index.js +9 -0
- package/dist/server/transports/index.js.map +1 -0
- package/dist/server/transports/session-manager.d.ts +40 -0
- package/dist/server/transports/session-manager.d.ts.map +1 -0
- package/dist/server/transports/session-manager.js +74 -0
- package/dist/server/transports/session-manager.js.map +1 -0
- package/dist/server/types.d.ts +82 -0
- package/dist/server/types.d.ts.map +1 -0
- package/dist/server/types.js +14 -0
- package/dist/server/types.js.map +1 -0
- package/dist/services/audit.d.ts +26 -0
- package/dist/services/audit.d.ts.map +1 -0
- package/dist/services/audit.js +43 -0
- package/dist/services/audit.js.map +1 -0
- package/dist/services/chunking/chunk-deduplicator.d.ts +33 -0
- package/dist/services/chunking/chunk-deduplicator.d.ts.map +1 -0
- package/dist/services/chunking/chunk-deduplicator.js +46 -0
- package/dist/services/chunking/chunk-deduplicator.js.map +1 -0
- package/dist/services/chunking/chunk-merger.d.ts +26 -0
- package/dist/services/chunking/chunk-merger.d.ts.map +1 -0
- package/dist/services/chunking/chunk-merger.js +94 -0
- package/dist/services/chunking/chunk-merger.js.map +1 -0
- package/dist/services/chunking/chunker.d.ts +62 -0
- package/dist/services/chunking/chunker.d.ts.map +1 -0
- package/dist/services/chunking/chunker.js +566 -0
- package/dist/services/chunking/chunker.js.map +1 -0
- package/dist/services/chunking/heading-normalizer.d.ts +33 -0
- package/dist/services/chunking/heading-normalizer.d.ts.map +1 -0
- package/dist/services/chunking/heading-normalizer.js +101 -0
- package/dist/services/chunking/heading-normalizer.js.map +1 -0
- package/dist/services/chunking/json-block-analyzer.d.ts +163 -0
- package/dist/services/chunking/json-block-analyzer.d.ts.map +1 -0
- package/dist/services/chunking/json-block-analyzer.js +1033 -0
- package/dist/services/chunking/json-block-analyzer.js.map +1 -0
- package/dist/services/chunking/markdown-parser.d.ts +75 -0
- package/dist/services/chunking/markdown-parser.d.ts.map +1 -0
- package/dist/services/chunking/markdown-parser.js +428 -0
- package/dist/services/chunking/markdown-parser.js.map +1 -0
- package/dist/services/chunking/text-normalizer.d.ts +20 -0
- package/dist/services/chunking/text-normalizer.d.ts.map +1 -0
- package/dist/services/chunking/text-normalizer.js +36 -0
- package/dist/services/chunking/text-normalizer.js.map +1 -0
- package/dist/services/clm/contract-schemas.d.ts +36 -0
- package/dist/services/clm/contract-schemas.d.ts.map +1 -0
- package/dist/services/clm/contract-schemas.js +92 -0
- package/dist/services/clm/contract-schemas.js.map +1 -0
- package/dist/services/clm/summarization.d.ts +46 -0
- package/dist/services/clm/summarization.d.ts.map +1 -0
- package/dist/services/clm/summarization.js +61 -0
- package/dist/services/clm/summarization.js.map +1 -0
- package/dist/services/clustering/clustering-service.d.ts +58 -0
- package/dist/services/clustering/clustering-service.d.ts.map +1 -0
- package/dist/services/clustering/clustering-service.js +467 -0
- package/dist/services/clustering/clustering-service.js.map +1 -0
- package/dist/services/comparison/diff-service.d.ts +41 -0
- package/dist/services/comparison/diff-service.d.ts.map +1 -0
- package/dist/services/comparison/diff-service.js +120 -0
- package/dist/services/comparison/diff-service.js.map +1 -0
- package/dist/services/embedding/embedder.d.ts +55 -0
- package/dist/services/embedding/embedder.d.ts.map +1 -0
- package/dist/services/embedding/embedder.js +202 -0
- package/dist/services/embedding/embedder.js.map +1 -0
- package/dist/services/embedding/nomic.d.ts +67 -0
- package/dist/services/embedding/nomic.d.ts.map +1 -0
- package/dist/services/embedding/nomic.js +280 -0
- package/dist/services/embedding/nomic.js.map +1 -0
- package/dist/services/gemini/circuit-breaker.d.ts +106 -0
- package/dist/services/gemini/circuit-breaker.d.ts.map +1 -0
- package/dist/services/gemini/circuit-breaker.js +237 -0
- package/dist/services/gemini/circuit-breaker.js.map +1 -0
- package/dist/services/gemini/client.d.ts +173 -0
- package/dist/services/gemini/client.d.ts.map +1 -0
- package/dist/services/gemini/client.js +483 -0
- package/dist/services/gemini/client.js.map +1 -0
- package/dist/services/gemini/config.d.ts +116 -0
- package/dist/services/gemini/config.d.ts.map +1 -0
- package/dist/services/gemini/config.js +118 -0
- package/dist/services/gemini/config.js.map +1 -0
- package/dist/services/gemini/index.d.ts +9 -0
- package/dist/services/gemini/index.d.ts.map +1 -0
- package/dist/services/gemini/index.js +13 -0
- package/dist/services/gemini/index.js.map +1 -0
- package/dist/services/gemini/rate-limiter.d.ts +62 -0
- package/dist/services/gemini/rate-limiter.d.ts.map +1 -0
- package/dist/services/gemini/rate-limiter.js +120 -0
- package/dist/services/gemini/rate-limiter.js.map +1 -0
- package/dist/services/images/extractor.d.ts +88 -0
- package/dist/services/images/extractor.d.ts.map +1 -0
- package/dist/services/images/extractor.js +340 -0
- package/dist/services/images/extractor.js.map +1 -0
- package/dist/services/images/optimizer.d.ts +130 -0
- package/dist/services/images/optimizer.d.ts.map +1 -0
- package/dist/services/images/optimizer.js +228 -0
- package/dist/services/images/optimizer.js.map +1 -0
- package/dist/services/ocr/datalab.d.ts +64 -0
- package/dist/services/ocr/datalab.d.ts.map +1 -0
- package/dist/services/ocr/datalab.js +425 -0
- package/dist/services/ocr/datalab.js.map +1 -0
- package/dist/services/ocr/errors.d.ts +38 -0
- package/dist/services/ocr/errors.d.ts.map +1 -0
- package/dist/services/ocr/errors.js +83 -0
- package/dist/services/ocr/errors.js.map +1 -0
- package/dist/services/ocr/file-manager.d.ts +76 -0
- package/dist/services/ocr/file-manager.d.ts.map +1 -0
- package/dist/services/ocr/file-manager.js +238 -0
- package/dist/services/ocr/file-manager.js.map +1 -0
- package/dist/services/ocr/form-fill.d.ts +48 -0
- package/dist/services/ocr/form-fill.d.ts.map +1 -0
- package/dist/services/ocr/form-fill.js +213 -0
- package/dist/services/ocr/form-fill.js.map +1 -0
- package/dist/services/ocr/processor.d.ts +95 -0
- package/dist/services/ocr/processor.d.ts.map +1 -0
- package/dist/services/ocr/processor.js +259 -0
- package/dist/services/ocr/processor.js.map +1 -0
- package/dist/services/provenance/agent-metadata.d.ts +82 -0
- package/dist/services/provenance/agent-metadata.d.ts.map +1 -0
- package/dist/services/provenance/agent-metadata.js +106 -0
- package/dist/services/provenance/agent-metadata.js.map +1 -0
- package/dist/services/provenance/chain-hash.d.ts +57 -0
- package/dist/services/provenance/chain-hash.d.ts.map +1 -0
- package/dist/services/provenance/chain-hash.js +131 -0
- package/dist/services/provenance/chain-hash.js.map +1 -0
- package/dist/services/provenance/exporter.d.ts +202 -0
- package/dist/services/provenance/exporter.d.ts.map +1 -0
- package/dist/services/provenance/exporter.js +457 -0
- package/dist/services/provenance/exporter.js.map +1 -0
- package/dist/services/provenance/index.d.ts +15 -0
- package/dist/services/provenance/index.d.ts.map +1 -0
- package/dist/services/provenance/index.js +17 -0
- package/dist/services/provenance/index.js.map +1 -0
- package/dist/services/provenance/tracker.d.ts +138 -0
- package/dist/services/provenance/tracker.d.ts.map +1 -0
- package/dist/services/provenance/tracker.js +293 -0
- package/dist/services/provenance/tracker.js.map +1 -0
- package/dist/services/provenance/verifier.d.ts +153 -0
- package/dist/services/provenance/verifier.d.ts.map +1 -0
- package/dist/services/provenance/verifier.js +536 -0
- package/dist/services/provenance/verifier.js.map +1 -0
- package/dist/services/python-pool.d.ts +70 -0
- package/dist/services/python-pool.d.ts.map +1 -0
- package/dist/services/python-pool.js +265 -0
- package/dist/services/python-pool.js.map +1 -0
- package/dist/services/search/bm25.d.ts +180 -0
- package/dist/services/search/bm25.d.ts.map +1 -0
- package/dist/services/search/bm25.js +656 -0
- package/dist/services/search/bm25.js.map +1 -0
- package/dist/services/search/fusion.d.ts +103 -0
- package/dist/services/search/fusion.d.ts.map +1 -0
- package/dist/services/search/fusion.js +122 -0
- package/dist/services/search/fusion.js.map +1 -0
- package/dist/services/search/local-reranker.d.ts +30 -0
- package/dist/services/search/local-reranker.d.ts.map +1 -0
- package/dist/services/search/local-reranker.js +123 -0
- package/dist/services/search/local-reranker.js.map +1 -0
- package/dist/services/search/quality.d.ts +11 -0
- package/dist/services/search/quality.d.ts.map +1 -0
- package/dist/services/search/quality.js +17 -0
- package/dist/services/search/quality.js.map +1 -0
- package/dist/services/search/query-classifier.d.ts +34 -0
- package/dist/services/search/query-classifier.d.ts.map +1 -0
- package/dist/services/search/query-classifier.js +114 -0
- package/dist/services/search/query-classifier.js.map +1 -0
- package/dist/services/search/query-expander.d.ts +73 -0
- package/dist/services/search/query-expander.d.ts.map +1 -0
- package/dist/services/search/query-expander.js +281 -0
- package/dist/services/search/query-expander.js.map +1 -0
- package/dist/services/search/reranker.d.ts +44 -0
- package/dist/services/search/reranker.d.ts.map +1 -0
- package/dist/services/search/reranker.js +101 -0
- package/dist/services/search/reranker.js.map +1 -0
- package/dist/services/storage/database/annotation-operations.d.ts +113 -0
- package/dist/services/storage/database/annotation-operations.d.ts.map +1 -0
- package/dist/services/storage/database/annotation-operations.js +177 -0
- package/dist/services/storage/database/annotation-operations.js.map +1 -0
- package/dist/services/storage/database/approval-operations.d.ts +132 -0
- package/dist/services/storage/database/approval-operations.d.ts.map +1 -0
- package/dist/services/storage/database/approval-operations.js +206 -0
- package/dist/services/storage/database/approval-operations.js.map +1 -0
- package/dist/services/storage/database/chunk-operations.d.ts +132 -0
- package/dist/services/storage/database/chunk-operations.d.ts.map +1 -0
- package/dist/services/storage/database/chunk-operations.js +306 -0
- package/dist/services/storage/database/chunk-operations.js.map +1 -0
- package/dist/services/storage/database/cluster-operations.d.ts +97 -0
- package/dist/services/storage/database/cluster-operations.d.ts.map +1 -0
- package/dist/services/storage/database/cluster-operations.js +258 -0
- package/dist/services/storage/database/cluster-operations.js.map +1 -0
- package/dist/services/storage/database/comparison-operations.d.ts +41 -0
- package/dist/services/storage/database/comparison-operations.d.ts.map +1 -0
- package/dist/services/storage/database/comparison-operations.js +65 -0
- package/dist/services/storage/database/comparison-operations.js.map +1 -0
- package/dist/services/storage/database/converters.d.ts +36 -0
- package/dist/services/storage/database/converters.d.ts.map +1 -0
- package/dist/services/storage/database/converters.js +244 -0
- package/dist/services/storage/database/converters.js.map +1 -0
- package/dist/services/storage/database/document-operations.d.ts +145 -0
- package/dist/services/storage/database/document-operations.d.ts.map +1 -0
- package/dist/services/storage/database/document-operations.js +498 -0
- package/dist/services/storage/database/document-operations.js.map +1 -0
- package/dist/services/storage/database/embedding-operations.d.ts +130 -0
- package/dist/services/storage/database/embedding-operations.d.ts.map +1 -0
- package/dist/services/storage/database/embedding-operations.js +315 -0
- package/dist/services/storage/database/embedding-operations.js.map +1 -0
- package/dist/services/storage/database/extraction-operations.d.ts +47 -0
- package/dist/services/storage/database/extraction-operations.d.ts.map +1 -0
- package/dist/services/storage/database/extraction-operations.js +85 -0
- package/dist/services/storage/database/extraction-operations.js.map +1 -0
- package/dist/services/storage/database/form-fill-operations.d.ts +58 -0
- package/dist/services/storage/database/form-fill-operations.d.ts.map +1 -0
- package/dist/services/storage/database/form-fill-operations.js +116 -0
- package/dist/services/storage/database/form-fill-operations.js.map +1 -0
- package/dist/services/storage/database/helpers.d.ts +29 -0
- package/dist/services/storage/database/helpers.d.ts.map +1 -0
- package/dist/services/storage/database/helpers.js +55 -0
- package/dist/services/storage/database/helpers.js.map +1 -0
- package/dist/services/storage/database/image-operations.d.ts +202 -0
- package/dist/services/storage/database/image-operations.d.ts.map +1 -0
- package/dist/services/storage/database/image-operations.js +484 -0
- package/dist/services/storage/database/image-operations.js.map +1 -0
- package/dist/services/storage/database/index.d.ts +13 -0
- package/dist/services/storage/database/index.d.ts.map +1 -0
- package/dist/services/storage/database/index.js +16 -0
- package/dist/services/storage/database/index.js.map +1 -0
- package/dist/services/storage/database/lock-operations.d.ts +59 -0
- package/dist/services/storage/database/lock-operations.d.ts.map +1 -0
- package/dist/services/storage/database/lock-operations.js +89 -0
- package/dist/services/storage/database/lock-operations.js.map +1 -0
- package/dist/services/storage/database/obligation-operations.d.ts +88 -0
- package/dist/services/storage/database/obligation-operations.d.ts.map +1 -0
- package/dist/services/storage/database/obligation-operations.js +206 -0
- package/dist/services/storage/database/obligation-operations.js.map +1 -0
- package/dist/services/storage/database/ocr-operations.d.ts +33 -0
- package/dist/services/storage/database/ocr-operations.d.ts.map +1 -0
- package/dist/services/storage/database/ocr-operations.js +70 -0
- package/dist/services/storage/database/ocr-operations.js.map +1 -0
- package/dist/services/storage/database/playbook-operations.d.ts +72 -0
- package/dist/services/storage/database/playbook-operations.d.ts.map +1 -0
- package/dist/services/storage/database/playbook-operations.js +247 -0
- package/dist/services/storage/database/playbook-operations.js.map +1 -0
- package/dist/services/storage/database/provenance-operations.d.ts +112 -0
- package/dist/services/storage/database/provenance-operations.d.ts.map +1 -0
- package/dist/services/storage/database/provenance-operations.js +251 -0
- package/dist/services/storage/database/provenance-operations.js.map +1 -0
- package/dist/services/storage/database/service.d.ts +142 -0
- package/dist/services/storage/database/service.d.ts.map +1 -0
- package/dist/services/storage/database/service.js +310 -0
- package/dist/services/storage/database/service.js.map +1 -0
- package/dist/services/storage/database/static-operations.d.ts +30 -0
- package/dist/services/storage/database/static-operations.d.ts.map +1 -0
- package/dist/services/storage/database/static-operations.js +218 -0
- package/dist/services/storage/database/static-operations.js.map +1 -0
- package/dist/services/storage/database/stats-operations.d.ts +101 -0
- package/dist/services/storage/database/stats-operations.d.ts.map +1 -0
- package/dist/services/storage/database/stats-operations.js +394 -0
- package/dist/services/storage/database/stats-operations.js.map +1 -0
- package/dist/services/storage/database/tag-operations.d.ts +76 -0
- package/dist/services/storage/database/tag-operations.d.ts.map +1 -0
- package/dist/services/storage/database/tag-operations.js +178 -0
- package/dist/services/storage/database/tag-operations.js.map +1 -0
- package/dist/services/storage/database/types.d.ts +286 -0
- package/dist/services/storage/database/types.d.ts.map +1 -0
- package/dist/services/storage/database/types.js +39 -0
- package/dist/services/storage/database/types.js.map +1 -0
- package/dist/services/storage/database/upload-operations.d.ts +71 -0
- package/dist/services/storage/database/upload-operations.d.ts.map +1 -0
- package/dist/services/storage/database/upload-operations.js +124 -0
- package/dist/services/storage/database/upload-operations.js.map +1 -0
- package/dist/services/storage/database/user-operations.d.ts +102 -0
- package/dist/services/storage/database/user-operations.d.ts.map +1 -0
- package/dist/services/storage/database/user-operations.js +151 -0
- package/dist/services/storage/database/user-operations.js.map +1 -0
- package/dist/services/storage/database/workflow-operations.d.ts +98 -0
- package/dist/services/storage/database/workflow-operations.d.ts.map +1 -0
- package/dist/services/storage/database/workflow-operations.js +157 -0
- package/dist/services/storage/database/workflow-operations.js.map +1 -0
- package/dist/services/storage/database.d.ts +16 -0
- package/dist/services/storage/database.d.ts.map +1 -0
- package/dist/services/storage/database.js +15 -0
- package/dist/services/storage/database.js.map +1 -0
- package/dist/services/storage/index.d.ts +10 -0
- package/dist/services/storage/index.d.ts.map +1 -0
- package/dist/services/storage/index.js +10 -0
- package/dist/services/storage/index.js.map +1 -0
- package/dist/services/storage/migrations/index.d.ts +16 -0
- package/dist/services/storage/migrations/index.d.ts.map +1 -0
- package/dist/services/storage/migrations/index.js +20 -0
- package/dist/services/storage/migrations/index.js.map +1 -0
- package/dist/services/storage/migrations/operations.d.ts +40 -0
- package/dist/services/storage/migrations/operations.d.ts.map +1 -0
- package/dist/services/storage/migrations/operations.js +2910 -0
- package/dist/services/storage/migrations/operations.js.map +1 -0
- package/dist/services/storage/migrations/schema-definitions.d.ts +306 -0
- package/dist/services/storage/migrations/schema-definitions.d.ts.map +1 -0
- package/dist/services/storage/migrations/schema-definitions.js +1006 -0
- package/dist/services/storage/migrations/schema-definitions.js.map +1 -0
- package/dist/services/storage/migrations/schema-helpers.d.ts +50 -0
- package/dist/services/storage/migrations/schema-helpers.d.ts.map +1 -0
- package/dist/services/storage/migrations/schema-helpers.js +176 -0
- package/dist/services/storage/migrations/schema-helpers.js.map +1 -0
- package/dist/services/storage/migrations/types.d.ts +15 -0
- package/dist/services/storage/migrations/types.d.ts.map +1 -0
- package/dist/services/storage/migrations/types.js +21 -0
- package/dist/services/storage/migrations/types.js.map +1 -0
- package/dist/services/storage/migrations/verification.d.ts +20 -0
- package/dist/services/storage/migrations/verification.d.ts.map +1 -0
- package/dist/services/storage/migrations/verification.js +78 -0
- package/dist/services/storage/migrations/verification.js.map +1 -0
- package/dist/services/storage/migrations.d.ts +16 -0
- package/dist/services/storage/migrations.d.ts.map +1 -0
- package/dist/services/storage/migrations.js +17 -0
- package/dist/services/storage/migrations.js.map +1 -0
- package/dist/services/storage/types.d.ts +12 -0
- package/dist/services/storage/types.d.ts.map +1 -0
- package/dist/services/storage/types.js +5 -0
- package/dist/services/storage/types.js.map +1 -0
- package/dist/services/storage/vector.d.ts +208 -0
- package/dist/services/storage/vector.d.ts.map +1 -0
- package/dist/services/storage/vector.js +526 -0
- package/dist/services/storage/vector.js.map +1 -0
- package/dist/services/vlm/pipeline.d.ts +194 -0
- package/dist/services/vlm/pipeline.d.ts.map +1 -0
- package/dist/services/vlm/pipeline.js +800 -0
- package/dist/services/vlm/pipeline.js.map +1 -0
- package/dist/services/vlm/prompts.d.ts +171 -0
- package/dist/services/vlm/prompts.d.ts.map +1 -0
- package/dist/services/vlm/prompts.js +229 -0
- package/dist/services/vlm/prompts.js.map +1 -0
- package/dist/services/vlm/service.d.ts +174 -0
- package/dist/services/vlm/service.d.ts.map +1 -0
- package/dist/services/vlm/service.js +256 -0
- package/dist/services/vlm/service.js.map +1 -0
- package/dist/services/webhook-delivery.d.ts +4 -0
- package/dist/services/webhook-delivery.d.ts.map +1 -0
- package/dist/services/webhook-delivery.js +140 -0
- package/dist/services/webhook-delivery.js.map +1 -0
- package/dist/tools/chunks.d.ts +19 -0
- package/dist/tools/chunks.d.ts.map +1 -0
- package/dist/tools/chunks.js +392 -0
- package/dist/tools/chunks.js.map +1 -0
- package/dist/tools/clm.d.ts +16 -0
- package/dist/tools/clm.d.ts.map +1 -0
- package/dist/tools/clm.js +668 -0
- package/dist/tools/clm.js.map +1 -0
- package/dist/tools/clustering.d.ts +13 -0
- package/dist/tools/clustering.d.ts.map +1 -0
- package/dist/tools/clustering.js +498 -0
- package/dist/tools/clustering.js.map +1 -0
- package/dist/tools/collaboration.d.ts +15 -0
- package/dist/tools/collaboration.d.ts.map +1 -0
- package/dist/tools/collaboration.js +516 -0
- package/dist/tools/collaboration.js.map +1 -0
- package/dist/tools/comparison.d.ts +13 -0
- package/dist/tools/comparison.d.ts.map +1 -0
- package/dist/tools/comparison.js +735 -0
- package/dist/tools/comparison.js.map +1 -0
- package/dist/tools/compliance.d.ts +15 -0
- package/dist/tools/compliance.d.ts.map +1 -0
- package/dist/tools/compliance.js +640 -0
- package/dist/tools/compliance.js.map +1 -0
- package/dist/tools/config.d.ts +19 -0
- package/dist/tools/config.d.ts.map +1 -0
- package/dist/tools/config.js +213 -0
- package/dist/tools/config.js.map +1 -0
- package/dist/tools/database.d.ts +62 -0
- package/dist/tools/database.d.ts.map +1 -0
- package/dist/tools/database.js +288 -0
- package/dist/tools/database.js.map +1 -0
- package/dist/tools/documents.d.ts +61 -0
- package/dist/tools/documents.d.ts.map +1 -0
- package/dist/tools/documents.js +1624 -0
- package/dist/tools/documents.js.map +1 -0
- package/dist/tools/embeddings.d.ts +14 -0
- package/dist/tools/embeddings.d.ts.map +1 -0
- package/dist/tools/embeddings.js +626 -0
- package/dist/tools/embeddings.js.map +1 -0
- package/dist/tools/evaluation.d.ts +25 -0
- package/dist/tools/evaluation.d.ts.map +1 -0
- package/dist/tools/evaluation.js +523 -0
- package/dist/tools/evaluation.js.map +1 -0
- package/dist/tools/events.d.ts +16 -0
- package/dist/tools/events.d.ts.map +1 -0
- package/dist/tools/events.js +493 -0
- package/dist/tools/events.js.map +1 -0
- package/dist/tools/extraction-structured.d.ts +13 -0
- package/dist/tools/extraction-structured.d.ts.map +1 -0
- package/dist/tools/extraction-structured.js +390 -0
- package/dist/tools/extraction-structured.js.map +1 -0
- package/dist/tools/extraction.d.ts +24 -0
- package/dist/tools/extraction.d.ts.map +1 -0
- package/dist/tools/extraction.js +424 -0
- package/dist/tools/extraction.js.map +1 -0
- package/dist/tools/file-management.d.ts +14 -0
- package/dist/tools/file-management.d.ts.map +1 -0
- package/dist/tools/file-management.js +523 -0
- package/dist/tools/file-management.js.map +1 -0
- package/dist/tools/form-fill.d.ts +13 -0
- package/dist/tools/form-fill.d.ts.map +1 -0
- package/dist/tools/form-fill.js +250 -0
- package/dist/tools/form-fill.js.map +1 -0
- package/dist/tools/health.d.ts +19 -0
- package/dist/tools/health.d.ts.map +1 -0
- package/dist/tools/health.js +229 -0
- package/dist/tools/health.js.map +1 -0
- package/dist/tools/images.d.ts +54 -0
- package/dist/tools/images.d.ts.map +1 -0
- package/dist/tools/images.js +787 -0
- package/dist/tools/images.js.map +1 -0
- package/dist/tools/ingestion.d.ts +94 -0
- package/dist/tools/ingestion.d.ts.map +1 -0
- package/dist/tools/ingestion.js +1659 -0
- package/dist/tools/ingestion.js.map +1 -0
- package/dist/tools/intelligence.d.ts +18 -0
- package/dist/tools/intelligence.d.ts.map +1 -0
- package/dist/tools/intelligence.js +1039 -0
- package/dist/tools/intelligence.js.map +1 -0
- package/dist/tools/provenance.d.ts +51 -0
- package/dist/tools/provenance.d.ts.map +1 -0
- package/dist/tools/provenance.js +691 -0
- package/dist/tools/provenance.js.map +1 -0
- package/dist/tools/reports.d.ts +41 -0
- package/dist/tools/reports.d.ts.map +1 -0
- package/dist/tools/reports.js +1394 -0
- package/dist/tools/reports.js.map +1 -0
- package/dist/tools/search.d.ts +35 -0
- package/dist/tools/search.d.ts.map +1 -0
- package/dist/tools/search.js +2528 -0
- package/dist/tools/search.js.map +1 -0
- package/dist/tools/shared.d.ts +52 -0
- package/dist/tools/shared.d.ts.map +1 -0
- package/dist/tools/shared.js +54 -0
- package/dist/tools/shared.js.map +1 -0
- package/dist/tools/tags.d.ts +15 -0
- package/dist/tools/tags.d.ts.map +1 -0
- package/dist/tools/tags.js +287 -0
- package/dist/tools/tags.js.map +1 -0
- package/dist/tools/timeline.d.ts +15 -0
- package/dist/tools/timeline.d.ts.map +1 -0
- package/dist/tools/timeline.js +14 -0
- package/dist/tools/timeline.js.map +1 -0
- package/dist/tools/users.d.ts +14 -0
- package/dist/tools/users.d.ts.map +1 -0
- package/dist/tools/users.js +257 -0
- package/dist/tools/users.js.map +1 -0
- package/dist/tools/vlm.d.ts +40 -0
- package/dist/tools/vlm.d.ts.map +1 -0
- package/dist/tools/vlm.js +475 -0
- package/dist/tools/vlm.js.map +1 -0
- package/dist/tools/workflow.d.ts +16 -0
- package/dist/tools/workflow.d.ts.map +1 -0
- package/dist/tools/workflow.js +495 -0
- package/dist/tools/workflow.js.map +1 -0
- package/dist/utils/backoff.d.ts +53 -0
- package/dist/utils/backoff.d.ts.map +1 -0
- package/dist/utils/backoff.js +78 -0
- package/dist/utils/backoff.js.map +1 -0
- package/dist/utils/config-persistence.d.ts +33 -0
- package/dist/utils/config-persistence.d.ts.map +1 -0
- package/dist/utils/config-persistence.js +61 -0
- package/dist/utils/config-persistence.js.map +1 -0
- package/dist/utils/hash.d.ts +65 -0
- package/dist/utils/hash.d.ts.map +1 -0
- package/dist/utils/hash.js +146 -0
- package/dist/utils/hash.js.map +1 -0
- package/dist/utils/math.d.ts +21 -0
- package/dist/utils/math.d.ts.map +1 -0
- package/dist/utils/math.js +39 -0
- package/dist/utils/math.js.map +1 -0
- package/dist/utils/validation.d.ts +697 -0
- package/dist/utils/validation.d.ts.map +1 -0
- package/dist/utils/validation.js +529 -0
- package/dist/utils/validation.js.map +1 -0
- package/package.json +96 -0
- package/python/.gitkeep +0 -0
- package/python/__init__.py +104 -0
- package/python/clustering_worker.py +440 -0
- package/python/docx_image_extractor.py +524 -0
- package/python/embedding_worker.py +552 -0
- package/python/file_manager_worker.py +564 -0
- package/python/form_fill_worker.py +399 -0
- package/python/gpu_utils.py +582 -0
- package/python/image_extractor.py +317 -0
- package/python/image_optimizer.py +444 -0
- package/python/ocr_worker.py +712 -0
- package/python/pyproject.toml +76 -0
- package/python/requirements.txt +51 -0
- package/python/reranker_worker.py +87 -0
package/README.md
ADDED
|
@@ -0,0 +1,1154 @@
|
|
|
1
|
+
# OCR Provenance MCP Server
|
|
2
|
+
|
|
3
|
+
**Turn thousands of documents into a searchable, AI-queryable knowledge base -- with full provenance.**
|
|
4
|
+
|
|
5
|
+
Point this at a folder of PDFs, Word docs, spreadsheets, images, or presentations. Minutes later, Claude can search, analyze, compare, and answer questions across your entire document collection -- with a cryptographic audit trail proving exactly where every answer came from.
|
|
6
|
+
|
|
7
|
+
[](LICENSE)
|
|
8
|
+
[](https://nodejs.org/)
|
|
9
|
+
[](https://www.typescriptlang.org/)
|
|
10
|
+
[](https://modelcontextprotocol.io/)
|
|
11
|
+
[](#tool-reference-141-tools)
|
|
12
|
+
[](#development)
|
|
13
|
+
[](https://github.com/ChrisRoyse/OCR-Provenance/pkgs/container/ocr-provenance)
|
|
14
|
+
[](https://github.com/ChrisRoyse/OCR-Provenance/actions/workflows/docker-publish.yml)
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## Why This Exists
|
|
19
|
+
|
|
20
|
+
AI assistants can't read your files natively. They can't search across 500 PDFs, compare contract versions, or find the one email buried in a discovery dump. This server bridges that gap.
|
|
21
|
+
|
|
22
|
+
It's a [Model Context Protocol](https://modelcontextprotocol.io/) server that gives Claude (or any MCP client) the ability to **ingest, OCR, search, compare, cluster, tag, version-track, and reason over** your documents -- with a cryptographic audit trail proving exactly where every answer came from.
|
|
23
|
+
|
|
24
|
+
### What Happens When You Ingest Documents
|
|
25
|
+
|
|
26
|
+
```
|
|
27
|
+
Your files (PDF, DOCX, XLSX, images, presentations...)
|
|
28
|
+
-> OCR text extraction via Datalab API (3 accuracy modes)
|
|
29
|
+
-> Hybrid section-aware chunking with markdown parsing
|
|
30
|
+
-> GPU vector embeddings (nomic-embed-text-v1.5, 768-dim)
|
|
31
|
+
-> Image extraction + AI vision analysis (Gemini 3 Flash)
|
|
32
|
+
-> Full-text + semantic + hybrid search indexes
|
|
33
|
+
-> Document clustering by similarity (HDBSCAN / agglomerative / k-means)
|
|
34
|
+
-> Cross-entity tagging system
|
|
35
|
+
-> Document version tracking (re-ingestion detects changes)
|
|
36
|
+
-> SHA-256 provenance chain on every artifact
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
**18 supported file types:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, PNG, JPG, JPEG, TIFF, TIF, BMP, GIF, WEBP, TXT, CSV, MD
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Real-World Use Cases
|
|
44
|
+
|
|
45
|
+
### Litigation & Legal Discovery
|
|
46
|
+
|
|
47
|
+
You have 3,000 documents from a civil case -- contracts, emails, depositions, medical records, invoices, and correspondence spanning 8 years. Normally this takes a team of paralegals weeks to organize.
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
"Search all documents for references to the March 2024 amendment"
|
|
51
|
+
"Compare the original contract with the signed version -- what changed?"
|
|
52
|
+
"Find every document mentioning Dr. Rivera and cluster them by topic"
|
|
53
|
+
"Which invoices were submitted after the termination date?"
|
|
54
|
+
"Build me a timeline of all communications between Smith and Davis Corp"
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
The provenance chain means you can trace every search result back to its exact source page and document -- critical for legal admissibility and audit.
|
|
58
|
+
|
|
59
|
+
### Medical Records Review
|
|
60
|
+
|
|
61
|
+
An insurance adjuster needs to review 800+ pages of medical records across 15 providers for a personal injury claim.
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
"Find all references to lumbar spine across every provider's records"
|
|
65
|
+
"What medications were prescribed between June and December 2024?"
|
|
66
|
+
"Compare the initial ER report with the orthopedic surgeon's assessment"
|
|
67
|
+
"Extract all diagnosis codes and dates from the treatment records"
|
|
68
|
+
"Cluster these records by provider and summarize each provider's findings"
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Financial Audit & Compliance
|
|
72
|
+
|
|
73
|
+
A forensic accountant is reviewing 5 years of financial records for a fraud investigation -- bank statements, tax returns, invoices, receipts, and internal reports.
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
"Find all transactions over $10,000 across every bank statement"
|
|
77
|
+
"Compare this year's tax return with last year's -- what changed?"
|
|
78
|
+
"Search for any mention of offshore accounts or shell companies"
|
|
79
|
+
"Cluster all invoices by vendor and flag any with duplicate amounts"
|
|
80
|
+
"Which expense reports don't have matching receipts?"
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Insurance Claims Processing
|
|
84
|
+
|
|
85
|
+
An adjuster is handling a commercial property damage claim with engineering reports, contractor estimates, photographs, and policy documents.
|
|
86
|
+
|
|
87
|
+
```
|
|
88
|
+
"What is the total estimated repair cost across all contractor bids?"
|
|
89
|
+
"Compare the policyholder's damage report with the independent adjuster's assessment"
|
|
90
|
+
"Find all photos showing water damage and describe what's in each one"
|
|
91
|
+
"Does the policy cover the type of damage described in the engineering report?"
|
|
92
|
+
"Cluster all documents by damage category -- structural, electrical, plumbing"
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Academic Research
|
|
96
|
+
|
|
97
|
+
A PhD student is doing a literature review across 200+ papers, supplementary materials, and datasets.
|
|
98
|
+
|
|
99
|
+
```
|
|
100
|
+
"Find all papers that discuss transformer architectures for protein folding"
|
|
101
|
+
"Which papers cite the 2023 AlphaFold study?"
|
|
102
|
+
"Compare the methodology sections of these three competing approaches"
|
|
103
|
+
"Cluster these papers by research topic and list the top 5 clusters"
|
|
104
|
+
"Build me a RAG context block about attention mechanisms for my thesis"
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### Real Estate Due Diligence
|
|
108
|
+
|
|
109
|
+
A commercial real estate firm is evaluating a property acquisition -- title reports, environmental assessments, lease agreements, zoning documents, and inspection reports.
|
|
110
|
+
|
|
111
|
+
```
|
|
112
|
+
"Are there any environmental liens or violations in the Phase I report?"
|
|
113
|
+
"Compare the rent rolls from 2023 and 2024 -- which tenants left?"
|
|
114
|
+
"Find all lease clauses related to early termination or renewal options"
|
|
115
|
+
"What does the zoning report say about permitted commercial uses?"
|
|
116
|
+
"Cluster all inspection findings by severity -- critical, major, minor"
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### HR & Employment Investigations
|
|
120
|
+
|
|
121
|
+
An HR director is investigating a workplace complaint with emails, performance reviews, chat logs, and policy documents.
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
"Find all communications between the complainant and the respondent"
|
|
125
|
+
"When was the anti-harassment policy last updated and what does it say?"
|
|
126
|
+
"Compare the employee's performance reviews from 2023 and 2024"
|
|
127
|
+
"Search for any prior complaints or disciplinary actions in these records"
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## Quick Start
|
|
133
|
+
|
|
134
|
+
```
|
|
135
|
+
1. Create a database -> ocr_db_create { name: "my-case" }
|
|
136
|
+
2. Select it -> ocr_db_select { database_name: "my-case" }
|
|
137
|
+
3. Ingest a folder -> ocr_ingest_directory { directory_path: "/path/to/docs" }
|
|
138
|
+
4. Process everything -> ocr_process_pending {}
|
|
139
|
+
5. Search -> ocr_search_hybrid { query: "breach of contract" }
|
|
140
|
+
6. Ask questions -> ocr_rag_context { question: "What were the settlement terms?" }
|
|
141
|
+
7. Verify provenance -> ocr_provenance_verify { item_id: "doc-id" }
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
Each database is fully isolated. Create one per case, project, or client.
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Architecture
|
|
149
|
+
|
|
150
|
+
```
|
|
151
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
152
|
+
│ MCP Server (stdio) │
|
|
153
|
+
│ TypeScript + @modelcontextprotocol/sdk │
|
|
154
|
+
│ 102 tools across 22 tool modules │
|
|
155
|
+
├─────────────────────────────────────────────────────────────┤
|
|
156
|
+
│ │
|
|
157
|
+
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
158
|
+
│ │ Ingestion│ │ Search │ │ Analysis │ │ Reports │ │
|
|
159
|
+
│ │ 9 tools │ │ 12 tools │ │ 35 tools │ │ 9 tools │ │
|
|
160
|
+
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
161
|
+
│ │ │ │ │ │
|
|
162
|
+
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
163
|
+
│ │ VLM │ │ Images │ │ Tags │ │ Intel │ │
|
|
164
|
+
│ │ 6 tools │ │ 14 tools │ │ 6 tools │ │ 4 tools │ │
|
|
165
|
+
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
166
|
+
│ │ │ │ │ │
|
|
167
|
+
│ ┌────┴──────────────┴──────────────┴──────────────┴────┐ │
|
|
168
|
+
│ │ Service Layer (11 domains) │ │
|
|
169
|
+
│ │ OCR · Chunking · Embedding · Search · VLM │ │
|
|
170
|
+
│ │ Provenance · Comparison · Clustering · Gemini │ │
|
|
171
|
+
│ │ Images · Storage │ │
|
|
172
|
+
│ └────┬──────────────┬──────────────┬───────────────────┘ │
|
|
173
|
+
│ │ │ │ │
|
|
174
|
+
│ ┌────┴────┐ ┌────┴────┐ ┌────┴─────┐ │
|
|
175
|
+
│ │ SQLite │ │sqlite-vec│ │ FTS5 │ │
|
|
176
|
+
│ │ 18 tbls │ │ vectors │ │ indexes │ │
|
|
177
|
+
│ └─────────┘ └─────────┘ └──────────┘ │
|
|
178
|
+
│ │
|
|
179
|
+
│ ┌──────────────────────────────────────────────────────┐ │
|
|
180
|
+
│ │ Python Workers (9 processes) │ │
|
|
181
|
+
│ │ OCR · Embedding · Clustering · Image Extraction │ │
|
|
182
|
+
│ │ DOCX Extraction · Image Optimizer · Form Fill │ │
|
|
183
|
+
│ │ File Manager · Local Reranker │ │
|
|
184
|
+
│ └──────────────────────────────────────────────────────┘ │
|
|
185
|
+
│ │
|
|
186
|
+
│ ┌──────────────────────────────────────────────────────┐ │
|
|
187
|
+
│ │ External APIs │ │
|
|
188
|
+
│ │ Datalab (OCR/Forms) · Gemini 3 Flash (VLM/AI) │ │
|
|
189
|
+
│ │ Nomic embed v1.5 (local GPU, 768-dim) │ │
|
|
190
|
+
│ └──────────────────────────────────────────────────────┘ │
|
|
191
|
+
└─────────────────────────────────────────────────────────────┘
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
- **TypeScript MCP Server** -- 102 tools across 22 modules, Zod validation, provenance tracking
|
|
195
|
+
- **Python Workers** (9) -- OCR, GPU embedding, image extraction, clustering, form fill, file management, local reranking
|
|
196
|
+
- **SQLite + sqlite-vec** -- 18 tables, FTS5 full-text search, vector similarity search, WAL mode
|
|
197
|
+
- **Gemini 3 Flash** -- vision analysis (image description, classification, PDF analysis)
|
|
198
|
+
- **Datalab API** -- document OCR, form filling, structured extraction, cloud storage
|
|
199
|
+
- **nomic-embed-text-v1.5** -- 768-dim local embeddings (CUDA / MPS / CPU)
|
|
200
|
+
|
|
201
|
+
### Hybrid Section-Aware Chunking
|
|
202
|
+
|
|
203
|
+
The chunking pipeline produces semantically coherent chunks that respect document structure:
|
|
204
|
+
|
|
205
|
+
```
|
|
206
|
+
OCR text (markdown)
|
|
207
|
+
│
|
|
208
|
+
├─ Text Normalization ──── Clean whitespace, normalize line breaks
|
|
209
|
+
├─ Heading Normalization ─ Fix skipped heading levels (h1→h3 becomes h1→h2)
|
|
210
|
+
├─ Markdown Parsing ────── Parse into heading/paragraph/table/code/list blocks
|
|
211
|
+
├─ JSON Block Analysis ──── Detect atomic regions (tables, figures) from OCR blocks
|
|
212
|
+
├─ Section-Aware Splitting ─ Chunk at heading boundaries, respect atomic regions
|
|
213
|
+
├─ Page Tracking ────────── Assign page numbers via Datalab page separators
|
|
214
|
+
├─ Chunk Merging ────────── Merge heading-only chunks into their content
|
|
215
|
+
├─ Chunk Deduplication ──── Remove near-duplicate chunks via fuzzy matching
|
|
216
|
+
├─ Header/Footer Tagging ── Auto-tag header/footer chunks for search exclusion
|
|
217
|
+
└─ Metadata Enrichment ──── section_path, heading_context, content_types per chunk
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
Each chunk carries: `section_path` (e.g., "Introduction > Background"), `heading_context`, `content_types` (table/code/text/list), and `page_number` -- all searchable as filters.
|
|
221
|
+
|
|
222
|
+
### How Search Works
|
|
223
|
+
|
|
224
|
+
Three search modes, combinable via Reciprocal Rank Fusion:
|
|
225
|
+
|
|
226
|
+
| Mode | Best For | How It Works |
|
|
227
|
+
|------|----------|--------------|
|
|
228
|
+
| **BM25** | Exact terms, case numbers, names | FTS5 full-text with porter stemming |
|
|
229
|
+
| **Semantic** | Conceptual queries, paraphrases | Vector similarity via nomic-embed-text-v1.5 |
|
|
230
|
+
| **Hybrid** (recommended) | General questions | BM25 + semantic fused, optional local re-ranking |
|
|
231
|
+
|
|
232
|
+
#### Search Enhancement Stack
|
|
233
|
+
|
|
234
|
+
All three search modes support a shared enhancement stack:
|
|
235
|
+
|
|
236
|
+
- **Query classification** -- heuristic analysis auto-routes queries between exact/semantic/mixed modes (`auto_route` on hybrid)
|
|
237
|
+
- **Query expansion** -- legal/medical synonym injection for broader recall (`expand_query`, default on for hybrid)
|
|
238
|
+
- **Local cross-encoder reranking** -- Python-based cross-encoder model (ms-marco-MiniLM-L-12-v2) re-scores results locally for relevance (`rerank`)
|
|
239
|
+
- **Quality-weighted ranking** -- always-on quality score multiplier (0.8x--1.0x) boosts higher-quality OCR results
|
|
240
|
+
- **Chunk-level filters** -- `content_type_filter`, `section_path_filter` (prefix match), `heading_filter` (LIKE), `page_range_filter`, `quality_boost`, `table_columns_contain`
|
|
241
|
+
- **Metadata filters** -- title/author/subject LIKE matching, document ID filtering, cluster filtering, quality score threshold
|
|
242
|
+
- **VLM image enrichment** -- search results from VLM descriptions include image metadata (path, dimensions, type)
|
|
243
|
+
- **Table metadata** -- search results include table column headers and row/column counts from OCR blocks
|
|
244
|
+
- **Context chunks** -- surrounding chunks automatically included with results for broader context
|
|
245
|
+
- **Group by document** -- deduplicate results by document, returning only the best match per document (`group_by_document`)
|
|
246
|
+
- **Header/footer exclusion** -- header/footer chunks auto-tagged during ingestion and excluded from search by default (`include_headers_footers`)
|
|
247
|
+
- **Document context** -- optionally attach cluster labels and comparison info to results (`include_document_context`)
|
|
248
|
+
- **Provenance inclusion** -- attach full provenance chain to each search result
|
|
249
|
+
- **Search persistence** -- save, list, retrieve, and re-execute named searches
|
|
250
|
+
- **Cross-database search** -- BM25 search across all databases simultaneously
|
|
251
|
+
|
|
252
|
+
### Provenance Chain
|
|
253
|
+
|
|
254
|
+
Every artifact carries a SHA-256 hash chain back to its source document:
|
|
255
|
+
|
|
256
|
+
```
|
|
257
|
+
DOCUMENT (depth 0)
|
|
258
|
+
+-- OCR_RESULT (depth 1)
|
|
259
|
+
| +-- CHUNK (depth 2) -> EMBEDDING (depth 3)
|
|
260
|
+
| +-- IMAGE (depth 2) -> VLM_DESCRIPTION (depth 3) -> EMBEDDING (depth 4)
|
|
261
|
+
| +-- EXTRACTION (depth 2) -> EMBEDDING (depth 3)
|
|
262
|
+
+-- FORM_FILL (depth 0)
|
|
263
|
+
+-- COMPARISON (depth 2)
|
|
264
|
+
+-- CLUSTERING (depth 2)
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
Export in JSON, W3C PROV-JSON, or CSV for regulatory compliance. Query provenance with 12+ filters, view processing timelines, and analyze per-processor statistics.
|
|
268
|
+
|
|
269
|
+
### Document Version Tracking
|
|
270
|
+
|
|
271
|
+
When you re-ingest a file, the system detects changes automatically:
|
|
272
|
+
|
|
273
|
+
- **Same hash** -- skip (already processed)
|
|
274
|
+
- **Different hash** -- creates a new version linked to the previous via `previous_version_id`
|
|
275
|
+
- **Version history** -- retrieve all versions of a document ordered by creation date
|
|
276
|
+
|
|
277
|
+
### Document Workflow
|
|
278
|
+
|
|
279
|
+
Tag-based workflow state management for document lifecycle:
|
|
280
|
+
|
|
281
|
+
- **States:** draft, review, approved, published, archived
|
|
282
|
+
- **History:** every state change is preserved (append-only)
|
|
283
|
+
- **Actions:** get current state, set new state, view full state history
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
## Requirements
|
|
288
|
+
|
|
289
|
+
| Component | Version | Notes |
|
|
290
|
+
|-----------|---------|-------|
|
|
291
|
+
| Node.js | >= 20 | MCP server runtime |
|
|
292
|
+
| Python | >= 3.10 | Worker processes |
|
|
293
|
+
| PyTorch | >= 2.0 | Embedding model inference |
|
|
294
|
+
| GPU | Optional | CUDA or Apple MPS; CPU works fine, just slower |
|
|
295
|
+
|
|
296
|
+
### API Keys
|
|
297
|
+
|
|
298
|
+
| Key | Get From | Used For |
|
|
299
|
+
|-----|----------|----------|
|
|
300
|
+
| `DATALAB_API_KEY` | [datalab.to](https://www.datalab.to) | OCR, form fill, file upload, structured extraction |
|
|
301
|
+
| `GEMINI_API_KEY` | [Google AI Studio](https://aistudio.google.com/) | VLM image description and classification |
|
|
302
|
+
|
|
303
|
+
---
|
|
304
|
+
|
|
305
|
+
## Installation
|
|
306
|
+
|
|
307
|
+
### One-Command Install
|
|
308
|
+
|
|
309
|
+
```bash
|
|
310
|
+
npm install -g ocr-provenance-mcp && ocr-provenance-mcp-setup
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
The setup wizard will prompt for your API keys, validate them, pull the Docker image, and register the server with your AI client. Requires [Docker Desktop](https://docker.com/products/docker-desktop).
|
|
314
|
+
|
|
315
|
+
### From Source
|
|
316
|
+
|
|
317
|
+
```bash
|
|
318
|
+
# Clone and build
|
|
319
|
+
git clone https://github.com/ChrisRoyse/OCR-Provenance.git
|
|
320
|
+
cd OCR-Provenance
|
|
321
|
+
npm install && npm run build
|
|
322
|
+
|
|
323
|
+
# Install globally (makes `ocr-provenance-mcp` available everywhere)
|
|
324
|
+
npm link
|
|
325
|
+
|
|
326
|
+
# Python dependencies
|
|
327
|
+
pip install torch transformers sentence-transformers numpy scikit-learn hdbscan pymupdf pillow python-docx requests
|
|
328
|
+
|
|
329
|
+
# Download embedding model (~270MB, one-time)
|
|
330
|
+
pip install huggingface_hub
|
|
331
|
+
huggingface-cli download nomic-ai/nomic-embed-text-v1.5 --local-dir models/nomic-embed-text-v1.5
|
|
332
|
+
|
|
333
|
+
# Configure API keys
|
|
334
|
+
cp .env.example .env
|
|
335
|
+
# Edit .env with your DATALAB_API_KEY and GEMINI_API_KEY
|
|
336
|
+
|
|
337
|
+
# Verify
|
|
338
|
+
ocr-provenance-mcp # Should print "Tools registered: 141" on stderr
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
> **PyTorch GPU note:** If `pip install torch` gives you CPU-only, install the CUDA version explicitly:
|
|
342
|
+
> ```bash
|
|
343
|
+
> pip install torch --index-url https://download.pytorch.org/whl/cu124
|
|
344
|
+
> ```
|
|
345
|
+
|
|
346
|
+
<details>
|
|
347
|
+
<summary><strong>Platform-specific notes</strong></summary>
|
|
348
|
+
|
|
349
|
+
**Linux / WSL2:** Install NVIDIA drivers and CUDA toolkit. For WSL2, install the [NVIDIA CUDA on WSL driver](https://developer.nvidia.com/cuda/wsl) from the Windows side.
|
|
350
|
+
|
|
351
|
+
**macOS (Apple Silicon):** MPS acceleration works automatically. Just `pip install torch torchvision torchaudio`.
|
|
352
|
+
|
|
353
|
+
**Windows:** Use WSL2 for best compatibility. Native Windows works too -- the server auto-detects `python` vs `python3`.
|
|
354
|
+
|
|
355
|
+
</details>
|
|
356
|
+
|
|
357
|
+
<details>
|
|
358
|
+
<summary><strong>Custom embedding model location</strong></summary>
|
|
359
|
+
|
|
360
|
+
If you install globally and want the model elsewhere:
|
|
361
|
+
|
|
362
|
+
```bash
|
|
363
|
+
# In your .env file:
|
|
364
|
+
EMBEDDING_MODEL_PATH=/path/to/nomic-embed-text-v1.5
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
The server checks: `EMBEDDING_MODEL_PATH` env var -> `models/` in the package directory -> `~/.ocr-provenance/models/`
|
|
368
|
+
|
|
369
|
+
</details>
|
|
370
|
+
|
|
371
|
+
---
|
|
372
|
+
|
|
373
|
+
## Connecting to Claude
|
|
374
|
+
|
|
375
|
+
### Claude Code
|
|
376
|
+
|
|
377
|
+
```bash
|
|
378
|
+
# Register globally (available in all projects)
|
|
379
|
+
claude mcp add ocr-provenance -s user \
|
|
380
|
+
-e OCR_PROVENANCE_ENV_FILE=/path/to/OCR-Provenance/.env \
|
|
381
|
+
-e NODE_OPTIONS=--max-semi-space-size=64 \
|
|
382
|
+
-- ocr-provenance-mcp
|
|
383
|
+
```
|
|
384
|
+
|
|
385
|
+
### Claude Desktop
|
|
386
|
+
|
|
387
|
+
Add to your config (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
|
|
388
|
+
|
|
389
|
+
```json
|
|
390
|
+
{
|
|
391
|
+
"mcpServers": {
|
|
392
|
+
"ocr-provenance": {
|
|
393
|
+
"command": "ocr-provenance-mcp",
|
|
394
|
+
"env": {
|
|
395
|
+
"OCR_PROVENANCE_ENV_FILE": "/absolute/path/to/OCR-Provenance/.env",
|
|
396
|
+
"NODE_OPTIONS": "--max-semi-space-size=64"
|
|
397
|
+
}
|
|
398
|
+
}
|
|
399
|
+
}
|
|
400
|
+
}
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
### Any MCP Client
|
|
404
|
+
|
|
405
|
+
The server uses stdio transport (JSON-RPC over stdin/stdout):
|
|
406
|
+
|
|
407
|
+
```bash
|
|
408
|
+
ocr-provenance-mcp # Global command (after npm link)
|
|
409
|
+
node /path/to/dist/index.js # Direct invocation
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
Environment variables can be provided via `OCR_PROVENANCE_ENV_FILE`, direct env vars, or a `.env` file in the working directory.
|
|
413
|
+
|
|
414
|
+
---
|
|
415
|
+
|
|
416
|
+
## Docker Installation (Recommended)
|
|
417
|
+
|
|
418
|
+
The fastest way to get started. No Node.js, Python, or model downloads needed -- everything is bundled in the Docker image.
|
|
419
|
+
|
|
420
|
+
### Claude Code CLI
|
|
421
|
+
|
|
422
|
+
```bash
|
|
423
|
+
claude mcp add ocr-provenance \
|
|
424
|
+
-e DATALAB_API_KEY=your_key \
|
|
425
|
+
-e GEMINI_API_KEY=your_key \
|
|
426
|
+
-- docker run -i --rm \
|
|
427
|
+
-v $HOME:/host:ro \
|
|
428
|
+
-v ocr-data:/data \
|
|
429
|
+
ghcr.io/chrisroyse/ocr-provenance:latest
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
### Claude Desktop
|
|
433
|
+
|
|
434
|
+
Add to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `%APPDATA%\Claude\claude_desktop_config.json` (Windows):
|
|
435
|
+
|
|
436
|
+
```json
|
|
437
|
+
{
|
|
438
|
+
"mcpServers": {
|
|
439
|
+
"ocr-provenance": {
|
|
440
|
+
"command": "docker",
|
|
441
|
+
"args": [
|
|
442
|
+
"run", "-i", "--rm",
|
|
443
|
+
"-e", "DATALAB_API_KEY=your_key_here",
|
|
444
|
+
"-e", "GEMINI_API_KEY=your_key_here",
|
|
445
|
+
"-v", "/Users/yourname:/host:ro",
|
|
446
|
+
"-v", "ocr-data:/data",
|
|
447
|
+
"ghcr.io/chrisroyse/ocr-provenance:latest"
|
|
448
|
+
]
|
|
449
|
+
}
|
|
450
|
+
}
|
|
451
|
+
}
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
### Cursor
|
|
455
|
+
|
|
456
|
+
Add to `~/.cursor/mcp.json` (global) or `.cursor/mcp.json` (project):
|
|
457
|
+
|
|
458
|
+
```json
|
|
459
|
+
{
|
|
460
|
+
"mcpServers": {
|
|
461
|
+
"ocr-provenance": {
|
|
462
|
+
"command": "docker",
|
|
463
|
+
"args": [
|
|
464
|
+
"run", "-i", "--rm",
|
|
465
|
+
"-e", "DATALAB_API_KEY=your_key_here",
|
|
466
|
+
"-e", "GEMINI_API_KEY=your_key_here",
|
|
467
|
+
"-v", "/Users/yourname:/host:ro",
|
|
468
|
+
"-v", "ocr-data:/data",
|
|
469
|
+
"ghcr.io/chrisroyse/ocr-provenance:latest"
|
|
470
|
+
]
|
|
471
|
+
}
|
|
472
|
+
}
|
|
473
|
+
}
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
### VS Code (GitHub Copilot)
|
|
477
|
+
|
|
478
|
+
Add to `.vscode/mcp.json`:
|
|
479
|
+
|
|
480
|
+
```json
|
|
481
|
+
{
|
|
482
|
+
"inputs": [
|
|
483
|
+
{ "id": "datalab-key", "type": "promptString", "description": "Datalab API key", "password": true },
|
|
484
|
+
{ "id": "gemini-key", "type": "promptString", "description": "Gemini API key", "password": true }
|
|
485
|
+
],
|
|
486
|
+
"servers": {
|
|
487
|
+
"ocr-provenance": {
|
|
488
|
+
"type": "stdio",
|
|
489
|
+
"command": "docker",
|
|
490
|
+
"args": ["run", "-i", "--rm", "-v", "ocr-data:/data", "-e", "DATALAB_API_KEY", "-e", "GEMINI_API_KEY", "ghcr.io/chrisroyse/ocr-provenance:latest"],
|
|
491
|
+
"env": { "DATALAB_API_KEY": "${input:datalab-key}", "GEMINI_API_KEY": "${input:gemini-key}" }
|
|
492
|
+
}
|
|
493
|
+
}
|
|
494
|
+
}
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
### Windsurf
|
|
498
|
+
|
|
499
|
+
Add to `~/.codeium/windsurf/mcp_config.json`:
|
|
500
|
+
|
|
501
|
+
```json
|
|
502
|
+
{
|
|
503
|
+
"mcpServers": {
|
|
504
|
+
"ocr-provenance": {
|
|
505
|
+
"command": "docker",
|
|
506
|
+
"args": [
|
|
507
|
+
"run", "-i", "--rm",
|
|
508
|
+
"-e", "DATALAB_API_KEY=your_key_here",
|
|
509
|
+
"-e", "GEMINI_API_KEY=your_key_here",
|
|
510
|
+
"-v", "/Users/yourname:/host:ro",
|
|
511
|
+
"-v", "ocr-data:/data",
|
|
512
|
+
"ghcr.io/chrisroyse/ocr-provenance:latest"
|
|
513
|
+
]
|
|
514
|
+
}
|
|
515
|
+
}
|
|
516
|
+
}
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
### HTTP Mode (Remote Deployment)
|
|
520
|
+
|
|
521
|
+
For remote/shared deployments, use HTTP transport with docker-compose:
|
|
522
|
+
|
|
523
|
+
```bash
|
|
524
|
+
# Start in HTTP mode
|
|
525
|
+
docker compose up -d
|
|
526
|
+
|
|
527
|
+
# Or with GPU support
|
|
528
|
+
docker compose -f docker-compose.gpu.yml up -d
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
The server exposes port 3100 with health endpoint at `GET /health` and MCP endpoint at `POST /mcp`.
|
|
532
|
+
|
|
533
|
+
### Docker Environment Variables
|
|
534
|
+
|
|
535
|
+
| Variable | Default | Description |
|
|
536
|
+
|----------|---------|-------------|
|
|
537
|
+
| `DATALAB_API_KEY` | (required) | Datalab OCR API key |
|
|
538
|
+
| `GEMINI_API_KEY` | (required) | Google Gemini API key |
|
|
539
|
+
| `MCP_TRANSPORT` | `stdio` | Transport mode: `stdio` or `http` |
|
|
540
|
+
| `MCP_HTTP_PORT` | `3100` | HTTP server port |
|
|
541
|
+
| `EMBEDDING_DEVICE` | `cpu` | Embedding device: `cpu`, `cuda`, `mps` |
|
|
542
|
+
| `OCR_PROVENANCE_DATABASES_PATH` | `/data` | Database storage path |
|
|
543
|
+
| `OCR_PROVENANCE_ALLOWED_DIRS` | `/host,/data` | Allowed directories for file access |
|
|
544
|
+
|
|
545
|
+
### Backup and Restore
|
|
546
|
+
|
|
547
|
+
```bash
|
|
548
|
+
# Backup
|
|
549
|
+
docker run --rm -v ocr-data:/data:ro -v $(pwd)/backup:/backup alpine cp -a /data/. /backup/
|
|
550
|
+
|
|
551
|
+
# Restore
|
|
552
|
+
docker run --rm -v ocr-data:/data -v $(pwd)/backup:/backup:ro alpine cp -a /backup/. /data/
|
|
553
|
+
```
|
|
554
|
+
|
|
555
|
+
---
|
|
556
|
+
|
|
557
|
+
## Configuration
|
|
558
|
+
|
|
559
|
+
```bash
|
|
560
|
+
# .env file
|
|
561
|
+
DATALAB_API_KEY=your_key
|
|
562
|
+
GEMINI_API_KEY=your_key
|
|
563
|
+
|
|
564
|
+
# OCR settings
|
|
565
|
+
DATALAB_DEFAULT_MODE=accurate # fast | balanced | accurate
|
|
566
|
+
DATALAB_MAX_CONCURRENT=3
|
|
567
|
+
|
|
568
|
+
# Embeddings (auto-detects CUDA > MPS > CPU)
|
|
569
|
+
EMBEDDING_DEVICE=auto
|
|
570
|
+
EMBEDDING_BATCH_SIZE=512
|
|
571
|
+
|
|
572
|
+
# Chunking
|
|
573
|
+
CHUNKING_SIZE=2000
|
|
574
|
+
CHUNKING_OVERLAP_PERCENT=10
|
|
575
|
+
|
|
576
|
+
# Auto-clustering (triggers after processing when enabled)
|
|
577
|
+
AUTO_CLUSTER_ENABLED=false
|
|
578
|
+
AUTO_CLUSTER_THRESHOLD=5 # Minimum documents to trigger
|
|
579
|
+
AUTO_CLUSTER_ALGORITHM=hdbscan
|
|
580
|
+
|
|
581
|
+
# Storage
|
|
582
|
+
STORAGE_DATABASES_PATH=~/.ocr-provenance/databases/
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
---
|
|
586
|
+
|
|
587
|
+
## Tool Reference (141 Tools)
|
|
588
|
+
|
|
589
|
+
<details>
|
|
590
|
+
<summary><strong>Database Management (5)</strong></summary>
|
|
591
|
+
|
|
592
|
+
| Tool | Description |
|
|
593
|
+
|------|-------------|
|
|
594
|
+
| `ocr_db_create` | Create a new isolated database |
|
|
595
|
+
| `ocr_db_list` | List all databases with optional stats |
|
|
596
|
+
| `ocr_db_select` | Select the active database |
|
|
597
|
+
| `ocr_db_stats` | Detailed statistics (documents, chunks, embeddings, images, clusters) |
|
|
598
|
+
| `ocr_db_delete` | Permanently delete a database |
|
|
599
|
+
|
|
600
|
+
</details>
|
|
601
|
+
|
|
602
|
+
<details>
|
|
603
|
+
<summary><strong>Ingestion & Processing (9)</strong></summary>
|
|
604
|
+
|
|
605
|
+
| Tool | Description |
|
|
606
|
+
|------|-------------|
|
|
607
|
+
| `ocr_ingest_directory` | Scan directory and register documents (18 file types, recursive) |
|
|
608
|
+
| `ocr_ingest_files` | Ingest specific files by path |
|
|
609
|
+
| `ocr_process_pending` | Full pipeline: OCR -> Chunk -> Embed -> Vector -> VLM (with auto-clustering) |
|
|
610
|
+
| `ocr_status` | Check processing status |
|
|
611
|
+
| `ocr_retry_failed` | Reset failed documents for reprocessing |
|
|
612
|
+
| `ocr_reprocess` | Reprocess with different OCR settings |
|
|
613
|
+
| `ocr_chunk_complete` | Repair documents missing chunks/embeddings |
|
|
614
|
+
| `ocr_convert_raw` | One-off OCR conversion without storing |
|
|
615
|
+
| `ocr_reembed_document` | Re-generate embeddings for a document without re-OCRing |
|
|
616
|
+
|
|
617
|
+
**Processing options:** `ocr_mode` (fast/balanced/accurate), `chunking_strategy` (hybrid section-aware), `page_range`, `max_pages`, `extras` (track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types)
|
|
618
|
+
|
|
619
|
+
**Version tracking:** Re-ingesting a file with a different hash creates a new version linked via `previous_version_id`.
|
|
620
|
+
|
|
621
|
+
</details>
|
|
622
|
+
|
|
623
|
+
<details>
|
|
624
|
+
<summary><strong>Search & Retrieval (12)</strong></summary>
|
|
625
|
+
|
|
626
|
+
| Tool | Description |
|
|
627
|
+
|------|-------------|
|
|
628
|
+
| `ocr_search` | BM25 full-text search (exact terms, codes, IDs) |
|
|
629
|
+
| `ocr_search_semantic` | Vector similarity search (conceptual queries) |
|
|
630
|
+
| `ocr_search_hybrid` | Reciprocal Rank Fusion of BM25 + semantic (recommended) |
|
|
631
|
+
| `ocr_rag_context` | Assemble hybrid search results into a markdown context block for LLMs |
|
|
632
|
+
| `ocr_search_export` | Export results to CSV or JSON |
|
|
633
|
+
| `ocr_benchmark_compare` | Compare search results across databases |
|
|
634
|
+
| `ocr_fts_manage` | Rebuild or check FTS5 index status |
|
|
635
|
+
| `ocr_search_save` | Save a search by name for later retrieval |
|
|
636
|
+
| `ocr_search_saved_list` | List all saved searches |
|
|
637
|
+
| `ocr_search_saved_get` | Retrieve a saved search and its parameters |
|
|
638
|
+
| `ocr_search_saved_execute` | Re-execute a saved search with optional parameter overrides |
|
|
639
|
+
| `ocr_search_cross_db` | BM25 search across all databases simultaneously |
|
|
640
|
+
|
|
641
|
+
**Enhancement options:** Local cross-encoder reranking (`rerank`), query expansion (`expand_query`), auto-routing (`auto_route`), quality-weighted ranking, chunk-level filters (content type, section path, heading, page range, table columns), metadata filters, cluster filtering, group by document, header/footer exclusion, context chunks, VLM image enrichment, provenance inclusion.
|
|
642
|
+
|
|
643
|
+
</details>
|
|
644
|
+
|
|
645
|
+
<details>
|
|
646
|
+
<summary><strong>Document Management (12)</strong></summary>
|
|
647
|
+
|
|
648
|
+
| Tool | Description |
|
|
649
|
+
|------|-------------|
|
|
650
|
+
| `ocr_document_list` | List documents with status filtering |
|
|
651
|
+
| `ocr_document_get` | Full document details (text, chunks, blocks, provenance) |
|
|
652
|
+
| `ocr_document_delete` | Delete document and all derived data (cascade) |
|
|
653
|
+
| `ocr_document_find_similar` | Find similar documents via embedding centroid similarity |
|
|
654
|
+
| `ocr_document_structure` | Analyze document structure (headings, tables, figures, code blocks) |
|
|
655
|
+
| `ocr_document_sections` | Get section hierarchy tree from chunk section paths |
|
|
656
|
+
| `ocr_document_update_metadata` | Batch update document metadata fields |
|
|
657
|
+
| `ocr_document_duplicates` | Detect exact (hash) and near (similarity) duplicates |
|
|
658
|
+
| `ocr_document_export` | Export document to JSON or markdown |
|
|
659
|
+
| `ocr_corpus_export` | Export entire corpus to JSON or markdown archive |
|
|
660
|
+
| `ocr_document_versions` | List all versions of a document by file path |
|
|
661
|
+
| `ocr_document_workflow` | Manage workflow states (draft/review/approved/published/archived) |
|
|
662
|
+
|
|
663
|
+
</details>
|
|
664
|
+
|
|
665
|
+
<details>
|
|
666
|
+
<summary><strong>Provenance (6)</strong></summary>
|
|
667
|
+
|
|
668
|
+
| Tool | Description |
|
|
669
|
+
|------|-------------|
|
|
670
|
+
| `ocr_provenance_get` | Get the complete provenance chain for any item |
|
|
671
|
+
| `ocr_provenance_verify` | Verify integrity through SHA-256 hash chain |
|
|
672
|
+
| `ocr_provenance_export` | Export provenance (JSON, W3C PROV-JSON, CSV) |
|
|
673
|
+
| `ocr_provenance_query` | Query provenance records with 12+ filters |
|
|
674
|
+
| `ocr_provenance_timeline` | View document processing timeline |
|
|
675
|
+
| `ocr_provenance_processor_stats` | Aggregate statistics per processor type |
|
|
676
|
+
|
|
677
|
+
</details>
|
|
678
|
+
|
|
679
|
+
<details>
|
|
680
|
+
<summary><strong>Document Comparison (6)</strong></summary>
|
|
681
|
+
|
|
682
|
+
| Tool | Description |
|
|
683
|
+
|------|-------------|
|
|
684
|
+
| `ocr_document_compare` | Text diff + structural metadata diff + similarity ratio |
|
|
685
|
+
| `ocr_comparison_list` | List comparisons with optional filtering |
|
|
686
|
+
| `ocr_comparison_get` | Full comparison details with diff operations |
|
|
687
|
+
| `ocr_comparison_discover` | Auto-discover similar document pairs for comparison |
|
|
688
|
+
| `ocr_comparison_batch` | Batch compare multiple document pairs |
|
|
689
|
+
| `ocr_comparison_matrix` | NxN pairwise cosine similarity matrix across documents |
|
|
690
|
+
|
|
691
|
+
</details>
|
|
692
|
+
|
|
693
|
+
<details>
|
|
694
|
+
<summary><strong>Document Clustering (7)</strong></summary>
|
|
695
|
+
|
|
696
|
+
| Tool | Description |
|
|
697
|
+
|------|-------------|
|
|
698
|
+
| `ocr_cluster_documents` | Cluster by semantic similarity (HDBSCAN / agglomerative / k-means) |
|
|
699
|
+
| `ocr_cluster_list` | List clusters with filtering by run ID or tag |
|
|
700
|
+
| `ocr_cluster_get` | Cluster details with member documents |
|
|
701
|
+
| `ocr_cluster_assign` | Auto-assign a document to the nearest cluster |
|
|
702
|
+
| `ocr_cluster_reassign` | Move a document to a different cluster |
|
|
703
|
+
| `ocr_cluster_merge` | Merge two clusters into one |
|
|
704
|
+
| `ocr_cluster_delete` | Delete a clustering run |
|
|
705
|
+
|
|
706
|
+
</details>
|
|
707
|
+
|
|
708
|
+
<details>
|
|
709
|
+
<summary><strong>VLM / Vision Analysis (6)</strong></summary>
|
|
710
|
+
|
|
711
|
+
| Tool | Description |
|
|
712
|
+
|------|-------------|
|
|
713
|
+
| `ocr_vlm_describe` | Describe an image using Gemini 3 Flash (supports thinking mode) |
|
|
714
|
+
| `ocr_vlm_classify` | Classify image type, complexity, text density |
|
|
715
|
+
| `ocr_vlm_process_document` | VLM-process all images in a document |
|
|
716
|
+
| `ocr_vlm_process_pending` | VLM-process all pending images across all documents |
|
|
717
|
+
| `ocr_vlm_analyze_pdf` | Analyze a PDF directly with Gemini 3 Flash (max 20MB) |
|
|
718
|
+
| `ocr_vlm_status` | Service status (API config, rate limits, circuit breaker) |
|
|
719
|
+
|
|
720
|
+
VLM descriptions automatically generate searchable embeddings for semantic image search.
|
|
721
|
+
|
|
722
|
+
</details>
|
|
723
|
+
|
|
724
|
+
<details>
|
|
725
|
+
<summary><strong>Image Operations (11)</strong></summary>
|
|
726
|
+
|
|
727
|
+
| Tool | Description |
|
|
728
|
+
|------|-------------|
|
|
729
|
+
| `ocr_image_extract` | Extract images from a PDF via Datalab OCR |
|
|
730
|
+
| `ocr_image_list` | List images extracted from a document |
|
|
731
|
+
| `ocr_image_get` | Get image details |
|
|
732
|
+
| `ocr_image_stats` | Processing statistics |
|
|
733
|
+
| `ocr_image_delete` | Delete an image record |
|
|
734
|
+
| `ocr_image_delete_by_document` | Delete all images for a document |
|
|
735
|
+
| `ocr_image_reset_failed` | Reset failed images for reprocessing |
|
|
736
|
+
| `ocr_image_pending` | List images pending VLM processing |
|
|
737
|
+
| `ocr_image_search` | Search images with 7 filters (type, size, status, confidence, etc.) |
|
|
738
|
+
| `ocr_image_semantic_search` | Semantic search over VLM image descriptions |
|
|
739
|
+
| `ocr_image_reanalyze` | Re-run VLM analysis with a custom prompt |
|
|
740
|
+
|
|
741
|
+
</details>
|
|
742
|
+
|
|
743
|
+
<details>
|
|
744
|
+
<summary><strong>Image Extraction (3)</strong></summary>
|
|
745
|
+
|
|
746
|
+
| Tool | Description |
|
|
747
|
+
|------|-------------|
|
|
748
|
+
| `ocr_extract_images` | Extract images locally (PyMuPDF for PDF, zipfile for DOCX) |
|
|
749
|
+
| `ocr_extract_images_batch` | Batch extract from all processed documents |
|
|
750
|
+
| `ocr_extraction_check` | Verify Python environment has required packages |
|
|
751
|
+
|
|
752
|
+
</details>
|
|
753
|
+
|
|
754
|
+
<details>
|
|
755
|
+
<summary><strong>Chunks & Pages (4)</strong></summary>
|
|
756
|
+
|
|
757
|
+
| Tool | Description |
|
|
758
|
+
|------|-------------|
|
|
759
|
+
| `ocr_chunk_get` | Get a chunk by ID with full metadata |
|
|
760
|
+
| `ocr_chunk_list` | List chunks with filtering (content type, section path, page, heading) |
|
|
761
|
+
| `ocr_chunk_context` | Get a chunk with N neighboring chunks for context |
|
|
762
|
+
| `ocr_document_page` | Get all chunks for a specific page number (page-by-page navigation) |
|
|
763
|
+
|
|
764
|
+
</details>
|
|
765
|
+
|
|
766
|
+
<details>
|
|
767
|
+
<summary><strong>Embeddings (4)</strong></summary>
|
|
768
|
+
|
|
769
|
+
| Tool | Description |
|
|
770
|
+
|------|-------------|
|
|
771
|
+
| `ocr_embedding_list` | List embeddings with filtering |
|
|
772
|
+
| `ocr_embedding_stats` | Embedding statistics (counts, models, coverage) |
|
|
773
|
+
| `ocr_embedding_get` | Get embedding details by ID |
|
|
774
|
+
| `ocr_embedding_rebuild` | Re-generate embeddings for specific targets |
|
|
775
|
+
|
|
776
|
+
</details>
|
|
777
|
+
|
|
778
|
+
<details>
|
|
779
|
+
<summary><strong>Structured Extraction (4)</strong></summary>
|
|
780
|
+
|
|
781
|
+
| Tool | Description |
|
|
782
|
+
|------|-------------|
|
|
783
|
+
| `ocr_extract_structured` | Extract structured data from OCR'd documents using a JSON schema |
|
|
784
|
+
| `ocr_extraction_list` | List structured extractions for a document |
|
|
785
|
+
| `ocr_extraction_get` | Get a structured extraction by ID |
|
|
786
|
+
| `ocr_extraction_search` | Search across extraction content |
|
|
787
|
+
|
|
788
|
+
</details>
|
|
789
|
+
|
|
790
|
+
<details>
|
|
791
|
+
<summary><strong>Form Fill (2)</strong></summary>
|
|
792
|
+
|
|
793
|
+
| Tool | Description |
|
|
794
|
+
|------|-------------|
|
|
795
|
+
| `ocr_form_fill` | Fill PDF/image forms via Datalab with field name-value mapping |
|
|
796
|
+
| `ocr_form_fill_status` | Form fill operation status and results |
|
|
797
|
+
|
|
798
|
+
</details>
|
|
799
|
+
|
|
800
|
+
<details>
|
|
801
|
+
<summary><strong>File Management (6)</strong></summary>
|
|
802
|
+
|
|
803
|
+
| Tool | Description |
|
|
804
|
+
|------|-------------|
|
|
805
|
+
| `ocr_file_upload` | Upload to Datalab cloud (deduplicates by SHA-256) |
|
|
806
|
+
| `ocr_file_list` | List uploaded files with duplicate detection |
|
|
807
|
+
| `ocr_file_get` | File metadata |
|
|
808
|
+
| `ocr_file_download` | Get download URL |
|
|
809
|
+
| `ocr_file_delete` | Delete file record |
|
|
810
|
+
| `ocr_file_ingest_uploaded` | Bridge uploaded files into the document pipeline |
|
|
811
|
+
|
|
812
|
+
</details>
|
|
813
|
+
|
|
814
|
+
<details>
|
|
815
|
+
<summary><strong>Tags (6)</strong></summary>
|
|
816
|
+
|
|
817
|
+
| Tool | Description |
|
|
818
|
+
|------|-------------|
|
|
819
|
+
| `ocr_tag_create` | Create a tag with optional color and description |
|
|
820
|
+
| `ocr_tag_list` | List tags with usage counts |
|
|
821
|
+
| `ocr_tag_apply` | Apply a tag to any entity (document, chunk, image, cluster, etc.) |
|
|
822
|
+
| `ocr_tag_remove` | Remove a tag from an entity |
|
|
823
|
+
| `ocr_tag_search` | Find entities by tag name |
|
|
824
|
+
| `ocr_tag_delete` | Delete a tag and all associations |
|
|
825
|
+
|
|
826
|
+
</details>
|
|
827
|
+
|
|
828
|
+
<details>
|
|
829
|
+
<summary><strong>Intelligence & Navigation (4)</strong></summary>
|
|
830
|
+
|
|
831
|
+
| Tool | Description |
|
|
832
|
+
|------|-------------|
|
|
833
|
+
| `ocr_guide` | AI agent navigation -- inspects system state and recommends next tools/actions |
|
|
834
|
+
| `ocr_document_tables` | Extract and parse tables from OCR JSON blocks |
|
|
835
|
+
| `ocr_document_recommend` | Get related document recommendations via embedding similarity |
|
|
836
|
+
| `ocr_document_extras` | Access OCR extras data (charts, links, tracked changes, infographics) |
|
|
837
|
+
|
|
838
|
+
</details>
|
|
839
|
+
|
|
840
|
+
<details>
|
|
841
|
+
<summary><strong>Evaluation (3)</strong></summary>
|
|
842
|
+
|
|
843
|
+
| Tool | Description |
|
|
844
|
+
|------|-------------|
|
|
845
|
+
| `ocr_evaluate_single` | Evaluate a single image with VLM |
|
|
846
|
+
| `ocr_evaluate_document` | Evaluate all images in a document |
|
|
847
|
+
| `ocr_evaluate_pending` | Evaluate all pending images system-wide |
|
|
848
|
+
|
|
849
|
+
</details>
|
|
850
|
+
|
|
851
|
+
<details>
|
|
852
|
+
<summary><strong>Reports & Analytics (9)</strong></summary>
|
|
853
|
+
|
|
854
|
+
| Tool | Description |
|
|
855
|
+
|------|-------------|
|
|
856
|
+
| `ocr_evaluation_report` | Comprehensive OCR + VLM metrics report (markdown) |
|
|
857
|
+
| `ocr_document_report` | Single document report (images, extractions, comparisons, clusters) |
|
|
858
|
+
| `ocr_quality_summary` | Quality summary across all documents |
|
|
859
|
+
| `ocr_cost_summary` | Cost analytics by document, mode, month, or total |
|
|
860
|
+
| `ocr_pipeline_analytics` | Pipeline throughput, duration, per-mode/type breakdown |
|
|
861
|
+
| `ocr_corpus_profile` | Corpus content profile (doc sizes, content types, section frequency) |
|
|
862
|
+
| `ocr_error_analytics` | Error/recovery analytics and failure rates |
|
|
863
|
+
| `ocr_provenance_bottlenecks` | Processing bottleneck analysis by processor |
|
|
864
|
+
| `ocr_quality_trends` | Quality trends over time (hourly/daily/weekly/monthly) |
|
|
865
|
+
|
|
866
|
+
</details>
|
|
867
|
+
|
|
868
|
+
<details>
|
|
869
|
+
<summary><strong>Timeline & Analytics (2)</strong></summary>
|
|
870
|
+
|
|
871
|
+
| Tool | Description |
|
|
872
|
+
|------|-------------|
|
|
873
|
+
| `ocr_timeline_analytics` | Volume metrics over time |
|
|
874
|
+
| `ocr_throughput_analytics` | Processing throughput per time bucket |
|
|
875
|
+
|
|
876
|
+
</details>
|
|
877
|
+
|
|
878
|
+
<details>
|
|
879
|
+
<summary><strong>Health & Diagnostics (1)</strong></summary>
|
|
880
|
+
|
|
881
|
+
| Tool | Description |
|
|
882
|
+
|------|-------------|
|
|
883
|
+
| `ocr_health_check` | Detect data integrity gaps (missing embeddings, orphaned chunks, etc.) with optional auto-fix |
|
|
884
|
+
|
|
885
|
+
</details>
|
|
886
|
+
|
|
887
|
+
<details>
|
|
888
|
+
<summary><strong>Configuration (2)</strong></summary>
|
|
889
|
+
|
|
890
|
+
| Tool | Description |
|
|
891
|
+
|------|-------------|
|
|
892
|
+
| `ocr_config_get` | Get current system configuration |
|
|
893
|
+
| `ocr_config_set` | Update configuration at runtime |
|
|
894
|
+
|
|
895
|
+
</details>
|
|
896
|
+
|
|
897
|
+
---
|
|
898
|
+
|
|
899
|
+
## Processing Pipeline
|
|
900
|
+
|
|
901
|
+
```
|
|
902
|
+
File on disk
|
|
903
|
+
│
|
|
904
|
+
├─ 1. REGISTER ──► documents table (status: pending)
|
|
905
|
+
│ ├─ file_hash computed (SHA-256)
|
|
906
|
+
│ ├─ version detection (new vs re-ingested)
|
|
907
|
+
│ └─ provenance record (type: DOCUMENT, depth: 0)
|
|
908
|
+
│
|
|
909
|
+
├─ 2. OCR ──────► ocr_results table
|
|
910
|
+
│ ├─ Datalab API call (fast/balanced/accurate)
|
|
911
|
+
│ ├─ extracted_text (markdown)
|
|
912
|
+
│ ├─ json_blocks (structural hierarchy)
|
|
913
|
+
│ ├─ extras_json (charts, links, track changes)
|
|
914
|
+
│ ├─ page_offsets (page boundaries)
|
|
915
|
+
│ └─ provenance record (type: OCR_RESULT, depth: 1)
|
|
916
|
+
│
|
|
917
|
+
├─ 3. CHUNK ────► chunks table
|
|
918
|
+
│ ├─ Hybrid section-aware chunking
|
|
919
|
+
│ │ ├─ Text + heading normalization
|
|
920
|
+
│ │ ├─ Markdown structure parsing
|
|
921
|
+
│ │ ├─ Atomic region detection (tables, figures)
|
|
922
|
+
│ │ ├─ Heading-only chunk merging
|
|
923
|
+
│ │ ├─ Near-duplicate deduplication
|
|
924
|
+
│ │ └─ Header/footer auto-tagging
|
|
925
|
+
│ ├─ 2000 chars with 10% overlap
|
|
926
|
+
│ ├─ section_path, heading_context, content_types
|
|
927
|
+
│ ├─ page_number assignment via page separators
|
|
928
|
+
│ └─ provenance records (type: CHUNK, depth: 2)
|
|
929
|
+
│
|
|
930
|
+
├─ 4. EMBED ────► embeddings + vec_embeddings tables
|
|
931
|
+
│ ├─ Nomic embed v1.5 (768-dim, local GPU)
|
|
932
|
+
│ ├─ "search_document: " prefix
|
|
933
|
+
│ └─ provenance records (type: EMBEDDING, depth: 3)
|
|
934
|
+
│
|
|
935
|
+
├─ 5. FTS ──────► fts_index (FTS5 virtual table)
|
|
936
|
+
│ └─ External content index on chunk text
|
|
937
|
+
│
|
|
938
|
+
├─ 6. IMAGES ───► images table
|
|
939
|
+
│ │ ├─ PyMuPDF extraction (PDF) / zip extraction (DOCX)
|
|
940
|
+
│ │ ├─ Image optimization (resize, format)
|
|
941
|
+
│ │ └─ provenance records (type: IMAGE, depth: 2)
|
|
942
|
+
│ │
|
|
943
|
+
│ └─ 7. VLM ──► images updated + embeddings table
|
|
944
|
+
│ ├─ Gemini 3 Flash multimodal analysis
|
|
945
|
+
│ ├─ Description, structured data, confidence
|
|
946
|
+
│ ├─ VLM description embedding generated (searchable)
|
|
947
|
+
│ └─ provenance records (type: VLM_DESCRIPTION, depth: 3→4)
|
|
948
|
+
│
|
|
949
|
+
├─ 8. AUTO-CLUSTER ──► clusters table (when configured)
|
|
950
|
+
│ └─ Triggers when threshold met and >1hr since last run
|
|
951
|
+
│
|
|
952
|
+
└─ documents.status = 'complete'
|
|
953
|
+
```
|
|
954
|
+
|
|
955
|
+
---
|
|
956
|
+
|
|
957
|
+
## Data Architecture (Schema v31)
|
|
958
|
+
|
|
959
|
+
18 core tables + FTS5 virtual tables + vec_embeddings:
|
|
960
|
+
|
|
961
|
+
| Table | Purpose | Key Fields |
|
|
962
|
+
|-------|---------|------------|
|
|
963
|
+
| `documents` | Source files | file_hash, status, page_count, metadata |
|
|
964
|
+
| `ocr_results` | Extracted text | extracted_text, json_blocks, quality_score, cost |
|
|
965
|
+
| `chunks` | Text segments | text (2000 chars), section_path, heading_context, content_types |
|
|
966
|
+
| `embeddings` | 768-dim vectors | original_text, model_name, source metadata |
|
|
967
|
+
| `images` | Extracted images | extracted_path, bbox, VLM description, confidence |
|
|
968
|
+
| `extractions` | Structured data | schema_json, extraction_json |
|
|
969
|
+
| `form_fills` | Form filling results | field mapping, output path |
|
|
970
|
+
| `comparisons` | Document pair diffs | similarity_ratio, diff_operations |
|
|
971
|
+
| `clusters` | Document groupings | label, classification_tag, coherence_score |
|
|
972
|
+
| `document_clusters` | Cluster membership | document_id, cluster_id |
|
|
973
|
+
| `provenance` | Full audit trail | type, processor, chain_depth, content_hash |
|
|
974
|
+
| `tags` | Cross-entity labels | name, color, description |
|
|
975
|
+
| `entity_tags` | Tag associations | tag_id, entity_type, entity_id |
|
|
976
|
+
| `saved_searches` | Search persistence | name, search_type, parameters |
|
|
977
|
+
| `uploaded_files` | Cloud file tracking | datalab_id, file_hash, upload status |
|
|
978
|
+
| `database_metadata` | DB-level settings | key-value pairs |
|
|
979
|
+
| `schema_version` | Migration tracking | version, applied_at |
|
|
980
|
+
| `fts_index_metadata` | FTS index state | last_rebuild, chunk count |
|
|
981
|
+
|
|
982
|
+
---
|
|
983
|
+
|
|
984
|
+
## AI/ML Capabilities
|
|
985
|
+
|
|
986
|
+
| Capability | Technology | Tool(s) |
|
|
987
|
+
|-----------|-----------|---------|
|
|
988
|
+
| Document OCR | Datalab API (3 modes) | `ocr_process_pending`, `ocr_convert_raw` |
|
|
989
|
+
| Text Embeddings | Nomic embed v1.5 (local GPU) | Auto during ingestion, `ocr_reembed_document` |
|
|
990
|
+
| Image Description | Gemini 3 Flash | `ocr_vlm_describe`, `ocr_vlm_process_*` |
|
|
991
|
+
| Image Classification | Gemini 3 Flash | `ocr_vlm_classify` |
|
|
992
|
+
| Search Reranking |Python cross-encoder | `rerank` parameter on all search tools (local, no API) |
|
|
993
|
+
| Query Expansion | Heuristic synonyms | `expand_query` parameter |
|
|
994
|
+
| Query Classification | Heuristic patterns | `auto_route` parameter (hybrid search) |
|
|
995
|
+
| Document Clustering | scikit-learn | `ocr_cluster_documents` (HDBSCAN/agglomerative/k-means) |
|
|
996
|
+
| Auto-Clustering | scikit-learn | Configurable auto-trigger after `ocr_process_pending` |
|
|
997
|
+
| Similarity Detection | Embedding centroids | `ocr_document_find_similar`, `ocr_document_recommend` |
|
|
998
|
+
| Duplicate Detection | File hash + embedding similarity | `ocr_document_duplicates` |
|
|
999
|
+
| Comparison Discovery | Embedding similarity | `ocr_comparison_discover` |
|
|
1000
|
+
| Comparison Matrix | Pairwise cosine similarity | `ocr_comparison_matrix` |
|
|
1001
|
+
| Text Comparison | npm diff (Sorensen-Dice) | `ocr_document_compare` |
|
|
1002
|
+
| RAG Context Assembly | Hybrid search + markdown | `ocr_rag_context` |
|
|
1003
|
+
| Semantic Image Search | VLM description embeddings | `ocr_image_semantic_search` |
|
|
1004
|
+
| PDF Direct Analysis | Gemini 3 Flash multimodal | `ocr_vlm_analyze_pdf` |
|
|
1005
|
+
| Table Extraction | OCR JSON block parsing | `ocr_document_tables` |
|
|
1006
|
+
| Cross-DB Search | BM25 across all databases | `ocr_search_cross_db` |
|
|
1007
|
+
| Chunk Deduplication | Fuzzy text matching | Automatic during chunking pipeline |
|
|
1008
|
+
| AI Agent Navigation | System state analysis | `ocr_guide` |
|
|
1009
|
+
| Health Diagnostics | Data integrity analysis | `ocr_health_check` |
|
|
1010
|
+
|
|
1011
|
+
---
|
|
1012
|
+
|
|
1013
|
+
## Development
|
|
1014
|
+
|
|
1015
|
+
```bash
|
|
1016
|
+
npm run build # Build TypeScript
|
|
1017
|
+
npm test # All tests (2,639 across 115 test suites)
|
|
1018
|
+
npm run test:unit # Unit tests only
|
|
1019
|
+
npm run test:integration # Integration tests only
|
|
1020
|
+
npm run lint:all # TypeScript + Python linting
|
|
1021
|
+
npm run check # typecheck + lint + test
|
|
1022
|
+
```
|
|
1023
|
+
|
|
1024
|
+
### Project Structure
|
|
1025
|
+
|
|
1026
|
+
```
|
|
1027
|
+
src/
|
|
1028
|
+
index.ts # MCP server entry point (tool registration, lifecycle)
|
|
1029
|
+
bin.ts # CLI entry point
|
|
1030
|
+
tools/ # 22 tool files + shared.ts
|
|
1031
|
+
database.ts # Database CRUD (5 tools)
|
|
1032
|
+
ingestion.ts # Ingest + process pipeline (9 tools)
|
|
1033
|
+
search.ts # BM25, semantic, hybrid, RAG, cross-DB (12 tools)
|
|
1034
|
+
documents.ts # Document ops, versions, workflow (12 tools)
|
|
1035
|
+
provenance.ts # Audit trail, verification (6 tools)
|
|
1036
|
+
comparison.ts # Diff, batch compare, matrix (6 tools)
|
|
1037
|
+
clustering.ts # Cluster, reassign, merge (7 tools)
|
|
1038
|
+
vlm.ts # Gemini vision analysis (6 tools)
|
|
1039
|
+
images.ts # Image ops, semantic search (11 tools)
|
|
1040
|
+
reports.ts # Analytics + quality reports (9 tools)
|
|
1041
|
+
tags.ts # Cross-entity tagging (6 tools)
|
|
1042
|
+
intelligence.ts # AI guide, tables, recommendations, extras (4 tools)
|
|
1043
|
+
embeddings.ts # Embedding management (4 tools)
|
|
1044
|
+
extraction-structured.ts # JSON schema extraction (4 tools)
|
|
1045
|
+
extraction.ts # Local image extraction (3 tools)
|
|
1046
|
+
file-management.ts # Cloud file ops (6 tools)
|
|
1047
|
+
chunks.ts # Chunk inspection + page navigation (4 tools)
|
|
1048
|
+
timeline.ts # Time-series analytics (2 tools)
|
|
1049
|
+
form-fill.ts # PDF form filling (2 tools)
|
|
1050
|
+
evaluation.ts # VLM evaluation (3 tools)
|
|
1051
|
+
config.ts # Runtime config (2 tools)
|
|
1052
|
+
health.ts # Data integrity check (1 tool)
|
|
1053
|
+
shared.ts # Shared utilities (formatResponse, handleError, etc.)
|
|
1054
|
+
services/ # Core services (11 domains, 64 files)
|
|
1055
|
+
chunking/ # Hybrid section-aware chunking pipeline
|
|
1056
|
+
chunker.ts # Main chunking orchestrator
|
|
1057
|
+
markdown-parser.ts
|
|
1058
|
+
heading-normalizer.ts
|
|
1059
|
+
text-normalizer.ts
|
|
1060
|
+
chunk-merger.ts
|
|
1061
|
+
chunk-deduplicator.ts
|
|
1062
|
+
json-block-analyzer.ts
|
|
1063
|
+
search/ # BM25, semantic, hybrid, fusion, reranker (AI + local), query expansion/classification, quality weighting
|
|
1064
|
+
gemini/ # Gemini client with caching, circuit breaker, rate limiting
|
|
1065
|
+
storage/ # SQLite database + migrations (19 operation files)
|
|
1066
|
+
... # OCR, embedding, VLM, provenance, comparison, clustering, images
|
|
1067
|
+
models/ # Zod schemas and TypeScript types
|
|
1068
|
+
utils/ # Hash, validation, path sanitization
|
|
1069
|
+
server/ # Server state, types, errors (14 custom error classes)
|
|
1070
|
+
python/ # 9 Python workers + GPU utils
|
|
1071
|
+
tests/
|
|
1072
|
+
unit/ # Unit tests
|
|
1073
|
+
integration/ # Integration tests
|
|
1074
|
+
e2e/ # End-to-end pipeline tests
|
|
1075
|
+
manual/ # Verification tests
|
|
1076
|
+
benchmark/ # Chunking benchmark
|
|
1077
|
+
fixtures/ # Test fixtures and sample documents
|
|
1078
|
+
docs/ # System documentation and reports
|
|
1079
|
+
```
|
|
1080
|
+
|
|
1081
|
+
### Key Metrics
|
|
1082
|
+
|
|
1083
|
+
| Metric | Value |
|
|
1084
|
+
|--------|-------|
|
|
1085
|
+
| MCP tools | 141 |
|
|
1086
|
+
| Tool modules | 22 |
|
|
1087
|
+
| Database tables | 18 core + FTS + vec |
|
|
1088
|
+
| Schema version | v32 (32 migrations) |
|
|
1089
|
+
| Database operation files | 19 |
|
|
1090
|
+
| Service domains | 11 |
|
|
1091
|
+
| Test suites | 115 |
|
|
1092
|
+
| Tests passing | 2,639 |
|
|
1093
|
+
| TypeScript source | ~46,000 lines |
|
|
1094
|
+
| Python source | ~4,700 lines |
|
|
1095
|
+
| Test code | ~65,000 lines |
|
|
1096
|
+
| Production deps | 9 packages |
|
|
1097
|
+
| Python workers | 9 |
|
|
1098
|
+
| External APIs | 3 (Datalab, Gemini, Nomic local) |
|
|
1099
|
+
| Custom error classes | 14 |
|
|
1100
|
+
| File types supported | 18 |
|
|
1101
|
+
|
|
1102
|
+
---
|
|
1103
|
+
|
|
1104
|
+
## Troubleshooting
|
|
1105
|
+
|
|
1106
|
+
<details>
|
|
1107
|
+
<summary><strong>sqlite-vec loading errors</strong></summary>
|
|
1108
|
+
|
|
1109
|
+
Run `npm install` -- sqlite-vec uses a prebuilt binary that must match your platform and Node.js version.
|
|
1110
|
+
</details>
|
|
1111
|
+
|
|
1112
|
+
<details>
|
|
1113
|
+
<summary><strong>Python not found (Windows)</strong></summary>
|
|
1114
|
+
|
|
1115
|
+
The server auto-detects `python` vs `python3`. Ensure Python is on your PATH: `python --version`.
|
|
1116
|
+
</details>
|
|
1117
|
+
|
|
1118
|
+
<details>
|
|
1119
|
+
<summary><strong>GPU not detected</strong></summary>
|
|
1120
|
+
|
|
1121
|
+
```bash
|
|
1122
|
+
python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('MPS:', hasattr(torch.backends, 'mps') and torch.backends.mps.is_available())"
|
|
1123
|
+
```
|
|
1124
|
+
If both are False, install the CUDA version of PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu124`
|
|
1125
|
+
</details>
|
|
1126
|
+
|
|
1127
|
+
<details>
|
|
1128
|
+
<summary><strong>Embedding model not found</strong></summary>
|
|
1129
|
+
|
|
1130
|
+
Download the model (see [Installation](#installation)). Verify `config.json`, `model.safetensors`, and `tokenizer.json` are present in the model directory.
|
|
1131
|
+
</details>
|
|
1132
|
+
|
|
1133
|
+
<details>
|
|
1134
|
+
<summary><strong>API key warnings at startup</strong></summary>
|
|
1135
|
+
|
|
1136
|
+
Copy `.env.example` to `.env` and fill in your `DATALAB_API_KEY` and `GEMINI_API_KEY`.
|
|
1137
|
+
</details>
|
|
1138
|
+
|
|
1139
|
+
<details>
|
|
1140
|
+
<summary><strong>Data integrity issues</strong></summary>
|
|
1141
|
+
|
|
1142
|
+
Run `ocr_health_check { fix: true }` to detect and auto-fix common issues like chunks missing embeddings or orphaned records.
|
|
1143
|
+
</details>
|
|
1144
|
+
|
|
1145
|
+
---
|
|
1146
|
+
|
|
1147
|
+
## License
|
|
1148
|
+
|
|
1149
|
+
This project uses a **dual-license** model:
|
|
1150
|
+
|
|
1151
|
+
- **Free for non-commercial use** -- personal projects, academic research, education, non-profits, evaluation, and contributions to this project are all permitted at no cost.
|
|
1152
|
+
- **Commercial license required for revenue-generating use** -- if you use this software to make money (paid services, SaaS, internal tools at for-profit companies, etc.), you must obtain a commercial license from the copyright holder. Terms are negotiated case-by-case and may include revenue sharing or flat-rate arrangements.
|
|
1153
|
+
|
|
1154
|
+
See [LICENSE](LICENSE) for full details. For commercial licensing inquiries, contact Chris Royse at [chrisroyseai@gmail.com](mailto:chrisroyseai@gmail.com) or via [GitHub](https://github.com/ChrisRoyse).
|