ocr-provenance-mcp 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of ocr-provenance-mcp might be problematic. Click here for more details.

Files changed (578) hide show
  1. package/.env.example +55 -0
  2. package/LICENSE +78 -0
  3. package/README.md +1154 -0
  4. package/dist/bin-http.d.ts +24 -0
  5. package/dist/bin-http.d.ts.map +1 -0
  6. package/dist/bin-http.js +275 -0
  7. package/dist/bin-http.js.map +1 -0
  8. package/dist/bin-setup.d.ts +11 -0
  9. package/dist/bin-setup.d.ts.map +1 -0
  10. package/dist/bin-setup.js +610 -0
  11. package/dist/bin-setup.js.map +1 -0
  12. package/dist/bin.d.ts +16 -0
  13. package/dist/bin.d.ts.map +1 -0
  14. package/dist/bin.js +16 -0
  15. package/dist/bin.js.map +1 -0
  16. package/dist/index.d.ts +13 -0
  17. package/dist/index.d.ts.map +1 -0
  18. package/dist/index.js +90 -0
  19. package/dist/index.js.map +1 -0
  20. package/dist/models/chunk.d.ts +136 -0
  21. package/dist/models/chunk.d.ts.map +1 -0
  22. package/dist/models/chunk.js +27 -0
  23. package/dist/models/chunk.js.map +1 -0
  24. package/dist/models/cluster.d.ts +79 -0
  25. package/dist/models/cluster.d.ts.map +1 -0
  26. package/dist/models/cluster.js +10 -0
  27. package/dist/models/cluster.js.map +1 -0
  28. package/dist/models/comparison.d.ts +62 -0
  29. package/dist/models/comparison.d.ts.map +1 -0
  30. package/dist/models/comparison.js +8 -0
  31. package/dist/models/comparison.js.map +1 -0
  32. package/dist/models/document.d.ts +104 -0
  33. package/dist/models/document.d.ts.map +1 -0
  34. package/dist/models/document.js +15 -0
  35. package/dist/models/document.js.map +1 -0
  36. package/dist/models/embedding.d.ts +87 -0
  37. package/dist/models/embedding.d.ts.map +1 -0
  38. package/dist/models/embedding.js +23 -0
  39. package/dist/models/embedding.js.map +1 -0
  40. package/dist/models/extraction.d.ts +15 -0
  41. package/dist/models/extraction.d.ts.map +1 -0
  42. package/dist/models/extraction.js +2 -0
  43. package/dist/models/extraction.js.map +1 -0
  44. package/dist/models/form-fill.d.ts +23 -0
  45. package/dist/models/form-fill.d.ts.map +1 -0
  46. package/dist/models/form-fill.js +2 -0
  47. package/dist/models/form-fill.js.map +1 -0
  48. package/dist/models/image.d.ts +177 -0
  49. package/dist/models/image.d.ts.map +1 -0
  50. package/dist/models/image.js +8 -0
  51. package/dist/models/image.js.map +1 -0
  52. package/dist/models/index.d.ts +14 -0
  53. package/dist/models/index.d.ts.map +1 -0
  54. package/dist/models/index.js +22 -0
  55. package/dist/models/index.js.map +1 -0
  56. package/dist/models/provenance.d.ts +174 -0
  57. package/dist/models/provenance.d.ts.map +1 -0
  58. package/dist/models/provenance.js +53 -0
  59. package/dist/models/provenance.js.map +1 -0
  60. package/dist/models/uploaded-file.d.ts +20 -0
  61. package/dist/models/uploaded-file.d.ts.map +1 -0
  62. package/dist/models/uploaded-file.js +2 -0
  63. package/dist/models/uploaded-file.js.map +1 -0
  64. package/dist/server/errors.d.ts +93 -0
  65. package/dist/server/errors.d.ts.map +1 -0
  66. package/dist/server/errors.js +256 -0
  67. package/dist/server/errors.js.map +1 -0
  68. package/dist/server/events.d.ts +36 -0
  69. package/dist/server/events.d.ts.map +1 -0
  70. package/dist/server/events.js +48 -0
  71. package/dist/server/events.js.map +1 -0
  72. package/dist/server/permissions.d.ts +26 -0
  73. package/dist/server/permissions.d.ts.map +1 -0
  74. package/dist/server/permissions.js +194 -0
  75. package/dist/server/permissions.js.map +1 -0
  76. package/dist/server/register-tools.d.ts +25 -0
  77. package/dist/server/register-tools.d.ts.map +1 -0
  78. package/dist/server/register-tools.js +102 -0
  79. package/dist/server/register-tools.js.map +1 -0
  80. package/dist/server/startup.d.ts +16 -0
  81. package/dist/server/startup.d.ts.map +1 -0
  82. package/dist/server/startup.js +37 -0
  83. package/dist/server/startup.js.map +1 -0
  84. package/dist/server/state.d.ts +166 -0
  85. package/dist/server/state.d.ts.map +1 -0
  86. package/dist/server/state.js +424 -0
  87. package/dist/server/state.js.map +1 -0
  88. package/dist/server/transports/http-transport.d.ts +37 -0
  89. package/dist/server/transports/http-transport.d.ts.map +1 -0
  90. package/dist/server/transports/http-transport.js +204 -0
  91. package/dist/server/transports/http-transport.js.map +1 -0
  92. package/dist/server/transports/index.d.ts +9 -0
  93. package/dist/server/transports/index.d.ts.map +1 -0
  94. package/dist/server/transports/index.js +9 -0
  95. package/dist/server/transports/index.js.map +1 -0
  96. package/dist/server/transports/session-manager.d.ts +40 -0
  97. package/dist/server/transports/session-manager.d.ts.map +1 -0
  98. package/dist/server/transports/session-manager.js +74 -0
  99. package/dist/server/transports/session-manager.js.map +1 -0
  100. package/dist/server/types.d.ts +82 -0
  101. package/dist/server/types.d.ts.map +1 -0
  102. package/dist/server/types.js +14 -0
  103. package/dist/server/types.js.map +1 -0
  104. package/dist/services/audit.d.ts +26 -0
  105. package/dist/services/audit.d.ts.map +1 -0
  106. package/dist/services/audit.js +43 -0
  107. package/dist/services/audit.js.map +1 -0
  108. package/dist/services/chunking/chunk-deduplicator.d.ts +33 -0
  109. package/dist/services/chunking/chunk-deduplicator.d.ts.map +1 -0
  110. package/dist/services/chunking/chunk-deduplicator.js +46 -0
  111. package/dist/services/chunking/chunk-deduplicator.js.map +1 -0
  112. package/dist/services/chunking/chunk-merger.d.ts +26 -0
  113. package/dist/services/chunking/chunk-merger.d.ts.map +1 -0
  114. package/dist/services/chunking/chunk-merger.js +94 -0
  115. package/dist/services/chunking/chunk-merger.js.map +1 -0
  116. package/dist/services/chunking/chunker.d.ts +62 -0
  117. package/dist/services/chunking/chunker.d.ts.map +1 -0
  118. package/dist/services/chunking/chunker.js +566 -0
  119. package/dist/services/chunking/chunker.js.map +1 -0
  120. package/dist/services/chunking/heading-normalizer.d.ts +33 -0
  121. package/dist/services/chunking/heading-normalizer.d.ts.map +1 -0
  122. package/dist/services/chunking/heading-normalizer.js +101 -0
  123. package/dist/services/chunking/heading-normalizer.js.map +1 -0
  124. package/dist/services/chunking/json-block-analyzer.d.ts +163 -0
  125. package/dist/services/chunking/json-block-analyzer.d.ts.map +1 -0
  126. package/dist/services/chunking/json-block-analyzer.js +1033 -0
  127. package/dist/services/chunking/json-block-analyzer.js.map +1 -0
  128. package/dist/services/chunking/markdown-parser.d.ts +75 -0
  129. package/dist/services/chunking/markdown-parser.d.ts.map +1 -0
  130. package/dist/services/chunking/markdown-parser.js +428 -0
  131. package/dist/services/chunking/markdown-parser.js.map +1 -0
  132. package/dist/services/chunking/text-normalizer.d.ts +20 -0
  133. package/dist/services/chunking/text-normalizer.d.ts.map +1 -0
  134. package/dist/services/chunking/text-normalizer.js +36 -0
  135. package/dist/services/chunking/text-normalizer.js.map +1 -0
  136. package/dist/services/clm/contract-schemas.d.ts +36 -0
  137. package/dist/services/clm/contract-schemas.d.ts.map +1 -0
  138. package/dist/services/clm/contract-schemas.js +92 -0
  139. package/dist/services/clm/contract-schemas.js.map +1 -0
  140. package/dist/services/clm/summarization.d.ts +46 -0
  141. package/dist/services/clm/summarization.d.ts.map +1 -0
  142. package/dist/services/clm/summarization.js +61 -0
  143. package/dist/services/clm/summarization.js.map +1 -0
  144. package/dist/services/clustering/clustering-service.d.ts +58 -0
  145. package/dist/services/clustering/clustering-service.d.ts.map +1 -0
  146. package/dist/services/clustering/clustering-service.js +467 -0
  147. package/dist/services/clustering/clustering-service.js.map +1 -0
  148. package/dist/services/comparison/diff-service.d.ts +41 -0
  149. package/dist/services/comparison/diff-service.d.ts.map +1 -0
  150. package/dist/services/comparison/diff-service.js +120 -0
  151. package/dist/services/comparison/diff-service.js.map +1 -0
  152. package/dist/services/embedding/embedder.d.ts +55 -0
  153. package/dist/services/embedding/embedder.d.ts.map +1 -0
  154. package/dist/services/embedding/embedder.js +202 -0
  155. package/dist/services/embedding/embedder.js.map +1 -0
  156. package/dist/services/embedding/nomic.d.ts +67 -0
  157. package/dist/services/embedding/nomic.d.ts.map +1 -0
  158. package/dist/services/embedding/nomic.js +280 -0
  159. package/dist/services/embedding/nomic.js.map +1 -0
  160. package/dist/services/gemini/circuit-breaker.d.ts +106 -0
  161. package/dist/services/gemini/circuit-breaker.d.ts.map +1 -0
  162. package/dist/services/gemini/circuit-breaker.js +237 -0
  163. package/dist/services/gemini/circuit-breaker.js.map +1 -0
  164. package/dist/services/gemini/client.d.ts +173 -0
  165. package/dist/services/gemini/client.d.ts.map +1 -0
  166. package/dist/services/gemini/client.js +483 -0
  167. package/dist/services/gemini/client.js.map +1 -0
  168. package/dist/services/gemini/config.d.ts +116 -0
  169. package/dist/services/gemini/config.d.ts.map +1 -0
  170. package/dist/services/gemini/config.js +118 -0
  171. package/dist/services/gemini/config.js.map +1 -0
  172. package/dist/services/gemini/index.d.ts +9 -0
  173. package/dist/services/gemini/index.d.ts.map +1 -0
  174. package/dist/services/gemini/index.js +13 -0
  175. package/dist/services/gemini/index.js.map +1 -0
  176. package/dist/services/gemini/rate-limiter.d.ts +62 -0
  177. package/dist/services/gemini/rate-limiter.d.ts.map +1 -0
  178. package/dist/services/gemini/rate-limiter.js +120 -0
  179. package/dist/services/gemini/rate-limiter.js.map +1 -0
  180. package/dist/services/images/extractor.d.ts +88 -0
  181. package/dist/services/images/extractor.d.ts.map +1 -0
  182. package/dist/services/images/extractor.js +340 -0
  183. package/dist/services/images/extractor.js.map +1 -0
  184. package/dist/services/images/optimizer.d.ts +130 -0
  185. package/dist/services/images/optimizer.d.ts.map +1 -0
  186. package/dist/services/images/optimizer.js +228 -0
  187. package/dist/services/images/optimizer.js.map +1 -0
  188. package/dist/services/ocr/datalab.d.ts +64 -0
  189. package/dist/services/ocr/datalab.d.ts.map +1 -0
  190. package/dist/services/ocr/datalab.js +425 -0
  191. package/dist/services/ocr/datalab.js.map +1 -0
  192. package/dist/services/ocr/errors.d.ts +38 -0
  193. package/dist/services/ocr/errors.d.ts.map +1 -0
  194. package/dist/services/ocr/errors.js +83 -0
  195. package/dist/services/ocr/errors.js.map +1 -0
  196. package/dist/services/ocr/file-manager.d.ts +76 -0
  197. package/dist/services/ocr/file-manager.d.ts.map +1 -0
  198. package/dist/services/ocr/file-manager.js +238 -0
  199. package/dist/services/ocr/file-manager.js.map +1 -0
  200. package/dist/services/ocr/form-fill.d.ts +48 -0
  201. package/dist/services/ocr/form-fill.d.ts.map +1 -0
  202. package/dist/services/ocr/form-fill.js +213 -0
  203. package/dist/services/ocr/form-fill.js.map +1 -0
  204. package/dist/services/ocr/processor.d.ts +95 -0
  205. package/dist/services/ocr/processor.d.ts.map +1 -0
  206. package/dist/services/ocr/processor.js +259 -0
  207. package/dist/services/ocr/processor.js.map +1 -0
  208. package/dist/services/provenance/agent-metadata.d.ts +82 -0
  209. package/dist/services/provenance/agent-metadata.d.ts.map +1 -0
  210. package/dist/services/provenance/agent-metadata.js +106 -0
  211. package/dist/services/provenance/agent-metadata.js.map +1 -0
  212. package/dist/services/provenance/chain-hash.d.ts +57 -0
  213. package/dist/services/provenance/chain-hash.d.ts.map +1 -0
  214. package/dist/services/provenance/chain-hash.js +131 -0
  215. package/dist/services/provenance/chain-hash.js.map +1 -0
  216. package/dist/services/provenance/exporter.d.ts +202 -0
  217. package/dist/services/provenance/exporter.d.ts.map +1 -0
  218. package/dist/services/provenance/exporter.js +457 -0
  219. package/dist/services/provenance/exporter.js.map +1 -0
  220. package/dist/services/provenance/index.d.ts +15 -0
  221. package/dist/services/provenance/index.d.ts.map +1 -0
  222. package/dist/services/provenance/index.js +17 -0
  223. package/dist/services/provenance/index.js.map +1 -0
  224. package/dist/services/provenance/tracker.d.ts +138 -0
  225. package/dist/services/provenance/tracker.d.ts.map +1 -0
  226. package/dist/services/provenance/tracker.js +293 -0
  227. package/dist/services/provenance/tracker.js.map +1 -0
  228. package/dist/services/provenance/verifier.d.ts +153 -0
  229. package/dist/services/provenance/verifier.d.ts.map +1 -0
  230. package/dist/services/provenance/verifier.js +536 -0
  231. package/dist/services/provenance/verifier.js.map +1 -0
  232. package/dist/services/python-pool.d.ts +70 -0
  233. package/dist/services/python-pool.d.ts.map +1 -0
  234. package/dist/services/python-pool.js +265 -0
  235. package/dist/services/python-pool.js.map +1 -0
  236. package/dist/services/search/bm25.d.ts +180 -0
  237. package/dist/services/search/bm25.d.ts.map +1 -0
  238. package/dist/services/search/bm25.js +656 -0
  239. package/dist/services/search/bm25.js.map +1 -0
  240. package/dist/services/search/fusion.d.ts +103 -0
  241. package/dist/services/search/fusion.d.ts.map +1 -0
  242. package/dist/services/search/fusion.js +122 -0
  243. package/dist/services/search/fusion.js.map +1 -0
  244. package/dist/services/search/local-reranker.d.ts +30 -0
  245. package/dist/services/search/local-reranker.d.ts.map +1 -0
  246. package/dist/services/search/local-reranker.js +123 -0
  247. package/dist/services/search/local-reranker.js.map +1 -0
  248. package/dist/services/search/quality.d.ts +11 -0
  249. package/dist/services/search/quality.d.ts.map +1 -0
  250. package/dist/services/search/quality.js +17 -0
  251. package/dist/services/search/quality.js.map +1 -0
  252. package/dist/services/search/query-classifier.d.ts +34 -0
  253. package/dist/services/search/query-classifier.d.ts.map +1 -0
  254. package/dist/services/search/query-classifier.js +114 -0
  255. package/dist/services/search/query-classifier.js.map +1 -0
  256. package/dist/services/search/query-expander.d.ts +73 -0
  257. package/dist/services/search/query-expander.d.ts.map +1 -0
  258. package/dist/services/search/query-expander.js +281 -0
  259. package/dist/services/search/query-expander.js.map +1 -0
  260. package/dist/services/search/reranker.d.ts +44 -0
  261. package/dist/services/search/reranker.d.ts.map +1 -0
  262. package/dist/services/search/reranker.js +101 -0
  263. package/dist/services/search/reranker.js.map +1 -0
  264. package/dist/services/storage/database/annotation-operations.d.ts +113 -0
  265. package/dist/services/storage/database/annotation-operations.d.ts.map +1 -0
  266. package/dist/services/storage/database/annotation-operations.js +177 -0
  267. package/dist/services/storage/database/annotation-operations.js.map +1 -0
  268. package/dist/services/storage/database/approval-operations.d.ts +132 -0
  269. package/dist/services/storage/database/approval-operations.d.ts.map +1 -0
  270. package/dist/services/storage/database/approval-operations.js +206 -0
  271. package/dist/services/storage/database/approval-operations.js.map +1 -0
  272. package/dist/services/storage/database/chunk-operations.d.ts +132 -0
  273. package/dist/services/storage/database/chunk-operations.d.ts.map +1 -0
  274. package/dist/services/storage/database/chunk-operations.js +306 -0
  275. package/dist/services/storage/database/chunk-operations.js.map +1 -0
  276. package/dist/services/storage/database/cluster-operations.d.ts +97 -0
  277. package/dist/services/storage/database/cluster-operations.d.ts.map +1 -0
  278. package/dist/services/storage/database/cluster-operations.js +258 -0
  279. package/dist/services/storage/database/cluster-operations.js.map +1 -0
  280. package/dist/services/storage/database/comparison-operations.d.ts +41 -0
  281. package/dist/services/storage/database/comparison-operations.d.ts.map +1 -0
  282. package/dist/services/storage/database/comparison-operations.js +65 -0
  283. package/dist/services/storage/database/comparison-operations.js.map +1 -0
  284. package/dist/services/storage/database/converters.d.ts +36 -0
  285. package/dist/services/storage/database/converters.d.ts.map +1 -0
  286. package/dist/services/storage/database/converters.js +244 -0
  287. package/dist/services/storage/database/converters.js.map +1 -0
  288. package/dist/services/storage/database/document-operations.d.ts +145 -0
  289. package/dist/services/storage/database/document-operations.d.ts.map +1 -0
  290. package/dist/services/storage/database/document-operations.js +498 -0
  291. package/dist/services/storage/database/document-operations.js.map +1 -0
  292. package/dist/services/storage/database/embedding-operations.d.ts +130 -0
  293. package/dist/services/storage/database/embedding-operations.d.ts.map +1 -0
  294. package/dist/services/storage/database/embedding-operations.js +315 -0
  295. package/dist/services/storage/database/embedding-operations.js.map +1 -0
  296. package/dist/services/storage/database/extraction-operations.d.ts +47 -0
  297. package/dist/services/storage/database/extraction-operations.d.ts.map +1 -0
  298. package/dist/services/storage/database/extraction-operations.js +85 -0
  299. package/dist/services/storage/database/extraction-operations.js.map +1 -0
  300. package/dist/services/storage/database/form-fill-operations.d.ts +58 -0
  301. package/dist/services/storage/database/form-fill-operations.d.ts.map +1 -0
  302. package/dist/services/storage/database/form-fill-operations.js +116 -0
  303. package/dist/services/storage/database/form-fill-operations.js.map +1 -0
  304. package/dist/services/storage/database/helpers.d.ts +29 -0
  305. package/dist/services/storage/database/helpers.d.ts.map +1 -0
  306. package/dist/services/storage/database/helpers.js +55 -0
  307. package/dist/services/storage/database/helpers.js.map +1 -0
  308. package/dist/services/storage/database/image-operations.d.ts +202 -0
  309. package/dist/services/storage/database/image-operations.d.ts.map +1 -0
  310. package/dist/services/storage/database/image-operations.js +484 -0
  311. package/dist/services/storage/database/image-operations.js.map +1 -0
  312. package/dist/services/storage/database/index.d.ts +13 -0
  313. package/dist/services/storage/database/index.d.ts.map +1 -0
  314. package/dist/services/storage/database/index.js +16 -0
  315. package/dist/services/storage/database/index.js.map +1 -0
  316. package/dist/services/storage/database/lock-operations.d.ts +59 -0
  317. package/dist/services/storage/database/lock-operations.d.ts.map +1 -0
  318. package/dist/services/storage/database/lock-operations.js +89 -0
  319. package/dist/services/storage/database/lock-operations.js.map +1 -0
  320. package/dist/services/storage/database/obligation-operations.d.ts +88 -0
  321. package/dist/services/storage/database/obligation-operations.d.ts.map +1 -0
  322. package/dist/services/storage/database/obligation-operations.js +206 -0
  323. package/dist/services/storage/database/obligation-operations.js.map +1 -0
  324. package/dist/services/storage/database/ocr-operations.d.ts +33 -0
  325. package/dist/services/storage/database/ocr-operations.d.ts.map +1 -0
  326. package/dist/services/storage/database/ocr-operations.js +70 -0
  327. package/dist/services/storage/database/ocr-operations.js.map +1 -0
  328. package/dist/services/storage/database/playbook-operations.d.ts +72 -0
  329. package/dist/services/storage/database/playbook-operations.d.ts.map +1 -0
  330. package/dist/services/storage/database/playbook-operations.js +247 -0
  331. package/dist/services/storage/database/playbook-operations.js.map +1 -0
  332. package/dist/services/storage/database/provenance-operations.d.ts +112 -0
  333. package/dist/services/storage/database/provenance-operations.d.ts.map +1 -0
  334. package/dist/services/storage/database/provenance-operations.js +251 -0
  335. package/dist/services/storage/database/provenance-operations.js.map +1 -0
  336. package/dist/services/storage/database/service.d.ts +142 -0
  337. package/dist/services/storage/database/service.d.ts.map +1 -0
  338. package/dist/services/storage/database/service.js +310 -0
  339. package/dist/services/storage/database/service.js.map +1 -0
  340. package/dist/services/storage/database/static-operations.d.ts +30 -0
  341. package/dist/services/storage/database/static-operations.d.ts.map +1 -0
  342. package/dist/services/storage/database/static-operations.js +218 -0
  343. package/dist/services/storage/database/static-operations.js.map +1 -0
  344. package/dist/services/storage/database/stats-operations.d.ts +101 -0
  345. package/dist/services/storage/database/stats-operations.d.ts.map +1 -0
  346. package/dist/services/storage/database/stats-operations.js +394 -0
  347. package/dist/services/storage/database/stats-operations.js.map +1 -0
  348. package/dist/services/storage/database/tag-operations.d.ts +76 -0
  349. package/dist/services/storage/database/tag-operations.d.ts.map +1 -0
  350. package/dist/services/storage/database/tag-operations.js +178 -0
  351. package/dist/services/storage/database/tag-operations.js.map +1 -0
  352. package/dist/services/storage/database/types.d.ts +286 -0
  353. package/dist/services/storage/database/types.d.ts.map +1 -0
  354. package/dist/services/storage/database/types.js +39 -0
  355. package/dist/services/storage/database/types.js.map +1 -0
  356. package/dist/services/storage/database/upload-operations.d.ts +71 -0
  357. package/dist/services/storage/database/upload-operations.d.ts.map +1 -0
  358. package/dist/services/storage/database/upload-operations.js +124 -0
  359. package/dist/services/storage/database/upload-operations.js.map +1 -0
  360. package/dist/services/storage/database/user-operations.d.ts +102 -0
  361. package/dist/services/storage/database/user-operations.d.ts.map +1 -0
  362. package/dist/services/storage/database/user-operations.js +151 -0
  363. package/dist/services/storage/database/user-operations.js.map +1 -0
  364. package/dist/services/storage/database/workflow-operations.d.ts +98 -0
  365. package/dist/services/storage/database/workflow-operations.d.ts.map +1 -0
  366. package/dist/services/storage/database/workflow-operations.js +157 -0
  367. package/dist/services/storage/database/workflow-operations.js.map +1 -0
  368. package/dist/services/storage/database.d.ts +16 -0
  369. package/dist/services/storage/database.d.ts.map +1 -0
  370. package/dist/services/storage/database.js +15 -0
  371. package/dist/services/storage/database.js.map +1 -0
  372. package/dist/services/storage/index.d.ts +10 -0
  373. package/dist/services/storage/index.d.ts.map +1 -0
  374. package/dist/services/storage/index.js +10 -0
  375. package/dist/services/storage/index.js.map +1 -0
  376. package/dist/services/storage/migrations/index.d.ts +16 -0
  377. package/dist/services/storage/migrations/index.d.ts.map +1 -0
  378. package/dist/services/storage/migrations/index.js +20 -0
  379. package/dist/services/storage/migrations/index.js.map +1 -0
  380. package/dist/services/storage/migrations/operations.d.ts +40 -0
  381. package/dist/services/storage/migrations/operations.d.ts.map +1 -0
  382. package/dist/services/storage/migrations/operations.js +2910 -0
  383. package/dist/services/storage/migrations/operations.js.map +1 -0
  384. package/dist/services/storage/migrations/schema-definitions.d.ts +306 -0
  385. package/dist/services/storage/migrations/schema-definitions.d.ts.map +1 -0
  386. package/dist/services/storage/migrations/schema-definitions.js +1006 -0
  387. package/dist/services/storage/migrations/schema-definitions.js.map +1 -0
  388. package/dist/services/storage/migrations/schema-helpers.d.ts +50 -0
  389. package/dist/services/storage/migrations/schema-helpers.d.ts.map +1 -0
  390. package/dist/services/storage/migrations/schema-helpers.js +176 -0
  391. package/dist/services/storage/migrations/schema-helpers.js.map +1 -0
  392. package/dist/services/storage/migrations/types.d.ts +15 -0
  393. package/dist/services/storage/migrations/types.d.ts.map +1 -0
  394. package/dist/services/storage/migrations/types.js +21 -0
  395. package/dist/services/storage/migrations/types.js.map +1 -0
  396. package/dist/services/storage/migrations/verification.d.ts +20 -0
  397. package/dist/services/storage/migrations/verification.d.ts.map +1 -0
  398. package/dist/services/storage/migrations/verification.js +78 -0
  399. package/dist/services/storage/migrations/verification.js.map +1 -0
  400. package/dist/services/storage/migrations.d.ts +16 -0
  401. package/dist/services/storage/migrations.d.ts.map +1 -0
  402. package/dist/services/storage/migrations.js +17 -0
  403. package/dist/services/storage/migrations.js.map +1 -0
  404. package/dist/services/storage/types.d.ts +12 -0
  405. package/dist/services/storage/types.d.ts.map +1 -0
  406. package/dist/services/storage/types.js +5 -0
  407. package/dist/services/storage/types.js.map +1 -0
  408. package/dist/services/storage/vector.d.ts +208 -0
  409. package/dist/services/storage/vector.d.ts.map +1 -0
  410. package/dist/services/storage/vector.js +526 -0
  411. package/dist/services/storage/vector.js.map +1 -0
  412. package/dist/services/vlm/pipeline.d.ts +194 -0
  413. package/dist/services/vlm/pipeline.d.ts.map +1 -0
  414. package/dist/services/vlm/pipeline.js +800 -0
  415. package/dist/services/vlm/pipeline.js.map +1 -0
  416. package/dist/services/vlm/prompts.d.ts +171 -0
  417. package/dist/services/vlm/prompts.d.ts.map +1 -0
  418. package/dist/services/vlm/prompts.js +229 -0
  419. package/dist/services/vlm/prompts.js.map +1 -0
  420. package/dist/services/vlm/service.d.ts +174 -0
  421. package/dist/services/vlm/service.d.ts.map +1 -0
  422. package/dist/services/vlm/service.js +256 -0
  423. package/dist/services/vlm/service.js.map +1 -0
  424. package/dist/services/webhook-delivery.d.ts +4 -0
  425. package/dist/services/webhook-delivery.d.ts.map +1 -0
  426. package/dist/services/webhook-delivery.js +140 -0
  427. package/dist/services/webhook-delivery.js.map +1 -0
  428. package/dist/tools/chunks.d.ts +19 -0
  429. package/dist/tools/chunks.d.ts.map +1 -0
  430. package/dist/tools/chunks.js +392 -0
  431. package/dist/tools/chunks.js.map +1 -0
  432. package/dist/tools/clm.d.ts +16 -0
  433. package/dist/tools/clm.d.ts.map +1 -0
  434. package/dist/tools/clm.js +668 -0
  435. package/dist/tools/clm.js.map +1 -0
  436. package/dist/tools/clustering.d.ts +13 -0
  437. package/dist/tools/clustering.d.ts.map +1 -0
  438. package/dist/tools/clustering.js +498 -0
  439. package/dist/tools/clustering.js.map +1 -0
  440. package/dist/tools/collaboration.d.ts +15 -0
  441. package/dist/tools/collaboration.d.ts.map +1 -0
  442. package/dist/tools/collaboration.js +516 -0
  443. package/dist/tools/collaboration.js.map +1 -0
  444. package/dist/tools/comparison.d.ts +13 -0
  445. package/dist/tools/comparison.d.ts.map +1 -0
  446. package/dist/tools/comparison.js +735 -0
  447. package/dist/tools/comparison.js.map +1 -0
  448. package/dist/tools/compliance.d.ts +15 -0
  449. package/dist/tools/compliance.d.ts.map +1 -0
  450. package/dist/tools/compliance.js +640 -0
  451. package/dist/tools/compliance.js.map +1 -0
  452. package/dist/tools/config.d.ts +19 -0
  453. package/dist/tools/config.d.ts.map +1 -0
  454. package/dist/tools/config.js +213 -0
  455. package/dist/tools/config.js.map +1 -0
  456. package/dist/tools/database.d.ts +62 -0
  457. package/dist/tools/database.d.ts.map +1 -0
  458. package/dist/tools/database.js +288 -0
  459. package/dist/tools/database.js.map +1 -0
  460. package/dist/tools/documents.d.ts +61 -0
  461. package/dist/tools/documents.d.ts.map +1 -0
  462. package/dist/tools/documents.js +1624 -0
  463. package/dist/tools/documents.js.map +1 -0
  464. package/dist/tools/embeddings.d.ts +14 -0
  465. package/dist/tools/embeddings.d.ts.map +1 -0
  466. package/dist/tools/embeddings.js +626 -0
  467. package/dist/tools/embeddings.js.map +1 -0
  468. package/dist/tools/evaluation.d.ts +25 -0
  469. package/dist/tools/evaluation.d.ts.map +1 -0
  470. package/dist/tools/evaluation.js +523 -0
  471. package/dist/tools/evaluation.js.map +1 -0
  472. package/dist/tools/events.d.ts +16 -0
  473. package/dist/tools/events.d.ts.map +1 -0
  474. package/dist/tools/events.js +493 -0
  475. package/dist/tools/events.js.map +1 -0
  476. package/dist/tools/extraction-structured.d.ts +13 -0
  477. package/dist/tools/extraction-structured.d.ts.map +1 -0
  478. package/dist/tools/extraction-structured.js +390 -0
  479. package/dist/tools/extraction-structured.js.map +1 -0
  480. package/dist/tools/extraction.d.ts +24 -0
  481. package/dist/tools/extraction.d.ts.map +1 -0
  482. package/dist/tools/extraction.js +424 -0
  483. package/dist/tools/extraction.js.map +1 -0
  484. package/dist/tools/file-management.d.ts +14 -0
  485. package/dist/tools/file-management.d.ts.map +1 -0
  486. package/dist/tools/file-management.js +523 -0
  487. package/dist/tools/file-management.js.map +1 -0
  488. package/dist/tools/form-fill.d.ts +13 -0
  489. package/dist/tools/form-fill.d.ts.map +1 -0
  490. package/dist/tools/form-fill.js +250 -0
  491. package/dist/tools/form-fill.js.map +1 -0
  492. package/dist/tools/health.d.ts +19 -0
  493. package/dist/tools/health.d.ts.map +1 -0
  494. package/dist/tools/health.js +229 -0
  495. package/dist/tools/health.js.map +1 -0
  496. package/dist/tools/images.d.ts +54 -0
  497. package/dist/tools/images.d.ts.map +1 -0
  498. package/dist/tools/images.js +787 -0
  499. package/dist/tools/images.js.map +1 -0
  500. package/dist/tools/ingestion.d.ts +94 -0
  501. package/dist/tools/ingestion.d.ts.map +1 -0
  502. package/dist/tools/ingestion.js +1659 -0
  503. package/dist/tools/ingestion.js.map +1 -0
  504. package/dist/tools/intelligence.d.ts +18 -0
  505. package/dist/tools/intelligence.d.ts.map +1 -0
  506. package/dist/tools/intelligence.js +1039 -0
  507. package/dist/tools/intelligence.js.map +1 -0
  508. package/dist/tools/provenance.d.ts +51 -0
  509. package/dist/tools/provenance.d.ts.map +1 -0
  510. package/dist/tools/provenance.js +691 -0
  511. package/dist/tools/provenance.js.map +1 -0
  512. package/dist/tools/reports.d.ts +41 -0
  513. package/dist/tools/reports.d.ts.map +1 -0
  514. package/dist/tools/reports.js +1394 -0
  515. package/dist/tools/reports.js.map +1 -0
  516. package/dist/tools/search.d.ts +35 -0
  517. package/dist/tools/search.d.ts.map +1 -0
  518. package/dist/tools/search.js +2528 -0
  519. package/dist/tools/search.js.map +1 -0
  520. package/dist/tools/shared.d.ts +52 -0
  521. package/dist/tools/shared.d.ts.map +1 -0
  522. package/dist/tools/shared.js +54 -0
  523. package/dist/tools/shared.js.map +1 -0
  524. package/dist/tools/tags.d.ts +15 -0
  525. package/dist/tools/tags.d.ts.map +1 -0
  526. package/dist/tools/tags.js +287 -0
  527. package/dist/tools/tags.js.map +1 -0
  528. package/dist/tools/timeline.d.ts +15 -0
  529. package/dist/tools/timeline.d.ts.map +1 -0
  530. package/dist/tools/timeline.js +14 -0
  531. package/dist/tools/timeline.js.map +1 -0
  532. package/dist/tools/users.d.ts +14 -0
  533. package/dist/tools/users.d.ts.map +1 -0
  534. package/dist/tools/users.js +257 -0
  535. package/dist/tools/users.js.map +1 -0
  536. package/dist/tools/vlm.d.ts +40 -0
  537. package/dist/tools/vlm.d.ts.map +1 -0
  538. package/dist/tools/vlm.js +475 -0
  539. package/dist/tools/vlm.js.map +1 -0
  540. package/dist/tools/workflow.d.ts +16 -0
  541. package/dist/tools/workflow.d.ts.map +1 -0
  542. package/dist/tools/workflow.js +495 -0
  543. package/dist/tools/workflow.js.map +1 -0
  544. package/dist/utils/backoff.d.ts +53 -0
  545. package/dist/utils/backoff.d.ts.map +1 -0
  546. package/dist/utils/backoff.js +78 -0
  547. package/dist/utils/backoff.js.map +1 -0
  548. package/dist/utils/config-persistence.d.ts +33 -0
  549. package/dist/utils/config-persistence.d.ts.map +1 -0
  550. package/dist/utils/config-persistence.js +61 -0
  551. package/dist/utils/config-persistence.js.map +1 -0
  552. package/dist/utils/hash.d.ts +65 -0
  553. package/dist/utils/hash.d.ts.map +1 -0
  554. package/dist/utils/hash.js +146 -0
  555. package/dist/utils/hash.js.map +1 -0
  556. package/dist/utils/math.d.ts +21 -0
  557. package/dist/utils/math.d.ts.map +1 -0
  558. package/dist/utils/math.js +39 -0
  559. package/dist/utils/math.js.map +1 -0
  560. package/dist/utils/validation.d.ts +697 -0
  561. package/dist/utils/validation.d.ts.map +1 -0
  562. package/dist/utils/validation.js +529 -0
  563. package/dist/utils/validation.js.map +1 -0
  564. package/package.json +96 -0
  565. package/python/.gitkeep +0 -0
  566. package/python/__init__.py +104 -0
  567. package/python/clustering_worker.py +440 -0
  568. package/python/docx_image_extractor.py +524 -0
  569. package/python/embedding_worker.py +552 -0
  570. package/python/file_manager_worker.py +564 -0
  571. package/python/form_fill_worker.py +399 -0
  572. package/python/gpu_utils.py +582 -0
  573. package/python/image_extractor.py +317 -0
  574. package/python/image_optimizer.py +444 -0
  575. package/python/ocr_worker.py +712 -0
  576. package/python/pyproject.toml +76 -0
  577. package/python/requirements.txt +51 -0
  578. package/python/reranker_worker.py +87 -0
package/README.md ADDED
@@ -0,0 +1,1154 @@
1
+ # OCR Provenance MCP Server
2
+
3
+ **Turn thousands of documents into a searchable, AI-queryable knowledge base -- with full provenance.**
4
+
5
+ Point this at a folder of PDFs, Word docs, spreadsheets, images, or presentations. Minutes later, Claude can search, analyze, compare, and answer questions across your entire document collection -- with a cryptographic audit trail proving exactly where every answer came from.
6
+
7
+ [![License: Dual](https://img.shields.io/badge/License-Free_Non--Commercial-green.svg)](LICENSE)
8
+ [![Node.js](https://img.shields.io/badge/Node.js-%3E%3D20-green)](https://nodejs.org/)
9
+ [![TypeScript](https://img.shields.io/badge/TypeScript-5.5+-blue)](https://www.typescriptlang.org/)
10
+ [![MCP](https://img.shields.io/badge/MCP-1.0-purple)](https://modelcontextprotocol.io/)
11
+ [![Tools](https://img.shields.io/badge/MCP_Tools-141-orange)](#tool-reference-141-tools)
12
+ [![Tests](https://img.shields.io/badge/Tests-2%2C639_passing-brightgreen)](#development)
13
+ [![Docker](https://img.shields.io/badge/Docker-ghcr.io-blue)](https://github.com/ChrisRoyse/OCR-Provenance/pkgs/container/ocr-provenance)
14
+ [![Docker Build](https://img.shields.io/github/actions/workflow/status/ChrisRoyse/OCR-Provenance/docker-publish.yml?branch=main&label=Docker%20Build)](https://github.com/ChrisRoyse/OCR-Provenance/actions/workflows/docker-publish.yml)
15
+
16
+ ---
17
+
18
+ ## Why This Exists
19
+
20
+ AI assistants can't read your files natively. They can't search across 500 PDFs, compare contract versions, or find the one email buried in a discovery dump. This server bridges that gap.
21
+
22
+ It's a [Model Context Protocol](https://modelcontextprotocol.io/) server that gives Claude (or any MCP client) the ability to **ingest, OCR, search, compare, cluster, tag, version-track, and reason over** your documents -- with a cryptographic audit trail proving exactly where every answer came from.
23
+
24
+ ### What Happens When You Ingest Documents
25
+
26
+ ```
27
+ Your files (PDF, DOCX, XLSX, images, presentations...)
28
+ -> OCR text extraction via Datalab API (3 accuracy modes)
29
+ -> Hybrid section-aware chunking with markdown parsing
30
+ -> GPU vector embeddings (nomic-embed-text-v1.5, 768-dim)
31
+ -> Image extraction + AI vision analysis (Gemini 3 Flash)
32
+ -> Full-text + semantic + hybrid search indexes
33
+ -> Document clustering by similarity (HDBSCAN / agglomerative / k-means)
34
+ -> Cross-entity tagging system
35
+ -> Document version tracking (re-ingestion detects changes)
36
+ -> SHA-256 provenance chain on every artifact
37
+ ```
38
+
39
+ **18 supported file types:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, PNG, JPG, JPEG, TIFF, TIF, BMP, GIF, WEBP, TXT, CSV, MD
40
+
41
+ ---
42
+
43
+ ## Real-World Use Cases
44
+
45
+ ### Litigation & Legal Discovery
46
+
47
+ You have 3,000 documents from a civil case -- contracts, emails, depositions, medical records, invoices, and correspondence spanning 8 years. Normally this takes a team of paralegals weeks to organize.
48
+
49
+ ```
50
+ "Search all documents for references to the March 2024 amendment"
51
+ "Compare the original contract with the signed version -- what changed?"
52
+ "Find every document mentioning Dr. Rivera and cluster them by topic"
53
+ "Which invoices were submitted after the termination date?"
54
+ "Build me a timeline of all communications between Smith and Davis Corp"
55
+ ```
56
+
57
+ The provenance chain means you can trace every search result back to its exact source page and document -- critical for legal admissibility and audit.
58
+
59
+ ### Medical Records Review
60
+
61
+ An insurance adjuster needs to review 800+ pages of medical records across 15 providers for a personal injury claim.
62
+
63
+ ```
64
+ "Find all references to lumbar spine across every provider's records"
65
+ "What medications were prescribed between June and December 2024?"
66
+ "Compare the initial ER report with the orthopedic surgeon's assessment"
67
+ "Extract all diagnosis codes and dates from the treatment records"
68
+ "Cluster these records by provider and summarize each provider's findings"
69
+ ```
70
+
71
+ ### Financial Audit & Compliance
72
+
73
+ A forensic accountant is reviewing 5 years of financial records for a fraud investigation -- bank statements, tax returns, invoices, receipts, and internal reports.
74
+
75
+ ```
76
+ "Find all transactions over $10,000 across every bank statement"
77
+ "Compare this year's tax return with last year's -- what changed?"
78
+ "Search for any mention of offshore accounts or shell companies"
79
+ "Cluster all invoices by vendor and flag any with duplicate amounts"
80
+ "Which expense reports don't have matching receipts?"
81
+ ```
82
+
83
+ ### Insurance Claims Processing
84
+
85
+ An adjuster is handling a commercial property damage claim with engineering reports, contractor estimates, photographs, and policy documents.
86
+
87
+ ```
88
+ "What is the total estimated repair cost across all contractor bids?"
89
+ "Compare the policyholder's damage report with the independent adjuster's assessment"
90
+ "Find all photos showing water damage and describe what's in each one"
91
+ "Does the policy cover the type of damage described in the engineering report?"
92
+ "Cluster all documents by damage category -- structural, electrical, plumbing"
93
+ ```
94
+
95
+ ### Academic Research
96
+
97
+ A PhD student is doing a literature review across 200+ papers, supplementary materials, and datasets.
98
+
99
+ ```
100
+ "Find all papers that discuss transformer architectures for protein folding"
101
+ "Which papers cite the 2023 AlphaFold study?"
102
+ "Compare the methodology sections of these three competing approaches"
103
+ "Cluster these papers by research topic and list the top 5 clusters"
104
+ "Build me a RAG context block about attention mechanisms for my thesis"
105
+ ```
106
+
107
+ ### Real Estate Due Diligence
108
+
109
+ A commercial real estate firm is evaluating a property acquisition -- title reports, environmental assessments, lease agreements, zoning documents, and inspection reports.
110
+
111
+ ```
112
+ "Are there any environmental liens or violations in the Phase I report?"
113
+ "Compare the rent rolls from 2023 and 2024 -- which tenants left?"
114
+ "Find all lease clauses related to early termination or renewal options"
115
+ "What does the zoning report say about permitted commercial uses?"
116
+ "Cluster all inspection findings by severity -- critical, major, minor"
117
+ ```
118
+
119
+ ### HR & Employment Investigations
120
+
121
+ An HR director is investigating a workplace complaint with emails, performance reviews, chat logs, and policy documents.
122
+
123
+ ```
124
+ "Find all communications between the complainant and the respondent"
125
+ "When was the anti-harassment policy last updated and what does it say?"
126
+ "Compare the employee's performance reviews from 2023 and 2024"
127
+ "Search for any prior complaints or disciplinary actions in these records"
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Quick Start
133
+
134
+ ```
135
+ 1. Create a database -> ocr_db_create { name: "my-case" }
136
+ 2. Select it -> ocr_db_select { database_name: "my-case" }
137
+ 3. Ingest a folder -> ocr_ingest_directory { directory_path: "/path/to/docs" }
138
+ 4. Process everything -> ocr_process_pending {}
139
+ 5. Search -> ocr_search_hybrid { query: "breach of contract" }
140
+ 6. Ask questions -> ocr_rag_context { question: "What were the settlement terms?" }
141
+ 7. Verify provenance -> ocr_provenance_verify { item_id: "doc-id" }
142
+ ```
143
+
144
+ Each database is fully isolated. Create one per case, project, or client.
145
+
146
+ ---
147
+
148
+ ## Architecture
149
+
150
+ ```
151
+ ┌─────────────────────────────────────────────────────────────┐
152
+ │ MCP Server (stdio) │
153
+ │ TypeScript + @modelcontextprotocol/sdk │
154
+ │ 102 tools across 22 tool modules │
155
+ ├─────────────────────────────────────────────────────────────┤
156
+ │ │
157
+ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
158
+ │ │ Ingestion│ │ Search │ │ Analysis │ │ Reports │ │
159
+ │ │ 9 tools │ │ 12 tools │ │ 35 tools │ │ 9 tools │ │
160
+ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
161
+ │ │ │ │ │ │
162
+ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
163
+ │ │ VLM │ │ Images │ │ Tags │ │ Intel │ │
164
+ │ │ 6 tools │ │ 14 tools │ │ 6 tools │ │ 4 tools │ │
165
+ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
166
+ │ │ │ │ │ │
167
+ │ ┌────┴──────────────┴──────────────┴──────────────┴────┐ │
168
+ │ │ Service Layer (11 domains) │ │
169
+ │ │ OCR · Chunking · Embedding · Search · VLM │ │
170
+ │ │ Provenance · Comparison · Clustering · Gemini │ │
171
+ │ │ Images · Storage │ │
172
+ │ └────┬──────────────┬──────────────┬───────────────────┘ │
173
+ │ │ │ │ │
174
+ │ ┌────┴────┐ ┌────┴────┐ ┌────┴─────┐ │
175
+ │ │ SQLite │ │sqlite-vec│ │ FTS5 │ │
176
+ │ │ 18 tbls │ │ vectors │ │ indexes │ │
177
+ │ └─────────┘ └─────────┘ └──────────┘ │
178
+ │ │
179
+ │ ┌──────────────────────────────────────────────────────┐ │
180
+ │ │ Python Workers (9 processes) │ │
181
+ │ │ OCR · Embedding · Clustering · Image Extraction │ │
182
+ │ │ DOCX Extraction · Image Optimizer · Form Fill │ │
183
+ │ │ File Manager · Local Reranker │ │
184
+ │ └──────────────────────────────────────────────────────┘ │
185
+ │ │
186
+ │ ┌──────────────────────────────────────────────────────┐ │
187
+ │ │ External APIs │ │
188
+ │ │ Datalab (OCR/Forms) · Gemini 3 Flash (VLM/AI) │ │
189
+ │ │ Nomic embed v1.5 (local GPU, 768-dim) │ │
190
+ │ └──────────────────────────────────────────────────────┘ │
191
+ └─────────────────────────────────────────────────────────────┘
192
+ ```
193
+
194
+ - **TypeScript MCP Server** -- 102 tools across 22 modules, Zod validation, provenance tracking
195
+ - **Python Workers** (9) -- OCR, GPU embedding, image extraction, clustering, form fill, file management, local reranking
196
+ - **SQLite + sqlite-vec** -- 18 tables, FTS5 full-text search, vector similarity search, WAL mode
197
+ - **Gemini 3 Flash** -- vision analysis (image description, classification, PDF analysis)
198
+ - **Datalab API** -- document OCR, form filling, structured extraction, cloud storage
199
+ - **nomic-embed-text-v1.5** -- 768-dim local embeddings (CUDA / MPS / CPU)
200
+
201
+ ### Hybrid Section-Aware Chunking
202
+
203
+ The chunking pipeline produces semantically coherent chunks that respect document structure:
204
+
205
+ ```
206
+ OCR text (markdown)
207
+
208
+ ├─ Text Normalization ──── Clean whitespace, normalize line breaks
209
+ ├─ Heading Normalization ─ Fix skipped heading levels (h1→h3 becomes h1→h2)
210
+ ├─ Markdown Parsing ────── Parse into heading/paragraph/table/code/list blocks
211
+ ├─ JSON Block Analysis ──── Detect atomic regions (tables, figures) from OCR blocks
212
+ ├─ Section-Aware Splitting ─ Chunk at heading boundaries, respect atomic regions
213
+ ├─ Page Tracking ────────── Assign page numbers via Datalab page separators
214
+ ├─ Chunk Merging ────────── Merge heading-only chunks into their content
215
+ ├─ Chunk Deduplication ──── Remove near-duplicate chunks via fuzzy matching
216
+ ├─ Header/Footer Tagging ── Auto-tag header/footer chunks for search exclusion
217
+ └─ Metadata Enrichment ──── section_path, heading_context, content_types per chunk
218
+ ```
219
+
220
+ Each chunk carries: `section_path` (e.g., "Introduction > Background"), `heading_context`, `content_types` (table/code/text/list), and `page_number` -- all searchable as filters.
221
+
222
+ ### How Search Works
223
+
224
+ Three search modes, combinable via Reciprocal Rank Fusion:
225
+
226
+ | Mode | Best For | How It Works |
227
+ |------|----------|--------------|
228
+ | **BM25** | Exact terms, case numbers, names | FTS5 full-text with porter stemming |
229
+ | **Semantic** | Conceptual queries, paraphrases | Vector similarity via nomic-embed-text-v1.5 |
230
+ | **Hybrid** (recommended) | General questions | BM25 + semantic fused, optional local re-ranking |
231
+
232
+ #### Search Enhancement Stack
233
+
234
+ All three search modes support a shared enhancement stack:
235
+
236
+ - **Query classification** -- heuristic analysis auto-routes queries between exact/semantic/mixed modes (`auto_route` on hybrid)
237
+ - **Query expansion** -- legal/medical synonym injection for broader recall (`expand_query`, default on for hybrid)
238
+ - **Local cross-encoder reranking** -- Python-based cross-encoder model (ms-marco-MiniLM-L-12-v2) re-scores results locally for relevance (`rerank`)
239
+ - **Quality-weighted ranking** -- always-on quality score multiplier (0.8x--1.0x) boosts higher-quality OCR results
240
+ - **Chunk-level filters** -- `content_type_filter`, `section_path_filter` (prefix match), `heading_filter` (LIKE), `page_range_filter`, `quality_boost`, `table_columns_contain`
241
+ - **Metadata filters** -- title/author/subject LIKE matching, document ID filtering, cluster filtering, quality score threshold
242
+ - **VLM image enrichment** -- search results from VLM descriptions include image metadata (path, dimensions, type)
243
+ - **Table metadata** -- search results include table column headers and row/column counts from OCR blocks
244
+ - **Context chunks** -- surrounding chunks automatically included with results for broader context
245
+ - **Group by document** -- deduplicate results by document, returning only the best match per document (`group_by_document`)
246
+ - **Header/footer exclusion** -- header/footer chunks auto-tagged during ingestion and excluded from search by default (`include_headers_footers`)
247
+ - **Document context** -- optionally attach cluster labels and comparison info to results (`include_document_context`)
248
+ - **Provenance inclusion** -- attach full provenance chain to each search result
249
+ - **Search persistence** -- save, list, retrieve, and re-execute named searches
250
+ - **Cross-database search** -- BM25 search across all databases simultaneously
251
+
252
+ ### Provenance Chain
253
+
254
+ Every artifact carries a SHA-256 hash chain back to its source document:
255
+
256
+ ```
257
+ DOCUMENT (depth 0)
258
+ +-- OCR_RESULT (depth 1)
259
+ | +-- CHUNK (depth 2) -> EMBEDDING (depth 3)
260
+ | +-- IMAGE (depth 2) -> VLM_DESCRIPTION (depth 3) -> EMBEDDING (depth 4)
261
+ | +-- EXTRACTION (depth 2) -> EMBEDDING (depth 3)
262
+ +-- FORM_FILL (depth 0)
263
+ +-- COMPARISON (depth 2)
264
+ +-- CLUSTERING (depth 2)
265
+ ```
266
+
267
+ Export in JSON, W3C PROV-JSON, or CSV for regulatory compliance. Query provenance with 12+ filters, view processing timelines, and analyze per-processor statistics.
268
+
269
+ ### Document Version Tracking
270
+
271
+ When you re-ingest a file, the system detects changes automatically:
272
+
273
+ - **Same hash** -- skip (already processed)
274
+ - **Different hash** -- creates a new version linked to the previous via `previous_version_id`
275
+ - **Version history** -- retrieve all versions of a document ordered by creation date
276
+
277
+ ### Document Workflow
278
+
279
+ Tag-based workflow state management for document lifecycle:
280
+
281
+ - **States:** draft, review, approved, published, archived
282
+ - **History:** every state change is preserved (append-only)
283
+ - **Actions:** get current state, set new state, view full state history
284
+
285
+ ---
286
+
287
+ ## Requirements
288
+
289
+ | Component | Version | Notes |
290
+ |-----------|---------|-------|
291
+ | Node.js | >= 20 | MCP server runtime |
292
+ | Python | >= 3.10 | Worker processes |
293
+ | PyTorch | >= 2.0 | Embedding model inference |
294
+ | GPU | Optional | CUDA or Apple MPS; CPU works fine, just slower |
295
+
296
+ ### API Keys
297
+
298
+ | Key | Get From | Used For |
299
+ |-----|----------|----------|
300
+ | `DATALAB_API_KEY` | [datalab.to](https://www.datalab.to) | OCR, form fill, file upload, structured extraction |
301
+ | `GEMINI_API_KEY` | [Google AI Studio](https://aistudio.google.com/) | VLM image description and classification |
302
+
303
+ ---
304
+
305
+ ## Installation
306
+
307
+ ### One-Command Install
308
+
309
+ ```bash
310
+ npm install -g ocr-provenance-mcp && ocr-provenance-mcp-setup
311
+ ```
312
+
313
+ The setup wizard will prompt for your API keys, validate them, pull the Docker image, and register the server with your AI client. Requires [Docker Desktop](https://docker.com/products/docker-desktop).
314
+
315
+ ### From Source
316
+
317
+ ```bash
318
+ # Clone and build
319
+ git clone https://github.com/ChrisRoyse/OCR-Provenance.git
320
+ cd OCR-Provenance
321
+ npm install && npm run build
322
+
323
+ # Install globally (makes `ocr-provenance-mcp` available everywhere)
324
+ npm link
325
+
326
+ # Python dependencies
327
+ pip install torch transformers sentence-transformers numpy scikit-learn hdbscan pymupdf pillow python-docx requests
328
+
329
+ # Download embedding model (~270MB, one-time)
330
+ pip install huggingface_hub
331
+ huggingface-cli download nomic-ai/nomic-embed-text-v1.5 --local-dir models/nomic-embed-text-v1.5
332
+
333
+ # Configure API keys
334
+ cp .env.example .env
335
+ # Edit .env with your DATALAB_API_KEY and GEMINI_API_KEY
336
+
337
+ # Verify
338
+ ocr-provenance-mcp # Should print "Tools registered: 141" on stderr
339
+ ```
340
+
341
+ > **PyTorch GPU note:** If `pip install torch` gives you CPU-only, install the CUDA version explicitly:
342
+ > ```bash
343
+ > pip install torch --index-url https://download.pytorch.org/whl/cu124
344
+ > ```
345
+
346
+ <details>
347
+ <summary><strong>Platform-specific notes</strong></summary>
348
+
349
+ **Linux / WSL2:** Install NVIDIA drivers and CUDA toolkit. For WSL2, install the [NVIDIA CUDA on WSL driver](https://developer.nvidia.com/cuda/wsl) from the Windows side.
350
+
351
+ **macOS (Apple Silicon):** MPS acceleration works automatically. Just `pip install torch torchvision torchaudio`.
352
+
353
+ **Windows:** Use WSL2 for best compatibility. Native Windows works too -- the server auto-detects `python` vs `python3`.
354
+
355
+ </details>
356
+
357
+ <details>
358
+ <summary><strong>Custom embedding model location</strong></summary>
359
+
360
+ If you install globally and want the model elsewhere:
361
+
362
+ ```bash
363
+ # In your .env file:
364
+ EMBEDDING_MODEL_PATH=/path/to/nomic-embed-text-v1.5
365
+ ```
366
+
367
+ The server checks: `EMBEDDING_MODEL_PATH` env var -> `models/` in the package directory -> `~/.ocr-provenance/models/`
368
+
369
+ </details>
370
+
371
+ ---
372
+
373
+ ## Connecting to Claude
374
+
375
+ ### Claude Code
376
+
377
+ ```bash
378
+ # Register globally (available in all projects)
379
+ claude mcp add ocr-provenance -s user \
380
+ -e OCR_PROVENANCE_ENV_FILE=/path/to/OCR-Provenance/.env \
381
+ -e NODE_OPTIONS=--max-semi-space-size=64 \
382
+ -- ocr-provenance-mcp
383
+ ```
384
+
385
+ ### Claude Desktop
386
+
387
+ Add to your config (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
388
+
389
+ ```json
390
+ {
391
+ "mcpServers": {
392
+ "ocr-provenance": {
393
+ "command": "ocr-provenance-mcp",
394
+ "env": {
395
+ "OCR_PROVENANCE_ENV_FILE": "/absolute/path/to/OCR-Provenance/.env",
396
+ "NODE_OPTIONS": "--max-semi-space-size=64"
397
+ }
398
+ }
399
+ }
400
+ }
401
+ ```
402
+
403
+ ### Any MCP Client
404
+
405
+ The server uses stdio transport (JSON-RPC over stdin/stdout):
406
+
407
+ ```bash
408
+ ocr-provenance-mcp # Global command (after npm link)
409
+ node /path/to/dist/index.js # Direct invocation
410
+ ```
411
+
412
+ Environment variables can be provided via `OCR_PROVENANCE_ENV_FILE`, direct env vars, or a `.env` file in the working directory.
413
+
414
+ ---
415
+
416
+ ## Docker Installation (Recommended)
417
+
418
+ The fastest way to get started. No Node.js, Python, or model downloads needed -- everything is bundled in the Docker image.
419
+
420
+ ### Claude Code CLI
421
+
422
+ ```bash
423
+ claude mcp add ocr-provenance \
424
+ -e DATALAB_API_KEY=your_key \
425
+ -e GEMINI_API_KEY=your_key \
426
+ -- docker run -i --rm \
427
+ -v $HOME:/host:ro \
428
+ -v ocr-data:/data \
429
+ ghcr.io/chrisroyse/ocr-provenance:latest
430
+ ```
431
+
432
+ ### Claude Desktop
433
+
434
+ Add to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `%APPDATA%\Claude\claude_desktop_config.json` (Windows):
435
+
436
+ ```json
437
+ {
438
+ "mcpServers": {
439
+ "ocr-provenance": {
440
+ "command": "docker",
441
+ "args": [
442
+ "run", "-i", "--rm",
443
+ "-e", "DATALAB_API_KEY=your_key_here",
444
+ "-e", "GEMINI_API_KEY=your_key_here",
445
+ "-v", "/Users/yourname:/host:ro",
446
+ "-v", "ocr-data:/data",
447
+ "ghcr.io/chrisroyse/ocr-provenance:latest"
448
+ ]
449
+ }
450
+ }
451
+ }
452
+ ```
453
+
454
+ ### Cursor
455
+
456
+ Add to `~/.cursor/mcp.json` (global) or `.cursor/mcp.json` (project):
457
+
458
+ ```json
459
+ {
460
+ "mcpServers": {
461
+ "ocr-provenance": {
462
+ "command": "docker",
463
+ "args": [
464
+ "run", "-i", "--rm",
465
+ "-e", "DATALAB_API_KEY=your_key_here",
466
+ "-e", "GEMINI_API_KEY=your_key_here",
467
+ "-v", "/Users/yourname:/host:ro",
468
+ "-v", "ocr-data:/data",
469
+ "ghcr.io/chrisroyse/ocr-provenance:latest"
470
+ ]
471
+ }
472
+ }
473
+ }
474
+ ```
475
+
476
+ ### VS Code (GitHub Copilot)
477
+
478
+ Add to `.vscode/mcp.json`:
479
+
480
+ ```json
481
+ {
482
+ "inputs": [
483
+ { "id": "datalab-key", "type": "promptString", "description": "Datalab API key", "password": true },
484
+ { "id": "gemini-key", "type": "promptString", "description": "Gemini API key", "password": true }
485
+ ],
486
+ "servers": {
487
+ "ocr-provenance": {
488
+ "type": "stdio",
489
+ "command": "docker",
490
+ "args": ["run", "-i", "--rm", "-v", "ocr-data:/data", "-e", "DATALAB_API_KEY", "-e", "GEMINI_API_KEY", "ghcr.io/chrisroyse/ocr-provenance:latest"],
491
+ "env": { "DATALAB_API_KEY": "${input:datalab-key}", "GEMINI_API_KEY": "${input:gemini-key}" }
492
+ }
493
+ }
494
+ }
495
+ ```
496
+
497
+ ### Windsurf
498
+
499
+ Add to `~/.codeium/windsurf/mcp_config.json`:
500
+
501
+ ```json
502
+ {
503
+ "mcpServers": {
504
+ "ocr-provenance": {
505
+ "command": "docker",
506
+ "args": [
507
+ "run", "-i", "--rm",
508
+ "-e", "DATALAB_API_KEY=your_key_here",
509
+ "-e", "GEMINI_API_KEY=your_key_here",
510
+ "-v", "/Users/yourname:/host:ro",
511
+ "-v", "ocr-data:/data",
512
+ "ghcr.io/chrisroyse/ocr-provenance:latest"
513
+ ]
514
+ }
515
+ }
516
+ }
517
+ ```
518
+
519
+ ### HTTP Mode (Remote Deployment)
520
+
521
+ For remote/shared deployments, use HTTP transport with docker-compose:
522
+
523
+ ```bash
524
+ # Start in HTTP mode
525
+ docker compose up -d
526
+
527
+ # Or with GPU support
528
+ docker compose -f docker-compose.gpu.yml up -d
529
+ ```
530
+
531
+ The server exposes port 3100 with health endpoint at `GET /health` and MCP endpoint at `POST /mcp`.
532
+
533
+ ### Docker Environment Variables
534
+
535
+ | Variable | Default | Description |
536
+ |----------|---------|-------------|
537
+ | `DATALAB_API_KEY` | (required) | Datalab OCR API key |
538
+ | `GEMINI_API_KEY` | (required) | Google Gemini API key |
539
+ | `MCP_TRANSPORT` | `stdio` | Transport mode: `stdio` or `http` |
540
+ | `MCP_HTTP_PORT` | `3100` | HTTP server port |
541
+ | `EMBEDDING_DEVICE` | `cpu` | Embedding device: `cpu`, `cuda`, `mps` |
542
+ | `OCR_PROVENANCE_DATABASES_PATH` | `/data` | Database storage path |
543
+ | `OCR_PROVENANCE_ALLOWED_DIRS` | `/host,/data` | Allowed directories for file access |
544
+
545
+ ### Backup and Restore
546
+
547
+ ```bash
548
+ # Backup
549
+ docker run --rm -v ocr-data:/data:ro -v $(pwd)/backup:/backup alpine cp -a /data/. /backup/
550
+
551
+ # Restore
552
+ docker run --rm -v ocr-data:/data -v $(pwd)/backup:/backup:ro alpine cp -a /backup/. /data/
553
+ ```
554
+
555
+ ---
556
+
557
+ ## Configuration
558
+
559
+ ```bash
560
+ # .env file
561
+ DATALAB_API_KEY=your_key
562
+ GEMINI_API_KEY=your_key
563
+
564
+ # OCR settings
565
+ DATALAB_DEFAULT_MODE=accurate # fast | balanced | accurate
566
+ DATALAB_MAX_CONCURRENT=3
567
+
568
+ # Embeddings (auto-detects CUDA > MPS > CPU)
569
+ EMBEDDING_DEVICE=auto
570
+ EMBEDDING_BATCH_SIZE=512
571
+
572
+ # Chunking
573
+ CHUNKING_SIZE=2000
574
+ CHUNKING_OVERLAP_PERCENT=10
575
+
576
+ # Auto-clustering (triggers after processing when enabled)
577
+ AUTO_CLUSTER_ENABLED=false
578
+ AUTO_CLUSTER_THRESHOLD=5 # Minimum documents to trigger
579
+ AUTO_CLUSTER_ALGORITHM=hdbscan
580
+
581
+ # Storage
582
+ STORAGE_DATABASES_PATH=~/.ocr-provenance/databases/
583
+ ```
584
+
585
+ ---
586
+
587
+ ## Tool Reference (141 Tools)
588
+
589
+ <details>
590
+ <summary><strong>Database Management (5)</strong></summary>
591
+
592
+ | Tool | Description |
593
+ |------|-------------|
594
+ | `ocr_db_create` | Create a new isolated database |
595
+ | `ocr_db_list` | List all databases with optional stats |
596
+ | `ocr_db_select` | Select the active database |
597
+ | `ocr_db_stats` | Detailed statistics (documents, chunks, embeddings, images, clusters) |
598
+ | `ocr_db_delete` | Permanently delete a database |
599
+
600
+ </details>
601
+
602
+ <details>
603
+ <summary><strong>Ingestion & Processing (9)</strong></summary>
604
+
605
+ | Tool | Description |
606
+ |------|-------------|
607
+ | `ocr_ingest_directory` | Scan directory and register documents (18 file types, recursive) |
608
+ | `ocr_ingest_files` | Ingest specific files by path |
609
+ | `ocr_process_pending` | Full pipeline: OCR -> Chunk -> Embed -> Vector -> VLM (with auto-clustering) |
610
+ | `ocr_status` | Check processing status |
611
+ | `ocr_retry_failed` | Reset failed documents for reprocessing |
612
+ | `ocr_reprocess` | Reprocess with different OCR settings |
613
+ | `ocr_chunk_complete` | Repair documents missing chunks/embeddings |
614
+ | `ocr_convert_raw` | One-off OCR conversion without storing |
615
+ | `ocr_reembed_document` | Re-generate embeddings for a document without re-OCRing |
616
+
617
+ **Processing options:** `ocr_mode` (fast/balanced/accurate), `chunking_strategy` (hybrid section-aware), `page_range`, `max_pages`, `extras` (track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types)
618
+
619
+ **Version tracking:** Re-ingesting a file with a different hash creates a new version linked via `previous_version_id`.
620
+
621
+ </details>
622
+
623
+ <details>
624
+ <summary><strong>Search & Retrieval (12)</strong></summary>
625
+
626
+ | Tool | Description |
627
+ |------|-------------|
628
+ | `ocr_search` | BM25 full-text search (exact terms, codes, IDs) |
629
+ | `ocr_search_semantic` | Vector similarity search (conceptual queries) |
630
+ | `ocr_search_hybrid` | Reciprocal Rank Fusion of BM25 + semantic (recommended) |
631
+ | `ocr_rag_context` | Assemble hybrid search results into a markdown context block for LLMs |
632
+ | `ocr_search_export` | Export results to CSV or JSON |
633
+ | `ocr_benchmark_compare` | Compare search results across databases |
634
+ | `ocr_fts_manage` | Rebuild or check FTS5 index status |
635
+ | `ocr_search_save` | Save a search by name for later retrieval |
636
+ | `ocr_search_saved_list` | List all saved searches |
637
+ | `ocr_search_saved_get` | Retrieve a saved search and its parameters |
638
+ | `ocr_search_saved_execute` | Re-execute a saved search with optional parameter overrides |
639
+ | `ocr_search_cross_db` | BM25 search across all databases simultaneously |
640
+
641
+ **Enhancement options:** Local cross-encoder reranking (`rerank`), query expansion (`expand_query`), auto-routing (`auto_route`), quality-weighted ranking, chunk-level filters (content type, section path, heading, page range, table columns), metadata filters, cluster filtering, group by document, header/footer exclusion, context chunks, VLM image enrichment, provenance inclusion.
642
+
643
+ </details>
644
+
645
+ <details>
646
+ <summary><strong>Document Management (12)</strong></summary>
647
+
648
+ | Tool | Description |
649
+ |------|-------------|
650
+ | `ocr_document_list` | List documents with status filtering |
651
+ | `ocr_document_get` | Full document details (text, chunks, blocks, provenance) |
652
+ | `ocr_document_delete` | Delete document and all derived data (cascade) |
653
+ | `ocr_document_find_similar` | Find similar documents via embedding centroid similarity |
654
+ | `ocr_document_structure` | Analyze document structure (headings, tables, figures, code blocks) |
655
+ | `ocr_document_sections` | Get section hierarchy tree from chunk section paths |
656
+ | `ocr_document_update_metadata` | Batch update document metadata fields |
657
+ | `ocr_document_duplicates` | Detect exact (hash) and near (similarity) duplicates |
658
+ | `ocr_document_export` | Export document to JSON or markdown |
659
+ | `ocr_corpus_export` | Export entire corpus to JSON or markdown archive |
660
+ | `ocr_document_versions` | List all versions of a document by file path |
661
+ | `ocr_document_workflow` | Manage workflow states (draft/review/approved/published/archived) |
662
+
663
+ </details>
664
+
665
+ <details>
666
+ <summary><strong>Provenance (6)</strong></summary>
667
+
668
+ | Tool | Description |
669
+ |------|-------------|
670
+ | `ocr_provenance_get` | Get the complete provenance chain for any item |
671
+ | `ocr_provenance_verify` | Verify integrity through SHA-256 hash chain |
672
+ | `ocr_provenance_export` | Export provenance (JSON, W3C PROV-JSON, CSV) |
673
+ | `ocr_provenance_query` | Query provenance records with 12+ filters |
674
+ | `ocr_provenance_timeline` | View document processing timeline |
675
+ | `ocr_provenance_processor_stats` | Aggregate statistics per processor type |
676
+
677
+ </details>
678
+
679
+ <details>
680
+ <summary><strong>Document Comparison (6)</strong></summary>
681
+
682
+ | Tool | Description |
683
+ |------|-------------|
684
+ | `ocr_document_compare` | Text diff + structural metadata diff + similarity ratio |
685
+ | `ocr_comparison_list` | List comparisons with optional filtering |
686
+ | `ocr_comparison_get` | Full comparison details with diff operations |
687
+ | `ocr_comparison_discover` | Auto-discover similar document pairs for comparison |
688
+ | `ocr_comparison_batch` | Batch compare multiple document pairs |
689
+ | `ocr_comparison_matrix` | NxN pairwise cosine similarity matrix across documents |
690
+
691
+ </details>
692
+
693
+ <details>
694
+ <summary><strong>Document Clustering (7)</strong></summary>
695
+
696
+ | Tool | Description |
697
+ |------|-------------|
698
+ | `ocr_cluster_documents` | Cluster by semantic similarity (HDBSCAN / agglomerative / k-means) |
699
+ | `ocr_cluster_list` | List clusters with filtering by run ID or tag |
700
+ | `ocr_cluster_get` | Cluster details with member documents |
701
+ | `ocr_cluster_assign` | Auto-assign a document to the nearest cluster |
702
+ | `ocr_cluster_reassign` | Move a document to a different cluster |
703
+ | `ocr_cluster_merge` | Merge two clusters into one |
704
+ | `ocr_cluster_delete` | Delete a clustering run |
705
+
706
+ </details>
707
+
708
+ <details>
709
+ <summary><strong>VLM / Vision Analysis (6)</strong></summary>
710
+
711
+ | Tool | Description |
712
+ |------|-------------|
713
+ | `ocr_vlm_describe` | Describe an image using Gemini 3 Flash (supports thinking mode) |
714
+ | `ocr_vlm_classify` | Classify image type, complexity, text density |
715
+ | `ocr_vlm_process_document` | VLM-process all images in a document |
716
+ | `ocr_vlm_process_pending` | VLM-process all pending images across all documents |
717
+ | `ocr_vlm_analyze_pdf` | Analyze a PDF directly with Gemini 3 Flash (max 20MB) |
718
+ | `ocr_vlm_status` | Service status (API config, rate limits, circuit breaker) |
719
+
720
+ VLM descriptions automatically generate searchable embeddings for semantic image search.
721
+
722
+ </details>
723
+
724
+ <details>
725
+ <summary><strong>Image Operations (11)</strong></summary>
726
+
727
+ | Tool | Description |
728
+ |------|-------------|
729
+ | `ocr_image_extract` | Extract images from a PDF via Datalab OCR |
730
+ | `ocr_image_list` | List images extracted from a document |
731
+ | `ocr_image_get` | Get image details |
732
+ | `ocr_image_stats` | Processing statistics |
733
+ | `ocr_image_delete` | Delete an image record |
734
+ | `ocr_image_delete_by_document` | Delete all images for a document |
735
+ | `ocr_image_reset_failed` | Reset failed images for reprocessing |
736
+ | `ocr_image_pending` | List images pending VLM processing |
737
+ | `ocr_image_search` | Search images with 7 filters (type, size, status, confidence, etc.) |
738
+ | `ocr_image_semantic_search` | Semantic search over VLM image descriptions |
739
+ | `ocr_image_reanalyze` | Re-run VLM analysis with a custom prompt |
740
+
741
+ </details>
742
+
743
+ <details>
744
+ <summary><strong>Image Extraction (3)</strong></summary>
745
+
746
+ | Tool | Description |
747
+ |------|-------------|
748
+ | `ocr_extract_images` | Extract images locally (PyMuPDF for PDF, zipfile for DOCX) |
749
+ | `ocr_extract_images_batch` | Batch extract from all processed documents |
750
+ | `ocr_extraction_check` | Verify Python environment has required packages |
751
+
752
+ </details>
753
+
754
+ <details>
755
+ <summary><strong>Chunks & Pages (4)</strong></summary>
756
+
757
+ | Tool | Description |
758
+ |------|-------------|
759
+ | `ocr_chunk_get` | Get a chunk by ID with full metadata |
760
+ | `ocr_chunk_list` | List chunks with filtering (content type, section path, page, heading) |
761
+ | `ocr_chunk_context` | Get a chunk with N neighboring chunks for context |
762
+ | `ocr_document_page` | Get all chunks for a specific page number (page-by-page navigation) |
763
+
764
+ </details>
765
+
766
+ <details>
767
+ <summary><strong>Embeddings (4)</strong></summary>
768
+
769
+ | Tool | Description |
770
+ |------|-------------|
771
+ | `ocr_embedding_list` | List embeddings with filtering |
772
+ | `ocr_embedding_stats` | Embedding statistics (counts, models, coverage) |
773
+ | `ocr_embedding_get` | Get embedding details by ID |
774
+ | `ocr_embedding_rebuild` | Re-generate embeddings for specific targets |
775
+
776
+ </details>
777
+
778
+ <details>
779
+ <summary><strong>Structured Extraction (4)</strong></summary>
780
+
781
+ | Tool | Description |
782
+ |------|-------------|
783
+ | `ocr_extract_structured` | Extract structured data from OCR'd documents using a JSON schema |
784
+ | `ocr_extraction_list` | List structured extractions for a document |
785
+ | `ocr_extraction_get` | Get a structured extraction by ID |
786
+ | `ocr_extraction_search` | Search across extraction content |
787
+
788
+ </details>
789
+
790
+ <details>
791
+ <summary><strong>Form Fill (2)</strong></summary>
792
+
793
+ | Tool | Description |
794
+ |------|-------------|
795
+ | `ocr_form_fill` | Fill PDF/image forms via Datalab with field name-value mapping |
796
+ | `ocr_form_fill_status` | Form fill operation status and results |
797
+
798
+ </details>
799
+
800
+ <details>
801
+ <summary><strong>File Management (6)</strong></summary>
802
+
803
+ | Tool | Description |
804
+ |------|-------------|
805
+ | `ocr_file_upload` | Upload to Datalab cloud (deduplicates by SHA-256) |
806
+ | `ocr_file_list` | List uploaded files with duplicate detection |
807
+ | `ocr_file_get` | File metadata |
808
+ | `ocr_file_download` | Get download URL |
809
+ | `ocr_file_delete` | Delete file record |
810
+ | `ocr_file_ingest_uploaded` | Bridge uploaded files into the document pipeline |
811
+
812
+ </details>
813
+
814
+ <details>
815
+ <summary><strong>Tags (6)</strong></summary>
816
+
817
+ | Tool | Description |
818
+ |------|-------------|
819
+ | `ocr_tag_create` | Create a tag with optional color and description |
820
+ | `ocr_tag_list` | List tags with usage counts |
821
+ | `ocr_tag_apply` | Apply a tag to any entity (document, chunk, image, cluster, etc.) |
822
+ | `ocr_tag_remove` | Remove a tag from an entity |
823
+ | `ocr_tag_search` | Find entities by tag name |
824
+ | `ocr_tag_delete` | Delete a tag and all associations |
825
+
826
+ </details>
827
+
828
+ <details>
829
+ <summary><strong>Intelligence & Navigation (4)</strong></summary>
830
+
831
+ | Tool | Description |
832
+ |------|-------------|
833
+ | `ocr_guide` | AI agent navigation -- inspects system state and recommends next tools/actions |
834
+ | `ocr_document_tables` | Extract and parse tables from OCR JSON blocks |
835
+ | `ocr_document_recommend` | Get related document recommendations via embedding similarity |
836
+ | `ocr_document_extras` | Access OCR extras data (charts, links, tracked changes, infographics) |
837
+
838
+ </details>
839
+
840
+ <details>
841
+ <summary><strong>Evaluation (3)</strong></summary>
842
+
843
+ | Tool | Description |
844
+ |------|-------------|
845
+ | `ocr_evaluate_single` | Evaluate a single image with VLM |
846
+ | `ocr_evaluate_document` | Evaluate all images in a document |
847
+ | `ocr_evaluate_pending` | Evaluate all pending images system-wide |
848
+
849
+ </details>
850
+
851
+ <details>
852
+ <summary><strong>Reports & Analytics (9)</strong></summary>
853
+
854
+ | Tool | Description |
855
+ |------|-------------|
856
+ | `ocr_evaluation_report` | Comprehensive OCR + VLM metrics report (markdown) |
857
+ | `ocr_document_report` | Single document report (images, extractions, comparisons, clusters) |
858
+ | `ocr_quality_summary` | Quality summary across all documents |
859
+ | `ocr_cost_summary` | Cost analytics by document, mode, month, or total |
860
+ | `ocr_pipeline_analytics` | Pipeline throughput, duration, per-mode/type breakdown |
861
+ | `ocr_corpus_profile` | Corpus content profile (doc sizes, content types, section frequency) |
862
+ | `ocr_error_analytics` | Error/recovery analytics and failure rates |
863
+ | `ocr_provenance_bottlenecks` | Processing bottleneck analysis by processor |
864
+ | `ocr_quality_trends` | Quality trends over time (hourly/daily/weekly/monthly) |
865
+
866
+ </details>
867
+
868
+ <details>
869
+ <summary><strong>Timeline & Analytics (2)</strong></summary>
870
+
871
+ | Tool | Description |
872
+ |------|-------------|
873
+ | `ocr_timeline_analytics` | Volume metrics over time |
874
+ | `ocr_throughput_analytics` | Processing throughput per time bucket |
875
+
876
+ </details>
877
+
878
+ <details>
879
+ <summary><strong>Health & Diagnostics (1)</strong></summary>
880
+
881
+ | Tool | Description |
882
+ |------|-------------|
883
+ | `ocr_health_check` | Detect data integrity gaps (missing embeddings, orphaned chunks, etc.) with optional auto-fix |
884
+
885
+ </details>
886
+
887
+ <details>
888
+ <summary><strong>Configuration (2)</strong></summary>
889
+
890
+ | Tool | Description |
891
+ |------|-------------|
892
+ | `ocr_config_get` | Get current system configuration |
893
+ | `ocr_config_set` | Update configuration at runtime |
894
+
895
+ </details>
896
+
897
+ ---
898
+
899
+ ## Processing Pipeline
900
+
901
+ ```
902
+ File on disk
903
+
904
+ ├─ 1. REGISTER ──► documents table (status: pending)
905
+ │ ├─ file_hash computed (SHA-256)
906
+ │ ├─ version detection (new vs re-ingested)
907
+ │ └─ provenance record (type: DOCUMENT, depth: 0)
908
+
909
+ ├─ 2. OCR ──────► ocr_results table
910
+ │ ├─ Datalab API call (fast/balanced/accurate)
911
+ │ ├─ extracted_text (markdown)
912
+ │ ├─ json_blocks (structural hierarchy)
913
+ │ ├─ extras_json (charts, links, track changes)
914
+ │ ├─ page_offsets (page boundaries)
915
+ │ └─ provenance record (type: OCR_RESULT, depth: 1)
916
+
917
+ ├─ 3. CHUNK ────► chunks table
918
+ │ ├─ Hybrid section-aware chunking
919
+ │ │ ├─ Text + heading normalization
920
+ │ │ ├─ Markdown structure parsing
921
+ │ │ ├─ Atomic region detection (tables, figures)
922
+ │ │ ├─ Heading-only chunk merging
923
+ │ │ ├─ Near-duplicate deduplication
924
+ │ │ └─ Header/footer auto-tagging
925
+ │ ├─ 2000 chars with 10% overlap
926
+ │ ├─ section_path, heading_context, content_types
927
+ │ ├─ page_number assignment via page separators
928
+ │ └─ provenance records (type: CHUNK, depth: 2)
929
+
930
+ ├─ 4. EMBED ────► embeddings + vec_embeddings tables
931
+ │ ├─ Nomic embed v1.5 (768-dim, local GPU)
932
+ │ ├─ "search_document: " prefix
933
+ │ └─ provenance records (type: EMBEDDING, depth: 3)
934
+
935
+ ├─ 5. FTS ──────► fts_index (FTS5 virtual table)
936
+ │ └─ External content index on chunk text
937
+
938
+ ├─ 6. IMAGES ───► images table
939
+ │ │ ├─ PyMuPDF extraction (PDF) / zip extraction (DOCX)
940
+ │ │ ├─ Image optimization (resize, format)
941
+ │ │ └─ provenance records (type: IMAGE, depth: 2)
942
+ │ │
943
+ │ └─ 7. VLM ──► images updated + embeddings table
944
+ │ ├─ Gemini 3 Flash multimodal analysis
945
+ │ ├─ Description, structured data, confidence
946
+ │ ├─ VLM description embedding generated (searchable)
947
+ │ └─ provenance records (type: VLM_DESCRIPTION, depth: 3→4)
948
+
949
+ ├─ 8. AUTO-CLUSTER ──► clusters table (when configured)
950
+ │ └─ Triggers when threshold met and >1hr since last run
951
+
952
+ └─ documents.status = 'complete'
953
+ ```
954
+
955
+ ---
956
+
957
+ ## Data Architecture (Schema v31)
958
+
959
+ 18 core tables + FTS5 virtual tables + vec_embeddings:
960
+
961
+ | Table | Purpose | Key Fields |
962
+ |-------|---------|------------|
963
+ | `documents` | Source files | file_hash, status, page_count, metadata |
964
+ | `ocr_results` | Extracted text | extracted_text, json_blocks, quality_score, cost |
965
+ | `chunks` | Text segments | text (2000 chars), section_path, heading_context, content_types |
966
+ | `embeddings` | 768-dim vectors | original_text, model_name, source metadata |
967
+ | `images` | Extracted images | extracted_path, bbox, VLM description, confidence |
968
+ | `extractions` | Structured data | schema_json, extraction_json |
969
+ | `form_fills` | Form filling results | field mapping, output path |
970
+ | `comparisons` | Document pair diffs | similarity_ratio, diff_operations |
971
+ | `clusters` | Document groupings | label, classification_tag, coherence_score |
972
+ | `document_clusters` | Cluster membership | document_id, cluster_id |
973
+ | `provenance` | Full audit trail | type, processor, chain_depth, content_hash |
974
+ | `tags` | Cross-entity labels | name, color, description |
975
+ | `entity_tags` | Tag associations | tag_id, entity_type, entity_id |
976
+ | `saved_searches` | Search persistence | name, search_type, parameters |
977
+ | `uploaded_files` | Cloud file tracking | datalab_id, file_hash, upload status |
978
+ | `database_metadata` | DB-level settings | key-value pairs |
979
+ | `schema_version` | Migration tracking | version, applied_at |
980
+ | `fts_index_metadata` | FTS index state | last_rebuild, chunk count |
981
+
982
+ ---
983
+
984
+ ## AI/ML Capabilities
985
+
986
+ | Capability | Technology | Tool(s) |
987
+ |-----------|-----------|---------|
988
+ | Document OCR | Datalab API (3 modes) | `ocr_process_pending`, `ocr_convert_raw` |
989
+ | Text Embeddings | Nomic embed v1.5 (local GPU) | Auto during ingestion, `ocr_reembed_document` |
990
+ | Image Description | Gemini 3 Flash | `ocr_vlm_describe`, `ocr_vlm_process_*` |
991
+ | Image Classification | Gemini 3 Flash | `ocr_vlm_classify` |
992
+ | Search Reranking |Python cross-encoder | `rerank` parameter on all search tools (local, no API) |
993
+ | Query Expansion | Heuristic synonyms | `expand_query` parameter |
994
+ | Query Classification | Heuristic patterns | `auto_route` parameter (hybrid search) |
995
+ | Document Clustering | scikit-learn | `ocr_cluster_documents` (HDBSCAN/agglomerative/k-means) |
996
+ | Auto-Clustering | scikit-learn | Configurable auto-trigger after `ocr_process_pending` |
997
+ | Similarity Detection | Embedding centroids | `ocr_document_find_similar`, `ocr_document_recommend` |
998
+ | Duplicate Detection | File hash + embedding similarity | `ocr_document_duplicates` |
999
+ | Comparison Discovery | Embedding similarity | `ocr_comparison_discover` |
1000
+ | Comparison Matrix | Pairwise cosine similarity | `ocr_comparison_matrix` |
1001
+ | Text Comparison | npm diff (Sorensen-Dice) | `ocr_document_compare` |
1002
+ | RAG Context Assembly | Hybrid search + markdown | `ocr_rag_context` |
1003
+ | Semantic Image Search | VLM description embeddings | `ocr_image_semantic_search` |
1004
+ | PDF Direct Analysis | Gemini 3 Flash multimodal | `ocr_vlm_analyze_pdf` |
1005
+ | Table Extraction | OCR JSON block parsing | `ocr_document_tables` |
1006
+ | Cross-DB Search | BM25 across all databases | `ocr_search_cross_db` |
1007
+ | Chunk Deduplication | Fuzzy text matching | Automatic during chunking pipeline |
1008
+ | AI Agent Navigation | System state analysis | `ocr_guide` |
1009
+ | Health Diagnostics | Data integrity analysis | `ocr_health_check` |
1010
+
1011
+ ---
1012
+
1013
+ ## Development
1014
+
1015
+ ```bash
1016
+ npm run build # Build TypeScript
1017
+ npm test # All tests (2,639 across 115 test suites)
1018
+ npm run test:unit # Unit tests only
1019
+ npm run test:integration # Integration tests only
1020
+ npm run lint:all # TypeScript + Python linting
1021
+ npm run check # typecheck + lint + test
1022
+ ```
1023
+
1024
+ ### Project Structure
1025
+
1026
+ ```
1027
+ src/
1028
+ index.ts # MCP server entry point (tool registration, lifecycle)
1029
+ bin.ts # CLI entry point
1030
+ tools/ # 22 tool files + shared.ts
1031
+ database.ts # Database CRUD (5 tools)
1032
+ ingestion.ts # Ingest + process pipeline (9 tools)
1033
+ search.ts # BM25, semantic, hybrid, RAG, cross-DB (12 tools)
1034
+ documents.ts # Document ops, versions, workflow (12 tools)
1035
+ provenance.ts # Audit trail, verification (6 tools)
1036
+ comparison.ts # Diff, batch compare, matrix (6 tools)
1037
+ clustering.ts # Cluster, reassign, merge (7 tools)
1038
+ vlm.ts # Gemini vision analysis (6 tools)
1039
+ images.ts # Image ops, semantic search (11 tools)
1040
+ reports.ts # Analytics + quality reports (9 tools)
1041
+ tags.ts # Cross-entity tagging (6 tools)
1042
+ intelligence.ts # AI guide, tables, recommendations, extras (4 tools)
1043
+ embeddings.ts # Embedding management (4 tools)
1044
+ extraction-structured.ts # JSON schema extraction (4 tools)
1045
+ extraction.ts # Local image extraction (3 tools)
1046
+ file-management.ts # Cloud file ops (6 tools)
1047
+ chunks.ts # Chunk inspection + page navigation (4 tools)
1048
+ timeline.ts # Time-series analytics (2 tools)
1049
+ form-fill.ts # PDF form filling (2 tools)
1050
+ evaluation.ts # VLM evaluation (3 tools)
1051
+ config.ts # Runtime config (2 tools)
1052
+ health.ts # Data integrity check (1 tool)
1053
+ shared.ts # Shared utilities (formatResponse, handleError, etc.)
1054
+ services/ # Core services (11 domains, 64 files)
1055
+ chunking/ # Hybrid section-aware chunking pipeline
1056
+ chunker.ts # Main chunking orchestrator
1057
+ markdown-parser.ts
1058
+ heading-normalizer.ts
1059
+ text-normalizer.ts
1060
+ chunk-merger.ts
1061
+ chunk-deduplicator.ts
1062
+ json-block-analyzer.ts
1063
+ search/ # BM25, semantic, hybrid, fusion, reranker (AI + local), query expansion/classification, quality weighting
1064
+ gemini/ # Gemini client with caching, circuit breaker, rate limiting
1065
+ storage/ # SQLite database + migrations (19 operation files)
1066
+ ... # OCR, embedding, VLM, provenance, comparison, clustering, images
1067
+ models/ # Zod schemas and TypeScript types
1068
+ utils/ # Hash, validation, path sanitization
1069
+ server/ # Server state, types, errors (14 custom error classes)
1070
+ python/ # 9 Python workers + GPU utils
1071
+ tests/
1072
+ unit/ # Unit tests
1073
+ integration/ # Integration tests
1074
+ e2e/ # End-to-end pipeline tests
1075
+ manual/ # Verification tests
1076
+ benchmark/ # Chunking benchmark
1077
+ fixtures/ # Test fixtures and sample documents
1078
+ docs/ # System documentation and reports
1079
+ ```
1080
+
1081
+ ### Key Metrics
1082
+
1083
+ | Metric | Value |
1084
+ |--------|-------|
1085
+ | MCP tools | 141 |
1086
+ | Tool modules | 22 |
1087
+ | Database tables | 18 core + FTS + vec |
1088
+ | Schema version | v32 (32 migrations) |
1089
+ | Database operation files | 19 |
1090
+ | Service domains | 11 |
1091
+ | Test suites | 115 |
1092
+ | Tests passing | 2,639 |
1093
+ | TypeScript source | ~46,000 lines |
1094
+ | Python source | ~4,700 lines |
1095
+ | Test code | ~65,000 lines |
1096
+ | Production deps | 9 packages |
1097
+ | Python workers | 9 |
1098
+ | External APIs | 3 (Datalab, Gemini, Nomic local) |
1099
+ | Custom error classes | 14 |
1100
+ | File types supported | 18 |
1101
+
1102
+ ---
1103
+
1104
+ ## Troubleshooting
1105
+
1106
+ <details>
1107
+ <summary><strong>sqlite-vec loading errors</strong></summary>
1108
+
1109
+ Run `npm install` -- sqlite-vec uses a prebuilt binary that must match your platform and Node.js version.
1110
+ </details>
1111
+
1112
+ <details>
1113
+ <summary><strong>Python not found (Windows)</strong></summary>
1114
+
1115
+ The server auto-detects `python` vs `python3`. Ensure Python is on your PATH: `python --version`.
1116
+ </details>
1117
+
1118
+ <details>
1119
+ <summary><strong>GPU not detected</strong></summary>
1120
+
1121
+ ```bash
1122
+ python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('MPS:', hasattr(torch.backends, 'mps') and torch.backends.mps.is_available())"
1123
+ ```
1124
+ If both are False, install the CUDA version of PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu124`
1125
+ </details>
1126
+
1127
+ <details>
1128
+ <summary><strong>Embedding model not found</strong></summary>
1129
+
1130
+ Download the model (see [Installation](#installation)). Verify `config.json`, `model.safetensors`, and `tokenizer.json` are present in the model directory.
1131
+ </details>
1132
+
1133
+ <details>
1134
+ <summary><strong>API key warnings at startup</strong></summary>
1135
+
1136
+ Copy `.env.example` to `.env` and fill in your `DATALAB_API_KEY` and `GEMINI_API_KEY`.
1137
+ </details>
1138
+
1139
+ <details>
1140
+ <summary><strong>Data integrity issues</strong></summary>
1141
+
1142
+ Run `ocr_health_check { fix: true }` to detect and auto-fix common issues like chunks missing embeddings or orphaned records.
1143
+ </details>
1144
+
1145
+ ---
1146
+
1147
+ ## License
1148
+
1149
+ This project uses a **dual-license** model:
1150
+
1151
+ - **Free for non-commercial use** -- personal projects, academic research, education, non-profits, evaluation, and contributions to this project are all permitted at no cost.
1152
+ - **Commercial license required for revenue-generating use** -- if you use this software to make money (paid services, SaaS, internal tools at for-profit companies, etc.), you must obtain a commercial license from the copyright holder. Terms are negotiated case-by-case and may include revenue sharing or flat-rate arrangements.
1153
+
1154
+ See [LICENSE](LICENSE) for full details. For commercial licensing inquiries, contact Chris Royse at [chrisroyseai@gmail.com](mailto:chrisroyseai@gmail.com) or via [GitHub](https://github.com/ChrisRoyse).