biblicus 0.7.0__py3-none-any.whl → 0.9.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: biblicus
3
- Version: 0.7.0
3
+ Version: 0.9.0
4
4
  Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
5
5
  License: MIT
6
6
  Requires-Python: >=3.9
@@ -25,8 +25,23 @@ Requires-Dist: unstructured>=0.12.0; extra == "unstructured"
25
25
  Requires-Dist: python-docx>=1.1.0; extra == "unstructured"
26
26
  Provides-Extra: ocr
27
27
  Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr"
28
+ Provides-Extra: paddleocr
29
+ Requires-Dist: paddleocr>=2.7.0; extra == "paddleocr"
30
+ Requires-Dist: paddlepaddle>=2.5.0; extra == "paddleocr"
31
+ Requires-Dist: huggingface_hub>=0.20.0; extra == "paddleocr"
32
+ Requires-Dist: requests>=2.28.0; extra == "paddleocr"
28
33
  Provides-Extra: markitdown
29
34
  Requires-Dist: markitdown[all]>=0.1.0; python_version >= "3.10" and extra == "markitdown"
35
+ Provides-Extra: deepgram
36
+ Requires-Dist: deepgram-sdk>=3.0; extra == "deepgram"
37
+ Provides-Extra: docling
38
+ Requires-Dist: docling[vlm]>=2.0.0; extra == "docling"
39
+ Provides-Extra: docling-mlx
40
+ Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
41
+ Provides-Extra: topic-modeling
42
+ Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
43
+ Provides-Extra: datasets
44
+ Requires-Dist: datasets>=2.18.0; extra == "datasets"
30
45
  Dynamic: license-file
31
46
 
32
47
  # Biblicus
@@ -160,9 +175,14 @@ python3 -m pip install biblicus
160
175
  Some extractors are optional so the base install stays small.
161
176
 
162
177
  - Optical character recognition for images: `python3 -m pip install "biblicus[ocr]"`
163
- - Speech to text transcription: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
178
+ - Advanced optical character recognition with PaddleOCR: `python3 -m pip install "biblicus[paddleocr]"`
179
+ - Document understanding with Docling VLM: `python3 -m pip install "biblicus[docling]"`
180
+ - Document understanding with Docling VLM and MLX acceleration: `python3 -m pip install "biblicus[docling-mlx]"`
181
+ - Speech to text transcription with OpenAI: `python3 -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
182
+ - Speech to text transcription with Deepgram: `python3 -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
164
183
  - Broad document parsing fallback: `python3 -m pip install "biblicus[unstructured]"`
165
184
  - MarkItDown document conversion (requires Python 3.10 or higher): `python3 -m pip install "biblicus[markitdown]"`
185
+ - Topic modeling analysis with BERTopic: `python3 -m pip install "biblicus[topic-modeling]"`
166
186
 
167
187
  ## Quick start
168
188
 
@@ -420,6 +440,7 @@ The documents below follow the pipeline from raw items to model context:
420
440
 
421
441
  - [Corpus][corpus]
422
442
  - [Text extraction][text-extraction]
443
+ - [Speech to text][speech-to-text]
423
444
  - [Knowledge base][knowledge-base]
424
445
  - [Backends][backends]
425
446
  - [Context packs][context-packs]
@@ -468,27 +489,107 @@ corpus/
468
489
  Two backends are included.
469
490
 
470
491
  - `scan` is a minimal baseline that scans raw items directly.
471
- - `sqlite-full-text-search` is a practical baseline that builds a full text search index in Sqlite.
492
+ - `sqlite-full-text-search` is a practical baseline that builds a full text search index in SQLite.
493
+
494
+ For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
472
495
 
473
496
  ## Extraction backends
474
497
 
475
- These extractors are built in. Optional ones require extra dependencies.
498
+ These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
476
499
 
477
- - `pass-through-text` reads text items and strips Markdown front matter.
478
- - `metadata-text` turns catalog metadata into a small text artifact.
479
- - `pdf-text` extracts text from Portable Document Format items with `pypdf`.
480
- - `select-text` chooses one prior extraction result in a pipeline.
481
- - `select-longest-text` chooses the longest prior extraction result.
482
- - `ocr-rapidocr` does optical character recognition on images (optional).
483
- - `stt-openai` performs speech to text on audio (optional).
484
- - `unstructured` provides broad document parsing (optional).
485
- - `markitdown` converts many formats into Markdown-like text (optional).
500
+ ### Text and document extraction
486
501
 
487
- ## Integration corpus and evaluation dataset
502
+ - [`pass-through-text`](docs/extractors/text-document/pass-through.md) reads text items and strips Markdown front matter.
503
+ - [`metadata-text`](docs/extractors/text-document/metadata.md) turns catalog metadata into a small text artifact.
504
+ - [`pdf-text`](docs/extractors/text-document/pdf.md) extracts text from Portable Document Format items with `pypdf`.
505
+ - [`unstructured`](docs/extractors/text-document/unstructured.md) provides broad document parsing (optional).
506
+ - [`markitdown`](docs/extractors/text-document/markitdown.md) converts many formats into Markdown-like text (optional).
507
+
508
+ ### Optical character recognition
509
+
510
+ - [`ocr-rapidocr`](docs/extractors/ocr/rapidocr.md) does optical character recognition on images (optional).
511
+ - [`ocr-paddleocr-vl`](docs/extractors/ocr/paddleocr-vl.md) does advanced optical character recognition with PaddleOCR vision-language model (optional).
512
+
513
+ ### Vision-language models
514
+
515
+ - [`docling-smol`](docs/extractors/vlm-document/docling-smol.md) uses the SmolDocling-256M vision-language model for fast document understanding (optional).
516
+ - [`docling-granite`](docs/extractors/vlm-document/docling-granite.md) uses the Granite Docling-258M vision-language model for high-accuracy extraction (optional).
517
+
518
+ ### Speech to text
519
+
520
+ - [`stt-openai`](docs/extractors/speech-to-text/openai.md) performs speech to text on audio using OpenAI (optional).
521
+ - [`stt-deepgram`](docs/extractors/speech-to-text/deepgram.md) performs speech to text on audio using Deepgram (optional).
488
522
 
489
- Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
523
+ ### Pipeline utilities
524
+
525
+ - [`select-text`](docs/extractors/pipeline-utilities/select-text.md) chooses one prior extraction result in a pipeline.
526
+ - [`select-longest-text`](docs/extractors/pipeline-utilities/select-longest.md) chooses the longest prior extraction result.
527
+ - [`select-override`](docs/extractors/pipeline-utilities/select-override.md) chooses the last extraction result for matching media types in a pipeline.
528
+ - [`select-smart-override`](docs/extractors/pipeline-utilities/select-smart-override.md) intelligently chooses between extraction results based on confidence and content quality.
529
+
530
+ For detailed documentation on all extractors, see the [Extractor Reference][extractor-reference].
531
+
532
+ ## Topic modeling analysis
533
+
534
+ Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
535
+ analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
536
+ processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
537
+ JavaScript Object Notation.
538
+
539
+ See `docs/ANALYSIS.md` for the analysis pipeline overview and `docs/TOPIC_MODELING.md` for topic modeling details.
540
+
541
+ Run a topic analysis using a recipe file:
542
+
543
+ ```
544
+ biblicus analyze topics --corpus corpora/example --recipe recipes/topic-modeling.yml --extraction-run pipeline:<run_id>
545
+ ```
546
+
547
+ If `--extraction-run` is omitted, Biblicus uses the most recent extraction run and emits a warning about
548
+ reproducibility. The analysis output is stored under:
549
+
550
+ ```
551
+ .biblicus/runs/analysis/topic-modeling/<run_id>/output.json
552
+ ```
553
+
554
+ Minimal recipe example:
555
+
556
+ ```yaml
557
+ schema_version: 1
558
+ text_source:
559
+ sample_size: 200
560
+ llm_extraction:
561
+ enabled: false
562
+ lexical_processing:
563
+ enabled: true
564
+ lowercase: true
565
+ strip_punctuation: false
566
+ collapse_whitespace: true
567
+ bertopic_analysis:
568
+ parameters:
569
+ min_topic_size: 8
570
+ nr_topics: 10
571
+ vectorizer:
572
+ ngram_range: [1, 2]
573
+ stop_words: english
574
+ llm_fine_tuning:
575
+ enabled: false
576
+ ```
577
+
578
+ LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
579
+ Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
580
+ AG News integration runs require `biblicus[datasets]` in addition to `biblicus[topic-modeling]`.
581
+
582
+ For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
583
+
584
+ ```
585
+ python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
586
+ ```
587
+
588
+ See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
589
+
590
+ ## Integration corpus and evaluation dataset
490
591
 
491
- The dataset file `datasets/wikipedia_mini.json` provides a small evaluation set that matches the integration corpus.
592
+ Use `scripts/download_ag_news.py` to download the AG News dataset when running topic modeling demos. The repository does not include that content.
492
593
 
493
594
  Use `scripts/download_pdf_samples.py` to download a small Portable Document Format integration corpus when running tests or demos. The repository does not include that content.
494
595
 
@@ -539,6 +640,9 @@ License terms are in `LICENSE`.
539
640
  [corpus]: docs/CORPUS.md
540
641
  [knowledge-base]: docs/KNOWLEDGE_BASE.md
541
642
  [text-extraction]: docs/EXTRACTION.md
643
+ [extractor-reference]: docs/extractors/index.md
644
+ [backend-reference]: docs/backends/index.md
645
+ [speech-to-text]: docs/STT.md
542
646
  [user-configuration]: docs/USER_CONFIGURATION.md
543
647
  [backends]: docs/BACKENDS.md
544
648
  [context-packs]: docs/CONTEXT_PACK.md
@@ -1,49 +1,62 @@
1
- biblicus/__init__.py,sha256=zpBSDOPXCoqBcc2QNjRWf_4dD4FKnBgUDl3j_ZG2_cA,495
1
+ biblicus/__init__.py,sha256=x14R9a_6nu3qTg2F-sUOaS_ZepXNBPpa3nsEgp4PZhg,495
2
2
  biblicus/__main__.py,sha256=ipfkUoTlocVnrQDM69C7TeBqQxmHVeiWMRaT3G9rtnk,117
3
- biblicus/cli.py,sha256=hBau464XNdSGdWeOCE2Q7dm0P8I4sR0W-NgVT0wPmh4,27724
4
- biblicus/constants.py,sha256=R6fZDoLVMCwgKvTaxEx7G0CstwHGaUTlW9MsmNLDZ44,269
3
+ biblicus/cli.py,sha256=GVmZlCSZPUMBbq69yjN16f4xNw71edlFbGPHX3300oI,32643
4
+ biblicus/constants.py,sha256=-JaHI3Dngte2drawx93cGWxFVobbgIuaVhmjUJpf4GI,333
5
5
  biblicus/context.py,sha256=qnT9CH7_ldoPcg-rxnUOtRhheOmpDAbF8uqhf8OdjC4,5832
6
- biblicus/corpus.py,sha256=gF1RNl6fdz7wplzpHEIkEBkhYxHgKTKguBR_kD9IgUw,54109
6
+ biblicus/corpus.py,sha256=Pq2OvXom7giwD1tuWoM3RhFnak5YFx5bCh6JTd6JYtI,55554
7
7
  biblicus/crawl.py,sha256=n8rXBMnziBK9vtKQQCXYOpBzqsPCswj2PzVJUb370KY,6250
8
8
  biblicus/errors.py,sha256=uMajd5DvgnJ_-jq5sbeom1GV8DPUc-kojBaECFi6CsY,467
9
9
  biblicus/evaluation.py,sha256=5xWpb-8f49Osh9aHzo1ab3AXOmls3Imc5rdnEC0pN-8,8143
10
10
  biblicus/evidence_processing.py,sha256=EMv1AkV_Eufk-poBz9nRR1dZgC-QewvI-NrULBUGVGA,6074
11
- biblicus/extraction.py,sha256=VEjBjIpaBboftGgEcpDj7z7um41e5uDZpP_7acQg7fw,19448
11
+ biblicus/extraction.py,sha256=20lRxz6Te6IcA4d-rfT4qjJtgRG_c4YvrqfXNA7EYfs,19738
12
12
  biblicus/frontmatter.py,sha256=JOGjIDzbbOkebQw2RzA-3WDVMAMtJta2INjS4e7-LMg,2463
13
13
  biblicus/hook_logging.py,sha256=IMvde-JhVWrx9tNz3eDJ1CY_rr5Sj7DZ2YNomYCZbz0,5366
14
14
  biblicus/hook_manager.py,sha256=ZCAkE5wLvn4lnQz8jho_o0HGEC9KdQd9qitkAEUQRcw,6997
15
15
  biblicus/hooks.py,sha256=OHQOmOi7rUcQqYWVeod4oPe8nVLepD7F_SlN7O_-BsE,7863
16
16
  biblicus/ignore.py,sha256=fyjt34E6tWNNrm1FseOhgH2MgryyVBQVzxhKL5s4aio,1800
17
+ biblicus/inference.py,sha256=_k00AIPoXD2lruiTB-JUagtY4f_WKcdzA3axwiq1tck,3512
17
18
  biblicus/knowledge_base.py,sha256=JmlJw8WD_fgstuq1PyWVzU9kzvVzyv7_xOvhS70xwUw,6654
18
- biblicus/models.py,sha256=6SWQ2Czg9O3zjuam8a4m8V3LlEgcGLbEctYDB6F1rRs,15317
19
+ biblicus/models.py,sha256=vlvPP7AOZGtnHSq47-s9YW-fqLwjgYR6NBcSfeC8YKk,15665
19
20
  biblicus/retrieval.py,sha256=A1SI4WK5cX-WbtN6FJ0QQxqlEOtQhddLrL0LZIuoTC4,4180
20
21
  biblicus/sources.py,sha256=EFy8-rQNLsyzz-98mH-z8gEHMYbqigcNFKLaR92KfDE,7241
21
22
  biblicus/time.py,sha256=3BSKOSo7R10K-0Dzrbdtl3fh5_yShTYqfdlKvvdkx7M,485
22
23
  biblicus/uris.py,sha256=xXD77lqsT9NxbyzI1spX9Y5a3-U6sLYMnpeSAV7g-nM,2013
23
- biblicus/user_config.py,sha256=DqO08yLn82DhTiFpmIyyLj_J0nMbrtE8xieTj2Cgd6A,4287
24
+ biblicus/user_config.py,sha256=okK57CRmT0W_yrc45tMPRl_abT7-D96IOrCBZtKtumM,6507
24
25
  biblicus/_vendor/dotyaml/__init__.py,sha256=e4zbejeJRwlD4I0q3YvotMypO19lXqmT8iyU1q6SvhY,376
25
26
  biblicus/_vendor/dotyaml/interpolation.py,sha256=PfUAEEOTFobv7Ox0E6nAxht6BqhHIDe4hP32fZn5TOs,1992
26
27
  biblicus/_vendor/dotyaml/loader.py,sha256=KePkjyhKZSvQZphmlmlzTYZJBQsqL5qhtGV1y7G6wzM,5624
27
28
  biblicus/_vendor/dotyaml/transformer.py,sha256=2AKPS8DMOPuYtzmM-dlwIqVbARfbBH5jYV1m5qpR49E,3725
29
+ biblicus/analysis/__init__.py,sha256=TrKsE2GmdZDr3OARo2poa9H0powo0bjiEEWVx0tZmEg,1192
30
+ biblicus/analysis/base.py,sha256=gB4ilvyMpiWU1m_ydy2dIHGP96ZFIFvVUL9iVDZKPJM,1265
31
+ biblicus/analysis/llm.py,sha256=VjkZDKauHCDfj-TP-bTbI6a9WAXEIDe8bEiwErPx9xc,3309
32
+ biblicus/analysis/models.py,sha256=4N8abx2kSMYYfckbq_QHl5YUnups3FFx5atepYR9cu4,19705
33
+ biblicus/analysis/schema.py,sha256=MCiAQJmijVk8iM8rOUYbzyaDwsMR-Oo86iZU5NCbDMM,435
34
+ biblicus/analysis/topic_modeling.py,sha256=9jSZrlpPK44H4UMfig7YNs3pPc0pNAqu-i4OlXzHET8,19454
28
35
  biblicus/backends/__init__.py,sha256=wLXIumV51l6ZIKzjoKKeU7AgIxGOryG7T7ls3a_Fv98,1212
29
36
  biblicus/backends/base.py,sha256=Erfj9dXg0nkRKnEcNjHR9_0Ddb2B1NvbmRksVm_g1dU,1776
30
37
  biblicus/backends/scan.py,sha256=hdNnQWqi5IH6j95w30BZHxLJ0W9PTaOkqfWJuxCCEMI,12478
31
38
  biblicus/backends/sqlite_full_text_search.py,sha256=KgmwOiKvkA0pv7vD0V7bcOdDx_nZIOfuIN6Z4Ij7I68,16516
32
- biblicus/extractors/__init__.py,sha256=ctf6TkGViOpxr1s1TGMs40emcXImQZ71p0uOEBvLy9s,1890
39
+ biblicus/extractors/__init__.py,sha256=ci3oldbdQZ8meAfHccM48CqQtZsPSRg3HkPrBSZF15M,2673
33
40
  biblicus/extractors/base.py,sha256=ka-nz_1zHPr4TS9sU4JfOoY-PJh7lbHPBOEBrbQFGSc,2171
41
+ biblicus/extractors/deepgram_stt.py,sha256=VI71i4lbE-EFHcvpNcCPRpT8z7A5IuaSrT1UaPyZ8UY,6323
42
+ biblicus/extractors/docling_granite_text.py,sha256=aFNx-HubvaMmVJHbNqk3CR_ilSwN96-phkaENT6E2B0,6879
43
+ biblicus/extractors/docling_smol_text.py,sha256=cSbQcT4O47MMcM6_pmQCvqgC5ferLvaxJnm3v9EQd0A,6811
34
44
  biblicus/extractors/markitdown_text.py,sha256=-7N8ebi3pYfNPnplccyy3qvsKi6uImC1xyo_dSDiD10,4546
35
45
  biblicus/extractors/metadata_text.py,sha256=7FbEPp0K1mXc7FH1_c0KhPhPexF9U6eLd3TVY1vTp1s,3537
36
46
  biblicus/extractors/openai_stt.py,sha256=fggErIu6YN6tXbleNTuROhfYi7zDgMd2vD_ecXZ7eXs,7162
47
+ biblicus/extractors/paddleocr_vl_text.py,sha256=augbxZ-kx22yHvFR1b6CUAS2I6ktXFsJx8nLWRfvdOA,11722
37
48
  biblicus/extractors/pass_through_text.py,sha256=DNxkCwpH2bbXjPGPEQwsx8kfqXi6rIxXNY_n3TU2-WI,2777
38
49
  biblicus/extractors/pdf_text.py,sha256=YtUphgLVxyWJXew6ZsJ8wBRh67Y5ri4ZTRlMmq3g1Bk,3255
39
50
  biblicus/extractors/pipeline.py,sha256=LY6eM3ypw50MDB2cPEQqZrjxkhVvIc6sv4UEhHdNDrE,3208
40
- biblicus/extractors/rapidocr_text.py,sha256=OMAuZealLSSTFVVmBalT-AFJy2pEpHyyvpuWxlnY-GU,4531
51
+ biblicus/extractors/rapidocr_text.py,sha256=StvizEha5BkEG7i5KJmnOUtji89p5pghF4w8iQ-WwFk,4776
41
52
  biblicus/extractors/select_longest_text.py,sha256=wRveXAfYLdj7CpGuo4RoD7zE6SIfylRCbv40z2azO0k,3702
53
+ biblicus/extractors/select_override.py,sha256=gSpffFmn1ux9pGtFvHD5Uu_LO8TmmJC4L_mvjehiSec,4014
54
+ biblicus/extractors/select_smart_override.py,sha256=-sLMnNoeXbCB3dO9zflQq324eHuLbd6hpveSwduXP-U,6763
42
55
  biblicus/extractors/select_text.py,sha256=w0ATmDy3tWWbOObzW87jGZuHbgXllUhotX5XyySLs-o,3395
43
56
  biblicus/extractors/unstructured_text.py,sha256=l2S_wD_htu7ZHoJQNQtP-kGlEgOeKV_w2IzAC93lePE,3564
44
- biblicus-0.7.0.dist-info/licenses/LICENSE,sha256=lw44GXFG_Q0fS8m5VoEvv_xtdBXK26pBcbSPUCXee_Q,1078
45
- biblicus-0.7.0.dist-info/METADATA,sha256=tt46S2yJOUMhhAQFvLayZmEPJ5q7hNSP4CnUGBS2eT0,22315
46
- biblicus-0.7.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
47
- biblicus-0.7.0.dist-info/entry_points.txt,sha256=BZmO4H8Uz00fyi1RAFryOCGfZgX7eHWkY2NE-G54U5A,47
48
- biblicus-0.7.0.dist-info/top_level.txt,sha256=sUD_XVZwDxZ29-FBv1MknTGh4mgDXznGuP28KJY_WKc,9
49
- biblicus-0.7.0.dist-info/RECORD,,
57
+ biblicus-0.9.0.dist-info/licenses/LICENSE,sha256=lw44GXFG_Q0fS8m5VoEvv_xtdBXK26pBcbSPUCXee_Q,1078
58
+ biblicus-0.9.0.dist-info/METADATA,sha256=7NBBKWloUkQ2mx_CuPqAQzQJWHEwM7aJT7XQHGL2VwU,27325
59
+ biblicus-0.9.0.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
60
+ biblicus-0.9.0.dist-info/entry_points.txt,sha256=BZmO4H8Uz00fyi1RAFryOCGfZgX7eHWkY2NE-G54U5A,47
61
+ biblicus-0.9.0.dist-info/top_level.txt,sha256=sUD_XVZwDxZ29-FBv1MknTGh4mgDXznGuP28KJY_WKc,9
62
+ biblicus-0.9.0.dist-info/RECORD,,