biblicus 0.14.0__tar.gz → 0.15.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {biblicus-0.14.0/src/biblicus.egg-info → biblicus-0.15.0}/PKG-INFO +88 -25
- {biblicus-0.14.0 → biblicus-0.15.0}/README.md +80 -24
- biblicus-0.15.0/docs/ANALYSIS.md +143 -0
- biblicus-0.15.0/docs/ARCHITECTURE.md +46 -0
- biblicus-0.15.0/docs/ARCHITECTURE_DETAIL.md +267 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/BACKENDS.md +24 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/CONTEXT_PACK.md +58 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/CORPUS.md +49 -10
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/CORPUS_DESIGN.md +18 -5
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/DEMOS.md +75 -49
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/EXTRACTION.md +46 -11
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/EXTRACTION_EVALUATION.md +33 -3
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/FEATURE_INDEX.md +145 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/KNOWLEDGE_BASE.md +19 -0
- biblicus-0.15.0/docs/MARKOV_ANALYSIS.md +262 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/PROFILING.md +65 -1
- biblicus-0.15.0/docs/PR_FAQ_TEXT_ANNOTATE.md +118 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/RETRIEVAL.md +33 -7
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/RETRIEVAL_EVALUATION.md +44 -7
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/RETRIEVAL_QUALITY.md +9 -3
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/ROADMAP.md +13 -15
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/STT.md +4 -4
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/TESTING.md +15 -4
- biblicus-0.15.0/docs/TEXT_ANNOTATE.md +119 -0
- biblicus-0.15.0/docs/TEXT_EXTRACT.md +671 -0
- biblicus-0.15.0/docs/TEXT_LINK.md +124 -0
- biblicus-0.15.0/docs/TEXT_REDACT.md +170 -0
- biblicus-0.15.0/docs/TEXT_SLICE.md +319 -0
- biblicus-0.15.0/docs/TEXT_UTILITIES.md +137 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/TOPIC_MODELING.md +78 -5
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/USER_CONFIGURATION.md +11 -0
- biblicus-0.15.0/docs/USE_CASES.md +37 -0
- biblicus-0.15.0/docs/UTILITIES.md +23 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/backends/index.md +25 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/backends/vector.md +2 -2
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/index.md +12 -1
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/ocr/index.md +8 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/index.md +11 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/speech-to-text/index.md +8 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/index.md +11 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/vlm-document/index.md +8 -0
- biblicus-0.15.0/docs/index.rst +213 -0
- biblicus-0.15.0/docs/use_cases/notes_to_context_pack.md +48 -0
- biblicus-0.15.0/docs/use_cases/sequence_markov.md +82 -0
- biblicus-0.15.0/docs/use_cases/text_folder_search.md +39 -0
- biblicus-0.15.0/docs/use_cases/text_redact.md +50 -0
- biblicus-0.15.0/features/ai_llm.feature +25 -0
- biblicus-0.15.0/features/ai_models.feature +74 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/analysis_schema.feature +1 -1
- biblicus-0.15.0/features/embeddings.feature +39 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/environment.py +61 -0
- biblicus-0.15.0/features/integration_text_annotate.feature +22 -0
- biblicus-0.15.0/features/integration_text_extract.feature +69 -0
- biblicus-0.15.0/features/integration_text_link.feature +25 -0
- biblicus-0.15.0/features/integration_text_redact.feature +31 -0
- biblicus-0.15.0/features/integration_text_slice.feature +27 -0
- biblicus-0.15.0/features/integration_use_cases.feature +10 -0
- biblicus-0.15.0/features/integration_use_cases_sequence_markov.feature +15 -0
- biblicus-0.15.0/features/markov_analysis.feature +36 -0
- biblicus-0.15.0/features/markov_analysis_categorical.feature +42 -0
- biblicus-0.15.0/features/markov_analysis_llm.feature +65 -0
- biblicus-0.15.0/features/markov_analysis_topic_modeling.feature +40 -0
- biblicus-0.15.0/features/markov_analysis_variants.feature +559 -0
- biblicus-0.15.0/features/markov_internal_branches.feature +297 -0
- biblicus-0.15.0/features/markov_schema.feature +161 -0
- biblicus-0.15.0/features/markov_start_end_labels.feature +10 -0
- biblicus-0.15.0/features/profiling_config_overrides.feature +16 -0
- biblicus-0.15.0/features/recipe_cascading.feature +63 -0
- biblicus-0.15.0/features/recipe_utilities.feature +77 -0
- biblicus-0.15.0/features/steps/ai_llm_steps.py +44 -0
- biblicus-0.15.0/features/steps/ai_models_steps.py +181 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/analysis_steps.py +8 -6
- biblicus-0.15.0/features/steps/embeddings_steps.py +122 -0
- biblicus-0.15.0/features/steps/markov_internal_steps.py +1933 -0
- biblicus-0.15.0/features/steps/markov_schema_steps.py +729 -0
- biblicus-0.15.0/features/steps/markov_start_end_steps.py +38 -0
- biblicus-0.15.0/features/steps/markov_steps.py +451 -0
- biblicus-0.15.0/features/steps/openai_steps.py +715 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/profiling_steps.py +74 -0
- biblicus-0.15.0/features/steps/recipe_steps.py +96 -0
- biblicus-0.15.0/features/steps/text_annotate_steps.py +477 -0
- biblicus-0.15.0/features/steps/text_extract_steps.py +480 -0
- biblicus-0.15.0/features/steps/text_internal_steps.py +64 -0
- biblicus-0.15.0/features/steps/text_link_internal_steps.py +379 -0
- biblicus-0.15.0/features/steps/text_link_steps.py +494 -0
- biblicus-0.15.0/features/steps/text_mock_steps.py +199 -0
- biblicus-0.15.0/features/steps/text_redact_steps.py +509 -0
- biblicus-0.15.0/features/steps/text_slice_steps.py +433 -0
- biblicus-0.15.0/features/steps/text_tool_loop_steps.py +36 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/topic_modeling_steps.py +45 -0
- biblicus-0.15.0/features/steps/use_cases_steps.py +118 -0
- biblicus-0.15.0/features/text_annotate.feature +227 -0
- biblicus-0.15.0/features/text_extract.feature +226 -0
- biblicus-0.15.0/features/text_internal_branches.feature +52 -0
- biblicus-0.15.0/features/text_link.feature +146 -0
- biblicus-0.15.0/features/text_link_internal_branches.feature +106 -0
- biblicus-0.15.0/features/text_mock.feature +86 -0
- biblicus-0.15.0/features/text_redact.feature +135 -0
- biblicus-0.15.0/features/text_slice.feature +135 -0
- biblicus-0.15.0/features/text_utilities.feature +51 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/topic_modeling.feature +3 -3
- biblicus-0.15.0/features/use_cases.feature +21 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/pyproject.toml +11 -1
- biblicus-0.15.0/scripts/markov_analysis_demo.py +279 -0
- biblicus-0.15.0/scripts/markov_cached_segments_demo.py +603 -0
- biblicus-0.15.0/scripts/markov_run_report.py +243 -0
- biblicus-0.15.0/scripts/use_cases/notes_to_context_pack_demo.py +119 -0
- biblicus-0.15.0/scripts/use_cases/sequence_markov_demo.py +189 -0
- biblicus-0.15.0/scripts/use_cases/text_folder_search_demo.py +132 -0
- biblicus-0.15.0/scripts/use_cases/text_redact_demo.py +116 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/__init__.py +1 -1
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/_vendor/dotyaml/__init__.py +2 -2
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/_vendor/dotyaml/loader.py +40 -1
- biblicus-0.15.0/src/biblicus/ai/__init__.py +39 -0
- biblicus-0.15.0/src/biblicus/ai/embeddings.py +114 -0
- biblicus-0.15.0/src/biblicus/ai/llm.py +138 -0
- biblicus-0.15.0/src/biblicus/ai/models.py +226 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/analysis/__init__.py +5 -2
- biblicus-0.15.0/src/biblicus/analysis/markov.py +1624 -0
- biblicus-0.15.0/src/biblicus/analysis/models.py +1530 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/analysis/topic_modeling.py +98 -19
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/sqlite_full_text_search.py +4 -2
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/cli.py +118 -23
- biblicus-0.15.0/src/biblicus/recipes.py +136 -0
- biblicus-0.15.0/src/biblicus/text/__init__.py +43 -0
- biblicus-0.15.0/src/biblicus/text/annotate.py +222 -0
- biblicus-0.15.0/src/biblicus/text/extract.py +210 -0
- biblicus-0.15.0/src/biblicus/text/link.py +519 -0
- biblicus-0.15.0/src/biblicus/text/markup.py +200 -0
- biblicus-0.15.0/src/biblicus/text/models.py +319 -0
- biblicus-0.15.0/src/biblicus/text/prompts.py +113 -0
- biblicus-0.15.0/src/biblicus/text/redact.py +229 -0
- biblicus-0.15.0/src/biblicus/text/slice.py +155 -0
- biblicus-0.15.0/src/biblicus/text/tool_loop.py +334 -0
- {biblicus-0.14.0 → biblicus-0.15.0/src/biblicus.egg-info}/PKG-INFO +88 -25
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus.egg-info/SOURCES.txt +88 -2
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus.egg-info/requires.txt +9 -0
- biblicus-0.14.0/docs/ANALYSIS.md +0 -47
- biblicus-0.14.0/docs/ARCHITECTURE.md +0 -180
- biblicus-0.14.0/docs/index.rst +0 -33
- biblicus-0.14.0/features/steps/openai_steps.py +0 -314
- biblicus-0.14.0/src/biblicus/analysis/llm.py +0 -106
- biblicus-0.14.0/src/biblicus/analysis/models.py +0 -777
- {biblicus-0.14.0 → biblicus-0.15.0}/LICENSE +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/MANIFEST.in +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/THIRD_PARTY_NOTICES.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/datasets/extraction_lab/labels.json +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/datasets/retrieval_lab/labels.json +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/datasets/wikipedia_mini.json +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/api.rst +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/backends/scan.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/backends/sqlite-full-text-search.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/conf.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/ocr/paddleocr-vl.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/ocr/rapidocr.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/pipeline.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/select-longest.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/select-override.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/select-smart-override.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/pipeline-utilities/select-text.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/speech-to-text/deepgram.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/speech-to-text/openai.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/markitdown.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/metadata.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/pass-through.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/pdf.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/text-document/unstructured.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/vlm-document/docling-granite.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/docs/extractors/vlm-document/docling-smol.md +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/backend_validation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/biblicus_corpus.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/cli_entrypoint.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/cli_parsing.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/cli_step_spec_parsing.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/content_sniffing.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/context_pack.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/context_pack_cli.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/context_pack_policies.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/corpus_edge_cases.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/corpus_identity.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/corpus_purge.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/crawl.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/docling_granite_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/docling_smol_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/error_cases.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/evaluation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/evidence_processing.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_error_handling.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_evaluation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_evaluation_lab.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_run_lifecycle.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_selection.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extraction_selection_longest.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extractor_pipeline.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/extractor_validation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/frontmatter.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/hook_config_validation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/hook_error_handling.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/import_tree.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/inference_backend.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/ingest_sources.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_audio_samples.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_image_samples.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_mixed_corpus.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_mixed_extraction.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_ocr_image_extraction.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_pdf_retrieval.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_pdf_samples.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_unstructured_extraction.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/integration_wikipedia.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/knowledge_base.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/lifecycle_hooks.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/markitdown_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/model_validation.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/ocr_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/paddleocr_vl_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/paddleocr_vl_parse_api_response.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/pdf_text_extraction.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/profiling.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/python_api.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/python_hook_logging.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/query_processing.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/recipe_file_extraction.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_budget.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_evaluation_lab.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_quality.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_scan.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_uses_extraction_run.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/retrieval_utilities.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/select_override.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/smart_override_selection.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/source_loading.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/backend_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/cli_parsing_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/cli_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/context_pack_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/crawl_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/deepgram_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/docling_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/evidence_processing_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/extraction_evaluation_lab_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/extraction_evaluation_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/extraction_run_lifecycle_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/extraction_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/extractor_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/frontmatter_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/inference_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/knowledge_base_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/markitdown_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/model_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/paddleocr_mock_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/paddleocr_vl_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/paddleocr_vl_unit_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/pdf_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/python_api_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/rapidocr_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/requests_mock_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/retrieval_evaluation_lab_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/retrieval_quality_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/retrieval_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/stt_deepgram_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/stt_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/unstructured_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/steps/user_config_steps.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/streaming_ingest.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/stt_deepgram_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/stt_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/text_extraction_runs.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/token_budget.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/unstructured_extractor.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/features/user_config.feature +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_ag_news.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_audio_samples.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_image_samples.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_mixed_samples.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_pdf_samples.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/download_wikipedia.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/extraction_evaluation_demo.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/extraction_evaluation_lab.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/profiling_demo.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/readme_end_to_end_demo.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/retrieval_evaluation_lab.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/test.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/topic_modeling_integration.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/scripts/wikipedia_rag_demo.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/setup.cfg +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/__main__.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/analysis/base.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/analysis/profiling.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/analysis/schema.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/__init__.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/base.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/hybrid.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/scan.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/backends/vector.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/constants.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/context.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/corpus.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/crawl.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/errors.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/evaluation.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/evidence_processing.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extraction.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extraction_evaluation.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/__init__.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/base.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/deepgram_stt.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/docling_granite_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/docling_smol_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/markitdown_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/metadata_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/openai_stt.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/paddleocr_vl_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/pass_through_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/pdf_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/pipeline.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/select_longest_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/select_override.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/select_smart_override.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/select_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/extractors/unstructured_text.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/frontmatter.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/hook_logging.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/hook_manager.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/hooks.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/ignore.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/inference.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/knowledge_base.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/models.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/retrieval.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/sources.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/time.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/uris.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus/user_config.py +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus.egg-info/entry_points.txt +0 -0
- {biblicus-0.14.0 → biblicus-0.15.0}/src/biblicus.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: biblicus
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.15.0
|
|
4
4
|
Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
|
|
5
5
|
License: MIT
|
|
6
6
|
Requires-Python: >=3.9
|
|
@@ -9,6 +9,8 @@ License-File: LICENSE
|
|
|
9
9
|
Requires-Dist: pydantic>=2.0
|
|
10
10
|
Requires-Dist: PyYAML>=6.0
|
|
11
11
|
Requires-Dist: pypdf>=4.0
|
|
12
|
+
Requires-Dist: Jinja2>=3.1
|
|
13
|
+
Requires-Dist: dotyaml>=0.1.3
|
|
12
14
|
Provides-Extra: dev
|
|
13
15
|
Requires-Dist: behave>=1.2.6; extra == "dev"
|
|
14
16
|
Requires-Dist: coverage[toml]>=7.0; extra == "dev"
|
|
@@ -18,6 +20,9 @@ Requires-Dist: sphinx_rtd_theme>=2.0; extra == "dev"
|
|
|
18
20
|
Requires-Dist: ruff>=0.4.0; extra == "dev"
|
|
19
21
|
Requires-Dist: black>=24.0; extra == "dev"
|
|
20
22
|
Requires-Dist: python-semantic-release>=9.0.0; extra == "dev"
|
|
23
|
+
Provides-Extra: dspy
|
|
24
|
+
Requires-Dist: dspy>=2.5; extra == "dspy"
|
|
25
|
+
Requires-Dist: litellm>=1.0; extra == "dspy"
|
|
21
26
|
Provides-Extra: openai
|
|
22
27
|
Requires-Dist: openai>=1.0; extra == "openai"
|
|
23
28
|
Provides-Extra: unstructured
|
|
@@ -40,6 +45,8 @@ Provides-Extra: docling-mlx
|
|
|
40
45
|
Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
|
|
41
46
|
Provides-Extra: topic-modeling
|
|
42
47
|
Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
|
|
48
|
+
Provides-Extra: markov-analysis
|
|
49
|
+
Requires-Dist: hmmlearn>=0.3.0; extra == "markov-analysis"
|
|
43
50
|
Provides-Extra: datasets
|
|
44
51
|
Requires-Dist: datasets>=2.18.0; extra == "datasets"
|
|
45
52
|
Dynamic: license-file
|
|
@@ -56,12 +63,20 @@ If you are building an assistant in Python, you probably have material you want
|
|
|
56
63
|
|
|
57
64
|
The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
|
|
58
65
|
|
|
59
|
-
|
|
66
|
+
Biblicus gives you a normal folder on disk to manage. In Biblicus documentation, that managed folder is called a *corpus* (plural: *corpora*). It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw files.
|
|
60
67
|
|
|
61
68
|
It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
|
|
62
69
|
|
|
63
70
|
See [retrieval augmented generation overview] for a short introduction to the idea.
|
|
64
71
|
|
|
72
|
+
## Analysis highlights
|
|
73
|
+
|
|
74
|
+
- `biblicus analyze markov` learns a directed, weighted state transition graph over segmented text.
|
|
75
|
+
- YAML recipes support cascading composition plus dotted `--config key=value` overrides.
|
|
76
|
+
- Text extract splits long texts with an LLM by inserting XML tags in-place for structured spans.
|
|
77
|
+
- See `docs/MARKOV_ANALYSIS.md` for Markov analysis details and runnable demos.
|
|
78
|
+
- See `docs/TEXT_EXTRACT.md` for the text extract utility and examples.
|
|
79
|
+
|
|
65
80
|
## Start with a knowledge base
|
|
66
81
|
|
|
67
82
|
If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
|
|
@@ -106,7 +121,7 @@ Think in three stages.
|
|
|
106
121
|
|
|
107
122
|
If you learn a few project words, the rest of the system becomes predictable.
|
|
108
123
|
|
|
109
|
-
- Corpus is the folder that holds raw items and their metadata.
|
|
124
|
+
- Corpus is the managed folder that holds raw items and their metadata.
|
|
110
125
|
- Item is the raw bytes plus optional metadata and source information.
|
|
111
126
|
- Catalog is the rebuildable index of the corpus.
|
|
112
127
|
- Extraction run is a recorded extraction build that produces text artifacts.
|
|
@@ -161,28 +176,28 @@ sequenceDiagram
|
|
|
161
176
|
This repository is a working Python package. Install it into a virtual environment from the repository root.
|
|
162
177
|
|
|
163
178
|
```
|
|
164
|
-
|
|
179
|
+
python -m pip install -e .
|
|
165
180
|
```
|
|
166
181
|
|
|
167
182
|
After the first release, you can install it from Python Package Index.
|
|
168
183
|
|
|
169
184
|
```
|
|
170
|
-
|
|
185
|
+
python -m pip install biblicus
|
|
171
186
|
```
|
|
172
187
|
|
|
173
188
|
### Optional extras
|
|
174
189
|
|
|
175
190
|
Some extractors are optional so the base install stays small.
|
|
176
191
|
|
|
177
|
-
- Optical character recognition for images: `
|
|
178
|
-
- Advanced optical character recognition with PaddleOCR: `
|
|
179
|
-
- Document understanding with Docling VLM: `
|
|
180
|
-
- Document understanding with Docling VLM and MLX acceleration: `
|
|
181
|
-
- Speech to text transcription with OpenAI: `
|
|
182
|
-
- Speech to text transcription with Deepgram: `
|
|
183
|
-
- Broad document parsing fallback: `
|
|
184
|
-
- MarkItDown document conversion (requires Python 3.10 or higher): `
|
|
185
|
-
- Topic modeling analysis with BERTopic: `
|
|
192
|
+
- Optical character recognition for images: `python -m pip install "biblicus[ocr]"`
|
|
193
|
+
- Advanced optical character recognition with PaddleOCR: `python -m pip install "biblicus[paddleocr]"`
|
|
194
|
+
- Document understanding with Docling VLM: `python -m pip install "biblicus[docling]"`
|
|
195
|
+
- Document understanding with Docling VLM and MLX acceleration: `python -m pip install "biblicus[docling-mlx]"`
|
|
196
|
+
- Speech to text transcription with OpenAI: `python -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
197
|
+
- Speech to text transcription with Deepgram: `python -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
198
|
+
- Broad document parsing fallback: `python -m pip install "biblicus[unstructured]"`
|
|
199
|
+
- MarkItDown document conversion (requires Python 3.10 or higher): `python -m pip install "biblicus[markitdown]"`
|
|
200
|
+
- Topic modeling analysis with BERTopic: `python -m pip install "biblicus[topic-modeling]"`
|
|
186
201
|
|
|
187
202
|
## Quick start
|
|
188
203
|
|
|
@@ -200,16 +215,49 @@ biblicus build --corpus corpora/example --backend scan
|
|
|
200
215
|
biblicus query --corpus corpora/example --query "note"
|
|
201
216
|
```
|
|
202
217
|
|
|
203
|
-
|
|
218
|
+
## Web Ingestion
|
|
219
|
+
|
|
220
|
+
Biblicus supports ingesting content directly from the web using two approaches.
|
|
221
|
+
|
|
222
|
+
### Ingest from URLs
|
|
223
|
+
|
|
224
|
+
Ingest individual documents or web pages from URLs. The `ingest` command automatically detects content types including PDF, HTML, Markdown, images, and audio:
|
|
204
225
|
|
|
226
|
+
```bash
|
|
227
|
+
# Ingest a document from a URL
|
|
228
|
+
biblicus ingest https://example.com/document.pdf --tags "research"
|
|
229
|
+
|
|
230
|
+
# Ingest a web page
|
|
231
|
+
biblicus ingest https://example.com/article.html --tags "article"
|
|
232
|
+
|
|
233
|
+
# Ingest with a corpus path specified
|
|
234
|
+
biblicus ingest --corpus corpora/example https://docs.example.com/guide.md --tags "documentation"
|
|
205
235
|
```
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
236
|
+
|
|
237
|
+
### Crawl Websites
|
|
238
|
+
|
|
239
|
+
Crawl entire website sections with automatic link discovery. The crawler follows links within the allowed prefix and stores discovered content:
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
# Crawl a documentation site
|
|
243
|
+
biblicus crawl \
|
|
244
|
+
--corpus corpora/example \
|
|
245
|
+
--root-url https://docs.example.com/ \
|
|
246
|
+
--allowed-prefix https://docs.example.com/ \
|
|
247
|
+
--max-items 100 \
|
|
248
|
+
--tags "documentation"
|
|
249
|
+
|
|
250
|
+
# Crawl a specific blog category
|
|
251
|
+
biblicus crawl \
|
|
252
|
+
--corpus corpora/example \
|
|
253
|
+
--root-url https://blog.example.com/category/tutorials/ \
|
|
254
|
+
--allowed-prefix https://blog.example.com/category/tutorials/ \
|
|
255
|
+
--max-items 50 \
|
|
256
|
+
--tags "tutorials,blog"
|
|
211
257
|
```
|
|
212
258
|
|
|
259
|
+
The `--allowed-prefix` parameter restricts the crawler to only follow links that start with the specified URL prefix, preventing it from crawling outside the intended scope. The crawler respects `.biblicusignore` rules and stores items under `raw/imports/crawl/` in your corpus.
|
|
260
|
+
|
|
213
261
|
## End-to-end example: lower-level control
|
|
214
262
|
|
|
215
263
|
The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
|
|
@@ -540,6 +588,21 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
|
|
|
540
588
|
For extraction evaluation workflows, dataset formats, and report interpretation, see
|
|
541
589
|
`docs/EXTRACTION_EVALUATION.md`.
|
|
542
590
|
|
|
591
|
+
## Text extract utility
|
|
592
|
+
|
|
593
|
+
Text extract is a reusable analysis utility that lets a model insert XML tags into a long text without re-emitting the
|
|
594
|
+
entire document. It returns structured spans and the marked-up text, and it is used as a segmentation option in Markov
|
|
595
|
+
analysis.
|
|
596
|
+
|
|
597
|
+
See `docs/TEXT_EXTRACT.md` for the utility API and examples, and `docs/MARKOV_ANALYSIS.md` for the Markov integration.
|
|
598
|
+
|
|
599
|
+
## Text slice utility
|
|
600
|
+
|
|
601
|
+
Text slice is a reusable analysis utility that lets a model insert `<slice/>` markers into a long text without
|
|
602
|
+
re-emitting the entire document. It returns ordered slices and the marked-up text for auditing and reuse.
|
|
603
|
+
|
|
604
|
+
See `docs/TEXT_SLICE.md` for the utility API and examples.
|
|
605
|
+
|
|
543
606
|
## Topic modeling analysis
|
|
544
607
|
|
|
545
608
|
Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
|
|
@@ -594,7 +657,7 @@ AG News integration runs require `biblicus[datasets]` in addition to `biblicus[t
|
|
|
594
657
|
For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
|
|
595
658
|
|
|
596
659
|
```
|
|
597
|
-
|
|
660
|
+
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
|
|
598
661
|
```
|
|
599
662
|
|
|
600
663
|
See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
|
|
@@ -608,13 +671,13 @@ Use `scripts/download_pdf_samples.py` to download a small Portable Document Form
|
|
|
608
671
|
## Tests and coverage
|
|
609
672
|
|
|
610
673
|
```
|
|
611
|
-
|
|
674
|
+
python scripts/test.py
|
|
612
675
|
```
|
|
613
676
|
|
|
614
677
|
To include integration scenarios that download public test data at runtime, run this command.
|
|
615
678
|
|
|
616
679
|
```
|
|
617
|
-
|
|
680
|
+
python scripts/test.py --integration
|
|
618
681
|
```
|
|
619
682
|
|
|
620
683
|
## Releases
|
|
@@ -632,13 +695,13 @@ Reference documentation is generated from Sphinx style docstrings.
|
|
|
632
695
|
Install development dependencies:
|
|
633
696
|
|
|
634
697
|
```
|
|
635
|
-
|
|
698
|
+
python -m pip install -e ".[dev]"
|
|
636
699
|
```
|
|
637
700
|
|
|
638
701
|
Build the documentation:
|
|
639
702
|
|
|
640
703
|
```
|
|
641
|
-
|
|
704
|
+
python -m sphinx -b html docs docs/_build/html
|
|
642
705
|
```
|
|
643
706
|
|
|
644
707
|
## License
|
|
@@ -10,12 +10,20 @@ If you are building an assistant in Python, you probably have material you want
|
|
|
10
10
|
|
|
11
11
|
The first practical problem is not retrieval. It is collection and care. You need a stable place to put raw items, you need a small amount of metadata so you can find them again, and you need a way to evolve your retrieval approach over time without rewriting ingestion.
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
Biblicus gives you a normal folder on disk to manage. In Biblicus documentation, that managed folder is called a *corpus* (plural: *corpora*). It stores each ingested item as a file, with optional metadata stored next to it. You can open and inspect the raw files directly. Any derived catalog or index can be rebuilt from the raw files.
|
|
14
14
|
|
|
15
15
|
It can be used alongside LangGraph, Tactus, Pydantic AI, any agent framework, or your own setup. Use it from Python or from the command line interface.
|
|
16
16
|
|
|
17
17
|
See [retrieval augmented generation overview] for a short introduction to the idea.
|
|
18
18
|
|
|
19
|
+
## Analysis highlights
|
|
20
|
+
|
|
21
|
+
- `biblicus analyze markov` learns a directed, weighted state transition graph over segmented text.
|
|
22
|
+
- YAML recipes support cascading composition plus dotted `--config key=value` overrides.
|
|
23
|
+
- Text extract splits long texts with an LLM by inserting XML tags in-place for structured spans.
|
|
24
|
+
- See `docs/MARKOV_ANALYSIS.md` for Markov analysis details and runnable demos.
|
|
25
|
+
- See `docs/TEXT_EXTRACT.md` for the text extract utility and examples.
|
|
26
|
+
|
|
19
27
|
## Start with a knowledge base
|
|
20
28
|
|
|
21
29
|
If you just want to hand a folder to your assistant and move on, use the high-level knowledge base interface. The folder can be nothing more than a handful of plain text files. You are not choosing a retrieval strategy yet. You are just collecting.
|
|
@@ -60,7 +68,7 @@ Think in three stages.
|
|
|
60
68
|
|
|
61
69
|
If you learn a few project words, the rest of the system becomes predictable.
|
|
62
70
|
|
|
63
|
-
- Corpus is the folder that holds raw items and their metadata.
|
|
71
|
+
- Corpus is the managed folder that holds raw items and their metadata.
|
|
64
72
|
- Item is the raw bytes plus optional metadata and source information.
|
|
65
73
|
- Catalog is the rebuildable index of the corpus.
|
|
66
74
|
- Extraction run is a recorded extraction build that produces text artifacts.
|
|
@@ -115,28 +123,28 @@ sequenceDiagram
|
|
|
115
123
|
This repository is a working Python package. Install it into a virtual environment from the repository root.
|
|
116
124
|
|
|
117
125
|
```
|
|
118
|
-
|
|
126
|
+
python -m pip install -e .
|
|
119
127
|
```
|
|
120
128
|
|
|
121
129
|
After the first release, you can install it from Python Package Index.
|
|
122
130
|
|
|
123
131
|
```
|
|
124
|
-
|
|
132
|
+
python -m pip install biblicus
|
|
125
133
|
```
|
|
126
134
|
|
|
127
135
|
### Optional extras
|
|
128
136
|
|
|
129
137
|
Some extractors are optional so the base install stays small.
|
|
130
138
|
|
|
131
|
-
- Optical character recognition for images: `
|
|
132
|
-
- Advanced optical character recognition with PaddleOCR: `
|
|
133
|
-
- Document understanding with Docling VLM: `
|
|
134
|
-
- Document understanding with Docling VLM and MLX acceleration: `
|
|
135
|
-
- Speech to text transcription with OpenAI: `
|
|
136
|
-
- Speech to text transcription with Deepgram: `
|
|
137
|
-
- Broad document parsing fallback: `
|
|
138
|
-
- MarkItDown document conversion (requires Python 3.10 or higher): `
|
|
139
|
-
- Topic modeling analysis with BERTopic: `
|
|
139
|
+
- Optical character recognition for images: `python -m pip install "biblicus[ocr]"`
|
|
140
|
+
- Advanced optical character recognition with PaddleOCR: `python -m pip install "biblicus[paddleocr]"`
|
|
141
|
+
- Document understanding with Docling VLM: `python -m pip install "biblicus[docling]"`
|
|
142
|
+
- Document understanding with Docling VLM and MLX acceleration: `python -m pip install "biblicus[docling-mlx]"`
|
|
143
|
+
- Speech to text transcription with OpenAI: `python -m pip install "biblicus[openai]"` (requires an OpenAI API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
144
|
+
- Speech to text transcription with Deepgram: `python -m pip install "biblicus[deepgram]"` (requires a Deepgram API key in `~/.biblicus/config.yml` or `./.biblicus/config.yml`)
|
|
145
|
+
- Broad document parsing fallback: `python -m pip install "biblicus[unstructured]"`
|
|
146
|
+
- MarkItDown document conversion (requires Python 3.10 or higher): `python -m pip install "biblicus[markitdown]"`
|
|
147
|
+
- Topic modeling analysis with BERTopic: `python -m pip install "biblicus[topic-modeling]"`
|
|
140
148
|
|
|
141
149
|
## Quick start
|
|
142
150
|
|
|
@@ -154,16 +162,49 @@ biblicus build --corpus corpora/example --backend scan
|
|
|
154
162
|
biblicus query --corpus corpora/example --query "note"
|
|
155
163
|
```
|
|
156
164
|
|
|
157
|
-
|
|
165
|
+
## Web Ingestion
|
|
166
|
+
|
|
167
|
+
Biblicus supports ingesting content directly from the web using two approaches.
|
|
168
|
+
|
|
169
|
+
### Ingest from URLs
|
|
170
|
+
|
|
171
|
+
Ingest individual documents or web pages from URLs. The `ingest` command automatically detects content types including PDF, HTML, Markdown, images, and audio:
|
|
158
172
|
|
|
173
|
+
```bash
|
|
174
|
+
# Ingest a document from a URL
|
|
175
|
+
biblicus ingest https://example.com/document.pdf --tags "research"
|
|
176
|
+
|
|
177
|
+
# Ingest a web page
|
|
178
|
+
biblicus ingest https://example.com/article.html --tags "article"
|
|
179
|
+
|
|
180
|
+
# Ingest with a corpus path specified
|
|
181
|
+
biblicus ingest --corpus corpora/example https://docs.example.com/guide.md --tags "documentation"
|
|
159
182
|
```
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
183
|
+
|
|
184
|
+
### Crawl Websites
|
|
185
|
+
|
|
186
|
+
Crawl entire website sections with automatic link discovery. The crawler follows links within the allowed prefix and stores discovered content:
|
|
187
|
+
|
|
188
|
+
```bash
|
|
189
|
+
# Crawl a documentation site
|
|
190
|
+
biblicus crawl \
|
|
191
|
+
--corpus corpora/example \
|
|
192
|
+
--root-url https://docs.example.com/ \
|
|
193
|
+
--allowed-prefix https://docs.example.com/ \
|
|
194
|
+
--max-items 100 \
|
|
195
|
+
--tags "documentation"
|
|
196
|
+
|
|
197
|
+
# Crawl a specific blog category
|
|
198
|
+
biblicus crawl \
|
|
199
|
+
--corpus corpora/example \
|
|
200
|
+
--root-url https://blog.example.com/category/tutorials/ \
|
|
201
|
+
--allowed-prefix https://blog.example.com/category/tutorials/ \
|
|
202
|
+
--max-items 50 \
|
|
203
|
+
--tags "tutorials,blog"
|
|
165
204
|
```
|
|
166
205
|
|
|
206
|
+
The `--allowed-prefix` parameter restricts the crawler to only follow links that start with the specified URL prefix, preventing it from crawling outside the intended scope. The crawler respects `.biblicusignore` rules and stores items under `raw/imports/crawl/` in your corpus.
|
|
207
|
+
|
|
167
208
|
## End-to-end example: lower-level control
|
|
168
209
|
|
|
169
210
|
The command-line interface returns JavaScript Object Notation by default. This makes it easy to use Biblicus in scripts and to treat retrieval as a deterministic, testable step.
|
|
@@ -494,6 +535,21 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
|
|
|
494
535
|
For extraction evaluation workflows, dataset formats, and report interpretation, see
|
|
495
536
|
`docs/EXTRACTION_EVALUATION.md`.
|
|
496
537
|
|
|
538
|
+
## Text extract utility
|
|
539
|
+
|
|
540
|
+
Text extract is a reusable analysis utility that lets a model insert XML tags into a long text without re-emitting the
|
|
541
|
+
entire document. It returns structured spans and the marked-up text, and it is used as a segmentation option in Markov
|
|
542
|
+
analysis.
|
|
543
|
+
|
|
544
|
+
See `docs/TEXT_EXTRACT.md` for the utility API and examples, and `docs/MARKOV_ANALYSIS.md` for the Markov integration.
|
|
545
|
+
|
|
546
|
+
## Text slice utility
|
|
547
|
+
|
|
548
|
+
Text slice is a reusable analysis utility that lets a model insert `<slice/>` markers into a long text without
|
|
549
|
+
re-emitting the entire document. It returns ordered slices and the marked-up text for auditing and reuse.
|
|
550
|
+
|
|
551
|
+
See `docs/TEXT_SLICE.md` for the utility API and examples.
|
|
552
|
+
|
|
497
553
|
## Topic modeling analysis
|
|
498
554
|
|
|
499
555
|
Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
|
|
@@ -548,7 +604,7 @@ AG News integration runs require `biblicus[datasets]` in addition to `biblicus[t
|
|
|
548
604
|
For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
|
|
549
605
|
|
|
550
606
|
```
|
|
551
|
-
|
|
607
|
+
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
|
|
552
608
|
```
|
|
553
609
|
|
|
554
610
|
See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
|
|
@@ -562,13 +618,13 @@ Use `scripts/download_pdf_samples.py` to download a small Portable Document Form
|
|
|
562
618
|
## Tests and coverage
|
|
563
619
|
|
|
564
620
|
```
|
|
565
|
-
|
|
621
|
+
python scripts/test.py
|
|
566
622
|
```
|
|
567
623
|
|
|
568
624
|
To include integration scenarios that download public test data at runtime, run this command.
|
|
569
625
|
|
|
570
626
|
```
|
|
571
|
-
|
|
627
|
+
python scripts/test.py --integration
|
|
572
628
|
```
|
|
573
629
|
|
|
574
630
|
## Releases
|
|
@@ -586,13 +642,13 @@ Reference documentation is generated from Sphinx style docstrings.
|
|
|
586
642
|
Install development dependencies:
|
|
587
643
|
|
|
588
644
|
```
|
|
589
|
-
|
|
645
|
+
python -m pip install -e ".[dev]"
|
|
590
646
|
```
|
|
591
647
|
|
|
592
648
|
Build the documentation:
|
|
593
649
|
|
|
594
650
|
```
|
|
595
|
-
|
|
651
|
+
python -m sphinx -b html docs docs/_build/html
|
|
596
652
|
```
|
|
597
653
|
|
|
598
654
|
## License
|
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
# Corpus analysis
|
|
2
|
+
|
|
3
|
+
Biblicus supports analysis backends that run on extracted text artifacts without changing the raw corpus. Analysis is a
|
|
4
|
+
pluggable phase that reads an extraction run, produces structured output, and stores artifacts under the corpus runs
|
|
5
|
+
folder. Each analysis backend declares its own configuration schema and output contract, and all schemas are validated
|
|
6
|
+
strictly.
|
|
7
|
+
|
|
8
|
+
## How analysis runs work
|
|
9
|
+
|
|
10
|
+
- Analysis runs are tied to a corpus state via the extraction run reference.
|
|
11
|
+
- The analysis output is written under `.biblicus/runs/analysis/<analysis-id>/<run_id>/`.
|
|
12
|
+
- Analysis is reproducible when you supply the same extraction run and corpus catalog state.
|
|
13
|
+
- Analysis configuration is stored as a recipe manifest in the run metadata.
|
|
14
|
+
|
|
15
|
+
If you omit the extraction run, Biblicus uses the most recent extraction run and emits a reproducibility warning. For
|
|
16
|
+
repeatable analysis runs, always pass the extraction run reference explicitly.
|
|
17
|
+
|
|
18
|
+
## Analysis run artifacts
|
|
19
|
+
|
|
20
|
+
Every analysis run records a manifest alongside the output:
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
.biblicus/runs/analysis/<analysis-id>/<run_id>/
|
|
24
|
+
manifest.json
|
|
25
|
+
output.json
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
The manifest captures the recipe, extraction run reference, and catalog timestamp so results can be reproduced and
|
|
29
|
+
compared later.
|
|
30
|
+
|
|
31
|
+
## Inspecting output
|
|
32
|
+
|
|
33
|
+
Analysis outputs are JSON documents. You can view them directly:
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
cat corpora/example/.biblicus/runs/analysis/profiling/RUN_ID/output.json
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Each analysis backend defines its own `report` payload. The run metadata is consistent across backends.
|
|
40
|
+
|
|
41
|
+
## Comparing analysis runs
|
|
42
|
+
|
|
43
|
+
When you compare analysis results, record:
|
|
44
|
+
|
|
45
|
+
- Corpus path and catalog timestamp.
|
|
46
|
+
- Extraction run reference.
|
|
47
|
+
- Analysis recipe name and configuration.
|
|
48
|
+
- Analysis run identifier and output path.
|
|
49
|
+
|
|
50
|
+
These make it possible to rerun the analysis and explain differences.
|
|
51
|
+
|
|
52
|
+
## Pluggable analysis backends
|
|
53
|
+
|
|
54
|
+
Analysis backends implement the `CorpusAnalysisBackend` interface and are registered under `biblicus.analysis`.
|
|
55
|
+
A backend receives the corpus, a recipe name, a configuration mapping, and an extraction run reference. It returns a
|
|
56
|
+
Pydantic model that is serialized to JavaScript Object Notation for storage.
|
|
57
|
+
|
|
58
|
+
## Choosing an analysis backend
|
|
59
|
+
|
|
60
|
+
Start with profiling when you need fast, deterministic baselines. Use topic modeling when you want thematic clustering
|
|
61
|
+
and exploratory labels. Use Markov analysis when you want state-transition structure over sequences of segments.
|
|
62
|
+
Combine multiple backends for a clear view of corpus composition, themes, and state dynamics.
|
|
63
|
+
|
|
64
|
+
## Recipe files
|
|
65
|
+
|
|
66
|
+
Analysis recipes are optional JavaScript Object Notation or YAML files that capture configuration in a repeatable way.
|
|
67
|
+
They are useful for sharing experiments and keeping runs reproducible.
|
|
68
|
+
|
|
69
|
+
Recipes support cascading composition. When a command accepts `--recipe`, you can pass multiple recipe files. Biblicus
|
|
70
|
+
merges them in order, where later recipes override earlier recipes via a deep merge. You can then apply `--config`
|
|
71
|
+
overrides on top of the composed view.
|
|
72
|
+
|
|
73
|
+
Minimal profiling recipe:
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
schema_version: 1
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Minimal topic modeling recipe:
|
|
80
|
+
|
|
81
|
+
```
|
|
82
|
+
schema_version: 1
|
|
83
|
+
text_source:
|
|
84
|
+
sample_size: 500
|
|
85
|
+
bertopic_analysis:
|
|
86
|
+
parameters:
|
|
87
|
+
nr_topics: 8
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
Minimal Markov analysis recipe:
|
|
91
|
+
|
|
92
|
+
```
|
|
93
|
+
schema_version: 1
|
|
94
|
+
model:
|
|
95
|
+
family: gaussian
|
|
96
|
+
n_states: 8
|
|
97
|
+
segmentation:
|
|
98
|
+
method: sentence
|
|
99
|
+
observations:
|
|
100
|
+
encoder: tfidf
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
## Topic modeling
|
|
104
|
+
|
|
105
|
+
Topic modeling is the first analysis backend. It uses BERTopic to cluster extracted text, produces per-topic evidence,
|
|
106
|
+
and optionally labels topics using an LLM. See `docs/TOPIC_MODELING.md` for detailed configuration and examples.
|
|
107
|
+
|
|
108
|
+
The integration demo script is a working reference you can use as a starting point:
|
|
109
|
+
|
|
110
|
+
```
|
|
111
|
+
python scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
The command prints the analysis run identifier and the output path. Open the resulting `output.json` to inspect per-topic
|
|
115
|
+
labels, keywords, and document examples.
|
|
116
|
+
|
|
117
|
+
## Markov analysis
|
|
118
|
+
|
|
119
|
+
Markov analysis learns a directed, weighted state transition graph over sequences of text segments. The output includes
|
|
120
|
+
per-state exemplars, per-item decoded paths, and optional GraphViz exports. See `docs/MARKOV_ANALYSIS.md` for detailed
|
|
121
|
+
configuration and examples.
|
|
122
|
+
|
|
123
|
+
Text extract is available as a segmentation strategy for long texts. It inserts XML tags in-place using a virtual file
|
|
124
|
+
editing loop, then extracts spans without requiring the model to re-emit the full transcript.
|
|
125
|
+
|
|
126
|
+
## Profiling analysis
|
|
127
|
+
|
|
128
|
+
Profiling is the baseline analysis backend. It summarizes corpus composition and extraction coverage using
|
|
129
|
+
deterministic counts and distribution metrics. See `docs/PROFILING.md` for the full reference and working demo.
|
|
130
|
+
|
|
131
|
+
### Minimal profiling run
|
|
132
|
+
|
|
133
|
+
```
|
|
134
|
+
python -m biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
The command writes an analysis run directory and prints the run identifier.
|
|
138
|
+
|
|
139
|
+
Run profiling from the CLI:
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
|
|
143
|
+
```
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
# Biblicus Architecture
|
|
2
|
+
|
|
3
|
+
Biblicus sits between raw, unstructured data and the moment you need reliable answers from it.
|
|
4
|
+
It is built for teams who receive large, messy corpora and must extract usable signals without
|
|
5
|
+
losing provenance or reproducibility. Retrieval-augmented generation is one use case, but the
|
|
6
|
+
system is broader than chatbots: it supports any pipeline that needs structured insight from
|
|
7
|
+
unstructured data.
|
|
8
|
+
|
|
9
|
+
At a high level the system does five things:
|
|
10
|
+
|
|
11
|
+
1. **Ingests** raw content into a corpus with minimal friction.
|
|
12
|
+
2. **Extracts** text from diverse media (documents, images, audio).
|
|
13
|
+
3. **Transforms** and annotates text with reusable LLM utilities.
|
|
14
|
+
4. **Retrieves** evidence through explicit, reproducible stages.
|
|
15
|
+
5. **Evaluates** results so improvements are measurable, not anecdotal.
|
|
16
|
+
|
|
17
|
+
The guiding idea is that every retrieval produces **evidence**: structured outputs with scores
|
|
18
|
+
and provenance that can be inspected, audited, and reused. Context packs, summaries, and downstream
|
|
19
|
+
generation are all derived from that evidence.
|
|
20
|
+
|
|
21
|
+
## Why it exists
|
|
22
|
+
|
|
23
|
+
Real-world AI work often starts with a folder full of files, not a clean database. Biblicus is the
|
|
24
|
+
toolkit that turns those files into a manageable, testable system. It supports workflows like:
|
|
25
|
+
|
|
26
|
+
- Indexing large collections of emails and making them searchable while protecting sensitive data.
|
|
27
|
+
- Processing discovery dumps of scanned PDFs with OCR and extracting evidence for analysis.
|
|
28
|
+
- Turning policy or rules documents into a controlled knowledge base for assistants.
|
|
29
|
+
|
|
30
|
+
## How it fits into AI systems
|
|
31
|
+
|
|
32
|
+
Biblicus integrates with agent frameworks through explicit tool interfaces. It does not hide
|
|
33
|
+
retrieval inside the model. Instead, it provides repeatable pipelines that expose *what* was
|
|
34
|
+
retrieved and *why*, so models can use evidence directly and safely.
|
|
35
|
+
|
|
36
|
+
## Where to go next
|
|
37
|
+
|
|
38
|
+
- Start with **CORPUS** and **EXTRACTION** to understand how raw content is ingested.
|
|
39
|
+
- Move to **RETRIEVAL** and **RETRIEVAL_EVALUATION** to see how evidence is produced and tested.
|
|
40
|
+
- Explore **TOPIC_MODELING** and **MARKOV_ANALYSIS** if you need higher-level analysis tools.
|
|
41
|
+
- See **TEXT_UTILITIES** for reusable, AI-assisted text transformations.
|
|
42
|
+
|
|
43
|
+
## Detailed architecture and policies
|
|
44
|
+
|
|
45
|
+
For a deep, internal reference (including design policies and architectural constraints), see
|
|
46
|
+
`ARCHITECTURE_DETAIL.md`.
|