biblicus 0.10.0__tar.gz → 0.11.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {biblicus-0.10.0/src/biblicus.egg-info → biblicus-0.11.0}/PKG-INFO +7 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/README.md +6 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/ARCHITECTURE.md +4 -4
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/CORPUS_DESIGN.md +2 -2
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/DEMOS.md +3 -3
- biblicus-0.11.0/docs/RETRIEVAL.md +47 -0
- biblicus-0.11.0/docs/RETRIEVAL_EVALUATION.md +74 -0
- biblicus-0.11.0/docs/RETRIEVAL_QUALITY.md +42 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/ROADMAP.md +15 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/pass-through.md +3 -3
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/unstructured.md +1 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/index.rst +3 -0
- biblicus-0.11.0/features/retrieval_quality.feature +253 -0
- biblicus-0.11.0/features/steps/retrieval_quality_steps.py +186 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/retrieval_steps.py +10 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/pyproject.toml +1 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/__init__.py +1 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/profiling.py +1 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/backends/__init__.py +4 -0
- biblicus-0.11.0/src/biblicus/backends/hybrid.py +284 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/backends/sqlite_full_text_search.py +264 -18
- biblicus-0.11.0/src/biblicus/backends/vector.py +460 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/models.py +3 -0
- {biblicus-0.10.0 → biblicus-0.11.0/src/biblicus.egg-info}/PKG-INFO +7 -1
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus.egg-info/SOURCES.txt +7 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/LICENSE +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/MANIFEST.in +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/THIRD_PARTY_NOTICES.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/datasets/wikipedia_mini.json +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/ANALYSIS.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/BACKENDS.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/CONTEXT_PACK.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/CORPUS.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/EXTRACTION.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/FEATURE_INDEX.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/KNOWLEDGE_BASE.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/PROFILING.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/STT.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/TESTING.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/TOPIC_MODELING.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/USER_CONFIGURATION.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/api.rst +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/backends/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/backends/scan.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/backends/sqlite-full-text-search.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/conf.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/ocr/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/ocr/paddleocr-vl.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/ocr/rapidocr.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/pipeline.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/select-longest.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/select-override.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/select-smart-override.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/pipeline-utilities/select-text.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/speech-to-text/deepgram.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/speech-to-text/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/speech-to-text/openai.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/markitdown.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/metadata.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/text-document/pdf.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/vlm-document/docling-granite.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/vlm-document/docling-smol.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/docs/extractors/vlm-document/index.md +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/analysis_schema.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/backend_validation.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/biblicus_corpus.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/cli_entrypoint.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/cli_parsing.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/cli_step_spec_parsing.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/content_sniffing.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/context_pack.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/context_pack_cli.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/corpus_edge_cases.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/corpus_identity.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/corpus_purge.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/crawl.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/docling_granite_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/docling_smol_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/environment.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/error_cases.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/evaluation.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/evidence_processing.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extraction_error_handling.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extraction_run_lifecycle.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extraction_selection.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extraction_selection_longest.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extractor_pipeline.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/extractor_validation.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/frontmatter.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/hook_config_validation.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/hook_error_handling.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/import_tree.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/inference_backend.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/ingest_sources.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_audio_samples.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_image_samples.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_mixed_corpus.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_mixed_extraction.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_ocr_image_extraction.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_pdf_retrieval.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_pdf_samples.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_unstructured_extraction.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/integration_wikipedia.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/knowledge_base.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/lifecycle_hooks.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/markitdown_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/model_validation.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/ocr_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/paddleocr_vl_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/paddleocr_vl_parse_api_response.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/pdf_text_extraction.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/profiling.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/python_api.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/python_hook_logging.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/query_processing.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/recipe_file_extraction.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/retrieval_budget.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/retrieval_scan.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/retrieval_sqlite_full_text_search.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/retrieval_uses_extraction_run.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/retrieval_utilities.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/select_override.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/smart_override_selection.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/source_loading.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/analysis_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/backend_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/cli_parsing_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/cli_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/context_pack_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/crawl_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/deepgram_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/docling_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/evidence_processing_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/extraction_run_lifecycle_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/extraction_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/extractor_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/frontmatter_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/inference_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/knowledge_base_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/markitdown_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/model_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/openai_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/paddleocr_mock_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/paddleocr_vl_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/paddleocr_vl_unit_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/pdf_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/profiling_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/python_api_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/rapidocr_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/requests_mock_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/stt_deepgram_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/stt_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/topic_modeling_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/unstructured_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/steps/user_config_steps.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/streaming_ingest.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/stt_deepgram_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/stt_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/text_extraction_runs.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/token_budget.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/topic_modeling.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/unstructured_extractor.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/features/user_config.feature +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_ag_news.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_audio_samples.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_image_samples.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_mixed_samples.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_pdf_samples.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/download_wikipedia.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/profiling_demo.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/readme_end_to_end_demo.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/test.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/topic_modeling_integration.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/scripts/wikipedia_rag_demo.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/setup.cfg +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/__main__.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/_vendor/dotyaml/__init__.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/_vendor/dotyaml/interpolation.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/_vendor/dotyaml/loader.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/_vendor/dotyaml/transformer.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/__init__.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/base.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/llm.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/models.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/schema.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/analysis/topic_modeling.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/backends/base.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/backends/scan.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/cli.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/constants.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/context.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/corpus.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/crawl.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/errors.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/evaluation.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/evidence_processing.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extraction.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/__init__.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/base.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/deepgram_stt.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/docling_granite_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/docling_smol_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/markitdown_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/metadata_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/openai_stt.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/paddleocr_vl_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/pass_through_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/pdf_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/pipeline.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/rapidocr_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/select_longest_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/select_override.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/select_smart_override.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/select_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/extractors/unstructured_text.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/frontmatter.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/hook_logging.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/hook_manager.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/hooks.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/ignore.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/inference.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/knowledge_base.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/retrieval.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/sources.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/time.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/uris.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus/user_config.py +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus.egg-info/dependency_links.txt +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus.egg-info/entry_points.txt +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus.egg-info/requires.txt +0 -0
- {biblicus-0.10.0 → biblicus-0.11.0}/src/biblicus.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: biblicus
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.11.0
|
|
4
4
|
Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
|
|
5
5
|
License: MIT
|
|
6
6
|
Requires-Python: >=3.9
|
|
@@ -493,6 +493,12 @@ Two backends are included.
|
|
|
493
493
|
|
|
494
494
|
For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
|
|
495
495
|
|
|
496
|
+
## Retrieval documentation
|
|
497
|
+
|
|
498
|
+
For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
|
|
499
|
+
(tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
|
|
500
|
+
and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
|
|
501
|
+
|
|
496
502
|
## Extraction backends
|
|
497
503
|
|
|
498
504
|
These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
|
|
@@ -447,6 +447,12 @@ Two backends are included.
|
|
|
447
447
|
|
|
448
448
|
For detailed documentation including configuration options, performance characteristics, and usage examples, see the [Backend Reference][backend-reference].
|
|
449
449
|
|
|
450
|
+
## Retrieval documentation
|
|
451
|
+
|
|
452
|
+
For the retrieval pipeline overview and run artifacts, see `docs/RETRIEVAL.md`. For retrieval quality upgrades
|
|
453
|
+
(tuned lexical baseline, reranking, hybrid retrieval), see `docs/RETRIEVAL_QUALITY.md`. For evaluation workflows
|
|
454
|
+
and dataset formats, see `docs/RETRIEVAL_EVALUATION.md`.
|
|
455
|
+
|
|
450
456
|
## Extraction backends
|
|
451
457
|
|
|
452
458
|
These extractors are built in. Optional ones require extra dependencies. See [text extraction documentation][text-extraction] for details.
|
|
@@ -88,11 +88,11 @@ Evidence is the canonical output of retrieval. Required fields:
|
|
|
88
88
|
### Integration boundary
|
|
89
89
|
|
|
90
90
|
- Biblicus can integrate with Tactus as a **Model Context Protocol toolset**, for example with tool names such as `knowledge_base_ingest`, `knowledge_base_query`, and `knowledge_base_stats`.
|
|
91
|
-
- We
|
|
91
|
+
- We do **not** add a knowledge base or retrieval augmented generation language primitive in version zero. Revisit only if we need semantics that tools cannot express cleanly, such as enforceable policy boundaries, runtime managed durability, caching hooks, or guaranteed instrumentation.
|
|
92
92
|
|
|
93
93
|
### Interface packaging
|
|
94
94
|
|
|
95
|
-
- The knowledge base interface is a **small protocol and reference implementation**, including tool schemas and a reference Model Context Protocol server. We
|
|
95
|
+
- The knowledge base interface is a **small protocol and reference implementation**, including tool schemas and a reference Model Context Protocol server. We do not build a full managed service in version zero.
|
|
96
96
|
|
|
97
97
|
### Corpus identity and layout
|
|
98
98
|
|
|
@@ -143,7 +143,7 @@ The interface stays the same; topology is configuration.
|
|
|
143
143
|
- When a backend produces persisted materializations, Biblicus treats them as **versioned build runs** identified by `run_id` (rather than overwriting in place by default).
|
|
144
144
|
- Manifests exist even for just-in-time backends (materializations may be empty).
|
|
145
145
|
- Full directed acyclic graph lineage is not included in version zero; revisit only if needed.
|
|
146
|
-
-
|
|
146
|
+
- Optional: define **shared materialization formats** (canonical chunk and embedding stores) so multiple backends can reuse intermediates when it makes sense; keep it opt-in.
|
|
147
147
|
|
|
148
148
|
### Evaluation
|
|
149
149
|
|
|
@@ -156,7 +156,7 @@ The interface stays the same; topology is configuration.
|
|
|
156
156
|
- The corpus catalog is **file-based** (committable, portable, backend-agnostic) so any backend/tool can consume it without requiring a database engine.
|
|
157
157
|
- Canonical version zero format is a single JavaScript Object Notation file at `.biblicus/catalog.json`, written atomically (temporary file and rename) on updates.
|
|
158
158
|
- The catalog includes `latest_run_id` and run manifests are stored at `.biblicus/runs/<run_id>.json`.
|
|
159
|
-
- If this
|
|
159
|
+
- If this becomes a bottleneck at very large scales, we **change the specification** (bump `schema_version`) rather than introduce multiple “supported” catalog storage modes.
|
|
160
160
|
|
|
161
161
|
## Near-term deliverables
|
|
162
162
|
|
|
@@ -216,7 +216,7 @@ Version zero locked this as policy. A prune workflow was not implemented yet.
|
|
|
216
216
|
|
|
217
217
|
Goal: retain derived artifacts from multiple implementations side by side so a user can compare results and switch between implementations without losing work.
|
|
218
218
|
|
|
219
|
-
This decision applies to extraction plugins and retrieval backends, and to any
|
|
219
|
+
This decision applies to extraction plugins and retrieval backends, and to any plugin type that produces derived artifacts.
|
|
220
220
|
|
|
221
221
|
Option A: store artifacts under the corpus, partitioned by plugin type
|
|
222
222
|
|
|
@@ -369,7 +369,7 @@ Version zero implemented option A by writing structured log entries for hook exe
|
|
|
369
369
|
|
|
370
370
|
## Outcomes and remaining questions
|
|
371
371
|
|
|
372
|
-
The hook protocol and hook logging policy above were implemented in version zero. This section records what was implemented
|
|
372
|
+
The hook protocol and hook logging policy above were implemented in version zero. This section records what was implemented and the open questions tracked for later iterations.
|
|
373
373
|
|
|
374
374
|
### Hook contexts implemented in version zero
|
|
375
375
|
|
|
@@ -6,7 +6,7 @@ For the ordered plan of what to build next, see `docs/ROADMAP.md`.
|
|
|
6
6
|
|
|
7
7
|
## Diagram of the current system and the next layers
|
|
8
8
|
|
|
9
|
-
Blue boxes are implemented now. Purple boxes are
|
|
9
|
+
Blue boxes are implemented now. Purple boxes are layers not implemented yet that we can build and compare.
|
|
10
10
|
|
|
11
11
|
```mermaid
|
|
12
12
|
%%{init: {"flowchart": {"useMaxWidth": true, "nodeSpacing": 18, "rankSpacing": 22}}}%%
|
|
@@ -233,7 +233,7 @@ python3 -m biblicus extract build --corpus corpora/demo \\
|
|
|
233
233
|
--step select-text
|
|
234
234
|
```
|
|
235
235
|
|
|
236
|
-
Copy the `run_id` from the JavaScript Object Notation output.
|
|
236
|
+
Copy the `run_id` from the JavaScript Object Notation output. Use it as `EXTRACTION_RUN_ID` in the next command.
|
|
237
237
|
|
|
238
238
|
```
|
|
239
239
|
python3 -m biblicus build --corpus corpora/demo --backend sqlite-full-text-search \\
|
|
@@ -251,7 +251,7 @@ python3 scripts/download_pdf_samples.py --corpus corpora/pdf_samples --force
|
|
|
251
251
|
python3 -m biblicus extract build --corpus corpora/pdf_samples --step pdf-text
|
|
252
252
|
```
|
|
253
253
|
|
|
254
|
-
Copy the `run_id` from the JavaScript Object Notation output.
|
|
254
|
+
Copy the `run_id` from the JavaScript Object Notation output. Use it as `PDF_EXTRACTION_RUN_ID` in the next command.
|
|
255
255
|
|
|
256
256
|
```
|
|
257
257
|
python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-text-search --config extraction_run=pipeline:PDF_EXTRACTION_RUN_ID --config chunk_size=200 --config chunk_overlap=50 --config snippet_characters=120
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
# Retrieval
|
|
2
|
+
|
|
3
|
+
Biblicus treats retrieval as a reproducible, explicit pipeline stage that transforms a corpus into structured evidence.
|
|
4
|
+
Retrieval is separated from extraction and context shaping so each can be evaluated independently and swapped without
|
|
5
|
+
rewriting ingestion.
|
|
6
|
+
|
|
7
|
+
## Retrieval concepts
|
|
8
|
+
|
|
9
|
+
- **Backend**: a pluggable retrieval implementation that can build and query runs.
|
|
10
|
+
- **Run**: a recorded retrieval build for a corpus and extraction run.
|
|
11
|
+
- **Evidence**: structured output containing identifiers, provenance, and scores.
|
|
12
|
+
- **Stage**: explicit steps such as retrieve, rerank, and filter.
|
|
13
|
+
|
|
14
|
+
## How retrieval runs work
|
|
15
|
+
|
|
16
|
+
1) Ingest raw items into a corpus.
|
|
17
|
+
2) Build an extraction run to produce text artifacts.
|
|
18
|
+
3) Build a retrieval run with a backend, referencing the extraction run.
|
|
19
|
+
4) Query the run to return evidence.
|
|
20
|
+
|
|
21
|
+
Retrieval runs are stored under:
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
.biblicus/runs/retrieval/<backend_id>/<run_id>/
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Backends
|
|
28
|
+
|
|
29
|
+
See `docs/backends/index.md` for backend selection and configuration.
|
|
30
|
+
|
|
31
|
+
## Evaluation
|
|
32
|
+
|
|
33
|
+
Retrieval runs are evaluated against datasets with explicit budgets. See `docs/RETRIEVAL_EVALUATION.md` for the
|
|
34
|
+
dataset format and workflow, `docs/FEATURE_INDEX.md` for the behavior specifications, and `docs/CONTEXT_PACK.md` for
|
|
35
|
+
how evidence feeds into context packs.
|
|
36
|
+
|
|
37
|
+
## Why the separation matters
|
|
38
|
+
|
|
39
|
+
Keeping extraction and retrieval distinct makes it possible to:
|
|
40
|
+
|
|
41
|
+
- Reuse the same extracted artifacts across many retrieval backends.
|
|
42
|
+
- Compare backends against the same corpus and dataset inputs.
|
|
43
|
+
- Record and audit retrieval decisions without mixing in prompting or context formatting.
|
|
44
|
+
|
|
45
|
+
## Retrieval quality
|
|
46
|
+
|
|
47
|
+
For retrieval quality upgrades, see `docs/RETRIEVAL_QUALITY.md`.
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# Retrieval evaluation
|
|
2
|
+
|
|
3
|
+
Biblicus evaluates retrieval runs against deterministic datasets so quality comparisons are repeatable across backends
|
|
4
|
+
and corpora. Evaluations keep the evidence-first model intact by reporting per-query evidence alongside summary
|
|
5
|
+
metrics.
|
|
6
|
+
|
|
7
|
+
## Dataset format
|
|
8
|
+
|
|
9
|
+
Retrieval datasets are stored as JavaScript Object Notation files with a strict schema:
|
|
10
|
+
|
|
11
|
+
```json
|
|
12
|
+
{
|
|
13
|
+
"schema_version": 1,
|
|
14
|
+
"name": "example-dataset",
|
|
15
|
+
"description": "Small hand-labeled dataset for smoke tests.",
|
|
16
|
+
"queries": [
|
|
17
|
+
{
|
|
18
|
+
"query_id": "q-001",
|
|
19
|
+
"query_text": "alpha",
|
|
20
|
+
"expected_item_id": "item-id-123",
|
|
21
|
+
"kind": "gold"
|
|
22
|
+
}
|
|
23
|
+
]
|
|
24
|
+
}
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Each query includes either an `expected_item_id` or an `expected_source_uri`. The `kind` field records whether the
|
|
28
|
+
query is hand-labeled (`gold`) or synthetic.
|
|
29
|
+
|
|
30
|
+
## Running an evaluation
|
|
31
|
+
|
|
32
|
+
Use the command-line interface to evaluate a retrieval run against a dataset:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
biblicus eval --corpus corpora/example --run <run_id> --dataset datasets/retrieval.json \
|
|
36
|
+
--max-total-items 5 --max-total-characters 2000 --max-items-per-source 5
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
If `--run` is omitted, the latest retrieval run is used. Evaluations are deterministic for the same corpus, run, and
|
|
40
|
+
budget.
|
|
41
|
+
|
|
42
|
+
## Output
|
|
43
|
+
|
|
44
|
+
The evaluation output includes:
|
|
45
|
+
|
|
46
|
+
- Dataset metadata (name, description, query count).
|
|
47
|
+
- Run metadata (backend ID, run ID, evaluation timestamp).
|
|
48
|
+
- Metrics (hit rate, precision-at-k, mean reciprocal rank).
|
|
49
|
+
- System diagnostics (latency percentiles and index size).
|
|
50
|
+
|
|
51
|
+
The output is JavaScript Object Notation suitable for downstream reporting.
|
|
52
|
+
|
|
53
|
+
## Python usage
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
from pathlib import Path
|
|
57
|
+
|
|
58
|
+
from biblicus.corpus import Corpus
|
|
59
|
+
from biblicus.evaluation import evaluate_run, load_dataset
|
|
60
|
+
from biblicus.models import QueryBudget
|
|
61
|
+
|
|
62
|
+
corpus = Corpus.open("corpora/example")
|
|
63
|
+
run = corpus.load_run("<run_id>")
|
|
64
|
+
dataset = load_dataset(Path("datasets/retrieval.json"))
|
|
65
|
+
budget = QueryBudget(max_total_items=5, max_total_characters=2000, max_items_per_source=5)
|
|
66
|
+
result = evaluate_run(corpus=corpus, run=run, dataset=dataset, budget=budget)
|
|
67
|
+
print(result.model_dump_json(indent=2))
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
## Design notes
|
|
71
|
+
|
|
72
|
+
- Evaluation is reproducible by construction: the run manifest, dataset, and budget fully determine the results.
|
|
73
|
+
- The evaluation workflow expects retrieval stages to remain explicit in the run artifacts.
|
|
74
|
+
- Reports are portable, so comparisons across backends and corpora are straightforward.
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# Retrieval quality upgrades
|
|
2
|
+
|
|
3
|
+
This document describes the retrieval quality upgrades available in Biblicus. It is a reference for how retrieval
|
|
4
|
+
quality is expressed in runs and should be read alongside `docs/ROADMAP.md`.
|
|
5
|
+
|
|
6
|
+
## Goals
|
|
7
|
+
|
|
8
|
+
- Improve relevance without losing determinism or reproducibility.
|
|
9
|
+
- Keep retrieval stages explicit and visible in run artifacts.
|
|
10
|
+
- Preserve the evidence-first output model.
|
|
11
|
+
|
|
12
|
+
## Available upgrades
|
|
13
|
+
|
|
14
|
+
### 1) Tuned lexical baseline
|
|
15
|
+
|
|
16
|
+
- BM25-style scoring with configurable parameters.
|
|
17
|
+
- N-gram range controls.
|
|
18
|
+
- Stop word strategy per backend.
|
|
19
|
+
- Field weighting (for example: title, body, metadata).
|
|
20
|
+
|
|
21
|
+
### 2) Reranking stage
|
|
22
|
+
|
|
23
|
+
- Optional rerank step that re-scores top-N candidates.
|
|
24
|
+
- Deterministic scoring keeps rerank behavior reproducible.
|
|
25
|
+
|
|
26
|
+
### 3) Hybrid retrieval
|
|
27
|
+
|
|
28
|
+
- Combine lexical and embedding signals.
|
|
29
|
+
- Expose fusion weights in the recipe schema.
|
|
30
|
+
- Emit stage-level scores and weights in evidence metadata.
|
|
31
|
+
|
|
32
|
+
## Evaluation guidance
|
|
33
|
+
|
|
34
|
+
- Measure accuracy-at-k and compare against the same datasets.
|
|
35
|
+
- Run artifacts capture each stage and configuration for auditability.
|
|
36
|
+
- Deterministic settings remain available as the default baseline.
|
|
37
|
+
|
|
38
|
+
## Non-goals
|
|
39
|
+
|
|
40
|
+
- Automated hyperparameter tuning.
|
|
41
|
+
- Hidden fallback stages that obscure retrieval behavior.
|
|
42
|
+
- UI-driven tuning in this phase.
|
|
@@ -31,6 +31,21 @@ Acceptance checks:
|
|
|
31
31
|
- Dataset formats are versioned when they change.
|
|
32
32
|
- Reports remain deterministic for the same inputs.
|
|
33
33
|
|
|
34
|
+
## Next: retrieval quality upgrades
|
|
35
|
+
|
|
36
|
+
Goal: make retrieval relevance stronger while keeping deterministic baselines and clear evaluation.
|
|
37
|
+
|
|
38
|
+
Deliverables:
|
|
39
|
+
|
|
40
|
+
- A tuned lexical baseline (for example: BM25 configuration, n-grams, field weighting, stop word controls).
|
|
41
|
+
- A reranking stage that can refine top-N results with either a cross-encoder or an LLM re-ranker.
|
|
42
|
+
- A hybrid retrieval mode that combines lexical signals with embeddings and exposes weights explicitly.
|
|
43
|
+
|
|
44
|
+
Acceptance checks:
|
|
45
|
+
|
|
46
|
+
- Accuracy-at-k improves on the same evaluation datasets without regressions in determinism.
|
|
47
|
+
- Retrieval stages are explicitly recorded (retrieve, rerank, filter) in the output artifacts.
|
|
48
|
+
|
|
34
49
|
## Next: context pack policy surfaces
|
|
35
50
|
|
|
36
51
|
Goal: make context shaping policies easier to evaluate and swap.
|
|
@@ -67,7 +82,6 @@ Goal: provide lightweight analysis utilities that summarize corpus themes and gu
|
|
|
67
82
|
|
|
68
83
|
Deliverables:
|
|
69
84
|
|
|
70
|
-
- Basic data profiling reports (counts, media types, size distributions, tag coverage).
|
|
71
85
|
- Hidden Markov modeling analysis for sequence-driven corpora.
|
|
72
86
|
- A way to compare analysis outputs across corpora or corpus snapshots.
|
|
73
87
|
|
|
@@ -120,12 +120,12 @@ title: My Document
|
|
|
120
120
|
tags: [note, draft]
|
|
121
121
|
---
|
|
122
122
|
|
|
123
|
-
This is the body content that
|
|
123
|
+
This is the body content that is extracted.
|
|
124
124
|
```
|
|
125
125
|
|
|
126
126
|
Output text:
|
|
127
127
|
```
|
|
128
|
-
This is the body content that
|
|
128
|
+
This is the body content that is extracted.
|
|
129
129
|
```
|
|
130
130
|
|
|
131
131
|
### Mixed Format Pipeline
|
|
@@ -185,7 +185,7 @@ Non-text items are silently skipped (returns `None`). This allows the extractor
|
|
|
185
185
|
|
|
186
186
|
### Encoding Errors
|
|
187
187
|
|
|
188
|
-
UTF-8 decoding errors
|
|
188
|
+
UTF-8 decoding errors cause per-item failures recorded in `errored_items` but do not halt the entire extraction run.
|
|
189
189
|
|
|
190
190
|
### Missing Files
|
|
191
191
|
|
|
@@ -78,7 +78,7 @@ class UnstructuredExtractorConfig(BaseModel):
|
|
|
78
78
|
|
|
79
79
|
### Configuration Options
|
|
80
80
|
|
|
81
|
-
This extractor currently accepts no configuration.
|
|
81
|
+
This extractor currently accepts no configuration. Optional extensions may expose Unstructured library options.
|
|
82
82
|
|
|
83
83
|
## Usage
|
|
84
84
|
|
|
@@ -0,0 +1,253 @@
|
|
|
1
|
+
Feature: Retrieval quality upgrades
|
|
2
|
+
Retrieval quality upgrades keep multi-stage retrieval explicit while improving relevance.
|
|
3
|
+
|
|
4
|
+
Scenario: Lexical tuning parameters are recorded in the retrieval recipe
|
|
5
|
+
Given I initialized a corpus at "corpus"
|
|
6
|
+
And a text file "alpha.md" exists with contents "alpha bravo charlie"
|
|
7
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
8
|
+
And I build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
9
|
+
| key | value |
|
|
10
|
+
| chunk_size | 200 |
|
|
11
|
+
| chunk_overlap | 50 |
|
|
12
|
+
| snippet_characters | 120 |
|
|
13
|
+
| bm25_k1 | 1.2 |
|
|
14
|
+
| bm25_b | 0.75 |
|
|
15
|
+
| ngram_min | 1 |
|
|
16
|
+
| ngram_max | 2 |
|
|
17
|
+
| stop_words | english |
|
|
18
|
+
| field_weight_title | 2.0 |
|
|
19
|
+
| field_weight_body | 1.0 |
|
|
20
|
+
| field_weight_tags | 0.5 |
|
|
21
|
+
Then the latest run recipe config includes:
|
|
22
|
+
| key | value |
|
|
23
|
+
| bm25_k1 | 1.2 |
|
|
24
|
+
| bm25_b | 0.75 |
|
|
25
|
+
| ngram_min | 1 |
|
|
26
|
+
| ngram_max | 2 |
|
|
27
|
+
| stop_words | english |
|
|
28
|
+
| field_weight_title | 2.0 |
|
|
29
|
+
| field_weight_body | 1.0 |
|
|
30
|
+
| field_weight_tags | 0.5 |
|
|
31
|
+
|
|
32
|
+
Scenario: Lexical tuning rejects invalid ngram ranges
|
|
33
|
+
Given I initialized a corpus at "corpus"
|
|
34
|
+
When I attempt to build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
35
|
+
| key | value |
|
|
36
|
+
| ngram_min | 2 |
|
|
37
|
+
| ngram_max | 1 |
|
|
38
|
+
Then the command fails with exit code 2
|
|
39
|
+
And standard error includes "ngram range"
|
|
40
|
+
|
|
41
|
+
Scenario: Stop words exclude common tokens from lexical retrieval
|
|
42
|
+
Given I initialized a corpus at "corpus"
|
|
43
|
+
And a text file "alpha.md" exists with contents "the zebra"
|
|
44
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
45
|
+
And I build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
46
|
+
| key | value |
|
|
47
|
+
| stop_words | english |
|
|
48
|
+
And I query with the latest run for "the" and budget:
|
|
49
|
+
| key | value |
|
|
50
|
+
| max_total_items | 5 |
|
|
51
|
+
| max_total_characters | 2000 |
|
|
52
|
+
| max_items_per_source | 5 |
|
|
53
|
+
Then the query evidence count is 0
|
|
54
|
+
|
|
55
|
+
Scenario: Reranking produces explicit stage metadata
|
|
56
|
+
Given I initialized a corpus at "corpus"
|
|
57
|
+
And a text file "alpha.md" exists with contents "alpha bravo charlie"
|
|
58
|
+
And a text file "beta.md" exists with contents "alpha beta charlie"
|
|
59
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
60
|
+
And I ingest the file "beta.md" into corpus "corpus"
|
|
61
|
+
And I build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
62
|
+
| key | value |
|
|
63
|
+
| rerank_enabled | true |
|
|
64
|
+
| rerank_model | cross-encoder |
|
|
65
|
+
| rerank_top_k | 2 |
|
|
66
|
+
And I query with the latest run for "alpha" and budget:
|
|
67
|
+
| key | value |
|
|
68
|
+
| max_total_items | 5 |
|
|
69
|
+
| max_total_characters | 2000 |
|
|
70
|
+
| max_items_per_source | 5 |
|
|
71
|
+
Then the query returns evidence with stage "rerank"
|
|
72
|
+
And the query evidence includes stage score "retrieve"
|
|
73
|
+
And the query evidence includes stage score "rerank"
|
|
74
|
+
And the query stats include reranked_candidates 2
|
|
75
|
+
|
|
76
|
+
Scenario: Hybrid retrieval records lexical and embedding scores
|
|
77
|
+
Given I initialized a corpus at "corpus"
|
|
78
|
+
And a text file "alpha.md" exists with contents "alpha bravo charlie"
|
|
79
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
80
|
+
And I build a "hybrid" retrieval run in corpus "corpus" with config:
|
|
81
|
+
| key | value |
|
|
82
|
+
| lexical_backend | sqlite-full-text-search |
|
|
83
|
+
| embedding_backend| vector |
|
|
84
|
+
| lexical_weight | 0.7 |
|
|
85
|
+
| embedding_weight | 0.3 |
|
|
86
|
+
And I query with the latest run for "alpha" and budget:
|
|
87
|
+
| key | value |
|
|
88
|
+
| max_total_items | 5 |
|
|
89
|
+
| max_total_characters | 2000 |
|
|
90
|
+
| max_items_per_source | 5 |
|
|
91
|
+
Then the query returns evidence with stage "hybrid"
|
|
92
|
+
And the query evidence includes stage score "lexical"
|
|
93
|
+
And the query evidence includes stage score "embedding"
|
|
94
|
+
And the query stats include fusion_weights "lexical=0.7,embedding=0.3"
|
|
95
|
+
|
|
96
|
+
Scenario: Hybrid retrieval rejects invalid weights
|
|
97
|
+
Given I initialized a corpus at "corpus"
|
|
98
|
+
When I attempt to build a "hybrid" retrieval run in corpus "corpus" with config:
|
|
99
|
+
| key | value |
|
|
100
|
+
| lexical_weight | 0.9 |
|
|
101
|
+
| embedding_weight | 0.9 |
|
|
102
|
+
Then the command fails with exit code 2
|
|
103
|
+
And standard error includes "weights must sum to 1"
|
|
104
|
+
|
|
105
|
+
Scenario: SQLite stop words reject invalid strings
|
|
106
|
+
Given I initialized a corpus at "corpus"
|
|
107
|
+
When I attempt to build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
108
|
+
| key | value |
|
|
109
|
+
| stop_words | spanish |
|
|
110
|
+
Then the command fails with exit code 2
|
|
111
|
+
And standard error includes "stop_words"
|
|
112
|
+
|
|
113
|
+
Scenario: SQLite stop words accept explicit lists
|
|
114
|
+
When I validate sqlite full-text search stop words list:
|
|
115
|
+
| value |
|
|
116
|
+
| the |
|
|
117
|
+
| and |
|
|
118
|
+
Then the sqlite stop words include "the"
|
|
119
|
+
|
|
120
|
+
Scenario: SQLite stop words reject empty lists
|
|
121
|
+
When I attempt to validate sqlite full-text search stop words list:
|
|
122
|
+
| value |
|
|
123
|
+
Then a model validation error is raised
|
|
124
|
+
And the validation error mentions "stop_words list must not be empty"
|
|
125
|
+
|
|
126
|
+
Scenario: SQLite stop words reject empty entries
|
|
127
|
+
When I attempt to validate sqlite full-text search stop words list:
|
|
128
|
+
| value |
|
|
129
|
+
| |
|
|
130
|
+
Then a model validation error is raised
|
|
131
|
+
And the validation error mentions "stop_words list must contain non-empty strings"
|
|
132
|
+
|
|
133
|
+
Scenario: Rerank requires a model identifier
|
|
134
|
+
Given I initialized a corpus at "corpus"
|
|
135
|
+
When I attempt to build a "sqlite-full-text-search" retrieval run in corpus "corpus" with config:
|
|
136
|
+
| key | value |
|
|
137
|
+
| rerank_enabled | true |
|
|
138
|
+
Then the command fails with exit code 2
|
|
139
|
+
And standard error includes "rerank_model"
|
|
140
|
+
|
|
141
|
+
Scenario: Vector retrieval returns evidence
|
|
142
|
+
Given I initialized a corpus at "corpus"
|
|
143
|
+
And a text file "alpha.md" exists with contents "alpha bravo"
|
|
144
|
+
And a text file "plain.txt" exists with contents "alpha plain"
|
|
145
|
+
And a text file "delta.md" exists with contents "delta"
|
|
146
|
+
And a text file "punct.md" exists with contents "!!!"
|
|
147
|
+
And a binary file "data.bin" exists
|
|
148
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
149
|
+
And I ingest the file "plain.txt" into corpus "corpus"
|
|
150
|
+
And I ingest the file "delta.md" into corpus "corpus"
|
|
151
|
+
And I ingest the file "punct.md" into corpus "corpus"
|
|
152
|
+
And I ingest the file "data.bin" into corpus "corpus"
|
|
153
|
+
And I build a "vector" retrieval run in corpus "corpus"
|
|
154
|
+
And I query with the latest run for "alpha" and budget:
|
|
155
|
+
| key | value |
|
|
156
|
+
| max_total_items | 5 |
|
|
157
|
+
| max_total_characters | 2000 |
|
|
158
|
+
| max_items_per_source | 5 |
|
|
159
|
+
Then the query returns evidence with stage "vector"
|
|
160
|
+
|
|
161
|
+
Scenario: Vector retrieval handles longer queries
|
|
162
|
+
Given I initialized a corpus at "corpus"
|
|
163
|
+
And a text file "alpha.md" exists with contents "alpha"
|
|
164
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
165
|
+
And I build a "vector" retrieval run in corpus "corpus"
|
|
166
|
+
And I query with the latest run for "alpha bravo charlie" and budget:
|
|
167
|
+
| key | value |
|
|
168
|
+
| max_total_items | 5 |
|
|
169
|
+
| max_total_characters | 2000 |
|
|
170
|
+
| max_items_per_source | 5 |
|
|
171
|
+
Then the query returns evidence with stage "vector"
|
|
172
|
+
|
|
173
|
+
Scenario: Vector retrieval ignores empty queries
|
|
174
|
+
Given I initialized a corpus at "corpus"
|
|
175
|
+
And a text file "alpha.md" exists with contents "alpha"
|
|
176
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
177
|
+
And I build a "vector" retrieval run in corpus "corpus"
|
|
178
|
+
And I query with the latest run for "!!!" and budget:
|
|
179
|
+
| key | value |
|
|
180
|
+
| max_total_items | 5 |
|
|
181
|
+
| max_total_characters | 2000 |
|
|
182
|
+
| max_items_per_source | 5 |
|
|
183
|
+
Then the query evidence count is 0
|
|
184
|
+
|
|
185
|
+
Scenario: Vector retrieval uses extracted text
|
|
186
|
+
Given I initialized a corpus at "corpus"
|
|
187
|
+
And a text file "alpha.md" exists with contents "alpha bravo"
|
|
188
|
+
And a text file "whitespace.txt" exists with contents " "
|
|
189
|
+
And a binary file "data.bin" exists
|
|
190
|
+
When I ingest the file "alpha.md" into corpus "corpus"
|
|
191
|
+
And I ingest the file "whitespace.txt" into corpus "corpus"
|
|
192
|
+
And I ingest the file "data.bin" into corpus "corpus"
|
|
193
|
+
And I build a "pass-through-text" extraction run in corpus "corpus"
|
|
194
|
+
And I build a "vector" retrieval run in corpus "corpus" using the latest extraction run and config:
|
|
195
|
+
| key | value |
|
|
196
|
+
| snippet_characters | 120 |
|
|
197
|
+
And I query with the latest run for "alpha" and budget:
|
|
198
|
+
| key | value |
|
|
199
|
+
| max_total_items | 5 |
|
|
200
|
+
| max_total_characters | 2000 |
|
|
201
|
+
| max_items_per_source | 5 |
|
|
202
|
+
Then the latest run stats include text_items 2
|
|
203
|
+
|
|
204
|
+
Scenario: Vector retrieval rejects missing extraction runs
|
|
205
|
+
Given I initialized a corpus at "corpus"
|
|
206
|
+
When I attempt to build a "vector" retrieval run in corpus "corpus" with extraction run "missing:run"
|
|
207
|
+
Then the command fails with exit code 2
|
|
208
|
+
And standard error includes "Missing extraction run"
|
|
209
|
+
|
|
210
|
+
Scenario: Vector snippet helpers handle missing spans
|
|
211
|
+
When I compute a vector match span for text "alpha" with tokens "beta"
|
|
212
|
+
Then the vector match span is None
|
|
213
|
+
And the vector snippet for text "alpha" with span "None" and max chars 5 equals "alpha"
|
|
214
|
+
|
|
215
|
+
Scenario: Vector match spans ignore empty tokens
|
|
216
|
+
When I compute a vector match span for text "alpha" with tokens "alpha,,beta"
|
|
217
|
+
Then the vector match span is "0..5"
|
|
218
|
+
|
|
219
|
+
Scenario: Vector match spans prefer earlier tokens
|
|
220
|
+
When I compute a vector match span for text "alpha beta" with tokens "beta,alpha"
|
|
221
|
+
Then the vector match span is "0..5"
|
|
222
|
+
|
|
223
|
+
Scenario: Vector match spans ignore later tokens
|
|
224
|
+
When I compute a vector match span for text "alpha beta" with tokens "alpha,beta"
|
|
225
|
+
Then the vector match span is "0..5"
|
|
226
|
+
|
|
227
|
+
Scenario: Vector snippet helpers handle empty text
|
|
228
|
+
When I compute a vector match span for text "<empty>" with tokens "alpha"
|
|
229
|
+
Then the vector match span is None
|
|
230
|
+
And the vector snippet for text "<empty>" with span "None" and max chars 5 equals "<empty>"
|
|
231
|
+
|
|
232
|
+
Scenario: Hybrid backend rejects nested lexical backends
|
|
233
|
+
Given I initialized a corpus at "corpus"
|
|
234
|
+
When I attempt to build a "hybrid" retrieval run in corpus "corpus" with config:
|
|
235
|
+
| key | value |
|
|
236
|
+
| lexical_backend | hybrid |
|
|
237
|
+
Then the command fails with exit code 2
|
|
238
|
+
And standard error includes "lexical"
|
|
239
|
+
|
|
240
|
+
Scenario: Hybrid backend rejects nested embedding backends
|
|
241
|
+
Given I initialized a corpus at "corpus"
|
|
242
|
+
When I attempt to build a "hybrid" retrieval run in corpus "corpus" with config:
|
|
243
|
+
| key | value |
|
|
244
|
+
| lexical_backend | sqlite-full-text-search |
|
|
245
|
+
| embedding_backend| hybrid |
|
|
246
|
+
Then the command fails with exit code 2
|
|
247
|
+
And standard error includes "embedding"
|
|
248
|
+
|
|
249
|
+
Scenario: Hybrid query requires component runs
|
|
250
|
+
Given I initialized a corpus at "corpus"
|
|
251
|
+
When I attempt to query a hybrid run without component runs
|
|
252
|
+
Then a model validation error is raised
|
|
253
|
+
And the validation error mentions "Hybrid run missing lexical or embedding run identifiers"
|