PyPI - biblicus - Versions diffs - 0.8.0__tar.gz → 0.10.0__tar.gz - Mend

biblicus 0.8.0tar.gz → 0.10.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (230) hide show

{biblicus-0.8.0/src/biblicus.egg-info → biblicus-0.10.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: biblicus
-Version: 0.8.0
+Version: 0.10.0
 Summary: Command line interface and Python library for corpus ingestion, retrieval, and evaluation.
 License: MIT
 Requires-Python: >=3.9
@@ -40,6 +40,8 @@ Provides-Extra: docling-mlx
 Requires-Dist: docling[mlx-vlm]>=2.0.0; extra == "docling-mlx"
 Provides-Extra: topic-modeling
 Requires-Dist: bertopic>=0.15.0; extra == "topic-modeling"
+Provides-Extra: datasets
+Requires-Dist: datasets>=2.18.0; extra == "datasets"
 Dynamic: license-file
 # Biblicus
@@ -529,10 +531,13 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
 ## Topic modeling analysis
-Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
-analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
-processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
-JavaScript Object Notation.
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
+are the first analysis backends. Profiling summarizes corpus composition and extraction coverage. Topic modeling reads
+an extraction run, optionally applies an LLM-driven extraction pass, applies lexical processing, runs BERTopic, and
+optionally applies an LLM fine-tuning pass to label topics. The output is structured JavaScript Object Notation.
+See `docs/ANALYSIS.md` for the analysis pipeline overview, `docs/PROFILING.md` for profiling, and
+`docs/TOPIC_MODELING.md` for topic modeling details.
 Run a topic analysis using a recipe file:
@@ -564,26 +569,28 @@ bertopic_analysis:
   parameters:
     min_topic_size: 8
     nr_topics: 10
+  vectorizer:
+    ngram_range: [1, 2]
+    stop_words: english
 llm_fine_tuning:
   enabled: false
 ```
 LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
 Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+AG News integration runs require `biblicus[datasets]` in addition to `biblicus[topic-modeling]`.
-For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
 ```
-python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
 ```
 See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
-Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
-The dataset file `datasets/wikipedia_mini.json` provides a small evaluation set that matches the integration corpus.
+Use `scripts/download_ag_news.py` to download the AG News dataset when running topic modeling demos. The repository does not include that content.
 Use `scripts/download_pdf_samples.py` to download a small Portable Document Format integration corpus when running tests or demos. The repository does not include that content.

{biblicus-0.8.0 → biblicus-0.10.0}/README.md RENAMED Viewed

@@ -485,10 +485,13 @@ For detailed documentation on all extractors, see the [Extractor Reference][extr
 ## Topic modeling analysis
-Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Topic modeling is the first
-analysis backend. It reads an extraction run, optionally applies an LLM-driven extraction pass, applies lexical
-processing, runs BERTopic, and optionally applies an LLM fine-tuning pass to label topics. The output is structured
-JavaScript Object Notation.
+Biblicus can run analysis pipelines on extracted text without changing the raw corpus. Profiling and topic modeling
+are the first analysis backends. Profiling summarizes corpus composition and extraction coverage. Topic modeling reads
+an extraction run, optionally applies an LLM-driven extraction pass, applies lexical processing, runs BERTopic, and
+optionally applies an LLM fine-tuning pass to label topics. The output is structured JavaScript Object Notation.
+See `docs/ANALYSIS.md` for the analysis pipeline overview, `docs/PROFILING.md` for profiling, and
+`docs/TOPIC_MODELING.md` for topic modeling details.
 Run a topic analysis using a recipe file:
@@ -520,26 +523,28 @@ bertopic_analysis:
   parameters:
     min_topic_size: 8
     nr_topics: 10
+  vectorizer:
+    ngram_range: [1, 2]
+    stop_words: english
 llm_fine_tuning:
   enabled: false
 ```
 LLM extraction and fine-tuning require `biblicus[openai]` and a configured OpenAI API key.
 Recipe files are validated strictly against the topic modeling schema, so type mismatches or unknown fields are errors.
+AG News integration runs require `biblicus[datasets]` in addition to `biblicus[topic-modeling]`.
-For a repeatable, real-world integration run that downloads a Wikipedia corpus and executes topic modeling, use:
+For a repeatable, real-world integration run that downloads AG News and executes topic modeling, use:
 ```
-python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
 ```
 See `docs/TOPIC_MODELING.md` for parameter examples and per-topic output behavior.
 ## Integration corpus and evaluation dataset
-Use `scripts/download_wikipedia.py` to download a small integration corpus from Wikipedia when running tests or demos. The repository does not include that content.
-The dataset file `datasets/wikipedia_mini.json` provides a small evaluation set that matches the integration corpus.
+Use `scripts/download_ag_news.py` to download the AG News dataset when running topic modeling demos. The repository does not include that content.
 Use `scripts/download_pdf_samples.py` to download a small Portable Document Format integration corpus when running tests or demos. The repository does not include that content.

biblicus-0.10.0/docs/ANALYSIS.md ADDED Viewed

@@ -0,0 +1,47 @@
+# Corpus analysis
+Biblicus supports analysis backends that run on extracted text artifacts without changing the raw corpus. Analysis is a
+pluggable phase that reads an extraction run, produces structured output, and stores artifacts under the corpus runs
+folder. Each analysis backend declares its own configuration schema and output contract, and all schemas are validated
+strictly.
+## How analysis runs work
+- Analysis runs are tied to a corpus state via the extraction run reference.
+- The analysis output is written under `.biblicus/runs/analysis/<analysis-id>/<run_id>/`.
+- Analysis is reproducible when you supply the same extraction run and corpus catalog state.
+- Analysis configuration is stored as a recipe manifest in the run metadata.
+If you omit the extraction run, Biblicus uses the most recent extraction run and emits a reproducibility warning. For
+repeatable analysis runs, always pass the extraction run reference explicitly.
+## Pluggable analysis backends
+Analysis backends implement the `CorpusAnalysisBackend` interface and are registered under `biblicus.analysis`.
+A backend receives the corpus, a recipe name, a configuration mapping, and an extraction run reference. It returns a
+Pydantic model that is serialized to JavaScript Object Notation for storage.
+## Topic modeling
+Topic modeling is the first analysis backend. It uses BERTopic to cluster extracted text, produces per-topic evidence,
+and optionally labels topics using an LLM. See `docs/TOPIC_MODELING.md` for detailed configuration and examples.
+The integration demo script is a working reference you can use as a starting point:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
+```
+The command prints the analysis run identifier and the output path. Open the resulting `output.json` to inspect per-topic
+labels, keywords, and document examples.
+## Profiling analysis
+Profiling is the baseline analysis backend. It summarizes corpus composition and extraction coverage using
+deterministic counts and distribution metrics. See `docs/PROFILING.md` for the full reference and working demo.
+Run profiling from the CLI:
+```
+biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
+```

{biblicus-0.8.0 → biblicus-0.10.0}/docs/DEMOS.md RENAMED Viewed

@@ -187,19 +187,26 @@ The output includes a `run_id` you can reuse when building a retrieval backend.
 ### Topic modeling integration run
-Use the integration script to download a Wikipedia corpus, run extraction, and run topic modeling with a single command.
+Use the integration script to download AG News, run extraction, and run topic modeling with a single command.
+Install optional dependencies first:
 ```
-python3 scripts/topic_modeling_integration.py --corpus corpora/wiki_demo --force
+python3 -m pip install "biblicus[datasets,topic-modeling]"
 ```
-Run with a smaller corpus and a higher topic count:
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
+```
+Run with a larger corpus and a higher topic count:
 ```
 python3 scripts/topic_modeling_integration.py \
-  --corpus corpora/wiki_demo \
+  --corpus corpora/ag_news_demo \
   --force \
-  --limit 20 \
+  --limit 10000 \
+  --vectorizer-ngram-min 1 \
+  --vectorizer-ngram-max 2 \
   --bertopic-param nr_topics=8 \
   --bertopic-param min_topic_size=2
 ```
@@ -207,6 +214,14 @@ python3 scripts/topic_modeling_integration.py \
 The command prints the analysis run identifier and the output path. Open the `output.json` file to inspect per-topic labels,
 keywords, and document examples.
+### Profiling analysis demo
+The profiling demo downloads AG News, runs extraction, and produces a profiling report.
+```
+python3 scripts/profiling_demo.py --corpus corpora/profiling_demo --force
+```
 ### Select extracted text within a pipeline
 When you want an explicit choice among multiple extraction outputs, add a selection extractor step at the end of the pipeline.
@@ -243,15 +258,6 @@ python3 -m biblicus build --corpus corpora/pdf_samples --backend sqlite-full-tex
 python3 -m biblicus query --corpus corpora/pdf_samples --query "Dummy PDF file"
 ```
-### Wikipedia retrieval demo (Python)
-This example downloads a few Wikipedia summaries about retrieval and knowledge bases, builds an extraction run, creates a local full text index, and returns evidence plus a context pack.
-```
-rm -rf corpora/wikipedia_rag_demo
-python3 scripts/wikipedia_rag_demo.py --corpus corpora/wikipedia_rag_demo --force
-```
 ### MarkItDown extraction demo (Python 3.10+)
 MarkItDown requires Python 3.10 or higher. This example uses the `py311` conda environment to run the extractor over the mixed sample corpus.
@@ -374,23 +380,6 @@ python3 -m biblicus build --corpus corpora/demo --backend sqlite-full-text-searc
 python3 -m biblicus query --corpus corpora/demo --query "tiny"
 ```
-### Evaluate a run against a dataset
-The repository includes a small dataset that matches the Wikipedia integration corpus.
-```
-python3 -m biblicus eval --corpus corpora/demo --dataset datasets/wikipedia_mini.json
-```
-If you want the matching corpus content, download it first into a separate corpus.
-```
-rm -rf corpora/wikipedia
-python3 scripts/download_wikipedia.py --corpus corpora/wikipedia --limit 5 --force
-python3 -m biblicus build --corpus corpora/wikipedia --backend sqlite-full-text-search
-python3 -m biblicus eval --corpus corpora/wikipedia --dataset datasets/wikipedia_mini.json
-```
 ### Run the test suite and view coverage
 ```

biblicus-0.10.0/docs/PROFILING.md ADDED Viewed

@@ -0,0 +1,98 @@
+# Corpus profiling analysis
+Biblicus provides a profiling analysis backend that summarizes corpus contents using deterministic counts and
+coverage metrics. Profiling is intended as a fast, local baseline before heavier analysis such as topic modeling.
+## What profiling does
+The profiling analysis reports:
+- Total item count and media type distribution
+- Extracted text coverage (present, empty, missing)
+- Size and length distributions with percentiles
+- Tag coverage and top tags
+The output is structured JSON that can be stored, versioned, and compared across runs.
+## Run profiling from the CLI
+```
+biblicus analyze profile --corpus corpora/example --extraction-run pipeline:RUN_ID
+```
+If you omit `--extraction-run`, Biblicus uses the latest extraction run and emits a reproducibility warning.
+To customize profiling metrics, pass a recipe file:
+```
+biblicus analyze profile --corpus corpora/example --recipe recipes/profiling.yml --extraction-run pipeline:RUN_ID
+```
+### Profiling recipe configuration
+Profiling recipes use the analysis schema version and accept these fields:
+- `schema_version`: analysis schema version, currently `1`
+- `sample_size`: optional cap for distribution calculations
+- `min_text_characters`: minimum extracted text length for inclusion
+- `percentiles`: percentiles to compute for size and length distributions
+- `top_tag_count`: maximum number of tags to list in `top_tags`
+- `tag_filters`: optional list of tags to include in tag coverage metrics
+Example recipe:
+```
+schema_version: 1
+sample_size: 500
+min_text_characters: 50
+percentiles: [50, 90, 99]
+top_tag_count: 10
+tag_filters: ["ag_news", "label:World"]
+```
+## Run profiling from Python
+```
+from pathlib import Path
+from biblicus.analysis import get_analysis_backend
+from biblicus.corpus import Corpus
+from biblicus.models import ExtractionRunReference
+corpus = Corpus.open(Path("corpora/example"))
+backend = get_analysis_backend("profiling")
+output = backend.run_analysis(
+    corpus,
+    recipe_name="default",
+    config={
+        "schema_version": 1,
+        "sample_size": 500,
+        "min_text_characters": 50,
+        "percentiles": [50, 90, 99],
+        "top_tag_count": 10,
+        "tag_filters": ["ag_news"],
+    },
+    extraction_run=ExtractionRunReference(
+        extractor_id="pipeline",
+        run_id="RUN_ID",
+    ),
+)
+print(output.model_dump())
+```
+## Output location
+Profiling output is stored under:
+```
+.biblicus/runs/analysis/profiling/<run_id>/output.json
+```
+## Working demo
+A runnable demo is provided in `scripts/profiling_demo.py`. It downloads a corpus, runs extraction, and executes the
+profiling analysis so you can inspect the output:
+```
+python3 scripts/profiling_demo.py --corpus corpora/profiling_demo --force
+```

{biblicus-0.8.0 → biblicus-0.10.0}/docs/ROADMAP.md RENAMED Viewed

@@ -46,23 +46,20 @@ Acceptance checks:
 - Behavior specifications cover policy selection and budgeting behaviors.
 - Example outputs show how context packs differ across policies.
-## Next: extraction backends (OCR and document understanding)
+## Next: extraction evaluation harness
-Goal: treat optical character recognition and document understanding as pluggable extractors with consistent inputs and outputs.
+Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
 Deliverables:
-- A baseline OCR extractor that is fast and local for smoke tests.
-- A higher quality OCR extractor candidate (for example: Paddle OCR or Docling OCR).
-- A general document understanding extractor candidate (for example: Docling or Unstructured).
-- A consistent output contract that captures text plus optional confidence and per-page metadata.
-- A selector policy for choosing between multiple extractor outputs in a pipeline.
-- A shared evaluation harness for extraction backends using the same corpus and dataset.
+- Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected OCR text).
+- Evaluation metrics for accuracy, speed, and cost, including processable fraction for a given extractor recipe.
+- A report format that can compare multiple extraction recipes against the same corpus and dataset.
 Acceptance checks:
-- Behavior specifications cover extractor selection and output provenance.
-- Evaluation reports compare accuracy, processable fraction, latency, and cost.
+- Evaluation results are stable and reproducible for the same corpus and dataset inputs.
+- Reports make it clear when an extractor fails to process an item versus producing empty output.
 ## Next: corpus analysis tools
@@ -70,41 +67,15 @@ Goal: provide lightweight analysis utilities that summarize corpus themes and gu
 Deliverables:
-- A topic modeling workflow for corpus analysis (for example: BERTopic).
-- A report that highlights dominant themes and outliers.
-- A way to compare topic distributions across corpora or corpus snapshots.
+- Basic data profiling reports (counts, media types, size distributions, tag coverage).
+- Hidden Markov modeling analysis for sequence-driven corpora.
+- A way to compare analysis outputs across corpora or corpus snapshots.
 Acceptance checks:
 - Analysis is reproducible for the same corpus state.
 - Reports are exportable and readable without custom tooling.
-### Candidate backend ecosystem (for planning and evaluation)
-Document understanding and OCR blur together at the interface level in Biblicus, so the roadmap treats them as extractor candidates with the same input/output contract.
-Docling family candidates:
-- Docling (document understanding with structured outputs)
-- docling-ocr (OCR component in the Docling ecosystem)
-General-purpose extraction candidates:
-- Unstructured (element-oriented extraction for many formats)
-- MarkItDown (lightweight conversion to Markdown)
-- Kreuzberg (speed-focused extraction for bulk workflows)
-- ExtractThinker (schema-driven extraction using Pydantic contracts)
-Ecosystem adapters:
-- LangChain document loaders (uniform loader interface across many sources)
-### Guidance for choosing early targets
-- If you need layout and table understanding, prioritize Docling and docling-ocr.
-- If you need speed and simplicity, prioritize MarkItDown or Kreuzberg.
-- If you need schema-first extraction, prioritize ExtractThinker layered on an OCR or document extractor.
 ## Later: alternate backends and hosting modes
 Goal: broaden the backend surface while keeping the core predictable.
@@ -138,18 +109,3 @@ Acceptance checks:
 - Behavior specifications cover ingestion, listing, and reindexing in memory.
 - Retrieval and extraction can operate on the in-memory corpus without special casing.
-### Extractor datasets and evaluation harness
-Goal: compare extraction approaches in a way that is measurable, repeatable, and useful for practical engineering decisions.
-Deliverables:
-- Dataset authoring workflow for extraction ground truth (for example: expected transcripts and expected optical character recognition text).
-- Evaluation metrics for accuracy, speed, and cost, including “processable fraction” for a given extractor recipe.
-- A report format that can compare multiple extraction recipes against the same corpus and dataset.
-Acceptance checks:
-- Evaluation results are stable and reproducible for the same corpus and dataset inputs.
-- Reports make it clear when an extractor fails to process an item versus producing empty output.

{biblicus-0.8.0 → biblicus-0.10.0}/docs/TESTING.md RENAMED Viewed

@@ -36,7 +36,7 @@ Integration scenarios are tagged `@integration`.
 The repository does not include downloaded content. Integration scripts download content into a corpus path you choose and then ingest it for a test run.
-- Wikipedia summaries: `scripts/download_wikipedia.py`
+- AG News dataset: `scripts/download_ag_news.py`
 - Portable Document Format samples: `scripts/download_pdf_samples.py`
 - Image samples: `scripts/download_image_samples.py`
 - Mixed modality samples: `scripts/download_mixed_samples.py`

biblicus-0.10.0/docs/TOPIC_MODELING.md ADDED Viewed

@@ -0,0 +1,159 @@
+# Topic modeling
+Biblicus provides a topic modeling analysis backend that reads extracted text artifacts, optionally applies an LLM
+extraction pass, applies lexical processing, runs BERTopic, and optionally applies an LLM fine-tuning pass for
+labels. The output is structured JavaScript Object Notation with explicit per-topic evidence.
+## What topic modeling does
+Topic modeling groups documents into clusters based on shared terms or phrases, then surfaces representative
+keywords for each cluster. It is a fast way to summarize large corpora, identify dominant themes, and spot outliers
+without manual labeling. The output is not a classifier; it is an exploratory tool that produces evidence that can
+be inspected or reviewed by humans.
+## About BERTopic
+BERTopic combines document embeddings with clustering and a class-based term frequency approach to extract topic
+keywords. Biblicus supports BERTopic as an optional dependency and forwards its configuration parameters directly to
+the BERTopic constructor. This allows you to tune clustering behavior while keeping the output in a consistent
+schema.
+## Pipeline stages
+- Text collection reads extracted text artifacts from an extraction run.
+- LLM extraction optionally transforms each document into one or more analysis documents.
+- Lexical processing optionally normalizes text before BERTopic.
+- BERTopic produces topic assignments and keyword weights.
+- LLM fine-tuning optionally replaces topic labels based on sampled documents.
+## Output structure
+Topic modeling writes a single `output.json` file under the analysis run directory. The output contains:
+- `run.run_id` and `run.stats` for reproducible tracking.
+- `report.topics` with the modeled topics.
+- `report.text_collection`, `report.llm_extraction`, `report.lexical_processing`, `report.bertopic_analysis`,
+  and `report.llm_fine_tuning` describing each pipeline stage.
+Each topic record includes:
+- `topic_id`: The BERTopic topic identifier. The outlier topic uses `-1`.
+- `label`: The human-readable label.
+- `label_source`: `bertopic` or `llm` depending on the stage that set the label.
+- `keywords`: Keyword list with weights.
+- `document_count`: Number of documents assigned to the topic.
+- `document_ids`: Item identifiers for the assigned documents.
+- `document_examples`: Sampled document text used for inspection.
+Per-topic behavior is determined by the BERTopic assignments and the optional fine-tuning stage. The lexical
+processing flags can substantially change tokenization and therefore the resulting topic labels. The outlier
+`topic_id` `-1` indicates documents that BERTopic could not confidently assign to a cluster.
+## Configuration reference
+Topic modeling recipes use a strict schema. Unknown fields or type mismatches are errors.
+### Text source
+- `text_source.sample_size`: Limit the number of documents used for analysis.
+- `text_source.min_text_characters`: Drop documents shorter than this count.
+### LLM extraction
+- `llm_extraction.enabled`: Enable the LLM extraction stage.
+- `llm_extraction.method`: `single` or `itemize` to control whether an input maps to one or many documents.
+- `llm_extraction.client`: LLM client configuration (requires `biblicus[openai]`).
+- `llm_extraction.prompt_template`: Prompt template for the extraction stage.
+- `llm_extraction.system_prompt`: Optional system prompt.
+### Lexical processing
+- `lexical_processing.enabled`: Enable normalization.
+- `lexical_processing.lowercase`: Lowercase text before tokenization.
+- `lexical_processing.strip_punctuation`: Remove punctuation before tokenization.
+- `lexical_processing.collapse_whitespace`: Normalize repeated whitespace.
+### BERTopic configuration
+- `bertopic_analysis.parameters`: Mapping of BERTopic constructor parameters.
+- `bertopic_analysis.vectorizer.ngram_range`: Inclusive n-gram range (for example `[1, 2]`).
+- `bertopic_analysis.vectorizer.stop_words`: `english` or a list of stop words. Set to `null` to disable.
+### LLM fine-tuning
+- `llm_fine_tuning.enabled`: Enable LLM topic labeling.
+- `llm_fine_tuning.client`: LLM client configuration.
+- `llm_fine_tuning.prompt_template`: Prompt template containing `{keywords}` and `{documents}`.
+- `llm_fine_tuning.system_prompt`: Optional system prompt.
+- `llm_fine_tuning.max_keywords`: Maximum keywords included per prompt.
+- `llm_fine_tuning.max_documents`: Maximum documents included per prompt.
+## Vectorizer configuration
+Biblicus forwards BERTopic configuration through `bertopic_analysis.parameters` and exposes vectorizer settings
+through `bertopic_analysis.vectorizer`. To include bigrams, set `ngram_range` to `[1, 2]`. To remove stop words,
+set `stop_words` to `english` or a list.
+```yaml
+bertopic_analysis:
+  parameters:
+    min_topic_size: 10
+    nr_topics: 12
+  vectorizer:
+    ngram_range: [1, 2]
+    stop_words: english
+```
+## Repeatable integration script
+The integration script downloads AG News, runs extraction, and then runs topic modeling with the selected
+parameters. It prints a summary with the analysis run identifier and the output path.
+```
+python3 scripts/topic_modeling_integration.py --corpus corpora/ag_news_demo --force
+```
+### Example: raise topic count
+```
+python3 scripts/topic_modeling_integration.py \
+  --corpus corpora/ag_news_demo \
+  --force \
+  --limit 10000 \
+  --vectorizer-ngram-min 1 \
+  --vectorizer-ngram-max 2 \
+  --bertopic-param nr_topics=8 \
+  --bertopic-param min_topic_size=2
+```
+### Example: disable lexical processing and restrict inputs
+```
+python3 scripts/topic_modeling_integration.py \
+  --corpus corpora/ag_news_demo \
+  --force \
+  --sample-size 200 \
+  --min-text-characters 200 \
+  --no-lexical-enabled
+```
+### Example: keep lexical processing but preserve punctuation
+```
+python3 scripts/topic_modeling_integration.py \
+  --corpus corpora/ag_news_demo \
+  --force \
+  --no-lexical-strip-punctuation
+```
+BERTopic parameters are passed directly to the constructor. Use repeated `--bertopic-param key=value` pairs for
+multiple parameters. Values that look like JSON objects or arrays are parsed as JSON.
+The integration script requires at least 16 documents to avoid BERTopic default UMAP errors. Increase `--limit` or
+use a larger corpus if you receive a small-corpus error.
+AG News downloads require the `datasets` dependency. Install with:
+```
+python3 -m pip install "biblicus[datasets,topic-modeling]"
+```

{biblicus-0.8.0 → biblicus-0.10.0}/docs/conf.py RENAMED Viewed

@@ -4,8 +4,13 @@ Sphinx configuration for Biblicus documentation.
 from __future__ import annotations
+import os
+import sys
 from pathlib import Path
+from pygments.lexers.special import TextLexer
+from sphinx.highlighting import lexers
 PROJECT_ROOT = Path(__file__).resolve().parent.parent
 SOURCE_ROOT = PROJECT_ROOT / "src"
@@ -31,8 +36,6 @@ html_theme_options = {
 }
 # ReadTheDocs integration - canonical URL for SEO
-import os
 if os.environ.get("READTHEDOCS"):
     rtd_version = os.environ.get("READTHEDOCS_VERSION", "latest")
     rtd_project = os.environ.get("READTHEDOCS_PROJECT", "biblicus")
@@ -44,12 +47,6 @@ source_suffix = {
 }
 suppress_warnings = ["misc.highlighting_failure"]
-import sys
 sys.path.insert(0, str(SOURCE_ROOT))
-from pygments.lexers.special import TextLexer
-from sphinx.highlighting import lexers
 lexers["mermaid"] = TextLexer()

{biblicus-0.8.0 → biblicus-0.10.0}/docs/index.rst RENAMED Viewed

@@ -16,6 +16,8 @@ Contents
    BACKENDS
    backends/index
    CONTEXT_PACK
+   ANALYSIS
+   PROFILING
    TOPIC_MODELING
    DEMOS
    USER_CONFIGURATION

biblicus 0.8.0__tar.gz → 0.10.0__tar.gz

biblicus 0.8.0tar.gz → 0.10.0tar.gz