PyPI - pycorpdiff - Versions diffs - 0.1.0a0__tar.gz - Mend

pycorpdiff 0.1.0a0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

pycorpdiff-0.1.0a0/.gitignore +60 -0
pycorpdiff-0.1.0a0/CHANGELOG.md +44 -0
pycorpdiff-0.1.0a0/CITATION.cff +49 -0
pycorpdiff-0.1.0a0/LICENSE +21 -0
pycorpdiff-0.1.0a0/PKG-INFO +230 -0
pycorpdiff-0.1.0a0/README.md +135 -0
pycorpdiff-0.1.0a0/pyproject.toml +200 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/__init__.py +126 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/_backends/__init__.py +3 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/_backends/pandas.py +3 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/_backends/polars.py +3 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/collocation/__init__.py +19 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/collocation/cooccurrence.py +65 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/collocation/measures.py +102 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/collocation/network.py +233 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/collocation/shift.py +146 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/compare.py +345 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/corpus.py +411 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/datasets/__init__.py +27 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/datasets/_data/hansard_sample.parquet +0 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/datasets/_generate_hansard.py +221 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/datasets/hansard.py +235 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/datasets/histwords.py +221 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/explain.py +177 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/io/__init__.py +16 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/io/duckdb.py +92 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/io/huggingface.py +142 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/io/readers.py +138 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/__init__.py +26 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/bayes.py +50 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/chi_squared.py +94 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/correction.py +34 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/dispersion.py +89 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/effect_sizes.py +65 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/loglikelihood.py +92 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/multicorpus.py +143 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/keyness/permutation.py +154 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/py.typed +0 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/results.py +635 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/semantic/__init__.py +18 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/semantic/alignment.py +53 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/semantic/embed.py +84 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/semantic/shift.py +224 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/semantic/trajectory.py +166 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/stats.py +69 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/__init__.py +15 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/bocpd.py +233 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/causal_impact.py +293 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/changepoint.py +92 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/forecast.py +405 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/its.py +123 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/temporal/slicing.py +174 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/tokenize.py +110 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/__init__.py +37 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/bocpd.py +173 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/causal_impact.py +142 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/collocation.py +48 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/dispersion.py +117 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/forecast.py +129 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/keyness.py +96 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/network.py +186 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/scattertext.py +160 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/semantic_forecast.py +114 -0
pycorpdiff-0.1.0a0/src/pycorpdiff/viz/trajectory.py +48 -0
pycorpdiff-0.1.0a0/tests/__init__.py +0 -0
pycorpdiff-0.1.0a0/tests/conftest.py +29 -0
pycorpdiff-0.1.0a0/tests/fixtures/__init__.py +0 -0
pycorpdiff-0.1.0a0/tests/integration/__init__.py +0 -0
pycorpdiff-0.1.0a0/tests/integration/test_collocation_integration.py +85 -0
pycorpdiff-0.1.0a0/tests/integration/test_crossval_histwords.py +174 -0
pycorpdiff-0.1.0a0/tests/integration/test_crossval_nltk.py +157 -0
pycorpdiff-0.1.0a0/tests/integration/test_crossval_quanteda.py +129 -0
pycorpdiff-0.1.0a0/tests/integration/test_crossval_rayson.py +171 -0
pycorpdiff-0.1.0a0/tests/integration/test_crossval_scattertext.py +110 -0
pycorpdiff-0.1.0a0/tests/integration/test_explain_integration.py +94 -0
pycorpdiff-0.1.0a0/tests/integration/test_keyness_integration.py +145 -0
pycorpdiff-0.1.0a0/tests/integration/test_sbert_slow.py +121 -0
pycorpdiff-0.1.0a0/tests/integration/test_semantic_integration.py +65 -0
pycorpdiff-0.1.0a0/tests/integration/test_stop_words.py +118 -0
pycorpdiff-0.1.0a0/tests/integration/test_temporal_stats.py +80 -0
pycorpdiff-0.1.0a0/tests/integration/test_viz.py +143 -0
pycorpdiff-0.1.0a0/tests/property/__init__.py +0 -0
pycorpdiff-0.1.0a0/tests/property/test_collocation_properties.py +106 -0
pycorpdiff-0.1.0a0/tests/property/test_keyness_properties.py +123 -0
pycorpdiff-0.1.0a0/tests/property/test_temporal_properties.py +101 -0
pycorpdiff-0.1.0a0/tests/unit/__init__.py +0 -0
pycorpdiff-0.1.0a0/tests/unit/test_bayes_factor.py +51 -0
pycorpdiff-0.1.0a0/tests/unit/test_bocpd.py +238 -0
pycorpdiff-0.1.0a0/tests/unit/test_causal_impact.py +283 -0
pycorpdiff-0.1.0a0/tests/unit/test_changepoint.py +63 -0
pycorpdiff-0.1.0a0/tests/unit/test_chi_squared.py +95 -0
pycorpdiff-0.1.0a0/tests/unit/test_collocation_cooccurrence.py +78 -0
pycorpdiff-0.1.0a0/tests/unit/test_collocation_measures.py +121 -0
pycorpdiff-0.1.0a0/tests/unit/test_collocation_shift.py +117 -0
pycorpdiff-0.1.0a0/tests/unit/test_comparison_concordance.py +200 -0
pycorpdiff-0.1.0a0/tests/unit/test_cooccurrence_network.py +218 -0
pycorpdiff-0.1.0a0/tests/unit/test_corpus_hash.py +82 -0
pycorpdiff-0.1.0a0/tests/unit/test_corpus_vocab.py +51 -0
pycorpdiff-0.1.0a0/tests/unit/test_correction.py +48 -0
pycorpdiff-0.1.0a0/tests/unit/test_datasets_hansard.py +80 -0
pycorpdiff-0.1.0a0/tests/unit/test_dispersion.py +74 -0
pycorpdiff-0.1.0a0/tests/unit/test_dispersion_plot.py +97 -0
pycorpdiff-0.1.0a0/tests/unit/test_doc_term_counts_sparse.py +135 -0
pycorpdiff-0.1.0a0/tests/unit/test_effect_sizes.py +80 -0
pycorpdiff-0.1.0a0/tests/unit/test_embedders.py +78 -0
pycorpdiff-0.1.0a0/tests/unit/test_explain.py +135 -0
pycorpdiff-0.1.0a0/tests/unit/test_forecast.py +296 -0
pycorpdiff-0.1.0a0/tests/unit/test_forecast_semantic_drift.py +206 -0
pycorpdiff-0.1.0a0/tests/unit/test_from_huggingface.py +153 -0
pycorpdiff-0.1.0a0/tests/unit/test_hansard_fetcher.py +222 -0
pycorpdiff-0.1.0a0/tests/unit/test_histwords_loader.py +188 -0
pycorpdiff-0.1.0a0/tests/unit/test_its.py +80 -0
pycorpdiff-0.1.0a0/tests/unit/test_keyness_multi.py +183 -0
pycorpdiff-0.1.0a0/tests/unit/test_loglikelihood.py +136 -0
pycorpdiff-0.1.0a0/tests/unit/test_ngram_tokenizer.py +167 -0
pycorpdiff-0.1.0a0/tests/unit/test_permutation_keyness.py +156 -0
pycorpdiff-0.1.0a0/tests/unit/test_polars_interop.py +136 -0
pycorpdiff-0.1.0a0/tests/unit/test_procrustes.py +58 -0
pycorpdiff-0.1.0a0/tests/unit/test_read_duckdb.py +148 -0
pycorpdiff-0.1.0a0/tests/unit/test_read_txt_line_mode.py +62 -0
pycorpdiff-0.1.0a0/tests/unit/test_result_exports.py +111 -0
pycorpdiff-0.1.0a0/tests/unit/test_scattertext_plot.py +217 -0
pycorpdiff-0.1.0a0/tests/unit/test_semantic_neighbours.py +93 -0
pycorpdiff-0.1.0a0/tests/unit/test_semantic_shift.py +123 -0
pycorpdiff-0.1.0a0/tests/unit/test_semantic_trajectory.py +173 -0
pycorpdiff-0.1.0a0/tests/unit/test_smoke.py +144 -0
pycorpdiff-0.1.0a0/tests/unit/test_temporal.py +147 -0
pycorpdiff-0.1.0a0/tests/unit/test_wilson_ci.py +71 -0

pycorpdiff-0.1.0a0/.gitignore ADDED Viewed

@@ -0,0 +1,60 @@
+# Python build artefacts
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.eggs/
+# Virtual environments
+.venv/
+venv/
+env/
+.env
+# Test / type-check / lint caches
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.coverage
+.coverage.*
+htmlcov/
+.tox/
+# Editor / OS cruft
+.DS_Store
+Thumbs.db
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+# AI workflow artefacts (kept local, never published)
+.claude/
+# Hypothesis example database (auto-managed)
+.hypothesis/
+# Jupyter checkpoints
+.ipynb_checkpoints/
+# Notebook outputs that aren't reviewed-as-source; the canonical notebooks
+# are executed in CI, not hand-edited with stale outputs.
+examples/*_executed.ipynb
+# Temp notebooks produced by scripts/render_notebooks_to_html.py
+examples/*.patched.ipynb
+# pyenv local override
+.python-version
+# Misc temp files
+*.tmp
+*.bak
+# Stray uv lockfiles created outside the repo root
+**/uv.lock.tmp
+# Mkdocs build output (legacy; mkdocs.yml itself is gone)
+site/

pycorpdiff-0.1.0a0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Changelog
+All notable changes to `pycorpdiff` are documented in this file. The format
+follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this
+project adheres to [Semantic Versioning](https://semver.org/).
+## [0.1.0a0] — initial release
+The initial public release of `pycorpdiff` — comparative corpus analysis
+for modern Python workflows. Three public verbs (`compare`, `track`,
+`compare.before_after`), nine `Result` dataclasses with a uniform
+six-method contract (`.to_df / .plot / .explain / .summary / .to_html /
+.to_json`), two `typing.Protocol` extension points (`Tokenizer`,
+`Embedder`), and opt-in extras for visualisation, semantic embedding,
+temporal modelling, polars interop, DuckDB ingestion, and 🤗 Datasets.
+### Analytical surface
+- **Keyness**: signed Dunning G², Pearson χ², Hardie LogRatio,
+  Gabrielatos %DIFF, BIC-Bayes factor, Juilland D / Gries DP dispersion
+  flagging, Benjamini–Hochberg correction, stop-word filtering,
+  empirical permutation *p*-values, N-way contingency G² via
+  `keyness_multi`.
+- **Collocations**: logDice, PMI, t-score, MI³ with Laplace smoothing;
+  cross-corpus `collocation_shift`; co-occurrence networks via
+  `cooccurrence_network`.
+- **Semantic shift**: averaged contextual embeddings, Procrustes
+  alignment, multi-period `semantic_trajectory`, `neighborhood_drift`.
+- **Temporal**: Wilson-CI trajectories, offline PELT changepoints,
+  online Bayesian changepoint detection, segmented-OLS interrupted
+  time series, Bayesian structural time-series causal impact,
+  state-space exponential-smoothing forecasting.
+### Cross-validated
+Numerically agrees with Rayson's LL Wizard (15 reference triples),
+NLTK's `BigramAssocMeasures` (≤ 1e-12 on PMI / t-score / MI³),
+Scattertext on the 2012 US conventions, `quanteda` via `rpy2`, and
+the HistWords COHA replication.
+### Infrastructure
+519 tests, `ruff` + `mypy --strict` clean across 55 source files,
+matrix CI on three Python versions × two operating systems.

pycorpdiff-0.1.0a0/CITATION.cff ADDED Viewed

@@ -0,0 +1,49 @@
+cff-version: 1.2.0
+message: >
+  If you use pycorpdiff in academic work, please cite both the
+  software (this entry) and the accompanying Journal of Statistical
+  Software paper once it appears. The JSS manuscript is in
+  preparation; the draft will live in this repository as paper/paper.tex.
+title: "pycorpdiff: Comparative Corpus Analysis for Modern Python Workflows"
+version: 0.1.0a0
+date-released: 2026-05-22
+authors:
+  - family-names: Turner
+    given-names: Jason
+    email: jason.s.turner@gmail.com
+license: MIT
+repository-code: "https://github.com/jturner-uofl/pycorpdiff"
+keywords:
+  - corpus linguistics
+  - comparative corpus analysis
+  - keyness
+  - collocation
+  - semantic change
+  - diachronic nlp
+  - digital humanities
+  - computational social science
+  - reproducible research
+  - python
+abstract: >
+  pycorpdiff is a Python package for comparative and temporal corpus
+  analysis. It provides a coherent comparative layer over the
+  existing PyData and NLP stacks, unifying classical corpus
+  linguistics methods (keyness, collocations, dispersion) with
+  embedding-based semantic-shift analysis under a single, composable
+  API. The package targets corpus linguistics, digital humanities,
+  computational social science, and discourse analysis research,
+  emphasising interpretability, explainability, statistical rigour,
+  and reproducibility.
+preferred-citation:
+  type: article
+  authors:
+    - family-names: Turner
+      given-names: Jason
+  title: "pycorpdiff: Comparative Corpus Analysis for Modern Python Workflows"
+  journal: "Journal of Statistical Software"
+  year: 2026
+  status: in-preparation
+identifiers:
+  - type: url
+    value: "https://github.com/jturner-uofl/pycorpdiff"
+    description: Project repository

pycorpdiff-0.1.0a0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Jason Turner
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

pycorpdiff-0.1.0a0/PKG-INFO ADDED Viewed

@@ -0,0 +1,230 @@
+Metadata-Version: 2.4
+Name: pycorpdiff
+Version: 0.1.0a0
+Summary: Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
+Project-URL: Homepage, https://github.com/jturner-uofl/pycorpdiff
+Project-URL: Documentation, https://github.com/jturner-uofl/pycorpdiff
+Project-URL: Repository, https://github.com/jturner-uofl/pycorpdiff
+Project-URL: Issues, https://github.com/jturner-uofl/pycorpdiff/issues
+Author-email: Jason Turner <jason.s.turner@gmail.com>
+License: MIT License
+        Copyright (c) 2026 Jason Turner
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: collocation,comparative corpus analysis,computational social science,corpus linguistics,diachronic nlp,digital humanities,discourse analysis,keyness,semantic change,temporal text analysis
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Text Processing :: Linguistic
+Requires-Python: >=3.11
+Requires-Dist: numpy>=1.24
+Requires-Dist: pandas<3,>=2.0
+Requires-Dist: pyarrow>=14
+Requires-Dist: scipy>=1.11
+Provides-Extra: all
+Requires-Dist: altair>=5; extra == 'all'
+Requires-Dist: datasets>=2.14; extra == 'all'
+Requires-Dist: duckdb>=0.10; extra == 'all'
+Requires-Dist: matplotlib>=3.8; extra == 'all'
+Requires-Dist: networkx>=3.1; extra == 'all'
+Requires-Dist: polars>=1.0; extra == 'all'
+Requires-Dist: pyarrow>=15; extra == 'all'
+Requires-Dist: pysofra>=0.1.0a2; extra == 'all'
+Requires-Dist: ruptures>=1.1; extra == 'all'
+Requires-Dist: scikit-learn>=1.3; extra == 'all'
+Requires-Dist: sentence-transformers>=2.2; extra == 'all'
+Requires-Dist: spacy>=3.7; extra == 'all'
+Requires-Dist: statsmodels>=0.14; extra == 'all'
+Requires-Dist: vl-convert-python>=1.5; extra == 'all'
+Provides-Extra: dev
+Requires-Dist: hypothesis>=6.100; extra == 'dev'
+Requires-Dist: mypy>=1.8; extra == 'dev'
+Requires-Dist: pandas-stubs>=2.2; extra == 'dev'
+Requires-Dist: pre-commit>=3.6; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1; extra == 'dev'
+Requires-Dist: pytest>=8; extra == 'dev'
+Requires-Dist: ruff>=0.4; extra == 'dev'
+Provides-Extra: duckdb
+Requires-Dist: duckdb>=0.10; extra == 'duckdb'
+Provides-Extra: huggingface
+Requires-Dist: datasets>=2.14; extra == 'huggingface'
+Provides-Extra: nlp
+Requires-Dist: spacy>=3.7; extra == 'nlp'
+Provides-Extra: notebooks
+Requires-Dist: jupyter>=1.0; extra == 'notebooks'
+Requires-Dist: pysofra>=0.1.0a2; extra == 'notebooks'
+Requires-Dist: vl-convert-python>=1.5; extra == 'notebooks'
+Provides-Extra: polars
+Requires-Dist: polars>=1.0; extra == 'polars'
+Requires-Dist: pyarrow>=15; extra == 'polars'
+Provides-Extra: semantic
+Requires-Dist: scikit-learn>=1.3; extra == 'semantic'
+Requires-Dist: sentence-transformers>=2.2; extra == 'semantic'
+Provides-Extra: temporal
+Requires-Dist: ruptures>=1.1; extra == 'temporal'
+Requires-Dist: statsmodels>=0.14; extra == 'temporal'
+Provides-Extra: viz
+Requires-Dist: altair>=5; extra == 'viz'
+Requires-Dist: matplotlib>=3.8; extra == 'viz'
+Requires-Dist: networkx>=3.1; extra == 'viz'
+Description-Content-Type: text/markdown
+# pycorpdiff
+<!--
+TODO post-publish (Phase 5 — once GitHub repo public + PyPI published + Zenodo DOI minted):
+[![PyPI](https://img.shields.io/pypi/v/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
+[![Python versions](https://img.shields.io/pypi/pyversions/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
+[![CI](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml/badge.svg)](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.<RECORD>.svg)](https://doi.org/10.5281/zenodo.<RECORD>)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+-->
+**Comparative corpus analysis for modern Python workflows.**
+`pycorpdiff` is the **missing comparative layer** between R's
+[`quanteda`](https://quanteda.io/), the closed-source SketchEngine
+platform, and the fragmented Python NLP stack
+(`nltk`/`spaCy`/`gensim`/`sentence-transformers`). Three public verbs
+— `compare(a, b)`, `track(c, term)`, `compare.before_after(c, event)` —
+consolidate keyness, collocations, dispersion, temporal trajectories,
+changepoint detection, interrupted time series, causal-impact analysis,
+forecasting, online changepoint detection, and embedding-based semantic
+shift under a single notebook-native API. Every result carries its own
+KWIC evidence: `.explain(term)` returns the source-text concordances
+behind any ranked term.
+The package answers the questions corpus linguistics, digital humanities,
+and computational social science routinely have:
+- *How does corpus A differ from corpus B?* — `compare(a, b).keyness()`
+- *How has discourse around X evolved over time?* — `track(c, "x").over_time()`
+- *What did "migrant" mean in 2005 vs 2023?* — `compare(...).semantic_shift("migrant", embedder=...)`
+- *Did this event actually shift the conversation?* — `track(...).causal_impact(event_date=...)`
+- *Where is the discourse heading?* — `track(...).forecast(horizon=4)`
+`pycorpdiff` is positioned as **orchestration**, not reinvention.
+Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
+`SBERT`-compatible model) plug in via two `typing.Protocol` extension
+points — one-line adapters, no plugin registry. The base install pulls
+only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
+via extras.
+> **Status: pre-release alpha (0.1.0a0).** Public API is stable for the
+> features described below; PyPI publication is the next milestone.
+## The three-layer architecture
+| Layer | Purpose | Key surface |
+|---|---|---|
+| **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
+| **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
+| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each with `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
+## Quick start
+```python
+import pycorpdiff as pcd
+news = pcd.from_dataframe(df, text_col="body", meta_cols=("outlet", "date"))
+# Compare — three verbs
+k = pcd.compare(news.slice(outlet="Guardian"), news.slice(outlet="Mail")).keyness()
+c = pcd.compare(a, b).collocation_shift("migrant")
+s = pcd.compare(a, b).semantic_shift("migrant", embedder=pcd.SBERTEmbedder())
+# Track over time
+tr = pcd.track(news, "migrant").over_time(freq="Y")
+tr.changepoints()                                     # offline PELT
+tr.changepoints_online(hazard=1/24)                   # Bayesian online (Adams & MacKay 2007)
+tr.interrupted_time_series(event_date="2016-06-23")   # segmented OLS
+tr.causal_impact(event_date="2016-06-23")             # Bayesian counterfactual (Brodersen 2015)
+tr.forecast(horizon=4)                                # state-space ETS
+# Before / after a known event
+pcd.compare.before_after(news, event_date="2016-06-23").keyness()
+# N-way (≥ 2 corpora)
+pcd.keyness_multi([gu, ma, te, mi], labels=["Guardian", "Mail", "Telegraph", "Mirror"])
+# The discourse as a graph
+pcd.cooccurrence_network(news, top_n=50).plot()
+# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()
+```
+See [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb)
+([rendered HTML](docs/rendered/pycorpdiff_showcase.html)) for a
+walkthrough on a synthetic UK Hansard corpus exercising every analytical
+surface.
+## Installation
+<!-- TODO post-publish: replace this block with the PyPI install commands once published. -->
+Currently a pre-release alpha. From a local clone:
+```bash
+git clone https://github.com/jturner-uofl/pycorpdiff
+cd pycorpdiff
+pip install -e ".[dev]"
+pytest -q                          # 519 default tests, ~7s
+```
+Optional extras: `[viz]` (altair + matplotlib + networkx), `[semantic]`
+(sentence-transformers + scikit-learn), `[temporal]` (ruptures +
+statsmodels), `[polars]`, `[duckdb]`, `[huggingface]`, `[nlp]` (spaCy),
+`[notebooks]` (jupyter + vl-convert + pysofra, for the showcase),
+or `[all]`.
+## Cross-validation receipts
+The math agrees with the standard tools — by automated test:
+- **Rayson's LL Wizard** — 15 hand-derived contingency-table reference triples
+- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
+- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
+- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+## Citation
+If you use `pycorpdiff` in academic work, please cite the software via
+the `CITATION.cff` file in this repository — GitHub renders a "Cite this
+repository" widget directly from it.
+## License
+MIT — see [LICENSE](LICENSE).
+## Further reading
+- [`docs/design.md`](docs/design.md) — three-layer architecture
+- [`docs/statistical-methods.md`](docs/statistical-methods.md) — every metric's formula + citation
+- [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
+- [`docs/rendered/`](docs/rendered/) — self-contained HTML renders of the example notebooks

pycorpdiff-0.1.0a0/README.md ADDED Viewed

@@ -0,0 +1,135 @@
+# pycorpdiff
+<!--
+TODO post-publish (Phase 5 — once GitHub repo public + PyPI published + Zenodo DOI minted):
+[![PyPI](https://img.shields.io/pypi/v/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
+[![Python versions](https://img.shields.io/pypi/pyversions/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)
+[![CI](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml/badge.svg)](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.<RECORD>.svg)](https://doi.org/10.5281/zenodo.<RECORD>)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+-->
+**Comparative corpus analysis for modern Python workflows.**
+`pycorpdiff` is the **missing comparative layer** between R's
+[`quanteda`](https://quanteda.io/), the closed-source SketchEngine
+platform, and the fragmented Python NLP stack
+(`nltk`/`spaCy`/`gensim`/`sentence-transformers`). Three public verbs
+— `compare(a, b)`, `track(c, term)`, `compare.before_after(c, event)` —
+consolidate keyness, collocations, dispersion, temporal trajectories,
+changepoint detection, interrupted time series, causal-impact analysis,
+forecasting, online changepoint detection, and embedding-based semantic
+shift under a single notebook-native API. Every result carries its own
+KWIC evidence: `.explain(term)` returns the source-text concordances
+behind any ranked term.
+The package answers the questions corpus linguistics, digital humanities,
+and computational social science routinely have:
+- *How does corpus A differ from corpus B?* — `compare(a, b).keyness()`
+- *How has discourse around X evolved over time?* — `track(c, "x").over_time()`
+- *What did "migrant" mean in 2005 vs 2023?* — `compare(...).semantic_shift("migrant", embedder=...)`
+- *Did this event actually shift the conversation?* — `track(...).causal_impact(event_date=...)`
+- *Where is the discourse heading?* — `track(...).forecast(horizon=4)`
+`pycorpdiff` is positioned as **orchestration**, not reinvention.
+Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
+`SBERT`-compatible model) plug in via two `typing.Protocol` extension
+points — one-line adapters, no plugin registry. The base install pulls
+only `numpy`, `pandas`, `scipy`, and `pyarrow`; everything else is opt-in
+via extras.
+> **Status: pre-release alpha (0.1.0a0).** Public API is stable for the
+> features described below; PyPI publication is the next milestone.
+## The three-layer architecture
+| Layer | Purpose | Key surface |
+|---|---|---|
+| **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
+| **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
+| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each with `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
+## Quick start
+```python
+import pycorpdiff as pcd
+news = pcd.from_dataframe(df, text_col="body", meta_cols=("outlet", "date"))
+# Compare — three verbs
+k = pcd.compare(news.slice(outlet="Guardian"), news.slice(outlet="Mail")).keyness()
+c = pcd.compare(a, b).collocation_shift("migrant")
+s = pcd.compare(a, b).semantic_shift("migrant", embedder=pcd.SBERTEmbedder())
+# Track over time
+tr = pcd.track(news, "migrant").over_time(freq="Y")
+tr.changepoints()                                     # offline PELT
+tr.changepoints_online(hazard=1/24)                   # Bayesian online (Adams & MacKay 2007)
+tr.interrupted_time_series(event_date="2016-06-23")   # segmented OLS
+tr.causal_impact(event_date="2016-06-23")             # Bayesian counterfactual (Brodersen 2015)
+tr.forecast(horizon=4)                                # state-space ETS
+# Before / after a known event
+pcd.compare.before_after(news, event_date="2016-06-23").keyness()
+# N-way (≥ 2 corpora)
+pcd.keyness_multi([gu, ma, te, mi], labels=["Guardian", "Mail", "Telegraph", "Mirror"])
+# The discourse as a graph
+pcd.cooccurrence_network(news, top_n=50).plot()
+# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()
+```
+See [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb)
+([rendered HTML](docs/rendered/pycorpdiff_showcase.html)) for a
+walkthrough on a synthetic UK Hansard corpus exercising every analytical
+surface.
+## Installation
+<!-- TODO post-publish: replace this block with the PyPI install commands once published. -->
+Currently a pre-release alpha. From a local clone:
+```bash
+git clone https://github.com/jturner-uofl/pycorpdiff
+cd pycorpdiff
+pip install -e ".[dev]"
+pytest -q                          # 519 default tests, ~7s
+```
+Optional extras: `[viz]` (altair + matplotlib + networkx), `[semantic]`
+(sentence-transformers + scikit-learn), `[temporal]` (ruptures +
+statsmodels), `[polars]`, `[duckdb]`, `[huggingface]`, `[nlp]` (spaCy),
+`[notebooks]` (jupyter + vl-convert + pysofra, for the showcase),
+or `[all]`.
+## Cross-validation receipts
+The math agrees with the standard tools — by automated test:
+- **Rayson's LL Wizard** — 15 hand-derived contingency-table reference triples
+- **NLTK** `BigramAssocMeasures` — PMI + t-score to ≤ 1e-12 on every adjacent bigram
+- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012 US Conventions corpus
+- **quanteda (R)** via `rpy2` — byte-for-byte G² agreement (slow tier)
+- **HistWords (Hamilton et al. 2016)** — diachronic cosine displacements on COHA (slow tier)
+## Citation
+If you use `pycorpdiff` in academic work, please cite the software via
+the `CITATION.cff` file in this repository — GitHub renders a "Cite this
+repository" widget directly from it.
+## License
+MIT — see [LICENSE](LICENSE).
+## Further reading
+- [`docs/design.md`](docs/design.md) — three-layer architecture
+- [`docs/statistical-methods.md`](docs/statistical-methods.md) — every metric's formula + citation
+- [`examples/pycorpdiff_showcase.ipynb`](examples/pycorpdiff_showcase.ipynb) — full feature tour as a notebook
+- [`docs/rendered/`](docs/rendered/) — self-contained HTML renders of the example notebooks