PyPI - google-ngrams - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

google-ngrams 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

{google_ngrams-0.1.0 → google_ngrams-0.2.0}/LICENSE RENAMED Viewed

@@ -1,63 +1,24 @@
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-   1. Definitions.
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
+MIT License
+Copyright (c) 2025 David Brown
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
       "Contributor" shall mean Licensor and any individual or Legal Entity
       on behalf of whom a Contribution has been received by Licensor and

{google_ngrams-0.1.0/google_ngrams.egg-info → google_ngrams-0.2.0}/PKG-INFO RENAMED Viewed

@@ -1,30 +1,40 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: google_ngrams
-Version: 0.1.0
+Version: 0.2.0
 Summary: Fetch and analyze Google Ngram data for specified word forms.
 Author-email: David Brown <dwb2@andrew.cmu.edu>
 Maintainer-email: David Brown <dwb2@andrew.cmu.edu>
+License-Expression: MIT
 Project-URL: Documentation, https://browndw.github.io/google_ngrams
-Project-URL: Homepage, https://github.com/browndw/pybiber
+Project-URL: Homepage, https://github.com/browndw/google_ngrams
 Keywords: nlp,language
 Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
-Classifier: Programming Language :: Python :: 3.13
-Requires-Python: >=3.9
+Classifier: Intended Audience :: Science/Research
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.10
 Description-Content-Type: text/x-rst
 License-File: LICENSE
-Requires-Dist: importlib-resources>=6.5
+Requires-Dist: numpy>=1.22
 Requires-Dist: matplotlib>=3.5
 Requires-Dist: polars>=1.17
-Requires-Dist: scipy>=1.15
-google_ngrams: Fetch and analyze Google Ngram data for specified word forms.
+Provides-Extra: test
+Requires-Dist: pytest>=7.0; extra == "test"
+Requires-Dist: pytest-cov>=4.0; extra == "test"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0; extra == "dev"
+Requires-Dist: build>=1.0.0; extra == "dev"
+Requires-Dist: twine>=5.0.0; extra == "dev"
+Dynamic: license-file
+google_ngrams
 =======================================================================================================
-|pypi| |pypi_downloads|
+|pypi| |pypi_downloads| |tests|
 This package has functions for processing `Google’s Ngram repositories <http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>`_ without having to download them locally. These repositories vary in their size, but the larger ones (like th one for the letter *s* or common bigrams) contain multiple gigabytes.
@@ -33,15 +43,20 @@ The main function uses `scan_csv from the polars <https://docs.pola.rs/api/pytho
 vnc
 ---
-To analyze the returned data, the package als contains functions based on the work of Gries and Hilpert (2012) for `Variability-Based Neighbor Clustering <https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199922765.001.0001/oxfordhb-9780199922765-e-14>`_.
+To analyze the returned data, the package also contains functions based on the work of Gries and Hilpert (2012) for `Variability-Based Neighbor Clustering <https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199922765.001.0001/oxfordhb-9780199922765-e-14>`_.
-The idea is to use hierarchical clustering to aid "bottom up  periodization of language change. The python functions are built on `their original R code <http://global.oup.com/us/companion.websites/fdscontent/uscompanion/us/static/companion.websites/nevalainen/Gries-Hilpert_web_final/vnc.individual.html>`_.
+The idea is to use hierarchical clustering to aid "bottom up" periodization of language change. The python functions are built on `their original R code <http://global.oup.com/us/companion.websites/fdscontent/uscompanion/us/static/companion.websites/nevalainen/Gries-Hilpert_web_final/vnc.individual.html>`_.
 Distances, therefore, are calculated in sums of standard deviations and coefficients of variation, according to their stated method.
-Dendrograms are plotted using matplotlib, following the scipy conventions for formatting coordinates. However, the package has customized functions for maintaining the plotting order of the leaves according the requirements of the method.
+Dendrograms are plotted using matplotlib, with custom implementations for hierarchical clustering that maintain the plotting order of the leaves according to the requirements of the method.
+The package also has a custom implementation of dendrogram truncation that consolidates leaves under a specified number of time periods (or clusters) while also maintaining the leaf order to facilitate the reading and interpretation of large dendrograms.
+Lightweight Implementation
+--------------------------
-The package also has an implementation of `scipy's truncate_mode <https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html/>`_ that consolidates leaves under a specified number of time periods (or clusters) while also maintaining the leaf order to facilitate the reading and interpretation of large dendrograms.
+Starting with version 0.2.0, google_ngrams uses lightweight, custom implementations for statistical computations instead of heavy dependencies like scipy and statsmodels. This design choice reduces installation overhead while maintaining full functionality for the core VNC methodology and smoothing operations.
 Installation
@@ -51,7 +66,7 @@ You can install the released version of google_ngrams from `PyPI <https://pypi.o
 .. code-block:: install-google_ngrams
-    pip install google_ngrams
+    pip install google-ngrams
 Usage
@@ -113,14 +128,17 @@ For additional information, consult the `documentation <https://browndw.github.i
 License
 -------
-Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
-See `LICENSE <https://github.com/browndw/docuscospacy/blob/master/LICENSE>`_ file.
+Code licensed under `MIT License <https://opensource.org/licenses/MIT>`_.
+See `LICENSE <https://github.com/browndw/google_ngrams/blob/main/LICENSE>`_ file.
 .. |pypi| image:: https://badge.fury.io/py/google_ngrams.svg
-    :target: https://badge.fury.io/py/pybiber
+    :target: https://badge.fury.io/py/google_ngrams
     :alt: PyPI Version
 .. |pypi_downloads| image:: https://img.shields.io/pypi/dm/google_ngrams
     :target: https://pypi.org/project/google_ngrams/
     :alt: Downloads from PyPI
+.. |tests| image:: https://github.com/browndw/google_ngrams/actions/workflows/test.yml/badge.svg
+    :target: https://github.com/browndw/google_ngrams/actions/workflows/test.yml
+    :alt: Test Status

{google_ngrams-0.1.0 → google_ngrams-0.2.0}/README.rst RENAMED Viewed

@@ -1,7 +1,7 @@
-google_ngrams: Fetch and analyze Google Ngram data for specified word forms.
+google_ngrams
 =======================================================================================================
-|pypi| |pypi_downloads|
+|pypi| |pypi_downloads| |tests|
 This package has functions for processing `Google’s Ngram repositories <http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>`_ without having to download them locally. These repositories vary in their size, but the larger ones (like th one for the letter *s* or common bigrams) contain multiple gigabytes.
@@ -10,15 +10,20 @@ The main function uses `scan_csv from the polars <https://docs.pola.rs/api/pytho
 vnc
 ---
-To analyze the returned data, the package als contains functions based on the work of Gries and Hilpert (2012) for `Variability-Based Neighbor Clustering <https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199922765.001.0001/oxfordhb-9780199922765-e-14>`_.
+To analyze the returned data, the package also contains functions based on the work of Gries and Hilpert (2012) for `Variability-Based Neighbor Clustering <https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199922765.001.0001/oxfordhb-9780199922765-e-14>`_.
-The idea is to use hierarchical clustering to aid "bottom up  periodization of language change. The python functions are built on `their original R code <http://global.oup.com/us/companion.websites/fdscontent/uscompanion/us/static/companion.websites/nevalainen/Gries-Hilpert_web_final/vnc.individual.html>`_.
+The idea is to use hierarchical clustering to aid "bottom up" periodization of language change. The python functions are built on `their original R code <http://global.oup.com/us/companion.websites/fdscontent/uscompanion/us/static/companion.websites/nevalainen/Gries-Hilpert_web_final/vnc.individual.html>`_.
 Distances, therefore, are calculated in sums of standard deviations and coefficients of variation, according to their stated method.
-Dendrograms are plotted using matplotlib, following the scipy conventions for formatting coordinates. However, the package has customized functions for maintaining the plotting order of the leaves according the requirements of the method.
+Dendrograms are plotted using matplotlib, with custom implementations for hierarchical clustering that maintain the plotting order of the leaves according to the requirements of the method.
-The package also has an implementation of `scipy's truncate_mode <https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html/>`_ that consolidates leaves under a specified number of time periods (or clusters) while also maintaining the leaf order to facilitate the reading and interpretation of large dendrograms.
+The package also has a custom implementation of dendrogram truncation that consolidates leaves under a specified number of time periods (or clusters) while also maintaining the leaf order to facilitate the reading and interpretation of large dendrograms.
+Lightweight Implementation
+--------------------------
+Starting with version 0.2.0, google_ngrams uses lightweight, custom implementations for statistical computations instead of heavy dependencies like scipy and statsmodels. This design choice reduces installation overhead while maintaining full functionality for the core VNC methodology and smoothing operations.
 Installation
@@ -28,7 +33,7 @@ You can install the released version of google_ngrams from `PyPI <https://pypi.o
 .. code-block:: install-google_ngrams
-    pip install google_ngrams
+    pip install google-ngrams
 Usage
@@ -90,14 +95,17 @@ For additional information, consult the `documentation <https://browndw.github.i
 License
 -------
-Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
-See `LICENSE <https://github.com/browndw/docuscospacy/blob/master/LICENSE>`_ file.
+Code licensed under `MIT License <https://opensource.org/licenses/MIT>`_.
+See `LICENSE <https://github.com/browndw/google_ngrams/blob/main/LICENSE>`_ file.
 .. |pypi| image:: https://badge.fury.io/py/google_ngrams.svg
-    :target: https://badge.fury.io/py/pybiber
+    :target: https://badge.fury.io/py/google_ngrams
     :alt: PyPI Version
 .. |pypi_downloads| image:: https://img.shields.io/pypi/dm/google_ngrams
     :target: https://pypi.org/project/google_ngrams/
     :alt: Downloads from PyPI
+.. |tests| image:: https://github.com/browndw/google_ngrams/actions/workflows/test.yml/badge.svg
+    :target: https://github.com/browndw/google_ngrams/actions/workflows/test.yml
+    :alt: Test Status

google_ngrams-0.2.0/google_ngrams/__init__.py ADDED Viewed

@@ -0,0 +1,19 @@
+# flake8: noqa
+# Set version ----
+from importlib.metadata import version as _v, PackageNotFoundError as _PNF
+try:
+	__version__ = _v("google_ngrams")
+except _PNF:  # Fallback when running from source without installed metadata
+	__version__ = "0.0.0"
+del _v
+# Imports ----
+from .ngrams import google_ngram
+from .vnc import TimeSeries
+__all__ = ['google_ngram', 'TimeSeries']

{google_ngrams-0.1.0 → google_ngrams-0.2.0}/google_ngrams/data/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 # flake8: noqa
-from importlib_resources import files as _files
+from importlib.resources import files as _files
 sources = {
     "eng_all": _files("google_ngrams") / "data/googlebooks_eng_all_totalcounts_20120701.parquet",

google_ngrams-0.2.0/google_ngrams/ngrams.py ADDED Viewed

@@ -0,0 +1,341 @@
+import os
+import re
+import polars as pl
+import warnings
+import logging
+from textwrap import dedent
+from typing import List
+from .data import sources
+def google_ngram(
+        word_forms: List[str],
+        variety="eng",
+        by="decade"
+) -> pl.DataFrame:
+    """
+    Fetches Google Ngram data for specified word forms.
+    This function retrieves ngram data from the Google Books Ngram Viewer
+    for the given word forms. It supports different varieties of English
+    (e.g., British, American) and allows aggregation by year or decade.
+    Parameters
+    ----------
+    word_forms : List
+        List of word forms to search for.
+    variety : str
+        Variety of English ('eng', 'gb', 'us').
+    by : str
+        Aggregation level ('year' or 'decade').
+    Returns
+    -------
+    pl.DataFrame
+        DataFrame containing the ngram data.
+    """
+    variety_types = ["eng", "gb", "us"]
+    if variety not in variety_types:
+        raise ValueError("""variety_types
+                         Invalid variety type. Expected one of: %s
+                         """ % variety_types)
+    by_types = ["year", "decade"]
+    if by not in by_types:
+        raise ValueError("""variety_types
+                         Invalid by type. Expected one of: %s
+                         """ % by_types)
+    word_forms = [re.sub(r'([a-zA-Z0-9])-([a-zA-Z0-9])',
+                         r'\1 - \2', wf) for wf in word_forms]
+    word_forms = [wf.strip() for wf in word_forms]
+    n = [len(re.findall(r'\S+', wf)) for wf in word_forms]
+    n = list(set(n))
+    if len(n) > 1:
+        raise ValueError("""Check spelling.
+                         Word forms should be lemmas of the same word
+                         (e.g. 'teenager' and 'teenagers'
+                         or 'walk', 'walks' and 'walked'
+                         """)
+    if n[0] > 5:
+        raise ValueError("""Ngrams can be a maximum of 5 tokens.
+                         Hyphenated words are split and include the hyphen,
+                         so 'x-ray' would count as 3 tokens.
+                         """)
+    gram = [wf[:2] if n[0] > 1 else wf[:1] for wf in word_forms]
+    gram = list(set([g.lower() for g in gram]))
+    if len(gram) > 1:
+        raise ValueError("""Check spelling.
+                         Word forms should be lemmas of the same word
+                         (e.g. 'teenager' and 'teenagers'
+                         or 'walk', 'walks' and 'walked'
+                         """)
+    if re.match(r'^[a-z][^a-z]', gram[0]):
+        gram[0] = re.sub(r'[^a-z]', '_', gram[0])
+    if re.match(r'^[0-9]', gram[0]):
+        gram[0] = gram[0][:1]
+    if re.match(r'^[\W]', gram[0]):
+        gram[0] = "punctuation"
+    if any(re.match(r'^[ßæðøłœıƒþȥəħŋªºɣđĳɔȝⅰʊʌʔɛȡɋⅱʃɇɑⅲ]', g) for g in gram):
+        gram[0] = "other"
+    gram[0] = gram[0].encode('latin-1', 'replace').decode('latin-1')
+    # Use HTTPS for integrity (Google Storage supports it) instead of HTTP
+    if variety == "eng":
+        repo = f"https://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-{n[0]}gram-20120701-{gram[0]}.gz"  # noqa: E501
+    else:
+        repo = f"https://storage.googleapis.com/books/ngrams/books/googlebooks-eng-{variety}-all-{n[0]}gram-20120701-{gram[0]}.gz"  # noqa: E501
+    logger = logging.getLogger(__name__)
+    logger.info(dedent(
+        """
+        Accessing repository. For larger ones
+        (e.g., ngrams containing 2 or more words).
+        This may take a few minutes...
+        """
+    ))
+    # Preserve exact tokens for equality filtering in non-regex fallbacks
+    tokens_exact = list(word_forms)
+    word_forms = [re.sub(
+        r'(\.|\?|\$|\^|\)|\(|\}|\{|\]|\[|\*|\+|\|)',
+        r'\\\1', wf
+        ) for wf in word_forms]
+    grep_words = "|".join([f"^{wf}$" for wf in word_forms])
+    # Read the data from the google repository and format
+    schema = {"column_1": pl.String,
+              "column_2": pl.Int64,
+              "column_3": pl.Int64,
+              "column_4": pl.Int64}
+    try:
+        df = pl.scan_csv(
+            repo,
+            separator='\t',
+            has_header=False,
+            schema=schema,
+            truncate_ragged_lines=True,
+            low_memory=True,
+            quote_char=None,
+            ignore_errors=True,
+        )
+    except TypeError:
+        # Fallback for environments/tests that monkeypatch scan_csv with a
+        # limited signature. Use minimal, widely-supported args.
+        df = pl.scan_csv(repo, separator='\t', has_header=False, schema=schema)
+    # Push down filter and projection before collection to minimize memory
+    filtered_df = (
+        df
+        .filter(pl.col("column_1").str.contains(r"(?i)" + grep_words))
+        .select([
+            pl.col("column_1").alias("Token"),
+            pl.col("column_2").alias("Year"),
+            pl.col("column_3").alias("AF"),
+        ])
+    )
+    # Optional: allow tuning streaming batch size via env
+    try:
+        chunk_sz = os.environ.get("POLARS_STREAMING_CHUNK_SIZE")
+        if chunk_sz:
+            pl.Config.set_streaming_chunk_size(int(chunk_sz))
+    except Exception:
+        pass
+    # Collect with streaming fallback for stability across polars versions
+    try:
+        logger.debug("Collecting with engine='streaming'.")
+        all_grams = filtered_df.collect(engine="streaming")
+    except Exception:
+        try:
+            # Older streaming path (deprecated in newer Polars)
+            logger.debug("Collecting with deprecated streaming=True path.")
+            with warnings.catch_warnings():
+                warnings.filterwarnings(
+                    "ignore",
+                    category=DeprecationWarning,
+                    message=r"the `streaming` parameter was deprecated.*",
+                )
+                all_grams = filtered_df.collect(  # type: ignore[arg-type]
+                    streaming=True
+                )
+        except Exception:
+            try:
+                # Plain in-memory collect
+                logger.debug(
+                    "Collecting with in-memory engine (no streaming)."
+                )
+                all_grams = filtered_df.collect()
+            except Exception:
+                # Final memory-safe fallback: batched CSV reader with
+                # per-batch filter
+                logger.debug(
+                    "Falling back to batched CSV reader + per-batch filter."
+                )
+                batch_sz = int(
+                    os.environ.get("POLARS_CSV_BATCH_SIZE", "200000")
+                )
+                try:
+                    reader = pl.read_csv_batched(
+                        repo,
+                        separator='\t',
+                        has_header=False,
+                        ignore_errors=True,
+                        low_memory=True,
+                        batch_size=batch_sz,
+                    )
+                    filtered_batches = []
+                    # Prefer equality match for speed and stability
+                    try:
+                        for batch in reader:  # type: ignore[assignment]
+                            fb = (
+                                batch
+                                .filter(pl.col("column_1").is_in(tokens_exact))
+                                .select([
+                                    pl.col("column_1").alias("Token"),
+                                    pl.col("column_2").alias("Year"),
+                                    pl.col("column_3").alias("AF"),
+                                ])
+                            )
+                            if fb.height:
+                                filtered_batches.append(fb)
+                    except TypeError:
+                        # Fallback for alternate reader APIs
+                        while True:
+                            try:
+                                batches = reader.next_batches(1)
+                            except AttributeError:
+                                break
+                            if not batches:
+                                break
+                            batch = batches[0]
+                            fb = (
+                                batch
+                                .filter(pl.col("column_1").is_in(tokens_exact))
+                                .select([
+                                    pl.col("column_1").alias("Token"),
+                                    pl.col("column_2").alias("Year"),
+                                    pl.col("column_3").alias("AF"),
+                                ])
+                            )
+                            if fb.height:
+                                filtered_batches.append(fb)
+                    if filtered_batches:
+                        all_grams = pl.concat(filtered_batches)
+                    else:
+                        all_grams = pl.DataFrame({
+                            "Token": pl.Series([], dtype=pl.String),
+                            "Year": pl.Series([], dtype=pl.Int64),
+                            "AF": pl.Series([], dtype=pl.Int64),
+                        })
+                except Exception as e:
+                    # If batched reader is unavailable, re-raise with guidance
+                    raise RuntimeError(
+                        "Polars batched CSV reader fallback failed; consider "
+                        "upgrading Polars or disabling this code path via "
+                        "environment if necessary."
+                    ) from e
+    # read totals
+    if variety == "eng":
+        f_path = sources.get("eng_all")
+    elif variety == "gb":
+        f_path = sources.get("gb_all")
+    elif variety == "us":
+        f_path = sources.get("us_all")
+    total_counts = pl.read_parquet(f_path)
+    # format totals, fill missing data, and sum
+    total_counts = total_counts.cast({
+        "Year": pl.UInt32,
+        "Total": pl.UInt64,
+        "Pages": pl.UInt64,
+        "Volumes": pl.UInt64,
+    })
+    total_counts = (
+        total_counts
+        .with_columns(
+            pl.col("Year")
+            .cast(pl.String).str.to_datetime("%Y")
+            )
+        .sort("Year")
+        .upsample(time_column="Year", every="1y")
+        .with_columns(
+            pl.col(["Total", "Pages", "Volumes"])
+            .fill_null(strategy="zero")
+            )
+            )
+    total_counts = (
+        total_counts
+        .group_by_dynamic(
+            "Year", every="1y"
+        ).agg(pl.col("Total").sum())
+    )
+    # sum token totals, convert to datetime and fill in missing years
+    sum_tokens = (
+        all_grams
+        .group_by("Year", maintain_order=True)
+        .agg(pl.col("AF").sum())
+    )
+    sum_tokens = (
+        sum_tokens
+        .with_columns(
+            pl.col("Year")
+            .cast(pl.String).str.to_datetime("%Y")
+            )
+        .sort("Year")
+        .upsample(time_column="Year", every="1y")
+        .with_columns(
+                pl.col("AF")
+                .fill_null(strategy="zero")
+                )
+        )
+    # join with totals
+    sum_tokens = sum_tokens.join(total_counts, on="Year", how="right")
+    # Fill any missing AF created by the join (years with no token hits)
+    sum_tokens = sum_tokens.with_columns(
+        pl.col("AF").fill_null(strategy="zero")
+    )
+    if by == "decade":
+        sum_tokens = (
+            sum_tokens
+            .group_by_dynamic("Year", every="10y")
+            .agg(pl.col(["AF", "Total"]).sum())
+        )
+    # normalize RF per million tokens
+    sum_tokens = (
+        sum_tokens
+        .with_columns(
+            RF=pl.col("AF").truediv("Total").mul(1000000)
+            )
+        .with_columns(
+            pl.col("RF").fill_nan(0)
+            )
+    )
+    sum_tokens.insert_column(1, (pl.lit(word_forms)).alias("Token"))
+    sum_tokens = (
+        sum_tokens
+        .with_columns(
+            pl.col("Year").dt.year().alias("Year")
+            )
+        .drop("Total")
+        )
+    if by == "decade":
+        # Avoid .rename to prevent potential segfaults
+        sum_tokens = (
+            sum_tokens
+            .with_columns(pl.col("Year").alias("Decade"))
+            .drop("Year")
+        )
+    return sum_tokens

google-ngrams 0.1.0__tar.gz → 0.2.0__tar.gz

google-ngrams 0.1.0tar.gz → 0.2.0tar.gz