PyPI - chunkey-bert - Versions diffs - 0.1.0__tar.gz - Mend

chunkey-bert 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

chunkey_bert-0.1.0/LICENSE +21 -0
chunkey_bert-0.1.0/PKG-INFO +52 -0
chunkey_bert-0.1.0/README.md +28 -0
chunkey_bert-0.1.0/pyproject.toml +140 -0
chunkey_bert-0.1.0/src/chunkey_bert/__init__.py +0 -0
chunkey_bert-0.1.0/src/chunkey_bert/model.py +306 -0

chunkey_bert-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 Yaniv Shulman
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

chunkey_bert-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,52 @@
+Metadata-Version: 2.1
+Name: chunkey-bert
+Version: 0.1.0
+Summary: Modification of the KeyBERT method to extract keywords and keyphrases using chunks. This provides better results, especialy when handling long documents.
+Home-page: https://github.com/yaniv-shulman/chunkey-keybert
+Keywords: machine learning
+Author: Yaniv Shulman
+Author-email: yaniv@shulman.info
+Requires-Python: >=3.9,<4.0
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Dist: keybert (>=0.8.4,<0.9.0)
+Project-URL: Repository, https://github.com/yaniv-shulman/chunkey-bert
+Description-Content-Type: text/markdown
+![Tests](https://github.com/yaniv-shulman/chunkey-bert/actions/workflows/linting_and_tests.yml/badge.svg?branch=main)
+[![phorm.ai](https://img.shields.io/badge/ask%20phorm.ai-8A2BE2)](https://www.phorm.ai/query?projectId=f7ddaf97-2b90-4515-a364-855258454655)
+# ChunkeyBERT #
+## Overview ##
+ChunkeyBert is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings for unsupervised
+keyphrase extraction from text documents. ChunkeyBert is a modification of the
+[KeyBERT method](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea) to handle documents with
+arbitrary length with better results. ChunkeyBERT works by chunking the documents and uses KeyBERT to extract candidate
+keywords/keyphrases from all chunks followed by a similarity based selection stage to produce the final keywords for the
+entire document. ChunkeyBert can use any document chunking method as long as it can be wrapped in a simple function,
+however it can also work without a chunker and process the entire document as a single chunk. ChunkeyBert works with any
+configuration of KeyBERT and can handle batches of documents.
+## Installation ##
+Install from [PyPI](https://pypi.org/project/rsklpr/) using pip (preferred method):
+```bash
+pip install chunkey-bert
+```
+## Experimental results ##
+Very limited experimental results and demonstration of the library on a small number of documents is available at
+ https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/.
+## Contribution and feedback ##
+Contributions and feedback are most welcome. Please see
+[CONTRIBUTING.md](https://github.com/yaniv-shulman/chunkey-bert/tree/main/CONTRIBUTING.md) for further details.

chunkey_bert-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,28 @@
+![Tests](https://github.com/yaniv-shulman/chunkey-bert/actions/workflows/linting_and_tests.yml/badge.svg?branch=main)
+[![phorm.ai](https://img.shields.io/badge/ask%20phorm.ai-8A2BE2)](https://www.phorm.ai/query?projectId=f7ddaf97-2b90-4515-a364-855258454655)
+# ChunkeyBERT #
+## Overview ##
+ChunkeyBert is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings for unsupervised
+keyphrase extraction from text documents. ChunkeyBert is a modification of the
+[KeyBERT method](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea) to handle documents with
+arbitrary length with better results. ChunkeyBERT works by chunking the documents and uses KeyBERT to extract candidate
+keywords/keyphrases from all chunks followed by a similarity based selection stage to produce the final keywords for the
+entire document. ChunkeyBert can use any document chunking method as long as it can be wrapped in a simple function,
+however it can also work without a chunker and process the entire document as a single chunk. ChunkeyBert works with any
+configuration of KeyBERT and can handle batches of documents.
+## Installation ##
+Install from [PyPI](https://pypi.org/project/rsklpr/) using pip (preferred method):
+```bash
+pip install chunkey-bert
+```
+## Experimental results ##
+Very limited experimental results and demonstration of the library on a small number of documents is available at
+ https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/.
+## Contribution and feedback ##
+Contributions and feedback are most welcome. Please see
+[CONTRIBUTING.md](https://github.com/yaniv-shulman/chunkey-bert/tree/main/CONTRIBUTING.md) for further details.

chunkey_bert-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,140 @@
+[tool.poetry]
+authors = ["Yaniv Shulman <yaniv@shulman.info>"]
+classifiers = [
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Topic :: Scientific/Engineering",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence"
+]
+description = "Modification of the KeyBERT method to extract keywords and keyphrases using chunks. This provides better results, especialy when handling long documents."
+homepage = "https://github.com/yaniv-shulman/chunkey-keybert"
+keywords = [
+    "machine learning",
+]
+name = "chunkey-bert"
+packages = [
+    { include = "chunkey_bert", from = "src" }
+]
+readme = "README.md"
+repository = "https://github.com/yaniv-shulman/chunkey-bert"
+version = "0.1.0"
+[tool.poetry.group.experiments]
+optional = true
+[tool.poetry.group.dev]
+optional = true
+[tool.poetry.dependencies]
+python = ">=3.9,<4.0"
+keybert = "^0.8.4"
+[tool.poetry.group.experiments.dependencies]
+notebook = "^7.1.3"
+ipywidgets = "^8.1.2"
+spacy = "^3.7.4"
+cupy-cuda12x = "^13.1.0"
+keyphrase-vectorizers = "^0.0.13"
+sentence-transformers = "^2.7.0"
+en-core-web-trf = {url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3.tar.gz"}
+datasets = "^2.19.1"
+[tool.poetry.group.dev.dependencies]
+black = {extras = ["jupyter"], version = "^24.4.2"}
+mypy = "^1.10.0"
+flake8 = "^7.0.0"
+ruff = "^0.4.3"
+pytest = "^8.2.0"
+pytest-mock = "^3.14.0"
+coverage = {extras = ["toml"], version = "^7.5.1"}
+pytest-xdist = "^3.6.1"
+pytest-cov = "^5.0.0"
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
+[tool.black]
+line-length = 120
+target-version = ["py39"]
+[tool.ruff]
+# Exclude a variety of commonly ignored directories.
+exclude = [
+    ".bzr",
+    ".direnv",
+    ".eggs",
+    ".git",
+    ".git-rewrite",
+    ".hg",
+    ".idea",
+    ".ipynb_checkpoints",
+    ".mypy_cache",
+    ".nox",
+    ".pants.d",
+    ".pytype",
+    ".ruff_cache",
+    ".svn",
+    ".tox",
+    ".venv",
+    "__pycache__",
+    "__pypackages__",
+    "_build",
+    "buck-out",
+    "build",
+    "dist",
+    "node_modules",
+    "paper",
+    "venv",
+]
+# Same as Black.
+line-length = 120
+indent-width = 4
+target-version = "py39"
+[tool.ruff.lint]
+# Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`)  codes by default.
+# Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
+# McCabe complexity (`C901`) by default.
+select = ["E4", "E7", "E9", "F"]
+ignore = []
+# Allow fix for all enabled rules (when `--fix`) is provided.
+fixable = ["ALL"]
+unfixable = []
+# Allow unused variables when underscore-prefixed.
+dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
+[tool.mypy]
+python_version = "3.9"
+warn_return_any = true
+warn_unused_configs = true
+ignore_missing_imports = true
+[tool.pytest.ini_options]
+addopts = "-ra -q"
+minversion = "6.0"
+testpaths = ["tests"]
+[tool.coverage.run]
+branch = true
+omit = ["tests/*", "src/experiments/*"]
+[tool.coverage.report]
+show_missing=true

chunkey_bert-0.1.0/src/chunkey_bert/__init__.py ADDED Viewed

File without changes

chunkey_bert-0.1.0/src/chunkey_bert/model.py ADDED Viewed

@@ -0,0 +1,306 @@
+import warnings
+from typing import Tuple, List, Optional, Union, Callable
+import numpy as np
+from keybert import KeyBERT
+from keybert.backend import BaseEmbedder
+from sklearn.feature_extraction.text import CountVectorizer
+def _calculate_top_similar_keywords_for_doc(
+    embeddings_doc: np.ndarray,
+    counts_doc: Optional[np.ndarray],
+    top_k: Optional[int] = None,
+) -> Tuple[np.ndarray, np.ndarray]:
+    """
+    Given embeddings to all keywords extracted from all chunks comprising a document, this method determines the top k
+    most similar keywords across all chunks. The method assumes the embeddings are normalized and uses dot product
+    similarity (cosine similarity). The score is then normalized to the range [0,1].
+    Args:
+        embeddings_doc: embeddings of all keywords extracted from a single document.
+        counts_doc: If provided, the multiplicity of keywords extracted from the document to use in calculating
+            similarity weighting.
+        top_k: the number of top keywords to return. If unspecified, returns all keywords sorted by decreasing score.
+    Returns:
+        The top k most similar keywords across all chunks and their score in [0,1].
+    """
+    if top_k is not None and top_k < 0:
+        raise ValueError("top_k must be greater than or equal to 0, or None.")
+    sim: np.ndarray = embeddings_doc @ embeddings_doc.T
+    np.fill_diagonal(a=sim, val=np.nan)
+    sim = np.nanmean(sim, axis=1)
+    sim = np.clip(sim, a_min=-1, a_max=1) + 1.0
+    if counts_doc is not None:
+        weights: np.ndarray = np.log2(counts_doc + 1)
+        weights /= np.max(weights)
+        sim *= weights
+    sim /= 2
+    top_idx: np.ndarray = np.argsort(sim)
+    if top_k is None:
+        top_idx = np.flip(top_idx)
+    else:
+        top_idx = top_idx[-1 : -top_k - 1 : -1]
+    return top_idx, sim[top_idx]
+def _get_unique_keywords_by_doc_idx(
+    all_keywords_chunks: List[List[Tuple[str, float]]],
+    docs_idx_list: List[int],
+    doc_idx: int,
+    use_count_weights: bool,
+) -> Tuple[np.ndarray, Optional[np.ndarray]]:
+    """
+    Extract all unique keywords (case sensitive) for a single doc from the list of keywords for all chunks.
+    Args:
+        all_keywords_chunks: Keywords for all chunks of documents.
+        docs_idx_list: Mapping chunks keywords indices to documents.
+        doc_idx: The document index to extract keywords for.
+        use_count_weights: If True, the number of times a keyword is repeated across chunks in the same document is
+            returned. If False, None is returned as counts.
+    Returns:
+        All unique keywords for a single doc from the list of keywords and optionally their counts.
+    """
+    if doc_idx < 0 or doc_idx >= len(docs_idx_list):
+        raise ValueError("doc_idx out of range")
+    chunk_i_keywords: List[List[Tuple[str, float]]] = all_keywords_chunks[
+        docs_idx_list[doc_idx] : docs_idx_list[doc_idx + 1]
+    ]
+    keywords_doc_unique: Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]
+    keywords_doc_unique = np.unique(
+        ar=[t[0] for lk in chunk_i_keywords for t in lk if (len(t[0]) > 0 and not t[0].isspace())],  # type: ignore[call-overload]
+        return_counts=use_count_weights,
+    )
+    return (
+        (keywords_doc_unique[0].astype(str), keywords_doc_unique[1])  # type: ignore[return-value]
+        if use_count_weights
+        else (keywords_doc_unique, None)
+    )
+def _extract_chunks_from_docs(
+    docs: Union[str, List[str]],
+    chunker: Optional[Callable[[str], List[str]]],
+) -> Tuple[List[str], List[int]]:
+    """
+    Applies the chunker to the docs to create a flat list of contiguous doc chunks with a list of indices that index the
+    beginning and end of each doc's chunks. The chunker is applied to each document independently. These chunks
+    represent the document for subsequent processing and each is later provided to KeyBERT to extract keywords. There is
+    no need for the chunker to return all the text in the document, and it can apply filtering and sampling to reduce
+    downstream processing complexity. If a chunker is not provided then the chunks returned are the input documents.
+    Args:
+        docs: The documents to chunk.
+        chunker: A callable that takes a string and returns a list of strings. This is applied to each document
+            independently
+    Returns:
+        A flat list of contiguous doc chunks with a list of indices that index the beginning and end of each doc's
+        chunks.
+    """
+    if len(docs) == 0:
+        return [], []
+    if isinstance(docs, str):
+        docs = [docs]
+    chunks: List[str]
+    idx: List[int]
+    if chunker is not None:
+        chunks = []
+        idx = [0] * (len(docs) + 1)
+        i: int
+        doc: str
+        for i, doc in enumerate(docs):
+            chunks_doc: List[str] = chunker(doc)
+            chunks.extend(chunks_doc)
+            idx[i + 1] = len(chunks)
+    else:
+        chunks = docs
+        idx = list(range(len(docs) + 1))
+    return chunks, idx
+class ChunkeyBert:
+    def __init__(
+        self,
+        keybert: KeyBERT,
+    ) -> None:
+        self._keybert: KeyBERT = keybert
+        self._embedder: BaseEmbedder = keybert.model
+        try:
+            self._embedding_dim: int = self._embedder.embed(documents=["Determining model embedding dim"]).shape[-1]
+        except Exception as exc:
+            raise ValueError(f"The provided embedder model is not working as expected. Original exception {exc}")
+    def extract_keywords(
+        self,
+        docs: Union[str, List[str]],
+        num_keywords: int,
+        chunker: Optional[Callable[[str], List[str]]] = None,
+        return_keywords_embeddings: bool = False,
+        use_count_weights: bool = True,
+        candidates: Optional[List[str]] = None,
+        keyphrase_ngram_range: Tuple[int, int] = (1, 1),
+        stop_words: Union[str, List[str]] = "english",
+        top_n: int = 3,
+        min_df: int = 1,
+        use_maxsum: bool = False,
+        use_mmr: bool = False,
+        diversity: float = 0.5,
+        nr_candidates: int = 20,
+        vectorizer: Optional[CountVectorizer] = None,
+        highlight: bool = False,
+        seed_keywords: Optional[Union[List[str], List[List[str]]]] = None,
+        doc_embeddings: Optional[np.ndarray] = None,
+        word_embeddings: Optional[np.ndarray] = None,
+        threshold: Optional[float] = None,
+    ) -> List[Optional[Union[List[Tuple[str, np.float32]], List[Tuple[str, np.float32, np.ndarray]]]]]:
+        """
+        Extract the unique keywords/keyphrases for the provided documents. The method uses the chunker if provided to
+        chunk the document and then use the KeyBERT model to extract keywords from each chunk. Finally, it merges all
+        the results and finds the most similar keywords across all chunks. If a chunker is not provided the documents
+        are provided as inputs to KeyBERT in their entirety and the similarity is calculated for the keywords extracted
+        of each complete document.
+        Args:
+            docs: The documents to extract keywords for.
+            num_keywords: The maximum number of keywords to extract.
+            chunker: Chunks the documents. The chunker can be any callbale that takes a string and returns a list of
+                strings. There are no constraints on the chunks, their length or order, e.g. chunks may be disjoint or
+                overlap and can be filtered or even sampled from the document.
+            return_keywords_embeddings: True to include the keywords embeddings in the returned list.
+            use_count_weights: If True, the number of times a keyword is repeated across chunks in the same document is
+                considered in scoring. If False it has no impact. Seems to work best when a small KeyBERT top_n value is
+                specified.
+            candidates: A KeyBert.extract_keywords parameter. Candidate keywords/keyphrases to use instead of extracting
+                them from the document(s). NOTE: This is not used if you passed a `vectorizer`.
+            keyphrase_ngram_range: A KeyBert.extract_keywords parameter. Length, in words, of the extracted
+                keywords/keyphrases. NOTE: This is not used if you passed a `vectorizer`.
+            stop_words: A KeyBert.extract_keywords parameter. Stopwords to remove from the document. NOTE: This is not
+                used if you passed a `vectorizer`.
+            top_n: A KeyBert.extract_keywords parameter. Return the top n keywords/keyphrases
+            min_df: A KeyBert.extract_keywords parameter. Minimum document frequency of a word across all documents if
+                keywords for multiple documents need to be extracted. NOTE: This is not used if you passed a
+                `vectorizer`.
+            use_maxsum: A KeyBert.extract_keywords parameter. Whether to use Max Sum Distance for the selection of
+                keywords/keyphrases.
+            use_mmr: A KeyBert.extract_keywords parameter. Whether to use Maximal Marginal Relevance (MMR) for the
+                selection of keywords/keyphrases.
+            diversity: A KeyBert.extract_keywords parameter. The diversity of the results between 0 and 1 if `use_mmr`
+                is set to True.
+            nr_candidates: A KeyBert.extract_keywords parameter. The number of candidates to consider if `use_maxsum` is
+                set to True.
+            vectorizer: A KeyBert.extract_keywords parameter. Pass in your own `CountVectorizer` from
+                `sklearn.feature_extraction.text.CountVectorizer`
+            highlight: A KeyBert.extract_keywords parameter. Whether to print the document and highlight its
+                keywords/keyphrases. NOTE: This does not work if multiple documents are passed.
+            seed_keywords: A KeyBert.extract_keywords parameter. Seed keywords that may guide the extraction of keywords
+                by steering the similarities towards the seeded keywords. NOTE: when multiple documents are passed,
+               `seed_keywords`funtions in either of the two ways:
+               - globally: when a flat list of str is passed, keywords are shared by all documents,
+               - locally: when a nested list of str is passed, keywords differs among documents.
+            doc_embeddings: A KeyBert.extract_keywords parameter. The embeddings of each document.
+            word_embeddings: A KeyBert.extract_keywords parameter. The embeddings of each potential keyword/keyphrase
+                across the vocabulary of the set of input documents. NOTE: The `word_embeddings` should be generated
+                through `.extract_embeddings` as the order of these embeddings depend on the vectorizer that was used to
+                generate its vocabulary.
+            threshold: Used by KeyBERT but is undocumented. Seems to be give to community_detection in
+                sentence_transformers.utils to determine clusters.
+        Returns:
+            The top keywords/keyphrases for each corresponding document and their score or None if no keywords are
+            available for a document. Optionally the embeddings for each keyword/keyphrase are returned if specified.
+        """
+        if chunker is None and num_keywords > top_n:
+            warnings.warn(
+                message="Setting num_keywords higher than top_n without a chunker will result in at most top_n "
+                "keywords/keyphrases returned."
+            )
+        if isinstance(docs, str):
+            docs = [docs]
+        chunks: List[str]
+        docs_idx_list: List[int]
+        chunks, docs_idx_list = _extract_chunks_from_docs(docs=docs, chunker=chunker)
+        if len(chunks) == 0:
+            return []
+        if len(chunks) != docs_idx_list[-1]:
+            raise RuntimeError("Indices mapping segments don't map to  documents, this is a likely issue.")
+        all_keywords_chunks: List[List[Tuple[str, float]]] = self._keybert.extract_keywords(
+            docs=chunks,
+            candidates=candidates,
+            keyphrase_ngram_range=keyphrase_ngram_range,
+            stop_words=stop_words,
+            top_n=top_n,
+            min_df=min_df,
+            use_maxsum=use_maxsum,
+            use_mmr=use_mmr,
+            diversity=diversity,
+            nr_candidates=nr_candidates,
+            vectorizer=vectorizer,
+            highlight=highlight,
+            seed_keywords=seed_keywords,
+            doc_embeddings=doc_embeddings,
+            word_embeddings=word_embeddings,
+            threshold=threshold,
+        )
+        if len(all_keywords_chunks) > 0 and not isinstance(all_keywords_chunks[0], List):
+            raise ValueError("Unexpected type returned by keybert.extract_keywords().")
+        keywords: List[Optional[Union[List[Tuple[str, np.float32]], List[Tuple[str, np.float32, np.ndarray]]]]] = []
+        doc_idx: int
+        for doc_idx in range(len(docs_idx_list) - 1):
+            keywords_doc: np.ndarray
+            counts_doc: Optional[np.ndarray]
+            keywords_doc, counts_doc = _get_unique_keywords_by_doc_idx(
+                all_keywords_chunks=all_keywords_chunks,
+                docs_idx_list=docs_idx_list,
+                doc_idx=doc_idx,
+                use_count_weights=use_count_weights,
+            )
+            if keywords_doc.shape[0] > 0:
+                embeddings_doc: np.ndarray = self._embedder.embed(documents=keywords_doc)
+                top_idx: np.ndarray
+                score: np.ndarray
+                top_idx, score = _calculate_top_similar_keywords_for_doc(
+                    embeddings_doc=embeddings_doc,
+                    counts_doc=counts_doc,
+                    top_k=num_keywords,
+                )
+                if not return_keywords_embeddings:
+                    keywords.append(list(zip(keywords_doc[top_idx], score)))
+                else:
+                    keywords.append(list(zip(keywords_doc[top_idx], score, embeddings_doc[top_idx])))
+            else:
+                keywords.append(None)
+        return keywords