PyPI - semchunk - Versions diffs - 2.2.0__tar.gz → 2.2.2__tar.gz - Mend

semchunk 2.2.0tar.gz → 2.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

{semchunk-2.2.0 → semchunk-2.2.2}/CHANGELOG.md RENAMED Viewed

@@ -1,6 +1,14 @@
 ## Changelog 🔄
 All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.2.2] - 2024-12-18
+### Fixed
+- Ensured `hatch` does not include irrelevant files in the distribution.
+## [2.2.1] - 2024-12-17
+### Changed
+- Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](https://github.com/benbrandt) ([#17](https://github.com/umarbutler/semchunk/pull/12)).
 ## [2.2.0] - 2024-07-12
 ### Changed
 - Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](https://github.com/umarbutler/semchunk/pull/7).
@@ -79,6 +87,9 @@ All notable changes to `semchunk` will be documented here. This project adheres
 ### Added
 - Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
+[2.2.2]: https://github.com/umarbutler/semchunk/compare/v2.2.1...v2.2.2
+[2.2.1]: https://github.com/umarbutler/semchunk/compare/v2.2.0...v2.2.1
+[2.2.0]: https://github.com/umarbutler/semchunk/compare/v2.1.0...v2.2.0
 [2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
 [2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
 [1.0.1]: https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1

{semchunk-2.2.0 → semchunk-2.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.3
+Metadata-Version: 2.4
 Name: semchunk
-Version: 2.2.0
+Version: 2.2.2
 Summary: A fast and lightweight Python library for splitting text into semantically meaningful chunks.
 Project-URL: Homepage, https://github.com/umarbutler/semchunk
 Project-URL: Documentation, https://github.com/umarbutler/semchunk/blob/main/README.md
@@ -37,7 +37,7 @@ Description-Content-Type: text/markdown
 `semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
-Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 90% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
+Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
 ## Installation 📦
 `semchunk` may be installed with `pip`:
@@ -148,7 +148,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
 1. All other characters.
 ## Benchmarks 📊
-On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes `semchunk` 6.69 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 116.48 seconds to chunk the same texts into 512-token-long chunks — a difference of 94.26%.
+On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
 The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).

{semchunk-2.2.0 → semchunk-2.2.2}/README.md RENAMED Viewed

@@ -3,7 +3,7 @@
 `semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
-Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 90% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
+Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
 ## Installation 📦
 `semchunk` may be installed with `pip`:
@@ -114,7 +114,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
 1. All other characters.
 ## Benchmarks 📊
-On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes `semchunk` 6.69 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 116.48 seconds to chunk the same texts into 512-token-long chunks — a difference of 94.26%.
+On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
 The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).

{semchunk-2.2.0 → semchunk-2.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "semchunk"
-version = "2.2.0"
+version = "2.2.2"
 authors = [
   {name="Umar Butler", email="umar@umar.au"},
 ]
@@ -53,3 +53,6 @@ Homepage = "https://github.com/umarbutler/semchunk"
 Documentation = "https://github.com/umarbutler/semchunk/blob/main/README.md"
 Issues = "https://github.com/umarbutler/semchunk/issues"
 Source = "https://github.com/umarbutler/semchunk"
+[tool.hatch.build.targets.sdist]
+only-include = ['src/semchunk/__init__.py', 'src/semchunk/py.typed', 'src/semchunk/semchunk.py', 'pyproject.toml', 'README.md', 'LICENCE', 'CHANGELOG.md', 'tests/bench.py', 'tests/test_semchunk.py', '.github/workflows/ci.yml']

{semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/semchunk.py RENAMED Viewed

@@ -13,10 +13,7 @@ import mpire
 from tqdm import tqdm
-if TYPE_CHECKING:
-    import tiktoken
-    import tokenizers
-    import transformers
+if TYPE_CHECKING: import tiktoken, tokenizers, transformers
 _memoized_token_counters = {}

{semchunk-2.2.0 → semchunk-2.2.2}/tests/bench.py RENAMED Viewed

@@ -11,7 +11,7 @@ from semantic_text_splitter import TextSplitter
 CHUNK_SIZE = 512
 # END CONFIG #
-def bench() -> dict[str, float]:
+def bench() -> dict[str, float]:
     # Initialise the chunkers.
     semchunk_chunker = semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), CHUNK_SIZE)
     sts_chunker = TextSplitter.from_tiktoken_model('gpt-4', CHUNK_SIZE)
@@ -21,7 +21,7 @@ def bench() -> dict[str, float]:
         semchunk_chunker(texts)
     def bench_sts(texts: list[str]) -> None:
-        [sts_chunker.chunks(text) for text in texts]
+        sts_chunker.chunk_all(texts)
     libraries = {
         'semchunk': bench_semchunk,
@@ -31,22 +31,24 @@ def bench() -> dict[str, float]:
     # Download the Gutenberg corpus.
     try:
         gutenberg = nltk.corpus.gutenberg
     except Exception:
         nltk.download('gutenberg')
         gutenberg = nltk.corpus.gutenberg
     # Benchmark the libraries.
-    benchmarks = dict.fromkeys(libraries.keys(), 0)
+    benchmarks = dict.fromkeys(libraries.keys(), 0.0)
     texts = [gutenberg.raw(fileid) for fileid in gutenberg.fileids()]
     for library, function in libraries.items():
         start = time.time()
         function(texts)
         benchmarks[library] = time.time() - start
     return benchmarks
 if __name__ == '__main__':
+    nltk.download('gutenberg')
     for library, time_taken in bench().items():
-        print(f'{library}: {time_taken:.2f}s')
+        print(f'{library}: {time_taken:.2f}s')

semchunk-2.2.0/htmlcov/.gitignore DELETED Viewed

	@@ -1,2 +0,0 @@
1	- # Created by coverage.py
2	- *

{semchunk-2.2.0 → semchunk-2.2.2}/.github/workflows/ci.yml RENAMED Viewed

File without changes

{semchunk-2.2.0 → semchunk-2.2.2}/.gitignore RENAMED Viewed

File without changes

{semchunk-2.2.0 → semchunk-2.2.2}/LICENCE RENAMED Viewed

File without changes

{semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/__init__.py RENAMED Viewed

File without changes

{semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/py.typed RENAMED Viewed

File without changes

{semchunk-2.2.0 → semchunk-2.2.2}/tests/test_semchunk.py RENAMED Viewed

File without changes

semchunk 2.2.0__tar.gz → 2.2.2__tar.gz

semchunk 2.2.0tar.gz → 2.2.2tar.gz