semchunk 2.2.0__tar.gz → 2.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,14 @@
1
1
  ## Changelog 🔄
2
2
  All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
3
3
 
4
+ ## [2.2.2] - 2024-12-18
5
+ ### Fixed
6
+ - Ensured `hatch` does not include irrelevant files in the distribution.
7
+
8
+ ## [2.2.1] - 2024-12-17
9
+ ### Changed
10
+ - Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](https://github.com/benbrandt) ([#17](https://github.com/umarbutler/semchunk/pull/12)).
11
+
4
12
  ## [2.2.0] - 2024-07-12
5
13
  ### Changed
6
14
  - Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](https://github.com/umarbutler/semchunk/pull/7).
@@ -79,6 +87,9 @@ All notable changes to `semchunk` will be documented here. This project adheres
79
87
  ### Added
80
88
  - Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
81
89
 
90
+ [2.2.2]: https://github.com/umarbutler/semchunk/compare/v2.2.1...v2.2.2
91
+ [2.2.1]: https://github.com/umarbutler/semchunk/compare/v2.2.0...v2.2.1
92
+ [2.2.0]: https://github.com/umarbutler/semchunk/compare/v2.1.0...v2.2.0
82
93
  [2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
83
94
  [2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
84
95
  [1.0.1]: https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1
@@ -1,6 +1,6 @@
1
- Metadata-Version: 2.3
1
+ Metadata-Version: 2.4
2
2
  Name: semchunk
3
- Version: 2.2.0
3
+ Version: 2.2.2
4
4
  Summary: A fast and lightweight Python library for splitting text into semantically meaningful chunks.
5
5
  Project-URL: Homepage, https://github.com/umarbutler/semchunk
6
6
  Project-URL: Documentation, https://github.com/umarbutler/semchunk/blob/main/README.md
@@ -37,7 +37,7 @@ Description-Content-Type: text/markdown
37
37
 
38
38
  `semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
39
39
 
40
- Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 90% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
40
+ Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
41
41
 
42
42
  ## Installation 📦
43
43
  `semchunk` may be installed with `pip`:
@@ -148,7 +148,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
148
148
  1. All other characters.
149
149
 
150
150
  ## Benchmarks 📊
151
- On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes `semchunk` 6.69 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 116.48 seconds to chunk the same texts into 512-token-long chunks — a difference of 94.26%.
151
+ On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
152
152
 
153
153
  The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
154
154
 
@@ -3,7 +3,7 @@
3
3
 
4
4
  `semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
5
5
 
6
- Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 90% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
6
+ Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
7
7
 
8
8
  ## Installation 📦
9
9
  `semchunk` may be installed with `pip`:
@@ -114,7 +114,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
114
114
  1. All other characters.
115
115
 
116
116
  ## Benchmarks 📊
117
- On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes `semchunk` 6.69 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 116.48 seconds to chunk the same texts into 512-token-long chunks — a difference of 94.26%.
117
+ On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
118
118
 
119
119
  The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
120
120
 
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "semchunk"
7
- version = "2.2.0"
7
+ version = "2.2.2"
8
8
  authors = [
9
9
  {name="Umar Butler", email="umar@umar.au"},
10
10
  ]
@@ -53,3 +53,6 @@ Homepage = "https://github.com/umarbutler/semchunk"
53
53
  Documentation = "https://github.com/umarbutler/semchunk/blob/main/README.md"
54
54
  Issues = "https://github.com/umarbutler/semchunk/issues"
55
55
  Source = "https://github.com/umarbutler/semchunk"
56
+
57
+ [tool.hatch.build.targets.sdist]
58
+ only-include = ['src/semchunk/__init__.py', 'src/semchunk/py.typed', 'src/semchunk/semchunk.py', 'pyproject.toml', 'README.md', 'LICENCE', 'CHANGELOG.md', 'tests/bench.py', 'tests/test_semchunk.py', '.github/workflows/ci.yml']
@@ -13,10 +13,7 @@ import mpire
13
13
 
14
14
  from tqdm import tqdm
15
15
 
16
- if TYPE_CHECKING:
17
- import tiktoken
18
- import tokenizers
19
- import transformers
16
+ if TYPE_CHECKING: import tiktoken, tokenizers, transformers
20
17
 
21
18
 
22
19
  _memoized_token_counters = {}
@@ -11,7 +11,7 @@ from semantic_text_splitter import TextSplitter
11
11
  CHUNK_SIZE = 512
12
12
  # END CONFIG #
13
13
 
14
- def bench() -> dict[str, float]:
14
+ def bench() -> dict[str, float]:
15
15
  # Initialise the chunkers.
16
16
  semchunk_chunker = semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), CHUNK_SIZE)
17
17
  sts_chunker = TextSplitter.from_tiktoken_model('gpt-4', CHUNK_SIZE)
@@ -21,7 +21,7 @@ def bench() -> dict[str, float]:
21
21
  semchunk_chunker(texts)
22
22
 
23
23
  def bench_sts(texts: list[str]) -> None:
24
- [sts_chunker.chunks(text) for text in texts]
24
+ sts_chunker.chunk_all(texts)
25
25
 
26
26
  libraries = {
27
27
  'semchunk': bench_semchunk,
@@ -31,22 +31,24 @@ def bench() -> dict[str, float]:
31
31
  # Download the Gutenberg corpus.
32
32
  try:
33
33
  gutenberg = nltk.corpus.gutenberg
34
-
34
+
35
35
  except Exception:
36
36
  nltk.download('gutenberg')
37
37
  gutenberg = nltk.corpus.gutenberg
38
-
38
+
39
39
  # Benchmark the libraries.
40
- benchmarks = dict.fromkeys(libraries.keys(), 0)
40
+ benchmarks = dict.fromkeys(libraries.keys(), 0.0)
41
41
  texts = [gutenberg.raw(fileid) for fileid in gutenberg.fileids()]
42
-
42
+
43
43
  for library, function in libraries.items():
44
44
  start = time.time()
45
45
  function(texts)
46
46
  benchmarks[library] = time.time() - start
47
-
47
+
48
48
  return benchmarks
49
49
 
50
50
  if __name__ == '__main__':
51
+ nltk.download('gutenberg')
52
+
51
53
  for library, time_taken in bench().items():
52
- print(f'{library}: {time_taken:.2f}s')
54
+ print(f'{library}: {time_taken:.2f}s')
@@ -1,2 +0,0 @@
1
- # Created by coverage.py
2
- *
File without changes
File without changes
File without changes