semchunk 2.2.0__tar.gz → 2.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {semchunk-2.2.0 → semchunk-2.2.2}/CHANGELOG.md +11 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/PKG-INFO +4 -4
- {semchunk-2.2.0 → semchunk-2.2.2}/README.md +2 -2
- {semchunk-2.2.0 → semchunk-2.2.2}/pyproject.toml +4 -1
- {semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/semchunk.py +1 -4
- {semchunk-2.2.0 → semchunk-2.2.2}/tests/bench.py +10 -8
- semchunk-2.2.0/htmlcov/.gitignore +0 -2
- {semchunk-2.2.0 → semchunk-2.2.2}/.github/workflows/ci.yml +0 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/.gitignore +0 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/LICENCE +0 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/__init__.py +0 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/src/semchunk/py.typed +0 -0
- {semchunk-2.2.0 → semchunk-2.2.2}/tests/test_semchunk.py +0 -0
|
@@ -1,6 +1,14 @@
|
|
|
1
1
|
## Changelog 🔄
|
|
2
2
|
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
3
3
|
|
|
4
|
+
## [2.2.2] - 2024-12-18
|
|
5
|
+
### Fixed
|
|
6
|
+
- Ensured `hatch` does not include irrelevant files in the distribution.
|
|
7
|
+
|
|
8
|
+
## [2.2.1] - 2024-12-17
|
|
9
|
+
### Changed
|
|
10
|
+
- Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](https://github.com/benbrandt) ([#17](https://github.com/umarbutler/semchunk/pull/12)).
|
|
11
|
+
|
|
4
12
|
## [2.2.0] - 2024-07-12
|
|
5
13
|
### Changed
|
|
6
14
|
- Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](https://github.com/umarbutler/semchunk/pull/7).
|
|
@@ -79,6 +87,9 @@ All notable changes to `semchunk` will be documented here. This project adheres
|
|
|
79
87
|
### Added
|
|
80
88
|
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
|
|
81
89
|
|
|
90
|
+
[2.2.2]: https://github.com/umarbutler/semchunk/compare/v2.2.1...v2.2.2
|
|
91
|
+
[2.2.1]: https://github.com/umarbutler/semchunk/compare/v2.2.0...v2.2.1
|
|
92
|
+
[2.2.0]: https://github.com/umarbutler/semchunk/compare/v2.1.0...v2.2.0
|
|
82
93
|
[2.1.0]: https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
|
|
83
94
|
[2.0.0]: https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
|
|
84
95
|
[1.0.1]: https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1
|
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
Metadata-Version: 2.
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
2
|
Name: semchunk
|
|
3
|
-
Version: 2.2.
|
|
3
|
+
Version: 2.2.2
|
|
4
4
|
Summary: A fast and lightweight Python library for splitting text into semantically meaningful chunks.
|
|
5
5
|
Project-URL: Homepage, https://github.com/umarbutler/semchunk
|
|
6
6
|
Project-URL: Documentation, https://github.com/umarbutler/semchunk/blob/main/README.md
|
|
@@ -37,7 +37,7 @@ Description-Content-Type: text/markdown
|
|
|
37
37
|
|
|
38
38
|
`semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
|
|
39
39
|
|
|
40
|
-
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over
|
|
40
|
+
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
|
|
41
41
|
|
|
42
42
|
## Installation 📦
|
|
43
43
|
`semchunk` may be installed with `pip`:
|
|
@@ -148,7 +148,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
|
|
|
148
148
|
1. All other characters.
|
|
149
149
|
|
|
150
150
|
## Benchmarks 📊
|
|
151
|
-
On a desktop with a Ryzen
|
|
151
|
+
On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
|
|
152
152
|
|
|
153
153
|
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
|
|
154
154
|
|
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
|
|
4
4
|
`semchunk` is a fast and lightweight Python library for splitting text into semantically meaningful chunks.
|
|
5
5
|
|
|
6
|
-
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over
|
|
6
|
+
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.2/docs/how_to/recursive_text_splitter/#splitting-text-from-languages-without-word-boundaries) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 80% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
|
|
7
7
|
|
|
8
8
|
## Installation 📦
|
|
9
9
|
`semchunk` may be installed with `pip`:
|
|
@@ -114,7 +114,7 @@ To ensure that chunks are as semantically meaningful as possible, `semchunk` use
|
|
|
114
114
|
1. All other characters.
|
|
115
115
|
|
|
116
116
|
## Benchmarks 📊
|
|
117
|
-
On a desktop with a Ryzen
|
|
117
|
+
On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 2.87 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 25.03 seconds to chunk the same texts into 512-token-long chunks — a difference of 88.53%.
|
|
118
118
|
|
|
119
119
|
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
|
|
120
120
|
|
|
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "semchunk"
|
|
7
|
-
version = "2.2.
|
|
7
|
+
version = "2.2.2"
|
|
8
8
|
authors = [
|
|
9
9
|
{name="Umar Butler", email="umar@umar.au"},
|
|
10
10
|
]
|
|
@@ -53,3 +53,6 @@ Homepage = "https://github.com/umarbutler/semchunk"
|
|
|
53
53
|
Documentation = "https://github.com/umarbutler/semchunk/blob/main/README.md"
|
|
54
54
|
Issues = "https://github.com/umarbutler/semchunk/issues"
|
|
55
55
|
Source = "https://github.com/umarbutler/semchunk"
|
|
56
|
+
|
|
57
|
+
[tool.hatch.build.targets.sdist]
|
|
58
|
+
only-include = ['src/semchunk/__init__.py', 'src/semchunk/py.typed', 'src/semchunk/semchunk.py', 'pyproject.toml', 'README.md', 'LICENCE', 'CHANGELOG.md', 'tests/bench.py', 'tests/test_semchunk.py', '.github/workflows/ci.yml']
|
|
@@ -11,7 +11,7 @@ from semantic_text_splitter import TextSplitter
|
|
|
11
11
|
CHUNK_SIZE = 512
|
|
12
12
|
# END CONFIG #
|
|
13
13
|
|
|
14
|
-
def bench() -> dict[str, float]:
|
|
14
|
+
def bench() -> dict[str, float]:
|
|
15
15
|
# Initialise the chunkers.
|
|
16
16
|
semchunk_chunker = semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), CHUNK_SIZE)
|
|
17
17
|
sts_chunker = TextSplitter.from_tiktoken_model('gpt-4', CHUNK_SIZE)
|
|
@@ -21,7 +21,7 @@ def bench() -> dict[str, float]:
|
|
|
21
21
|
semchunk_chunker(texts)
|
|
22
22
|
|
|
23
23
|
def bench_sts(texts: list[str]) -> None:
|
|
24
|
-
|
|
24
|
+
sts_chunker.chunk_all(texts)
|
|
25
25
|
|
|
26
26
|
libraries = {
|
|
27
27
|
'semchunk': bench_semchunk,
|
|
@@ -31,22 +31,24 @@ def bench() -> dict[str, float]:
|
|
|
31
31
|
# Download the Gutenberg corpus.
|
|
32
32
|
try:
|
|
33
33
|
gutenberg = nltk.corpus.gutenberg
|
|
34
|
-
|
|
34
|
+
|
|
35
35
|
except Exception:
|
|
36
36
|
nltk.download('gutenberg')
|
|
37
37
|
gutenberg = nltk.corpus.gutenberg
|
|
38
|
-
|
|
38
|
+
|
|
39
39
|
# Benchmark the libraries.
|
|
40
|
-
benchmarks = dict.fromkeys(libraries.keys(), 0)
|
|
40
|
+
benchmarks = dict.fromkeys(libraries.keys(), 0.0)
|
|
41
41
|
texts = [gutenberg.raw(fileid) for fileid in gutenberg.fileids()]
|
|
42
|
-
|
|
42
|
+
|
|
43
43
|
for library, function in libraries.items():
|
|
44
44
|
start = time.time()
|
|
45
45
|
function(texts)
|
|
46
46
|
benchmarks[library] = time.time() - start
|
|
47
|
-
|
|
47
|
+
|
|
48
48
|
return benchmarks
|
|
49
49
|
|
|
50
50
|
if __name__ == '__main__':
|
|
51
|
+
nltk.download('gutenberg')
|
|
52
|
+
|
|
51
53
|
for library, time_taken in bench().items():
|
|
52
|
-
print(f'{library}: {time_taken:.2f}s')
|
|
54
|
+
print(f'{library}: {time_taken:.2f}s')
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|