yasbd-lib 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,356 @@
1
+ Mozilla Public License Version 2.0
2
+ ==================================
3
+
4
+ ### 1. Definitions
5
+
6
+ **1.1. “Contributor”**
7
+ means each individual or legal entity that creates, contributes to
8
+ the creation of, or owns Covered Software.
9
+
10
+ **1.2. “Contributor Version”**
11
+ means the combination of the Contributions of others (if any) used
12
+ by a Contributor and that particular Contributor's Contribution.
13
+
14
+ **1.3. “Contribution”**
15
+ means Covered Software of a particular Contributor.
16
+
17
+ **1.4. “Covered Software”**
18
+ means Source Code Form to which the initial Contributor has attached
19
+ the notice in Exhibit A, the Executable Form of such Source Code
20
+ Form, and Modifications of such Source Code Form, in each case
21
+ including portions thereof.
22
+
23
+ **1.5. “Incompatible With Secondary Licenses”**
24
+ means
25
+
26
+ * **(a)** that the initial Contributor has attached the notice described
27
+ in Exhibit B to the Covered Software; or
28
+ * **(b)** that the Covered Software was made available under the terms of
29
+ version 1.1 or earlier of the License, but not also under the
30
+ terms of a Secondary License.
31
+
32
+ **1.6. “Executable Form”**
33
+ means any form of the work other than Source Code Form.
34
+
35
+ **1.7. “Larger Work”**
36
+ means a work that combines Covered Software with other material, in
37
+ a separate file or files, that is not Covered Software.
38
+
39
+ **1.8. “License”**
40
+ means this document.
41
+
42
+ **1.9. “Licensable”**
43
+ means having the right to grant, to the maximum extent possible,
44
+ whether at the time of the initial grant or subsequently, any and
45
+ all of the rights conveyed by this License.
46
+
47
+ **1.10. “Modifications”**
48
+ means any of the following:
49
+
50
+ * **(a)** any file in Source Code Form that results from an addition to,
51
+ deletion from, or modification of the contents of Covered
52
+ Software; or
53
+ * **(b)** any new file in Source Code Form that contains any Covered
54
+ Software.
55
+
56
+ **1.11. “Patent Claims” of a Contributor**
57
+ means any patent claim(s), including without limitation, method,
58
+ process, and apparatus claims, in any patent Licensable by such
59
+ Contributor that would be infringed, but for the grant of the
60
+ License, by the making, using, selling, offering for sale, having
61
+ made, import, or transfer of either its Contributions or its
62
+ Contributor Version.
63
+
64
+ **1.12. “Secondary License”**
65
+ means either the GNU General Public License, Version 2.0, the GNU
66
+ Lesser General Public License, Version 2.1, the GNU Affero General
67
+ Public License, Version 3.0, or any later versions of those
68
+ licenses.
69
+
70
+ **1.13. “Source Code Form”**
71
+ means the form of the work preferred for making modifications.
72
+
73
+ **1.14. “You” (or “Your”)**
74
+ means an individual or a legal entity exercising rights under this
75
+ License. For legal entities, “You” includes any entity that
76
+ controls, is controlled by, or is under common control with You. For
77
+ purposes of this definition, “control” means **(a)** the power, direct
78
+ or indirect, to cause the direction or management of such entity,
79
+ whether by contract or otherwise, or **(b)** ownership of more than
80
+ fifty percent (50%) of the outstanding shares or beneficial
81
+ ownership of such entity.
82
+
83
+
84
+ ### 2. License Grants and Conditions
85
+
86
+ #### 2.1. Grants
87
+
88
+ Each Contributor hereby grants You a world-wide, royalty-free,
89
+ non-exclusive license:
90
+
91
+ * **(a)** under intellectual property rights (other than patent or trademark)
92
+ Licensable by such Contributor to use, reproduce, make available,
93
+ modify, display, perform, distribute, and otherwise exploit its
94
+ Contributions, either on an unmodified basis, with Modifications, or
95
+ as part of a Larger Work; and
96
+ * **(b)** under Patent Claims of such Contributor to make, use, sell, offer
97
+ for sale, have made, import, and otherwise transfer either its
98
+ Contributions or its Contributor Version.
99
+
100
+ #### 2.2. Effective Date
101
+
102
+ The licenses granted in Section 2.1 with respect to any Contribution
103
+ become effective for each Contribution on the date the Contributor first
104
+ distributes such Contribution.
105
+
106
+ #### 2.3. Limitations on Grant Scope
107
+
108
+ The licenses granted in this Section 2 are the only rights granted under
109
+ this License. No additional rights or licenses will be implied from the
110
+ distribution or licensing of Covered Software under this License.
111
+ Notwithstanding Section 2.1(b) above, no patent license is granted by a
112
+ Contributor:
113
+
114
+ * **(a)** for any code that a Contributor has removed from Covered Software;
115
+ or
116
+ * **(b)** for infringements caused by: **(i)** Your and any other third party's
117
+ modifications of Covered Software, or **(ii)** the combination of its
118
+ Contributions with other software (except as part of its Contributor
119
+ Version); or
120
+ * **(c)** under Patent Claims infringed by Covered Software in the absence of
121
+ its Contributions.
122
+
123
+ This License does not grant any rights in the trademarks, service marks,
124
+ or logos of any Contributor (except as may be necessary to comply with
125
+ the notice requirements in Section 3.4).
126
+
127
+ #### 2.4. Subsequent Licenses
128
+
129
+ No Contributor makes additional grants as a result of Your choice to
130
+ distribute the Covered Software under a subsequent version of this
131
+ License (see Section 10.2) or under the terms of a Secondary License (if
132
+ permitted under the terms of Section 3.3).
133
+
134
+ #### 2.5. Representation
135
+
136
+ Each Contributor represents that the Contributor believes its
137
+ Contributions are its original creation(s) or it has sufficient rights
138
+ to grant the rights to its Contributions conveyed by this License.
139
+
140
+ #### 2.6. Fair Use
141
+
142
+ This License is not intended to limit any rights You have under
143
+ applicable copyright doctrines of fair use, fair dealing, or other
144
+ equivalents.
145
+
146
+ #### 2.7. Conditions
147
+
148
+ Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted
149
+ in Section 2.1.
150
+
151
+
152
+ ### 3. Responsibilities
153
+
154
+ #### 3.1. Distribution of Source Form
155
+
156
+ All distribution of Covered Software in Source Code Form, including any
157
+ Modifications that You create or to which You contribute, must be under
158
+ the terms of this License. You must inform recipients that the Source
159
+ Code Form of the Covered Software is governed by the terms of this
160
+ License, and how they can obtain a copy of this License. You may not
161
+ attempt to alter or restrict the recipients' rights in the Source Code
162
+ Form.
163
+
164
+ #### 3.2. Distribution of Executable Form
165
+
166
+ If You distribute Covered Software in Executable Form then:
167
+
168
+ * **(a)** such Covered Software must also be made available in Source Code
169
+ Form, as described in Section 3.1, and You must inform recipients of
170
+ the Executable Form how they can obtain a copy of such Source Code
171
+ Form by reasonable means in a timely manner, at a charge no more
172
+ than the cost of distribution to the recipient; and
173
+
174
+ * **(b)** You may distribute such Executable Form under the terms of this
175
+ License, or sublicense it under different terms, provided that the
176
+ license for the Executable Form does not attempt to limit or alter
177
+ the recipients' rights in the Source Code Form under this License.
178
+
179
+ #### 3.3. Distribution of a Larger Work
180
+
181
+ You may create and distribute a Larger Work under terms of Your choice,
182
+ provided that You also comply with the requirements of this License for
183
+ the Covered Software. If the Larger Work is a combination of Covered
184
+ Software with a work governed by one or more Secondary Licenses, and the
185
+ Covered Software is not Incompatible With Secondary Licenses, this
186
+ License permits You to additionally distribute such Covered Software
187
+ under the terms of such Secondary License(s), so that the recipient of
188
+ the Larger Work may, at their option, further distribute the Covered
189
+ Software under the terms of either this License or such Secondary
190
+ License(s).
191
+
192
+ #### 3.4. Notices
193
+
194
+ You may not remove or alter the substance of any license notices
195
+ (including copyright notices, patent notices, disclaimers of warranty,
196
+ or limitations of liability) contained within the Source Code Form of
197
+ the Covered Software, except that You may alter any license notices to
198
+ the extent required to remedy known factual inaccuracies.
199
+
200
+ #### 3.5. Application of Additional Terms
201
+
202
+ You may choose to offer, and to charge a fee for, warranty, support,
203
+ indemnity or liability obligations to one or more recipients of Covered
204
+ Software. However, You may do so only on Your own behalf, and not on
205
+ behalf of any Contributor. You must make it absolutely clear that any
206
+ such warranty, support, indemnity, or liability obligation is offered by
207
+ You alone, and You hereby agree to indemnify every Contributor for any
208
+ liability incurred by such Contributor as a result of warranty, support,
209
+ indemnity or liability terms You offer. You may include additional
210
+ disclaimers of warranty and limitations of liability specific to any
211
+ jurisdiction.
212
+
213
+
214
+ ### 4. Inability to Comply Due to Statute or Regulation
215
+
216
+ If it is impossible for You to comply with any of the terms of this
217
+ License with respect to some or all of the Covered Software due to
218
+ statute, judicial order, or regulation then You must: **(a)** comply with
219
+ the terms of this License to the maximum extent possible; and **(b)**
220
+ describe the limitations and the code they affect. Such description must
221
+ be placed in a text file included with all distributions of the Covered
222
+ Software under this License. Except to the extent prohibited by statute
223
+ or regulation, such description must be sufficiently detailed for a
224
+ recipient of ordinary skill to be able to understand it.
225
+
226
+
227
+ ### 5. Termination
228
+
229
+ **5.1.** The rights granted under this License will terminate automatically
230
+ if You fail to comply with any of its terms. However, if You become
231
+ compliant, then the rights granted under this License from a particular
232
+ Contributor are reinstated **(a)** provisionally, unless and until such
233
+ Contributor explicitly and finally terminates Your grants, and **(b)** on an
234
+ ongoing basis, if such Contributor fails to notify You of the
235
+ non-compliance by some reasonable means prior to 60 days after You have
236
+ come back into compliance. Moreover, Your grants from a particular
237
+ Contributor are reinstated on an ongoing basis if such Contributor
238
+ notifies You of the non-compliance by some reasonable means, this is the
239
+ first time You have received notice of non-compliance with this License
240
+ from such Contributor, and You become compliant prior to 30 days after
241
+ Your receipt of the notice.
242
+
243
+ **5.2.** If You initiate litigation against any entity by asserting a patent
244
+ infringement claim (excluding declaratory judgment actions,
245
+ counter-claims, and cross-claims) alleging that a Contributor Version
246
+ directly or indirectly infringes any patent, then the rights granted to
247
+ You by any and all Contributors for the Covered Software under Section
248
+ 2.1 of this License shall terminate.
249
+
250
+ **5.3.** In the event of termination under Sections 5.1 or 5.2 above, all
251
+ end user license agreements (excluding distributors and resellers) which
252
+ have been validly granted by You or Your distributors under this License
253
+ prior to termination shall survive termination.
254
+
255
+
256
+ ### 6. Disclaimer of Warranty
257
+
258
+ > Covered Software is provided under this License on an “as is”
259
+ > basis, without warranty of any kind, either expressed, implied, or
260
+ > statutory, including, without limitation, warranties that the
261
+ > Covered Software is free of defects, merchantable, fit for a
262
+ > particular purpose or non-infringing. The entire risk as to the
263
+ > quality and performance of the Covered Software is with You.
264
+ > Should any Covered Software prove defective in any respect, You
265
+ > (not any Contributor) assume the cost of any necessary servicing,
266
+ > repair, or correction. This disclaimer of warranty constitutes an
267
+ > essential part of this License. No use of any Covered Software is
268
+ > authorized under this License except under this disclaimer.
269
+
270
+ ### 7. Limitation of Liability
271
+
272
+ > Under no circumstances and under no legal theory, whether tort
273
+ > (including negligence), contract, or otherwise, shall any
274
+ > Contributor, or anyone who distributes Covered Software as
275
+ > permitted above, be liable to You for any direct, indirect,
276
+ > special, incidental, or consequential damages of any character
277
+ > including, without limitation, damages for lost profits, loss of
278
+ > goodwill, work stoppage, computer failure or malfunction, or any
279
+ > and all other commercial damages or losses, even if such party
280
+ > shall have been informed of the possibility of such damages. This
281
+ > limitation of liability shall not apply to liability for death or
282
+ > personal injury resulting from such party's negligence to the
283
+ > extent applicable law prohibits such limitation. Some
284
+ > jurisdictions do not allow the exclusion or limitation of
285
+ > incidental or consequential damages, so this exclusion and
286
+ > limitation may not apply to You.
287
+
288
+
289
+ ### 8. Litigation
290
+
291
+ Any litigation relating to this License may be brought only in the
292
+ courts of a jurisdiction where the defendant maintains its principal
293
+ place of business and such litigation shall be governed by laws of that
294
+ jurisdiction, without reference to its conflict-of-law provisions.
295
+ Nothing in this Section shall prevent a party's ability to bring
296
+ cross-claims or counter-claims.
297
+
298
+
299
+ ### 9. Miscellaneous
300
+
301
+ This License represents the complete agreement concerning the subject
302
+ matter hereof. If any provision of this License is held to be
303
+ unenforceable, such provision shall be reformed only to the extent
304
+ necessary to make it enforceable. Any law or regulation which provides
305
+ that the language of a contract shall be construed against the drafter
306
+ shall not be used to construe this License against a Contributor.
307
+
308
+
309
+ ### 10. Versions of the License
310
+
311
+ #### 10.1. New Versions
312
+
313
+ Mozilla Foundation is the license steward. Except as provided in Section
314
+ 10.3, no one other than the license steward has the right to modify or
315
+ publish new versions of this License. Each version will be given a
316
+ distinguishing version number.
317
+
318
+ #### 10.2. Effect of New Versions
319
+
320
+ You may distribute the Covered Software under the terms of the version
321
+ of the License under which You originally received the Covered Software,
322
+ or under the terms of any subsequent version published by the license
323
+ steward.
324
+
325
+ #### 10.3. Modified Versions
326
+
327
+ If you create software not governed by this License, and you want to
328
+ create a new license for such software, you may create and use a
329
+ modified version of this License if you rename the license and remove
330
+ any references to the name of the license steward (except to note that
331
+ such modified license differs from this License).
332
+
333
+ #### 10.4. Distributing Source Code Form that is Incompatible With Secondary Licenses
334
+
335
+ If You choose to distribute Source Code Form that is Incompatible With
336
+ Secondary Licenses under the terms of this version of the License, the
337
+ notice described in Exhibit B of this License must be attached.
338
+
339
+ ## Exhibit A - Source Code Form License Notice
340
+
341
+ This Source Code Form is subject to the terms of the Mozilla Public
342
+ License, v. 2.0. If a copy of the MPL was not distributed with this
343
+ file, You can obtain one at http://mozilla.org/MPL/2.0/.
344
+
345
+ If it is not possible or desirable to put the notice in a particular
346
+ file, then You may include the notice in a location (such as a LICENSE
347
+ file in a relevant directory) where a recipient would be likely to look
348
+ for such a notice.
349
+
350
+ You may add additional accurate notices of copyright ownership.
351
+
352
+ ## Exhibit B - “Incompatible With Secondary Licenses” Notice
353
+
354
+ This Source Code Form is "Incompatible With Secondary Licenses", as
355
+ defined by the Mozilla Public License, v. 2.0.
356
+
@@ -0,0 +1,311 @@
1
+ Metadata-Version: 2.4
2
+ Name: yasbd-lib
3
+ Version: 0.1.0
4
+ Summary: A high-accuracy, from-scratch Sentence Boundary Detector (SBD) for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring.
5
+ Author-email: speedyk_005 <speedy40115719@gmail.com>
6
+ License-Expression: MPL-2.0
7
+ Project-URL: Homepage, https://github.com/speedyk-005/yasbd-lib
8
+ Project-URL: Repository, https://github.com/speedyk-005/yasbd-lib
9
+ Project-URL: Documentation, https://speedyk-005.github.io/yasbd/
10
+ Project-URL: Issues, https://github.com/speedyk-005/yasbd-lib/issues
11
+ Project-URL: Changelog, https://github.com/speedyk-005/yasbd-lib/blob/main/CHANGELOG.md
12
+ Keywords: sentence-segmentation,sentence-boundary-detection,sbd,sentence-splitting,sentence-splitter,multilingual,text processing,text-splitting,natural language processing,nlp
13
+ Classifier: Development Status :: 3 - Alpha
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Programming Language :: Python :: 3.14
21
+ Classifier: Topic :: Text Processing
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
24
+ Requires-Python: >=3.11
25
+ Description-Content-Type: text/markdown
26
+ License-File: LICENSE
27
+ Requires-Dist: loguru<1.0,>=0.7.3
28
+ Requires-Dist: regex>=2025.7.29
29
+ Requires-Dist: ftfy<7.0,>=6.2.0
30
+ Requires-Dist: pydantic<3.0,>=2.12.2
31
+ Provides-Extra: bench
32
+ Requires-Dist: pysbd; extra == "bench"
33
+ Requires-Dist: sentencex; extra == "bench"
34
+ Requires-Dist: sentsplit; extra == "bench"
35
+ Requires-Dist: nupunkt; extra == "bench"
36
+ Requires-Dist: blingfire; extra == "bench"
37
+ Requires-Dist: sentence-splitter; extra == "bench"
38
+ Requires-Dist: setuptools<81; extra == "bench"
39
+ Requires-Dist: matplotlib; extra == "bench"
40
+ Provides-Extra: dev
41
+ Requires-Dist: pytest>=8.3.5; extra == "dev"
42
+ Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
43
+ Requires-Dist: pre-commit>=4.4.0; extra == "dev"
44
+ Requires-Dist: python-docstring-markdown>=0.2.2; extra == "dev"
45
+ Requires-Dist: ruff>=0.14.14; extra == "dev"
46
+ Dynamic: license-file
47
+
48
+ <div align="center">
49
+ <img src="https://github.com/speedyk-005/yasbd-lib/blob/main/yasbd_logo.png?raw=true" alt="Yasbd-lib Logo" width="500"/>
50
+ <p><i>“Even a pair of scissors deserves to be smart. Welcome to cybernetic boundary shearing.”</i></p>
51
+ </div>
52
+
53
+ [![Python Version](https://img.shields.io/badge/Python-3.11%20--%203.14-blue)](https://www.python.org/downloads/)
54
+ [![PyPI](https://img.shields.io/pypi/v/yasbd-lib)](https://pypi.org/project/yasbd-lib)
55
+ [![Coverage Status](https://coveralls.io/repos/github/speedyk-005/yasbd-lib/badge.svg?branch=main&kill_cache=1)](https://coveralls.io/github/speedyk-005/yasbd-lib?branch=main)
56
+ [![Stability](https://img.shields.io/badge/stability-alpha-red)](https://github.com/speedyk-005/yasbd-lib)
57
+ [![License: MPL 2.0](https://img.shields.io/badge/License-MPL_2.0-brightgreen.svg)](https://opensource.org/licenses/MPL-2.0)
58
+ [![Tests](https://img.shields.io/badge/tests-passing-brightgreen)](https://github.com/speedyk-005/yasbd-lib/actions)
59
+ [![CodeFactor](https://www.codefactor.io/repository/github/speedyk-005/yasbd-lib/badge)](https://www.codefactor.io/repository/github/speedyk-005/yasbd-lib)
60
+ [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/speedyk-005/yasbd-lib)
61
+
62
+ > [!WARNING]
63
+ > This project is currently in alpha.
64
+
65
+ ---
66
+
67
+ <!-- START doctoc generated TOC please keep comment here to allow auto update -->
68
+ <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
69
+ **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*
70
+
71
+ - [Manifesto](#manifesto)
72
+ - [✂ Why do I need a pair of "smart scissors" for text?](#-why-do-i-need-a-pair-of-smart-scissors-for-text)
73
+ - [🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?](#-are-these-shears-just-a-rusty-regex-loop-spray-painted-in-carbon-fiber)
74
+ - [📦 Installation](#-installation)
75
+ - [The Quick & Easy Way](#the-quick--easy-way)
76
+ - [The From-Source Way](#the-from-source-way)
77
+ - [Want to Help Make yasbd Even Better?](#want-to-help-make-yasbd-even-better)
78
+ - [📟 Usage](#-usage)
79
+ - [Initialization](#initialization)
80
+ - [Boundary detection](#boundary-detection)
81
+ - [Segmentation](#segmentation)
82
+ - [Cleaner](#cleaner)
83
+ - [Adapter](#adapter)
84
+ - [🗺 Features & Roadmap](#-features--roadmap)
85
+ - [🏁 Benchmarks](#-benchmarks)
86
+ - [📜 Last note](#-last-note)
87
+
88
+ <!-- END doctoc generated TOC please keep comment here to allow auto update -->
89
+
90
+ ---
91
+
92
+ ## Manifesto
93
+
94
+ **Y**et **A**nother **S**entence **B**oundary **D**etector is a pair of smart scissors for text. Pointer-based, from-scratch [SBD](https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation) for production pipelines. Features a drop-in adapter for pysbd to fix edges cases without heavy refactoring. Five languages supported today (en, fr, es, ht, ja). Target is 22+.
95
+
96
+ ### ✂ Why do I need a pair of "smart scissors" for text?
97
+
98
+ Running `re.split(r'\.\s+[A-Z]')` and praying. This blunt tool instantly shears titles like `Mr. Smith` or French corporate markers like `Sté. Générale` in half, scattering semantic fragments across your pipeline.
99
+ Punctuation is the most overloaded glyph set in text. A period alone does six jobs and only one is "sentence end." Generic split-on-punctuation fails on:
100
+
101
+ - `Dr.` `Inc.` `U.S.A.` (abbreviation markers, not boundaries. ~47% of periods in news text are these)
102
+ - `3.5M` `3.14` (decimal points, not sentence ends)
103
+ - `D. H. Lawrence` (initials. Two periods, zero boundaries)
104
+ - `...` (ellipsis. Trailing off or sentence end? ambiguous)
105
+ - `1.` `a.` at line start (inline list markers impersonating sentence ends)
106
+ - `?!` inside quotes (punctuation nesting across boundaries)
107
+
108
+ And multilingual quirks a naive splitter never saw coming.
109
+
110
+ ### 🔪 Are these shears just a rusty regex loop spray-painted in carbon fiber?
111
+
112
+ Regex is how I cut. Not what I am. My brain is a two-pass pipeline. Pass one finds every possible boundary, greedy and over-inclusive. Pass two surgically removes false positives by cross-referencing 150+ curated abbreviations across 8 semantic categories, checking context before and after each candidate. Quote spans, parentheses, list markers, ellipsis, contiguous terminators -- each gets its own refiner.
113
+
114
+ ---
115
+
116
+ ## 📦 Installation
117
+
118
+ Ready to do some cybernetic boundary shearing? Let's get you set up quickly and painlessly.
119
+
120
+ ### The Quick & Easy Way
121
+
122
+ The simplest way to get started is with pip:
123
+
124
+ ```bash
125
+ pip install yasbd-lib
126
+ ```
127
+
128
+ > [!TIP]
129
+ > **Termux (Android)**
130
+ >
131
+ > No Rust toolchain? Install pydantic-core pre-built wheels first, then retry:
132
+ >
133
+ > ```bash
134
+ > pip install typing-extensions
135
+ > pip install pydantic-core --index-url https://termux-user-repository.github.io/pypi/
136
+ > pip install "pydantic>=2.12.4,<2.13"
137
+ > ```
138
+
139
+ That's it! Blade is armed.
140
+
141
+ ### The From-Source Way
142
+
143
+ Prefer building from source? Clone and install manually for full control:
144
+
145
+ ```bash
146
+ git clone https://github.com/speedyk-005/yasbd-lib.git
147
+ cd yasbd
148
+ pip install .
149
+ ```
150
+
151
+ (But honestly, the pip way is way easier.)
152
+
153
+ ### Want to Help Make yasbd Even Better?
154
+
155
+ That's awesome. See [**Contributing Guide**](https://github.com/speedyk-005/yasbd-lib/blob/main/CONTRIBUTING.md).
156
+
157
+
158
+ ---
159
+
160
+ ## 📟 Usage
161
+
162
+ > [!TIP]
163
+ > Looking for the pysbd drop-in replacement? Jump straight to the [Adapter](#adapter) section.
164
+
165
+ ### Initialization
166
+
167
+ ```python
168
+ from yasbd.boundary_detector import BoundaryDetector
169
+ # Or from yasbd import BoundaryDetector
170
+
171
+ # Basic setup
172
+ detector = BoundaryDetector(lang="en")
173
+
174
+ # With all options (so far.)
175
+ detector = BoundaryDetector(
176
+ # ISO 639 code (e.g., en, fr, es, ...). Defaults to `en`.
177
+ # https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes
178
+ lang="fr",
179
+
180
+ # Don't split inside them. (It won't protect block quotes) Defaults to `True`.
181
+ # https://en.wikipedia.org/wiki/Block_quotation
182
+ preserve_quote_and_paren=True,
183
+
184
+ # Enable verbose logging. Defaults to `False`.
185
+ verbose=True,
186
+ )
187
+ ```
188
+
189
+ Switching languages at runtime is a property set:
190
+
191
+ ```python
192
+ detector.lang = "es"
193
+ ```
194
+
195
+ The rule module loads lazily on first access. Switching mid-stream reimports the module and rebinds the pattern cache. Zero config, no restarts needed.
196
+
197
+ ### Boundary detection
198
+
199
+ [`detect()`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-boundary_detector-BoundaryDetector-detect) tells you where each sentence stops. Integer offsets into the original string. No copies, no slicing, no bookkeeping. Feed them to whatever downstream logic you already have.
200
+
201
+ Two detection modes:
202
+
203
+ - **absolute**: (default) offsets count from the start of the entire input stream.
204
+ - **relative**: offsets reset at each paragraph boundary. A `ParagraphEOF` sentinel signals the gap between paragraphs.
205
+
206
+ ```python
207
+ # absolute mode (default)
208
+ res= list(detector.detect('She turned to him, "This is great." She held the book out to show him.'))
209
+ print(res)
210
+ # [35, 70]
211
+
212
+ # relative mode with paragraph break
213
+ detector.lang = "es"
214
+ res = list(detector.detect(
215
+ "El Sr. García llegó ayer. La Sra. López también.\n\nVéase la pág. 55 del libro.",
216
+ relative=True,
217
+ ))
218
+ print(res)
219
+ # [25, 48, ParagraphEOF, 27]
220
+ ```
221
+
222
+ ### Segmentation
223
+
224
+ If you do not want to manage boundary offsets yourself (and who would?), [`segment()`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-boundary_detector-BoundaryDetector-segment) wraps `detect()` with string slicing. It yields sentences as strings, one at a time. By default it strips leading and trailing whitespace and drops empty results. Set `preserve_whitespace=True` to keep original spacing around boundaries.
225
+
226
+ ```python
227
+ detector.lang = "en"
228
+
229
+ # Basic sentence splitting
230
+ res = list(detector.segment("Hello world. How are you? I am fine."))
231
+ print(res)
232
+ # ['Hello world.', 'How are you?', 'I am fine.']
233
+
234
+ # Multi-paragraph with whitespace preserved
235
+ res = list(detector.segment(
236
+ "First para.\nStill first.\n\nSecond para.\nFinished.",
237
+ preserve_whitespace=True,
238
+ ))
239
+ print(res)
240
+ # ['First para.', '\nStill first.', '\n\n', 'Second para.', '\nFinished.']
241
+ ```
242
+
243
+ > [!TIP]
244
+ > **Inputs & streaming** — `detect()` and `segment()` accept plain strings, open file streams (`TextIOBase`), or a `StreamCleaner`. Both are generators: they yield results lazily without loading the entire source into memory. Internally, the text is split on blank lines into paragraphs, and each paragraph is processed independently with offset tracking between them.
245
+
246
+ > [!TIP]
247
+ > **ParagraphStream** — yasbd uses [`ParagraphStream`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-utils-paragraph_stream-ParagraphStream) internally to split text into paragraph blocks. You can import it directly if you need paragraph-level processing in your own code:
248
+ > ```python
249
+ > from yasbd.utils.paragraph_stream import ParagraphStream
250
+ >
251
+ > for para in ParagraphStream(text):
252
+ > print(para) # each paragraph block
253
+ > ```
254
+ > You can also skip empty lines with `skip_empty_lines=True`
255
+
256
+ ### Cleaner
257
+
258
+ OCRd a PDF or scraping noisy text? [`StreamCleaner`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-utils-cleaner-StreamCleaner) normalizes paragraphs before they hit the detector:
259
+
260
+ ```python
261
+ from yasbd.utils.cleaner import StreamCleaner
262
+
263
+ cleaner = StreamCleaner("Hello world. This is messy.")
264
+ list(cleaner)
265
+ # ['Hello world. This is messy.']
266
+ ```
267
+
268
+ It collapses multiple spaces, strips HTML tags, removes page numbers, re-joins hyphenated words split across lines, and more. Pass it directly to `detect()` or `segment()` instead of a string.
269
+
270
+ ### Adapter
271
+
272
+ Migrating from pysbd? Swap the import and keep your pipeline:
273
+
274
+ ```python
275
+ # Before: from pysbd import Segmenter
276
+ from yasbd.utils.pysbd_adapter import Segmenter
277
+
278
+ seg = Segmenter(language="ja")
279
+ res = seg.segment('田中さんは「準備は完了しました」そう言って部屋を出た。U.S.A.の経済政策は非常に複雑です。')
280
+ print(res)
281
+ # ['田中さんは「準備は完了しました」そう言って部屋を出た。', 'U.S.A.の経済政策は非常に複雑です。']
282
+ ```
283
+
284
+ Same API surface. Same [`Segmenter`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-utils-pysbd_adapter-Segmenter) class. Same [`segment()`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-utils-pysbd_adapter-Segmenter-segment) method. Even the [`TextSpan`](https://github.com/speedyk-005/yasbd-lib/blob/main/API_REFERENCES.md#yasbd-utils-pysbd_adapter-TextSpan) class is there with `sent`, `start`, and `end` fields, hurray. It also handles leading whitespace the way pysbd expects it (trailing on the previous sentence instead of leading on the next).
285
+
286
+ ---
287
+
288
+ ## 🗺 Features & Roadmap
289
+
290
+ - [x] Regex caching (compile once per language class)
291
+ - [x] Drop-in pysbd adapter (same API, no pipeline changes)
292
+ - [x] StreamCleaner for OCR'd and noisy text
293
+ - [ ] spaCy integration
294
+ - [ ] 22+ language targets
295
+ - [ ] CLI tool
296
+ - [ ] REST API for remote boundary detection
297
+
298
+ ---
299
+
300
+ ## 🏁 Benchmarks
301
+
302
+ Tested against 6 competitors (pysbd, sentencex, sentsplit, nupunkt, blingfire, sentence-splitter) across 5 languages and 7 edge cases: compound abbreviations, CJK quotes, newline wrapping, chat logs, URLs, and more.
303
+
304
+ **TL;DR:** yasbd ranked #1 in accuracy across almost every test, while staying competitive on speed as pure Python. blingfire is faster but brittle. pysbd and sentencex shred French abbreviations. nupunkt has an 11-second cold start.
305
+ Full results, terminal output, and a performance graph can be found in **[benchmarks/](https://github.com/speedyk-005/yasbd-lib/tree/main/benchmarks)**
306
+
307
+ ---
308
+
309
+ ## 📜 Last note
310
+
311
+ **yasbd** is maintained by [speedyk-005](https://github.com/speedyk-005). Licensed under [Mozilla Public License 2.0](https://github.com/speedyk-005/yasbd-lib/blob/main/LICENSE) — you can use it in proprietary software, but modifications to the source files must stay open under MPL 2.0. Contributions are welcome; see [CONTRIBUTING.md](https://github.com/speedyk-005/yasbd-lib/blob/main/CONTRIBUTING.md).