yasbd-union 0.1.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 speedyk-005
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,147 @@
1
+ Metadata-Version: 2.4
2
+ Name: yasbd-union
3
+ Version: 0.1.2
4
+ Summary: Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles.
5
+ Author-email: speedyk_005 <speedy40115719@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/speedyk-005/yasbd-union
8
+ Project-URL: Repository, https://github.com/speedyk-005/yasbd-union
9
+ Project-URL: Issues, https://github.com/speedyk-005/yasbd-union/issues
10
+ Keywords: sentence-segmentation,sentence-boundary-detection,sbd,multilingual,experimental,aggregate,language-agnostic,lang-pack,text-processing
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: Python :: 3.14
19
+ Classifier: Topic :: Text Processing
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Requires-Python: >=3.11
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: yasbd-lib>=0.8.0
25
+ Provides-Extra: dev
26
+ Requires-Dist: pytest>=8.3.5; extra == "dev"
27
+ Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
28
+ Dynamic: license-file
29
+
30
+ # yasbd-union
31
+
32
+ [![Python Version](https://img.shields.io/badge/Python-3.11%20--%203.14-blue)](https://www.python.org/downloads/)
33
+ [![PyPI](https://img.shields.io/pypi/v/yasbd-union?kill_cache=1)](https://pypi.org/project/yasbd-union)
34
+ [![Tests](https://img.shields.io/github/actions/workflow/status/speedyk-005/yasbd-union/build-and-test.yml?branch=main&label=tests)](https://github.com/speedyk-005/yasbd-union/actions)
35
+ [![Stability](https://img.shields.io/badge/stability-alpha-red)](https://github.com/speedyk-005/yasbd-union)
36
+ [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
37
+
38
+ **Sentence splitting for when you genuinely have no idea what language you're looking at.**
39
+
40
+ ---
41
+
42
+ ## What this is
43
+
44
+ `yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
45
+
46
+ It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
47
+
48
+ Sometimes it behaves nicely.
49
+ Sometimes it makes bold assumptions.
50
+ Sometimes it surprises even you.
51
+
52
+ That's basically the whole deal.
53
+
54
+ ---
55
+
56
+ ## `auto` vs `xx`
57
+
58
+ **`auto`** tries to be smart about it.
59
+
60
+ It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
61
+
62
+ **`xx`** doesn't bother with that step.
63
+
64
+ It assumes your text is already a mix of everything and just applies all available rules at once.
65
+
66
+ | | `auto` | `xx` |
67
+ |--------------------|---------------------|----------------------------------------|
68
+ | Language handling | Detects first | Doesn't care |
69
+ | Accuracy | Stable | Depends on what rules are installed |
70
+ | Mixed text | Not ideal | Basically its natural habitat |
71
+ | False splits | Rare | Happens sometimes |
72
+ | Personality | Careful | A bit chaotic, but trying its best |
73
+ | Best for | Clean text | Mixed-language messes |
74
+
75
+ ---
76
+
77
+ ## Install
78
+
79
+ ```bash
80
+ pip install yasbd-union
81
+ ```
82
+
83
+ Then register it:
84
+
85
+ ```python
86
+ from yasbd.rules import register_lang_packs
87
+ from yasbd import BoundaryDetector
88
+
89
+ register_lang_packs(["yasbd_union"])
90
+
91
+ detector = BoundaryDetector("xx")
92
+ ```
93
+
94
+ ---
95
+
96
+ Example
97
+ ```python
98
+ sentences = list(detector.segment(
99
+ "Dr. Wang said 你好世界。Prof. Li replied 是的。"
100
+ ))
101
+
102
+ print(sentences)
103
+ ```
104
+ Output:
105
+ ```bash
106
+ ["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
107
+ ```
108
+
109
+ ---
110
+
111
+ ## When to use xx
112
+
113
+ Use it when:
114
+
115
+ - You don't know what language your text is in
116
+ - Your input is messy, mixed, or unpredictable
117
+ - You're dealing with logs, chats, or scraped text
118
+ - You just want something that "tries its best"
119
+
120
+ ---
121
+
122
+ ## When not to use xx
123
+
124
+ Avoid it when:
125
+
126
+ - You need strict, repeatable results
127
+ - Your text is single-language
128
+ - You don't want surprises in sentence boundaries
129
+ - You're trying to explain results to someone very literal
130
+
131
+ In those cases, auto or a specific language pack will behave better.
132
+
133
+ ---
134
+
135
+ ## A few honest notes
136
+
137
+ - Some sentence splits will be slightly unexpected
138
+ - Results can change depending on installed language packs
139
+ - It is not fully predictable by design
140
+
141
+ If that sounds like a problem, xx is probably not what you want.
142
+
143
+ ---
144
+
145
+ ## License
146
+
147
+ **MIT:** If it breaks, it's still yours.
@@ -0,0 +1,118 @@
1
+ # yasbd-union
2
+
3
+ [![Python Version](https://img.shields.io/badge/Python-3.11%20--%203.14-blue)](https://www.python.org/downloads/)
4
+ [![PyPI](https://img.shields.io/pypi/v/yasbd-union?kill_cache=1)](https://pypi.org/project/yasbd-union)
5
+ [![Tests](https://img.shields.io/github/actions/workflow/status/speedyk-005/yasbd-union/build-and-test.yml?branch=main&label=tests)](https://github.com/speedyk-005/yasbd-union/actions)
6
+ [![Stability](https://img.shields.io/badge/stability-alpha-red)](https://github.com/speedyk-005/yasbd-union)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
8
+
9
+ **Sentence splitting for when you genuinely have no idea what language you're looking at.**
10
+
11
+ ---
12
+
13
+ ## What this is
14
+
15
+ `yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
16
+
17
+ It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
18
+
19
+ Sometimes it behaves nicely.
20
+ Sometimes it makes bold assumptions.
21
+ Sometimes it surprises even you.
22
+
23
+ That's basically the whole deal.
24
+
25
+ ---
26
+
27
+ ## `auto` vs `xx`
28
+
29
+ **`auto`** tries to be smart about it.
30
+
31
+ It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
32
+
33
+ **`xx`** doesn't bother with that step.
34
+
35
+ It assumes your text is already a mix of everything and just applies all available rules at once.
36
+
37
+ | | `auto` | `xx` |
38
+ |--------------------|---------------------|----------------------------------------|
39
+ | Language handling | Detects first | Doesn't care |
40
+ | Accuracy | Stable | Depends on what rules are installed |
41
+ | Mixed text | Not ideal | Basically its natural habitat |
42
+ | False splits | Rare | Happens sometimes |
43
+ | Personality | Careful | A bit chaotic, but trying its best |
44
+ | Best for | Clean text | Mixed-language messes |
45
+
46
+ ---
47
+
48
+ ## Install
49
+
50
+ ```bash
51
+ pip install yasbd-union
52
+ ```
53
+
54
+ Then register it:
55
+
56
+ ```python
57
+ from yasbd.rules import register_lang_packs
58
+ from yasbd import BoundaryDetector
59
+
60
+ register_lang_packs(["yasbd_union"])
61
+
62
+ detector = BoundaryDetector("xx")
63
+ ```
64
+
65
+ ---
66
+
67
+ Example
68
+ ```python
69
+ sentences = list(detector.segment(
70
+ "Dr. Wang said 你好世界。Prof. Li replied 是的。"
71
+ ))
72
+
73
+ print(sentences)
74
+ ```
75
+ Output:
76
+ ```bash
77
+ ["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
78
+ ```
79
+
80
+ ---
81
+
82
+ ## When to use xx
83
+
84
+ Use it when:
85
+
86
+ - You don't know what language your text is in
87
+ - Your input is messy, mixed, or unpredictable
88
+ - You're dealing with logs, chats, or scraped text
89
+ - You just want something that "tries its best"
90
+
91
+ ---
92
+
93
+ ## When not to use xx
94
+
95
+ Avoid it when:
96
+
97
+ - You need strict, repeatable results
98
+ - Your text is single-language
99
+ - You don't want surprises in sentence boundaries
100
+ - You're trying to explain results to someone very literal
101
+
102
+ In those cases, auto or a specific language pack will behave better.
103
+
104
+ ---
105
+
106
+ ## A few honest notes
107
+
108
+ - Some sentence splits will be slightly unexpected
109
+ - Results can change depending on installed language packs
110
+ - It is not fully predictable by design
111
+
112
+ If that sounds like a problem, xx is probably not what you want.
113
+
114
+ ---
115
+
116
+ ## License
117
+
118
+ **MIT:** If it breaks, it's still yours.
@@ -0,0 +1,67 @@
1
+ [project]
2
+ name = "yasbd-union"
3
+ version = "0.1.2"
4
+ description = "Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles."
5
+ authors = [
6
+ { name = "speedyk_005", email = "speedy40115719@gmail.com" }
7
+ ]
8
+
9
+ license = "MIT"
10
+ readme = "README.md"
11
+ requires-python = ">=3.11"
12
+
13
+ classifiers = [
14
+ "Development Status :: 3 - Alpha",
15
+ "Intended Audience :: Developers",
16
+ "Intended Audience :: Science/Research",
17
+ "Operating System :: OS Independent",
18
+ "Programming Language :: Python :: 3.11",
19
+ "Programming Language :: Python :: 3.12",
20
+ "Programming Language :: Python :: 3.13",
21
+ "Programming Language :: Python :: 3.14",
22
+ "Topic :: Text Processing",
23
+ "Topic :: Software Development :: Libraries :: Python Modules"
24
+ ]
25
+
26
+ keywords = [
27
+ "sentence-segmentation", "sentence-boundary-detection", "sbd",
28
+ "multilingual", "experimental", "aggregate", "language-agnostic", "lang-pack",
29
+ "text-processing"
30
+ ]
31
+
32
+ dependencies = [
33
+ "yasbd-lib>=0.8.0",
34
+ ]
35
+
36
+ [project.optional-dependencies]
37
+ dev = [
38
+ "pytest>=8.3.5",
39
+ "pytest-cov>=6.2.1",
40
+ ]
41
+
42
+ [project.urls]
43
+ Homepage = "https://github.com/speedyk-005/yasbd-union"
44
+ Repository = "https://github.com/speedyk-005/yasbd-union"
45
+ Issues = "https://github.com/speedyk-005/yasbd-union/issues"
46
+
47
+ [build-system]
48
+ requires = [
49
+ "setuptools>=64",
50
+ "wheel",
51
+ ]
52
+ build-backend = "setuptools.build_meta"
53
+
54
+ [tool.setuptools]
55
+ packages = [
56
+ "yasbd_union",
57
+ ]
58
+
59
+ [tool.pytest.ini_options]
60
+ minversion = "7.4"
61
+ addopts = "--cov=yasbd_union --cov-report=term-missing:skip-covered --durations=5"
62
+ testpaths = ["tests"]
63
+ python_files = "test_*.py"
64
+
65
+ [tool.coverage.run]
66
+ omit = []
67
+
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,50 @@
1
+ import pytest
2
+
3
+ from yasbd import BoundaryDetector, register_lang_packs
4
+
5
+ # Register self
6
+ register_lang_packs(["yasbd_union"])
7
+
8
+
9
+ TEST_DATA = [
10
+ # Basic sentences (Mixed language)
11
+ "Hello world!| How are you?| I'm fine.",
12
+ "ሰላም ሁሉም ሰው።| ደህና ነህ?| ደህና ነኝ።",
13
+ "你好世界。| 你好吗?| 我很好。",
14
+ "信じられない!| 本当にそうなの?| 早く教えてください。",
15
+ "เวลา 08.30 น. เริ่มเรียน เลิก 16.00 น.",
16
+ "我喜欢AI。|It is useful",
17
+ "U.S.A.的经济政策非常复杂。|下个月的动向值得关注。",
18
+ "那个会议在下午两点。|Please don't be late!",
19
+ # Abbreviations
20
+ "Prof. Smith und Dr. Schmidt arbeiten zusammen.| Das Projekt ist groß.",
21
+ "डॉ. सिंह और प्रो. वर्मा ने व्याख्यान दिया।| उन्होंने भौतिकी पढ़ाई।",
22
+ "D. José y Dña. María son los dueños.| El Cmdte. Rodríguez saludó al Cnel. Díaz.",
23
+ "Это, т.е. новый закон, вступает в силу.| Встреча назначена на пн. 15 января.",
24
+ "Η συνάντηση είναι τη Δευ. 15 Ιαν.| Γεννήθηκε στις 5 Μαρ. 1990.",
25
+ "Το μάθημα είναι Τετ. και Παρ. κάθε εβδομάδα.",
26
+ "The U.S. Army is recruiting.| Many join every year.",
27
+ "She lived in the U.S.A. for 20 years.| Now she lives in the E.U.",
28
+ "ዶ/ር ኃይሉ በሆስፒታሉ ውስጥ ናቸው።| ነገ ይመረመራሉ።",
29
+ "Das ist z. B. ein Beispiel.| Es funktioniert gut.",
30
+ # Quoted speech
31
+ "Er sagte: 'Ich bin müde.'| Dann ging er nach Hause.",
32
+ "„Das ist großartig!“ rief sie.",
33
+ "Léa dit : « Bonjour ! Je suis Léa. Et toi ? »",
34
+ "Elle s'est tournée vers lui, \"C'est magnifique.\" dit-elle.",
35
+ # Ellipsis
36
+ "Die Ergebnisse waren nicht eindeutig....| Wir haben es wiederholt.",
37
+ "Het project was bijna afgerond... of dat dachten we tenminste.",
38
+ ]
39
+
40
+
41
+ @pytest.mark.parametrize("test_case", TEST_DATA)
42
+ def test_segmentation(subtests, test_case):
43
+ """Test sentence segmentation for xx multilingual aggregate."""
44
+ expected = [sent.strip() for sent in test_case.split("|")]
45
+ input_text = test_case.replace("|", "")
46
+
47
+ seg = BoundaryDetector(lang="xx")
48
+ result = list(seg.segment(input_text))
49
+
50
+ assert result == expected, f"Input: {input_text}"
@@ -0,0 +1,152 @@
1
+ from functools import cache
2
+ from importlib import import_module
3
+ from itertools import chain
4
+
5
+ import regex as re
6
+
7
+ from yasbd.rules import _LANG_PACK_REGISTRY, get_supported_langs
8
+ from yasbd.rules.base import Rules, build_abbr_pattern
9
+
10
+ # fmt: off
11
+ _RULE_SET_NAMES = {
12
+ "TERMINATORS",
13
+ "TITLE_ABBRVS",
14
+ "DOTTED_GEOPOL_ABBRVS",
15
+ "REFERENCE_ABBRVS",
16
+ "SECTION_MARKERS",
17
+ "INLINE_ONLY_ABBRVS",
18
+ "NAMES_WITH_EXCLAMATION",
19
+ "DATE_ABBRVS",
20
+ "COMMON_SENT_STARTERS",
21
+ "POST_QUOTATIVE_PARTICLES",
22
+ "REPORTING_WORDS",
23
+
24
+ # Specials
25
+ "DISCOURSE_FINAL_PARTICLES",
26
+ "STREET_ABBRVS",
27
+ "ORG_PROPER_NOUNS",
28
+ "DATE_WORDS",
29
+ }
30
+ # fmt: on
31
+
32
+
33
+ @cache
34
+ def _get_all_rules():
35
+ """Return a list of all Rules subclasses from supported languages"""
36
+ rules = []
37
+ for lang in get_supported_langs():
38
+ if lang in {"auto", "xx"}:
39
+ continue
40
+
41
+ # Prioritize registered lang packs
42
+ if lang in _LANG_PACK_REGISTRY:
43
+ _, cls = _LANG_PACK_REGISTRY[lang]
44
+ else:
45
+ rule_mod = import_module(f"yasbd.rules.{lang}")
46
+ cls = getattr(rule_mod, f"{lang.capitalize()}Rules")
47
+
48
+ rules.append(cls)
49
+ return rules
50
+
51
+
52
+ def _get_rule_set(set_name):
53
+ """Union a named attribute from all Rules subclasses into a single set."""
54
+ rule_set = set()
55
+ for cls in _get_all_rules():
56
+ if found_set := getattr(cls, set_name, None):
57
+ rule_set.update(found_set)
58
+ return rule_set
59
+
60
+
61
+ # fmt: off
62
+ class XxRules(Rules):
63
+
64
+
65
+ @classmethod
66
+ def _compile_regex_dynamically(cls):
67
+ """Aggregate all languages' rule sets, then compile regex from the merged data."""
68
+ for set_name in _RULE_SET_NAMES:
69
+ setattr(cls, set_name, _get_rule_set(set_name))
70
+ super()._compile_regex_dynamically()
71
+
72
+ cls.MID_SENTENCE_FINDER_LST.extend(
73
+ [
74
+ # Spaced three-dot ellipsis mid-thought (e.g., ". . . she didn't")
75
+ # Consecutive dots "..." or "...." still create sentence boundaries.
76
+ re.compile(r"(?<!\.)\.(?:\s\.){2}"),
77
+
78
+ # Ordinal numbers
79
+ # https://learngerman.dw.com/en/ordinal-numbers/l-57731450/gr-60885529
80
+ re.compile(r"\s\d{1,3}\."),
81
+
82
+ # Multi-part abbreviations with spaces (like "d. h.", "z. B.", "i. d. R.")
83
+ re.compile(r"\b[a-zA-Z]\.(?!\s+\w{2,})"),
84
+
85
+ # Number/Time abbreviations followed by a date token (e.g., 9 a.m. Monday)
86
+ re.compile(
87
+ rf"""
88
+ (?:\d\.|(?:(?<=\d)|\b)(?i:[ap]\.m\.))
89
+ (?=
90
+ \s+(?i:{build_abbr_pattern(cls.DATE_ABBRVS | cls.DATE_WORDS)})
91
+ (?:\.|\s|$)
92
+ )
93
+ """, re.X,
94
+ ),
95
+
96
+ # Geopolitical abbrv is followed by a common org noun (e.g., U.S.A Army)
97
+ re.compile(
98
+ rf"""
99
+ \b(?i:{cls.DOTTED_GEOPOL_ABBRVS_PATTERN})\.
100
+ (?=\s+(?:{build_abbr_pattern(cls.ORG_PROPER_NOUNS)}))
101
+ """, re.X,
102
+ ),
103
+
104
+ # Full-width geopolitical abbreviations
105
+ re.compile(r"(?:[\uFF21-\uFF3A\uFF41-\uFF5A\uFF10-\uFF19].){1,5}"),
106
+
107
+ # Time abbreviations followed by a date token (e.g., 9 a.m. Monday)
108
+ re.compile(
109
+ rf"""
110
+ (?:(?<=\d)|\b)(?i:[ap]\.m\.)
111
+ (?=
112
+ \s+(?i:{build_abbr_pattern(cls.DATE_ABBRVS | cls.DATE_WORDS)})
113
+ (?:\.|\s|$)
114
+ )
115
+ """, re.X,
116
+ ),
117
+
118
+ # Ud./Vd. pronoun abbreviation not followed by a proper name
119
+ re.compile(
120
+ rf"""
121
+ \b(?i:{build_abbr_pattern({"ud", "uds", "vd", "vds"})})\.
122
+ (?!\s+(?:{cls.COMMON_STARTERS_PATTERN})\b)
123
+ """, re.X,
124
+ ),
125
+ ]
126
+ )
127
+
128
+ # Street abbrv followed by a common starters
129
+ cls.ENDING_STREET_ABBRVS_FINDER = re.compile(
130
+ rf"""
131
+ (?:\b(?i:{build_abbr_pattern(cls.STREET_ABBRVS)})\.)
132
+ (?=\s+(?:{cls.COMMON_STARTERS_PATTERN})\b)
133
+ """, re.X,
134
+ )
135
+
136
+ # Discourse final particles that should not end a sentence (Thai, Burmese, etc.)
137
+ cls.FINAL_PARTICLES_FINDER = re.compile(
138
+ rf"{build_abbr_pattern(cls.DISCOURSE_FINAL_PARTICLES)}(?![\s]*[.?!;:๚๛])"
139
+ )
140
+
141
+ # fmt: on
142
+ def _post_process_boundaries(self, main_boundaries: set[int], text: str) -> None:
143
+ main_boundaries.update(
144
+ m.end()
145
+ for m in chain(
146
+ self.FINAL_PARTICLES_FINDER.finditer(text),
147
+ self.ENDING_STREET_ABBRVS_FINDER.finditer(text),
148
+ )
149
+ )
150
+
151
+
152
+ PROFILES = [XxRules]
@@ -0,0 +1,147 @@
1
+ Metadata-Version: 2.4
2
+ Name: yasbd-union
3
+ Version: 0.1.2
4
+ Summary: Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles.
5
+ Author-email: speedyk_005 <speedy40115719@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/speedyk-005/yasbd-union
8
+ Project-URL: Repository, https://github.com/speedyk-005/yasbd-union
9
+ Project-URL: Issues, https://github.com/speedyk-005/yasbd-union/issues
10
+ Keywords: sentence-segmentation,sentence-boundary-detection,sbd,multilingual,experimental,aggregate,language-agnostic,lang-pack,text-processing
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3.11
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: Python :: 3.14
19
+ Classifier: Topic :: Text Processing
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Requires-Python: >=3.11
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: yasbd-lib>=0.8.0
25
+ Provides-Extra: dev
26
+ Requires-Dist: pytest>=8.3.5; extra == "dev"
27
+ Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
28
+ Dynamic: license-file
29
+
30
+ # yasbd-union
31
+
32
+ [![Python Version](https://img.shields.io/badge/Python-3.11%20--%203.14-blue)](https://www.python.org/downloads/)
33
+ [![PyPI](https://img.shields.io/pypi/v/yasbd-union?kill_cache=1)](https://pypi.org/project/yasbd-union)
34
+ [![Tests](https://img.shields.io/github/actions/workflow/status/speedyk-005/yasbd-union/build-and-test.yml?branch=main&label=tests)](https://github.com/speedyk-005/yasbd-union/actions)
35
+ [![Stability](https://img.shields.io/badge/stability-alpha-red)](https://github.com/speedyk-005/yasbd-union)
36
+ [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
37
+
38
+ **Sentence splitting for when you genuinely have no idea what language you're looking at.**
39
+
40
+ ---
41
+
42
+ ## What this is
43
+
44
+ `yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
45
+
46
+ It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
47
+
48
+ Sometimes it behaves nicely.
49
+ Sometimes it makes bold assumptions.
50
+ Sometimes it surprises even you.
51
+
52
+ That's basically the whole deal.
53
+
54
+ ---
55
+
56
+ ## `auto` vs `xx`
57
+
58
+ **`auto`** tries to be smart about it.
59
+
60
+ It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
61
+
62
+ **`xx`** doesn't bother with that step.
63
+
64
+ It assumes your text is already a mix of everything and just applies all available rules at once.
65
+
66
+ | | `auto` | `xx` |
67
+ |--------------------|---------------------|----------------------------------------|
68
+ | Language handling | Detects first | Doesn't care |
69
+ | Accuracy | Stable | Depends on what rules are installed |
70
+ | Mixed text | Not ideal | Basically its natural habitat |
71
+ | False splits | Rare | Happens sometimes |
72
+ | Personality | Careful | A bit chaotic, but trying its best |
73
+ | Best for | Clean text | Mixed-language messes |
74
+
75
+ ---
76
+
77
+ ## Install
78
+
79
+ ```bash
80
+ pip install yasbd-union
81
+ ```
82
+
83
+ Then register it:
84
+
85
+ ```python
86
+ from yasbd.rules import register_lang_packs
87
+ from yasbd import BoundaryDetector
88
+
89
+ register_lang_packs(["yasbd_union"])
90
+
91
+ detector = BoundaryDetector("xx")
92
+ ```
93
+
94
+ ---
95
+
96
+ Example
97
+ ```python
98
+ sentences = list(detector.segment(
99
+ "Dr. Wang said 你好世界。Prof. Li replied 是的。"
100
+ ))
101
+
102
+ print(sentences)
103
+ ```
104
+ Output:
105
+ ```bash
106
+ ["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
107
+ ```
108
+
109
+ ---
110
+
111
+ ## When to use xx
112
+
113
+ Use it when:
114
+
115
+ - You don't know what language your text is in
116
+ - Your input is messy, mixed, or unpredictable
117
+ - You're dealing with logs, chats, or scraped text
118
+ - You just want something that "tries its best"
119
+
120
+ ---
121
+
122
+ ## When not to use xx
123
+
124
+ Avoid it when:
125
+
126
+ - You need strict, repeatable results
127
+ - Your text is single-language
128
+ - You don't want surprises in sentence boundaries
129
+ - You're trying to explain results to someone very literal
130
+
131
+ In those cases, auto or a specific language pack will behave better.
132
+
133
+ ---
134
+
135
+ ## A few honest notes
136
+
137
+ - Some sentence splits will be slightly unexpected
138
+ - Results can change depending on installed language packs
139
+ - It is not fully predictable by design
140
+
141
+ If that sounds like a problem, xx is probably not what you want.
142
+
143
+ ---
144
+
145
+ ## License
146
+
147
+ **MIT:** If it breaks, it's still yours.
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ tests/test_segmentation.py
5
+ yasbd_union/__init__.py
6
+ yasbd_union.egg-info/PKG-INFO
7
+ yasbd_union.egg-info/SOURCES.txt
8
+ yasbd_union.egg-info/dependency_links.txt
9
+ yasbd_union.egg-info/requires.txt
10
+ yasbd_union.egg-info/top_level.txt
@@ -0,0 +1,5 @@
1
+ yasbd-lib>=0.8.0
2
+
3
+ [dev]
4
+ pytest>=8.3.5
5
+ pytest-cov>=6.2.1
@@ -0,0 +1 @@
1
+ yasbd_union