yasbd-union 0.1.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- yasbd_union-0.1.2/LICENSE +21 -0
- yasbd_union-0.1.2/PKG-INFO +147 -0
- yasbd_union-0.1.2/README.md +118 -0
- yasbd_union-0.1.2/pyproject.toml +67 -0
- yasbd_union-0.1.2/setup.cfg +4 -0
- yasbd_union-0.1.2/tests/test_segmentation.py +50 -0
- yasbd_union-0.1.2/yasbd_union/__init__.py +152 -0
- yasbd_union-0.1.2/yasbd_union.egg-info/PKG-INFO +147 -0
- yasbd_union-0.1.2/yasbd_union.egg-info/SOURCES.txt +10 -0
- yasbd_union-0.1.2/yasbd_union.egg-info/dependency_links.txt +1 -0
- yasbd_union-0.1.2/yasbd_union.egg-info/requires.txt +5 -0
- yasbd_union-0.1.2/yasbd_union.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 speedyk-005
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: yasbd-union
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles.
|
|
5
|
+
Author-email: speedyk_005 <speedy40115719@gmail.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/speedyk-005/yasbd-union
|
|
8
|
+
Project-URL: Repository, https://github.com/speedyk-005/yasbd-union
|
|
9
|
+
Project-URL: Issues, https://github.com/speedyk-005/yasbd-union/issues
|
|
10
|
+
Keywords: sentence-segmentation,sentence-boundary-detection,sbd,multilingual,experimental,aggregate,language-agnostic,lang-pack,text-processing
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
19
|
+
Classifier: Topic :: Text Processing
|
|
20
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
21
|
+
Requires-Python: >=3.11
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
License-File: LICENSE
|
|
24
|
+
Requires-Dist: yasbd-lib>=0.8.0
|
|
25
|
+
Provides-Extra: dev
|
|
26
|
+
Requires-Dist: pytest>=8.3.5; extra == "dev"
|
|
27
|
+
Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
|
|
28
|
+
Dynamic: license-file
|
|
29
|
+
|
|
30
|
+
# yasbd-union
|
|
31
|
+
|
|
32
|
+
[](https://www.python.org/downloads/)
|
|
33
|
+
[](https://pypi.org/project/yasbd-union)
|
|
34
|
+
[](https://github.com/speedyk-005/yasbd-union/actions)
|
|
35
|
+
[](https://github.com/speedyk-005/yasbd-union)
|
|
36
|
+
[](https://opensource.org/licenses/MIT)
|
|
37
|
+
|
|
38
|
+
**Sentence splitting for when you genuinely have no idea what language you're looking at.**
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## What this is
|
|
43
|
+
|
|
44
|
+
`yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
|
|
45
|
+
|
|
46
|
+
It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
|
|
47
|
+
|
|
48
|
+
Sometimes it behaves nicely.
|
|
49
|
+
Sometimes it makes bold assumptions.
|
|
50
|
+
Sometimes it surprises even you.
|
|
51
|
+
|
|
52
|
+
That's basically the whole deal.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## `auto` vs `xx`
|
|
57
|
+
|
|
58
|
+
**`auto`** tries to be smart about it.
|
|
59
|
+
|
|
60
|
+
It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
|
|
61
|
+
|
|
62
|
+
**`xx`** doesn't bother with that step.
|
|
63
|
+
|
|
64
|
+
It assumes your text is already a mix of everything and just applies all available rules at once.
|
|
65
|
+
|
|
66
|
+
| | `auto` | `xx` |
|
|
67
|
+
|--------------------|---------------------|----------------------------------------|
|
|
68
|
+
| Language handling | Detects first | Doesn't care |
|
|
69
|
+
| Accuracy | Stable | Depends on what rules are installed |
|
|
70
|
+
| Mixed text | Not ideal | Basically its natural habitat |
|
|
71
|
+
| False splits | Rare | Happens sometimes |
|
|
72
|
+
| Personality | Careful | A bit chaotic, but trying its best |
|
|
73
|
+
| Best for | Clean text | Mixed-language messes |
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Install
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
pip install yasbd-union
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
Then register it:
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
from yasbd.rules import register_lang_packs
|
|
87
|
+
from yasbd import BoundaryDetector
|
|
88
|
+
|
|
89
|
+
register_lang_packs(["yasbd_union"])
|
|
90
|
+
|
|
91
|
+
detector = BoundaryDetector("xx")
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
Example
|
|
97
|
+
```python
|
|
98
|
+
sentences = list(detector.segment(
|
|
99
|
+
"Dr. Wang said 你好世界。Prof. Li replied 是的。"
|
|
100
|
+
))
|
|
101
|
+
|
|
102
|
+
print(sentences)
|
|
103
|
+
```
|
|
104
|
+
Output:
|
|
105
|
+
```bash
|
|
106
|
+
["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## When to use xx
|
|
112
|
+
|
|
113
|
+
Use it when:
|
|
114
|
+
|
|
115
|
+
- You don't know what language your text is in
|
|
116
|
+
- Your input is messy, mixed, or unpredictable
|
|
117
|
+
- You're dealing with logs, chats, or scraped text
|
|
118
|
+
- You just want something that "tries its best"
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## When not to use xx
|
|
123
|
+
|
|
124
|
+
Avoid it when:
|
|
125
|
+
|
|
126
|
+
- You need strict, repeatable results
|
|
127
|
+
- Your text is single-language
|
|
128
|
+
- You don't want surprises in sentence boundaries
|
|
129
|
+
- You're trying to explain results to someone very literal
|
|
130
|
+
|
|
131
|
+
In those cases, auto or a specific language pack will behave better.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## A few honest notes
|
|
136
|
+
|
|
137
|
+
- Some sentence splits will be slightly unexpected
|
|
138
|
+
- Results can change depending on installed language packs
|
|
139
|
+
- It is not fully predictable by design
|
|
140
|
+
|
|
141
|
+
If that sounds like a problem, xx is probably not what you want.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## License
|
|
146
|
+
|
|
147
|
+
**MIT:** If it breaks, it's still yours.
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# yasbd-union
|
|
2
|
+
|
|
3
|
+
[](https://www.python.org/downloads/)
|
|
4
|
+
[](https://pypi.org/project/yasbd-union)
|
|
5
|
+
[](https://github.com/speedyk-005/yasbd-union/actions)
|
|
6
|
+
[](https://github.com/speedyk-005/yasbd-union)
|
|
7
|
+
[](https://opensource.org/licenses/MIT)
|
|
8
|
+
|
|
9
|
+
**Sentence splitting for when you genuinely have no idea what language you're looking at.**
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## What this is
|
|
14
|
+
|
|
15
|
+
`yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
|
|
16
|
+
|
|
17
|
+
It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
|
|
18
|
+
|
|
19
|
+
Sometimes it behaves nicely.
|
|
20
|
+
Sometimes it makes bold assumptions.
|
|
21
|
+
Sometimes it surprises even you.
|
|
22
|
+
|
|
23
|
+
That's basically the whole deal.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## `auto` vs `xx`
|
|
28
|
+
|
|
29
|
+
**`auto`** tries to be smart about it.
|
|
30
|
+
|
|
31
|
+
It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
|
|
32
|
+
|
|
33
|
+
**`xx`** doesn't bother with that step.
|
|
34
|
+
|
|
35
|
+
It assumes your text is already a mix of everything and just applies all available rules at once.
|
|
36
|
+
|
|
37
|
+
| | `auto` | `xx` |
|
|
38
|
+
|--------------------|---------------------|----------------------------------------|
|
|
39
|
+
| Language handling | Detects first | Doesn't care |
|
|
40
|
+
| Accuracy | Stable | Depends on what rules are installed |
|
|
41
|
+
| Mixed text | Not ideal | Basically its natural habitat |
|
|
42
|
+
| False splits | Rare | Happens sometimes |
|
|
43
|
+
| Personality | Careful | A bit chaotic, but trying its best |
|
|
44
|
+
| Best for | Clean text | Mixed-language messes |
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pip install yasbd-union
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
Then register it:
|
|
55
|
+
|
|
56
|
+
```python
|
|
57
|
+
from yasbd.rules import register_lang_packs
|
|
58
|
+
from yasbd import BoundaryDetector
|
|
59
|
+
|
|
60
|
+
register_lang_packs(["yasbd_union"])
|
|
61
|
+
|
|
62
|
+
detector = BoundaryDetector("xx")
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
Example
|
|
68
|
+
```python
|
|
69
|
+
sentences = list(detector.segment(
|
|
70
|
+
"Dr. Wang said 你好世界。Prof. Li replied 是的。"
|
|
71
|
+
))
|
|
72
|
+
|
|
73
|
+
print(sentences)
|
|
74
|
+
```
|
|
75
|
+
Output:
|
|
76
|
+
```bash
|
|
77
|
+
["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## When to use xx
|
|
83
|
+
|
|
84
|
+
Use it when:
|
|
85
|
+
|
|
86
|
+
- You don't know what language your text is in
|
|
87
|
+
- Your input is messy, mixed, or unpredictable
|
|
88
|
+
- You're dealing with logs, chats, or scraped text
|
|
89
|
+
- You just want something that "tries its best"
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## When not to use xx
|
|
94
|
+
|
|
95
|
+
Avoid it when:
|
|
96
|
+
|
|
97
|
+
- You need strict, repeatable results
|
|
98
|
+
- Your text is single-language
|
|
99
|
+
- You don't want surprises in sentence boundaries
|
|
100
|
+
- You're trying to explain results to someone very literal
|
|
101
|
+
|
|
102
|
+
In those cases, auto or a specific language pack will behave better.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## A few honest notes
|
|
107
|
+
|
|
108
|
+
- Some sentence splits will be slightly unexpected
|
|
109
|
+
- Results can change depending on installed language packs
|
|
110
|
+
- It is not fully predictable by design
|
|
111
|
+
|
|
112
|
+
If that sounds like a problem, xx is probably not what you want.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## License
|
|
117
|
+
|
|
118
|
+
**MIT:** If it breaks, it's still yours.
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "yasbd-union"
|
|
3
|
+
version = "0.1.2"
|
|
4
|
+
description = "Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles."
|
|
5
|
+
authors = [
|
|
6
|
+
{ name = "speedyk_005", email = "speedy40115719@gmail.com" }
|
|
7
|
+
]
|
|
8
|
+
|
|
9
|
+
license = "MIT"
|
|
10
|
+
readme = "README.md"
|
|
11
|
+
requires-python = ">=3.11"
|
|
12
|
+
|
|
13
|
+
classifiers = [
|
|
14
|
+
"Development Status :: 3 - Alpha",
|
|
15
|
+
"Intended Audience :: Developers",
|
|
16
|
+
"Intended Audience :: Science/Research",
|
|
17
|
+
"Operating System :: OS Independent",
|
|
18
|
+
"Programming Language :: Python :: 3.11",
|
|
19
|
+
"Programming Language :: Python :: 3.12",
|
|
20
|
+
"Programming Language :: Python :: 3.13",
|
|
21
|
+
"Programming Language :: Python :: 3.14",
|
|
22
|
+
"Topic :: Text Processing",
|
|
23
|
+
"Topic :: Software Development :: Libraries :: Python Modules"
|
|
24
|
+
]
|
|
25
|
+
|
|
26
|
+
keywords = [
|
|
27
|
+
"sentence-segmentation", "sentence-boundary-detection", "sbd",
|
|
28
|
+
"multilingual", "experimental", "aggregate", "language-agnostic", "lang-pack",
|
|
29
|
+
"text-processing"
|
|
30
|
+
]
|
|
31
|
+
|
|
32
|
+
dependencies = [
|
|
33
|
+
"yasbd-lib>=0.8.0",
|
|
34
|
+
]
|
|
35
|
+
|
|
36
|
+
[project.optional-dependencies]
|
|
37
|
+
dev = [
|
|
38
|
+
"pytest>=8.3.5",
|
|
39
|
+
"pytest-cov>=6.2.1",
|
|
40
|
+
]
|
|
41
|
+
|
|
42
|
+
[project.urls]
|
|
43
|
+
Homepage = "https://github.com/speedyk-005/yasbd-union"
|
|
44
|
+
Repository = "https://github.com/speedyk-005/yasbd-union"
|
|
45
|
+
Issues = "https://github.com/speedyk-005/yasbd-union/issues"
|
|
46
|
+
|
|
47
|
+
[build-system]
|
|
48
|
+
requires = [
|
|
49
|
+
"setuptools>=64",
|
|
50
|
+
"wheel",
|
|
51
|
+
]
|
|
52
|
+
build-backend = "setuptools.build_meta"
|
|
53
|
+
|
|
54
|
+
[tool.setuptools]
|
|
55
|
+
packages = [
|
|
56
|
+
"yasbd_union",
|
|
57
|
+
]
|
|
58
|
+
|
|
59
|
+
[tool.pytest.ini_options]
|
|
60
|
+
minversion = "7.4"
|
|
61
|
+
addopts = "--cov=yasbd_union --cov-report=term-missing:skip-covered --durations=5"
|
|
62
|
+
testpaths = ["tests"]
|
|
63
|
+
python_files = "test_*.py"
|
|
64
|
+
|
|
65
|
+
[tool.coverage.run]
|
|
66
|
+
omit = []
|
|
67
|
+
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
import pytest
|
|
2
|
+
|
|
3
|
+
from yasbd import BoundaryDetector, register_lang_packs
|
|
4
|
+
|
|
5
|
+
# Register self
|
|
6
|
+
register_lang_packs(["yasbd_union"])
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
TEST_DATA = [
|
|
10
|
+
# Basic sentences (Mixed language)
|
|
11
|
+
"Hello world!| How are you?| I'm fine.",
|
|
12
|
+
"ሰላም ሁሉም ሰው።| ደህና ነህ?| ደህና ነኝ።",
|
|
13
|
+
"你好世界。| 你好吗?| 我很好。",
|
|
14
|
+
"信じられない!| 本当にそうなの?| 早く教えてください。",
|
|
15
|
+
"เวลา 08.30 น. เริ่มเรียน เลิก 16.00 น.",
|
|
16
|
+
"我喜欢AI。|It is useful",
|
|
17
|
+
"U.S.A.的经济政策非常复杂。|下个月的动向值得关注。",
|
|
18
|
+
"那个会议在下午两点。|Please don't be late!",
|
|
19
|
+
# Abbreviations
|
|
20
|
+
"Prof. Smith und Dr. Schmidt arbeiten zusammen.| Das Projekt ist groß.",
|
|
21
|
+
"डॉ. सिंह और प्रो. वर्मा ने व्याख्यान दिया।| उन्होंने भौतिकी पढ़ाई।",
|
|
22
|
+
"D. José y Dña. María son los dueños.| El Cmdte. Rodríguez saludó al Cnel. Díaz.",
|
|
23
|
+
"Это, т.е. новый закон, вступает в силу.| Встреча назначена на пн. 15 января.",
|
|
24
|
+
"Η συνάντηση είναι τη Δευ. 15 Ιαν.| Γεννήθηκε στις 5 Μαρ. 1990.",
|
|
25
|
+
"Το μάθημα είναι Τετ. και Παρ. κάθε εβδομάδα.",
|
|
26
|
+
"The U.S. Army is recruiting.| Many join every year.",
|
|
27
|
+
"She lived in the U.S.A. for 20 years.| Now she lives in the E.U.",
|
|
28
|
+
"ዶ/ር ኃይሉ በሆስፒታሉ ውስጥ ናቸው።| ነገ ይመረመራሉ።",
|
|
29
|
+
"Das ist z. B. ein Beispiel.| Es funktioniert gut.",
|
|
30
|
+
# Quoted speech
|
|
31
|
+
"Er sagte: 'Ich bin müde.'| Dann ging er nach Hause.",
|
|
32
|
+
"„Das ist großartig!“ rief sie.",
|
|
33
|
+
"Léa dit : « Bonjour ! Je suis Léa. Et toi ? »",
|
|
34
|
+
"Elle s'est tournée vers lui, \"C'est magnifique.\" dit-elle.",
|
|
35
|
+
# Ellipsis
|
|
36
|
+
"Die Ergebnisse waren nicht eindeutig....| Wir haben es wiederholt.",
|
|
37
|
+
"Het project was bijna afgerond... of dat dachten we tenminste.",
|
|
38
|
+
]
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
@pytest.mark.parametrize("test_case", TEST_DATA)
|
|
42
|
+
def test_segmentation(subtests, test_case):
|
|
43
|
+
"""Test sentence segmentation for xx multilingual aggregate."""
|
|
44
|
+
expected = [sent.strip() for sent in test_case.split("|")]
|
|
45
|
+
input_text = test_case.replace("|", "")
|
|
46
|
+
|
|
47
|
+
seg = BoundaryDetector(lang="xx")
|
|
48
|
+
result = list(seg.segment(input_text))
|
|
49
|
+
|
|
50
|
+
assert result == expected, f"Input: {input_text}"
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
from functools import cache
|
|
2
|
+
from importlib import import_module
|
|
3
|
+
from itertools import chain
|
|
4
|
+
|
|
5
|
+
import regex as re
|
|
6
|
+
|
|
7
|
+
from yasbd.rules import _LANG_PACK_REGISTRY, get_supported_langs
|
|
8
|
+
from yasbd.rules.base import Rules, build_abbr_pattern
|
|
9
|
+
|
|
10
|
+
# fmt: off
|
|
11
|
+
_RULE_SET_NAMES = {
|
|
12
|
+
"TERMINATORS",
|
|
13
|
+
"TITLE_ABBRVS",
|
|
14
|
+
"DOTTED_GEOPOL_ABBRVS",
|
|
15
|
+
"REFERENCE_ABBRVS",
|
|
16
|
+
"SECTION_MARKERS",
|
|
17
|
+
"INLINE_ONLY_ABBRVS",
|
|
18
|
+
"NAMES_WITH_EXCLAMATION",
|
|
19
|
+
"DATE_ABBRVS",
|
|
20
|
+
"COMMON_SENT_STARTERS",
|
|
21
|
+
"POST_QUOTATIVE_PARTICLES",
|
|
22
|
+
"REPORTING_WORDS",
|
|
23
|
+
|
|
24
|
+
# Specials
|
|
25
|
+
"DISCOURSE_FINAL_PARTICLES",
|
|
26
|
+
"STREET_ABBRVS",
|
|
27
|
+
"ORG_PROPER_NOUNS",
|
|
28
|
+
"DATE_WORDS",
|
|
29
|
+
}
|
|
30
|
+
# fmt: on
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
@cache
|
|
34
|
+
def _get_all_rules():
|
|
35
|
+
"""Return a list of all Rules subclasses from supported languages"""
|
|
36
|
+
rules = []
|
|
37
|
+
for lang in get_supported_langs():
|
|
38
|
+
if lang in {"auto", "xx"}:
|
|
39
|
+
continue
|
|
40
|
+
|
|
41
|
+
# Prioritize registered lang packs
|
|
42
|
+
if lang in _LANG_PACK_REGISTRY:
|
|
43
|
+
_, cls = _LANG_PACK_REGISTRY[lang]
|
|
44
|
+
else:
|
|
45
|
+
rule_mod = import_module(f"yasbd.rules.{lang}")
|
|
46
|
+
cls = getattr(rule_mod, f"{lang.capitalize()}Rules")
|
|
47
|
+
|
|
48
|
+
rules.append(cls)
|
|
49
|
+
return rules
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def _get_rule_set(set_name):
|
|
53
|
+
"""Union a named attribute from all Rules subclasses into a single set."""
|
|
54
|
+
rule_set = set()
|
|
55
|
+
for cls in _get_all_rules():
|
|
56
|
+
if found_set := getattr(cls, set_name, None):
|
|
57
|
+
rule_set.update(found_set)
|
|
58
|
+
return rule_set
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
# fmt: off
|
|
62
|
+
class XxRules(Rules):
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
@classmethod
|
|
66
|
+
def _compile_regex_dynamically(cls):
|
|
67
|
+
"""Aggregate all languages' rule sets, then compile regex from the merged data."""
|
|
68
|
+
for set_name in _RULE_SET_NAMES:
|
|
69
|
+
setattr(cls, set_name, _get_rule_set(set_name))
|
|
70
|
+
super()._compile_regex_dynamically()
|
|
71
|
+
|
|
72
|
+
cls.MID_SENTENCE_FINDER_LST.extend(
|
|
73
|
+
[
|
|
74
|
+
# Spaced three-dot ellipsis mid-thought (e.g., ". . . she didn't")
|
|
75
|
+
# Consecutive dots "..." or "...." still create sentence boundaries.
|
|
76
|
+
re.compile(r"(?<!\.)\.(?:\s\.){2}"),
|
|
77
|
+
|
|
78
|
+
# Ordinal numbers
|
|
79
|
+
# https://learngerman.dw.com/en/ordinal-numbers/l-57731450/gr-60885529
|
|
80
|
+
re.compile(r"\s\d{1,3}\."),
|
|
81
|
+
|
|
82
|
+
# Multi-part abbreviations with spaces (like "d. h.", "z. B.", "i. d. R.")
|
|
83
|
+
re.compile(r"\b[a-zA-Z]\.(?!\s+\w{2,})"),
|
|
84
|
+
|
|
85
|
+
# Number/Time abbreviations followed by a date token (e.g., 9 a.m. Monday)
|
|
86
|
+
re.compile(
|
|
87
|
+
rf"""
|
|
88
|
+
(?:\d\.|(?:(?<=\d)|\b)(?i:[ap]\.m\.))
|
|
89
|
+
(?=
|
|
90
|
+
\s+(?i:{build_abbr_pattern(cls.DATE_ABBRVS | cls.DATE_WORDS)})
|
|
91
|
+
(?:\.|\s|$)
|
|
92
|
+
)
|
|
93
|
+
""", re.X,
|
|
94
|
+
),
|
|
95
|
+
|
|
96
|
+
# Geopolitical abbrv is followed by a common org noun (e.g., U.S.A Army)
|
|
97
|
+
re.compile(
|
|
98
|
+
rf"""
|
|
99
|
+
\b(?i:{cls.DOTTED_GEOPOL_ABBRVS_PATTERN})\.
|
|
100
|
+
(?=\s+(?:{build_abbr_pattern(cls.ORG_PROPER_NOUNS)}))
|
|
101
|
+
""", re.X,
|
|
102
|
+
),
|
|
103
|
+
|
|
104
|
+
# Full-width geopolitical abbreviations
|
|
105
|
+
re.compile(r"(?:[\uFF21-\uFF3A\uFF41-\uFF5A\uFF10-\uFF19].){1,5}"),
|
|
106
|
+
|
|
107
|
+
# Time abbreviations followed by a date token (e.g., 9 a.m. Monday)
|
|
108
|
+
re.compile(
|
|
109
|
+
rf"""
|
|
110
|
+
(?:(?<=\d)|\b)(?i:[ap]\.m\.)
|
|
111
|
+
(?=
|
|
112
|
+
\s+(?i:{build_abbr_pattern(cls.DATE_ABBRVS | cls.DATE_WORDS)})
|
|
113
|
+
(?:\.|\s|$)
|
|
114
|
+
)
|
|
115
|
+
""", re.X,
|
|
116
|
+
),
|
|
117
|
+
|
|
118
|
+
# Ud./Vd. pronoun abbreviation not followed by a proper name
|
|
119
|
+
re.compile(
|
|
120
|
+
rf"""
|
|
121
|
+
\b(?i:{build_abbr_pattern({"ud", "uds", "vd", "vds"})})\.
|
|
122
|
+
(?!\s+(?:{cls.COMMON_STARTERS_PATTERN})\b)
|
|
123
|
+
""", re.X,
|
|
124
|
+
),
|
|
125
|
+
]
|
|
126
|
+
)
|
|
127
|
+
|
|
128
|
+
# Street abbrv followed by a common starters
|
|
129
|
+
cls.ENDING_STREET_ABBRVS_FINDER = re.compile(
|
|
130
|
+
rf"""
|
|
131
|
+
(?:\b(?i:{build_abbr_pattern(cls.STREET_ABBRVS)})\.)
|
|
132
|
+
(?=\s+(?:{cls.COMMON_STARTERS_PATTERN})\b)
|
|
133
|
+
""", re.X,
|
|
134
|
+
)
|
|
135
|
+
|
|
136
|
+
# Discourse final particles that should not end a sentence (Thai, Burmese, etc.)
|
|
137
|
+
cls.FINAL_PARTICLES_FINDER = re.compile(
|
|
138
|
+
rf"{build_abbr_pattern(cls.DISCOURSE_FINAL_PARTICLES)}(?![\s]*[.?!;:๚๛])"
|
|
139
|
+
)
|
|
140
|
+
|
|
141
|
+
# fmt: on
|
|
142
|
+
def _post_process_boundaries(self, main_boundaries: set[int], text: str) -> None:
|
|
143
|
+
main_boundaries.update(
|
|
144
|
+
m.end()
|
|
145
|
+
for m in chain(
|
|
146
|
+
self.FINAL_PARTICLES_FINDER.finditer(text),
|
|
147
|
+
self.ENDING_STREET_ABBRVS_FINDER.finditer(text),
|
|
148
|
+
)
|
|
149
|
+
)
|
|
150
|
+
|
|
151
|
+
|
|
152
|
+
PROFILES = [XxRules]
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: yasbd-union
|
|
3
|
+
Version: 0.1.2
|
|
4
|
+
Summary: Experimental multilingual aggregate for yasbd-lib — best-effort sentence splitting over all installed language profiles.
|
|
5
|
+
Author-email: speedyk_005 <speedy40115719@gmail.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/speedyk-005/yasbd-union
|
|
8
|
+
Project-URL: Repository, https://github.com/speedyk-005/yasbd-union
|
|
9
|
+
Project-URL: Issues, https://github.com/speedyk-005/yasbd-union/issues
|
|
10
|
+
Keywords: sentence-segmentation,sentence-boundary-detection,sbd,multilingual,experimental,aggregate,language-agnostic,lang-pack,text-processing
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Intended Audience :: Science/Research
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
19
|
+
Classifier: Topic :: Text Processing
|
|
20
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
21
|
+
Requires-Python: >=3.11
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
License-File: LICENSE
|
|
24
|
+
Requires-Dist: yasbd-lib>=0.8.0
|
|
25
|
+
Provides-Extra: dev
|
|
26
|
+
Requires-Dist: pytest>=8.3.5; extra == "dev"
|
|
27
|
+
Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
|
|
28
|
+
Dynamic: license-file
|
|
29
|
+
|
|
30
|
+
# yasbd-union
|
|
31
|
+
|
|
32
|
+
[](https://www.python.org/downloads/)
|
|
33
|
+
[](https://pypi.org/project/yasbd-union)
|
|
34
|
+
[](https://github.com/speedyk-005/yasbd-union/actions)
|
|
35
|
+
[](https://github.com/speedyk-005/yasbd-union)
|
|
36
|
+
[](https://opensource.org/licenses/MIT)
|
|
37
|
+
|
|
38
|
+
**Sentence splitting for when you genuinely have no idea what language you're looking at.**
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## What this is
|
|
43
|
+
|
|
44
|
+
`yasbd-union` is an experimental add-on for [yasbd-lib](https://github.com/speedyk-005/yasbd).
|
|
45
|
+
|
|
46
|
+
It takes sentence-splitting rules from every installed language pack and throws them into one shared space.
|
|
47
|
+
|
|
48
|
+
Sometimes it behaves nicely.
|
|
49
|
+
Sometimes it makes bold assumptions.
|
|
50
|
+
Sometimes it surprises even you.
|
|
51
|
+
|
|
52
|
+
That's basically the whole deal.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## `auto` vs `xx`
|
|
57
|
+
|
|
58
|
+
**`auto`** tries to be smart about it.
|
|
59
|
+
|
|
60
|
+
It looks at your text, decides what language it is, and uses the right rules for the job. Clean and structured.
|
|
61
|
+
|
|
62
|
+
**`xx`** doesn't bother with that step.
|
|
63
|
+
|
|
64
|
+
It assumes your text is already a mix of everything and just applies all available rules at once.
|
|
65
|
+
|
|
66
|
+
| | `auto` | `xx` |
|
|
67
|
+
|--------------------|---------------------|----------------------------------------|
|
|
68
|
+
| Language handling | Detects first | Doesn't care |
|
|
69
|
+
| Accuracy | Stable | Depends on what rules are installed |
|
|
70
|
+
| Mixed text | Not ideal | Basically its natural habitat |
|
|
71
|
+
| False splits | Rare | Happens sometimes |
|
|
72
|
+
| Personality | Careful | A bit chaotic, but trying its best |
|
|
73
|
+
| Best for | Clean text | Mixed-language messes |
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Install
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
pip install yasbd-union
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
Then register it:
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
from yasbd.rules import register_lang_packs
|
|
87
|
+
from yasbd import BoundaryDetector
|
|
88
|
+
|
|
89
|
+
register_lang_packs(["yasbd_union"])
|
|
90
|
+
|
|
91
|
+
detector = BoundaryDetector("xx")
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
Example
|
|
97
|
+
```python
|
|
98
|
+
sentences = list(detector.segment(
|
|
99
|
+
"Dr. Wang said 你好世界。Prof. Li replied 是的。"
|
|
100
|
+
))
|
|
101
|
+
|
|
102
|
+
print(sentences)
|
|
103
|
+
```
|
|
104
|
+
Output:
|
|
105
|
+
```bash
|
|
106
|
+
["Dr. Wang said 你好世界。", "Prof. Li replied 是的。"]
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## When to use xx
|
|
112
|
+
|
|
113
|
+
Use it when:
|
|
114
|
+
|
|
115
|
+
- You don't know what language your text is in
|
|
116
|
+
- Your input is messy, mixed, or unpredictable
|
|
117
|
+
- You're dealing with logs, chats, or scraped text
|
|
118
|
+
- You just want something that "tries its best"
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## When not to use xx
|
|
123
|
+
|
|
124
|
+
Avoid it when:
|
|
125
|
+
|
|
126
|
+
- You need strict, repeatable results
|
|
127
|
+
- Your text is single-language
|
|
128
|
+
- You don't want surprises in sentence boundaries
|
|
129
|
+
- You're trying to explain results to someone very literal
|
|
130
|
+
|
|
131
|
+
In those cases, auto or a specific language pack will behave better.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## A few honest notes
|
|
136
|
+
|
|
137
|
+
- Some sentence splits will be slightly unexpected
|
|
138
|
+
- Results can change depending on installed language packs
|
|
139
|
+
- It is not fully predictable by design
|
|
140
|
+
|
|
141
|
+
If that sounds like a problem, xx is probably not what you want.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## License
|
|
146
|
+
|
|
147
|
+
**MIT:** If it breaks, it's still yours.
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
tests/test_segmentation.py
|
|
5
|
+
yasbd_union/__init__.py
|
|
6
|
+
yasbd_union.egg-info/PKG-INFO
|
|
7
|
+
yasbd_union.egg-info/SOURCES.txt
|
|
8
|
+
yasbd_union.egg-info/dependency_links.txt
|
|
9
|
+
yasbd_union.egg-info/requires.txt
|
|
10
|
+
yasbd_union.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
yasbd_union
|