russian-tts-normalization 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,27 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Ilya Shigabeev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
23
+ ---
24
+ The embedded abbreviation and unit inventories are informed by
25
+ NVIDIA NeMo-text-processing (Apache License 2.0,
26
+ https://github.com/NVIDIA/NeMo-text-processing); the spoken forms
27
+ were rewritten and checked by hand.
@@ -0,0 +1,120 @@
1
+ Metadata-Version: 2.4
2
+ Name: russian-tts-normalization
3
+ Version: 1.0.0
4
+ Summary: Russian text normalization for TTS: numbers, dates, currency, units, case agreement - one file, regex only, no dependencies
5
+ Author-email: Ilya Shigabeev <shigabeevilya@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/frappuccino/russian_tts_normalization
8
+ Keywords: tts,text-normalization,russian,speech-synthesis
9
+ Classifier: Development Status :: 5 - Production/Stable
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Natural Language :: Russian
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
15
+ Classifier: Topic :: Text Processing :: Linguistic
16
+ Requires-Python: >=3.8
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE
19
+ Dynamic: license-file
20
+
21
+ # Russian text normalization for TTS
22
+ Normalize Text in Russian.
23
+
24
+
25
+ Install: `pip install russian-tts-normalization`, or just copy `russian.py`
26
+ (a single self-contained file, no dependencies) into the `text` folder of your
27
+ TTS system. It can also be used as a command-line filter:
28
+ `echo "цена 1 500 руб." | python3 russian.py`.
29
+
30
+ ```
31
+ from russian import normalize_russian
32
+
33
+ complex_test_text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
34
+ В моем кошельке было 876 UAH и 543.21 RUB, а также я нашел 20 центов."""
35
+
36
+ normalized_text = normalize_russian(complex_test_text)
37
+ print(normalized_text)
38
+ ```
39
+
40
+ ​Prints:
41
+
42
+ ```
43
+ У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.\nВ моем кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать один копейка, а также я нашел двадцать центов.
44
+ ```
45
+
46
+ # Implemented
47
+ 1. Cyrrilization of letters such as "apple" -> "эппл".
48
+ 2. Abbreviations expansion such as "СССР" -> "эс эс эс эр".
49
+ 3. Numbers conversion of any size
50
+ 4. Currency expansion
51
+ 5. Phone number expansion
52
+ 6. Dates: "1862 год", "12 февраля 2013", "05.08.2008" -> ordinal year/day reading
53
+ 7. Ordinals with a suffix ("1-й" -> "первый") and Roman numerals ("XIX" -> "девятнадцатого")
54
+ 8. Decimals: "1,2" -> "одна целая и две десятых"; percentages: "50%" -> "пятьдесят процентов"
55
+ 9. Fractions: "2/3" -> "две третьих"
56
+ 10. Clock times: "06:06" -> "шесть часов шесть минут"
57
+ 11. Digit strings with a leading zero: "06" -> "ноль шесть"
58
+ 12. Symbols / foreign letters by name: "&" -> "и", "²" -> "в квадрате", "°C", Greek
59
+ 13. Space/NBSP-grouped thousands: "1 234 567" -> one number; negatives: "-5" -> "минус пять"
60
+ 14. Quantity multipliers: "5 млн" -> "пять миллионов" (agrees with the number)
61
+ 15. Units of measure: "5 кг" -> "пять килограммов", "90 км/ч" -> "...в час", "5 ГБ", "25°"
62
+ 16. Textual abbreviations: "и т.д." -> "и так далее"
63
+ 17. Acronyms: vowel-less spelled out ("СССР" -> "эс эс эс эр"), pronounceable kept as-is ("НАТО")
64
+ 18. E-mail/URL spell-out: "example.com" -> "ексампле точка ком"
65
+ 19. Dotted units ("82 т." -> "восемьдесят две тонны") and decimal counts ("1,5 км" -> "...километра")
66
+ 20. Years with г./гг. and ranges: "2008 г." -> "две тысячи восьмой год", "1941—1945 гг.", "XIX–XX вв."
67
+ 21. Context-governed case: "около 500 км" -> "около пятисот километров", "к 5" -> "к пяти",
68
+ "с 500 рублями" -> "с пятьюстами рублями" (closed-class cardinal declension tables)
69
+ 22. Ordinal trigger nouns: "2 место" -> "второе место", "5 этаж" -> "пятый этаж"
70
+ 23. Compound number adjectives: "25-этажный" -> "двадцатипятиэтажный"
71
+ 24. Times beyond HH:MM: "02:25:00", "2PM" -> "два часа дня"; scores: "3:1" -> "три один"
72
+ 25. Versions/IP: "Python 3.11" -> "питон три точка одиннадцать", "192.168.1.1"
73
+ 26. Structural refs before a number: "ст. 158" -> "статья сто пятьдесят восемь"; math: "2+2=4"
74
+ 27. English word dictionary: "Google" -> "гугл"; Latin acronyms by English letter name: "GPS" -> "джи пи эс"
75
+ 28. ё restoration (unambiguous words only): "еще" -> "ещё"; hashtags: "#новости" -> "хештег новости"
76
+
77
+ Notes:
78
+ - The letter ё is kept in the output (it carries pronunciation for TTS).
79
+ - Vocabularies are embedded in `russian.py` (single-file module). Abbreviations come from NVIDIA NeMo-text-processing
80
+ (`ru/whitelist.tsv`, Apache-2.0); only single-sense entries are used.
81
+
82
+ # Validation
83
+ Tested against the Google/Kaggle Russian text-normalization set
84
+ (`ru_train.csv`, 10,574,516 tokens). Each token's input is normalized in
85
+ isolation and compared to the gold output; "accuracy" is exact string match,
86
+ compared ё/е-insensitively (the reference data writes only е, this script keeps ё).
87
+ "Original" is the script before these changes.
88
+
89
+ The evaluation harness (`eval_assess.py`, `eval_extension.csv`), regression
90
+ tests and the dataset-cleaning script live on the `ru-2.0-alpha` branch; this
91
+ branch ships only the module itself.
92
+
93
+ | Domain (class) | Tokens | Original acc. | Current acc. | Notes |
94
+ |---|--:|--:|--:|---|
95
+ | PLAIN | 7,360,439 | 69.9% | 92.5% | residual: Latin spelled per-letter in gold |
96
+ | PUNCT | 2,288,640 | 100.0% | 100.0% | passthrough |
97
+ | CARDINAL | 272,442 | 51.2% | 77.0% | residual: oblique case of bare numbers (no context in token) |
98
+ | LETTERS | 189,528 | 0.8% | 0.0% | not targeted (gold uses bare letters, worse for TTS) |
99
+ | DATE | 185,961 | 0.0% | 86.2% | residual: bare years, ambiguous day-case |
100
+ | VERBATIM | 157,912 | 91.1% | 95.7% | symbol / Greek map |
101
+ | ORDINAL | 46,738 | 0.0% | 40.6% | residual: bare-number ordinals (need context) |
102
+ | MEASURE | 40,537 | 3.1% | 50.8% | residual: oblique case agreement |
103
+ | TELEPHONE | 10,088 | 0.3% | 1.3% | not targeted (irregular ISBN grouping) |
104
+ | DECIMAL | 7,299 | 6.1% | 54.3% | residual: oblique case agreement |
105
+ | ELECTRONIC | 5,832 | 2.6% | 2.8% | not targeted (English G2P + markers) |
106
+ | MONEY | 2,690 | 14.4% | 34.0% | residual: case agreement, "долларов сэ ш а" artifact |
107
+ | FRACTION | 2,460 | 0.0% | 66.0% | residual: context-dependent case |
108
+ | DIGIT | 2,012 | 0.0% | 100.0% | leading-zero digit strings |
109
+ | TIME | 1,949 | 0.0% | 85.3% | residual: oblique case, timezone suffixes |
110
+ | **Overall** | **10,574,570** | **73.0%** | **91.4%** | exact-match token accuracy (incl. eval_extension rows) |
111
+
112
+ The remaining error is dominated by things rules cannot resolve without a token
113
+ classifier or sentence context. Case agreement is now rule-handled when the
114
+ context is inside the token (`около 500 км` -> `около пятисот километров`,
115
+ preposition- and noun-ending-governed), but the gold set scores tokens in
116
+ isolation, where a bare `500 км` gives no case signal. Likewise
117
+ disambiguating a bare number as cardinal/ordinal/year, and
118
+ classes left untargeted on purpose (LETTERS, TELEPHONE, ELECTRONIC). The test set
119
+ is treated as a regression guard, not a target — some choices (keeping ё, reading
120
+ acronyms as words, nominative Roman numerals) favour TTS quality over this score.
@@ -0,0 +1,100 @@
1
+ # Russian text normalization for TTS
2
+ Normalize Text in Russian.
3
+
4
+
5
+ Install: `pip install russian-tts-normalization`, or just copy `russian.py`
6
+ (a single self-contained file, no dependencies) into the `text` folder of your
7
+ TTS system. It can also be used as a command-line filter:
8
+ `echo "цена 1 500 руб." | python3 russian.py`.
9
+
10
+ ```
11
+ from russian import normalize_russian
12
+
13
+ complex_test_text = """У меня есть $1234 и 5678 рублей. Кроме того, я должен 90.50€ и взял в долг 4321 GBP.
14
+ В моем кошельке было 876 UAH и 543.21 RUB, а также я нашел 20 центов."""
15
+
16
+ normalized_text = normalize_russian(complex_test_text)
17
+ print(normalized_text)
18
+ ```
19
+
20
+ ​Prints:
21
+
22
+ ```
23
+ У меня есть тысяча двести тридцать четыре доллара и пять тысяч шестьсот семьдесят восемь рублей. Кроме того, я должен девяносто евро пятьдесят евроцентов и взял в долг четыре тысячи триста двадцать один фунт.\nВ моем кошельке было восемьсот семьдесят шесть гривен и пятьсот сорок три рубля двадцать один копейка, а также я нашел двадцать центов.
24
+ ```
25
+
26
+ # Implemented
27
+ 1. Cyrrilization of letters such as "apple" -> "эппл".
28
+ 2. Abbreviations expansion such as "СССР" -> "эс эс эс эр".
29
+ 3. Numbers conversion of any size
30
+ 4. Currency expansion
31
+ 5. Phone number expansion
32
+ 6. Dates: "1862 год", "12 февраля 2013", "05.08.2008" -> ordinal year/day reading
33
+ 7. Ordinals with a suffix ("1-й" -> "первый") and Roman numerals ("XIX" -> "девятнадцатого")
34
+ 8. Decimals: "1,2" -> "одна целая и две десятых"; percentages: "50%" -> "пятьдесят процентов"
35
+ 9. Fractions: "2/3" -> "две третьих"
36
+ 10. Clock times: "06:06" -> "шесть часов шесть минут"
37
+ 11. Digit strings with a leading zero: "06" -> "ноль шесть"
38
+ 12. Symbols / foreign letters by name: "&" -> "и", "²" -> "в квадрате", "°C", Greek
39
+ 13. Space/NBSP-grouped thousands: "1 234 567" -> one number; negatives: "-5" -> "минус пять"
40
+ 14. Quantity multipliers: "5 млн" -> "пять миллионов" (agrees with the number)
41
+ 15. Units of measure: "5 кг" -> "пять килограммов", "90 км/ч" -> "...в час", "5 ГБ", "25°"
42
+ 16. Textual abbreviations: "и т.д." -> "и так далее"
43
+ 17. Acronyms: vowel-less spelled out ("СССР" -> "эс эс эс эр"), pronounceable kept as-is ("НАТО")
44
+ 18. E-mail/URL spell-out: "example.com" -> "ексампле точка ком"
45
+ 19. Dotted units ("82 т." -> "восемьдесят две тонны") and decimal counts ("1,5 км" -> "...километра")
46
+ 20. Years with г./гг. and ranges: "2008 г." -> "две тысячи восьмой год", "1941—1945 гг.", "XIX–XX вв."
47
+ 21. Context-governed case: "около 500 км" -> "около пятисот километров", "к 5" -> "к пяти",
48
+ "с 500 рублями" -> "с пятьюстами рублями" (closed-class cardinal declension tables)
49
+ 22. Ordinal trigger nouns: "2 место" -> "второе место", "5 этаж" -> "пятый этаж"
50
+ 23. Compound number adjectives: "25-этажный" -> "двадцатипятиэтажный"
51
+ 24. Times beyond HH:MM: "02:25:00", "2PM" -> "два часа дня"; scores: "3:1" -> "три один"
52
+ 25. Versions/IP: "Python 3.11" -> "питон три точка одиннадцать", "192.168.1.1"
53
+ 26. Structural refs before a number: "ст. 158" -> "статья сто пятьдесят восемь"; math: "2+2=4"
54
+ 27. English word dictionary: "Google" -> "гугл"; Latin acronyms by English letter name: "GPS" -> "джи пи эс"
55
+ 28. ё restoration (unambiguous words only): "еще" -> "ещё"; hashtags: "#новости" -> "хештег новости"
56
+
57
+ Notes:
58
+ - The letter ё is kept in the output (it carries pronunciation for TTS).
59
+ - Vocabularies are embedded in `russian.py` (single-file module). Abbreviations come from NVIDIA NeMo-text-processing
60
+ (`ru/whitelist.tsv`, Apache-2.0); only single-sense entries are used.
61
+
62
+ # Validation
63
+ Tested against the Google/Kaggle Russian text-normalization set
64
+ (`ru_train.csv`, 10,574,516 tokens). Each token's input is normalized in
65
+ isolation and compared to the gold output; "accuracy" is exact string match,
66
+ compared ё/е-insensitively (the reference data writes only е, this script keeps ё).
67
+ "Original" is the script before these changes.
68
+
69
+ The evaluation harness (`eval_assess.py`, `eval_extension.csv`), regression
70
+ tests and the dataset-cleaning script live on the `ru-2.0-alpha` branch; this
71
+ branch ships only the module itself.
72
+
73
+ | Domain (class) | Tokens | Original acc. | Current acc. | Notes |
74
+ |---|--:|--:|--:|---|
75
+ | PLAIN | 7,360,439 | 69.9% | 92.5% | residual: Latin spelled per-letter in gold |
76
+ | PUNCT | 2,288,640 | 100.0% | 100.0% | passthrough |
77
+ | CARDINAL | 272,442 | 51.2% | 77.0% | residual: oblique case of bare numbers (no context in token) |
78
+ | LETTERS | 189,528 | 0.8% | 0.0% | not targeted (gold uses bare letters, worse for TTS) |
79
+ | DATE | 185,961 | 0.0% | 86.2% | residual: bare years, ambiguous day-case |
80
+ | VERBATIM | 157,912 | 91.1% | 95.7% | symbol / Greek map |
81
+ | ORDINAL | 46,738 | 0.0% | 40.6% | residual: bare-number ordinals (need context) |
82
+ | MEASURE | 40,537 | 3.1% | 50.8% | residual: oblique case agreement |
83
+ | TELEPHONE | 10,088 | 0.3% | 1.3% | not targeted (irregular ISBN grouping) |
84
+ | DECIMAL | 7,299 | 6.1% | 54.3% | residual: oblique case agreement |
85
+ | ELECTRONIC | 5,832 | 2.6% | 2.8% | not targeted (English G2P + markers) |
86
+ | MONEY | 2,690 | 14.4% | 34.0% | residual: case agreement, "долларов сэ ш а" artifact |
87
+ | FRACTION | 2,460 | 0.0% | 66.0% | residual: context-dependent case |
88
+ | DIGIT | 2,012 | 0.0% | 100.0% | leading-zero digit strings |
89
+ | TIME | 1,949 | 0.0% | 85.3% | residual: oblique case, timezone suffixes |
90
+ | **Overall** | **10,574,570** | **73.0%** | **91.4%** | exact-match token accuracy (incl. eval_extension rows) |
91
+
92
+ The remaining error is dominated by things rules cannot resolve without a token
93
+ classifier or sentence context. Case agreement is now rule-handled when the
94
+ context is inside the token (`около 500 км` -> `около пятисот километров`,
95
+ preposition- and noun-ending-governed), but the gold set scores tokens in
96
+ isolation, where a bare `500 км` gives no case signal. Likewise
97
+ disambiguating a bare number as cardinal/ordinal/year, and
98
+ classes left untargeted on purpose (LETTERS, TELEPHONE, ELECTRONIC). The test set
99
+ is treated as a regression guard, not a target — some choices (keeping ё, reading
100
+ acronyms as words, nominative Roman numerals) favour TTS quality over this score.
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["setuptools>=64"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "russian-tts-normalization"
7
+ version = "1.0.0"
8
+ description = "Russian text normalization for TTS: numbers, dates, currency, units, case agreement - one file, regex only, no dependencies"
9
+ readme = "README.md"
10
+ requires-python = ">=3.8"
11
+ license = { text = "MIT" }
12
+ authors = [{ name = "Ilya Shigabeev", email = "shigabeevilya@gmail.com" }]
13
+ keywords = ["tts", "text-normalization", "russian", "speech-synthesis"]
14
+ classifiers = [
15
+ "Development Status :: 5 - Production/Stable",
16
+ "Intended Audience :: Developers",
17
+ "License :: OSI Approved :: MIT License",
18
+ "Natural Language :: Russian",
19
+ "Programming Language :: Python :: 3",
20
+ "Topic :: Multimedia :: Sound/Audio :: Speech",
21
+ "Topic :: Text Processing :: Linguistic",
22
+ ]
23
+
24
+ [project.urls]
25
+ Homepage = "https://github.com/frappuccino/russian_tts_normalization"
26
+
27
+ [tool.setuptools]
28
+ py-modules = ["russian"]