zhconv-rs-opencc 0.3.2.post2__cp39-abi3-win32.whl → 0.4.0__cp39-abi3-win32.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Binary file
@@ -1,6 +1,6 @@
1
- Metadata-Version: 2.3
1
+ Metadata-Version: 2.4
2
2
  Name: zhconv-rs-opencc
3
- Version: 0.3.2.post2
3
+ Version: 0.4.0
4
4
  Classifier: Programming Language :: Rust
5
5
  Classifier: Programming Language :: Python :: Implementation :: CPython
6
6
  Classifier: Programming Language :: Python :: Implementation :: PyPy
@@ -21,33 +21,36 @@ Project-URL: Source, https://github.com/Gowee/zhconv-rs/tree/main/pyo3
21
21
  [![PyPI version](https://img.shields.io/pypi/v/zhconv-rs)](https://pypi.org/project/zhconv-rs/)
22
22
  [![NPM version](https://badge.fury.io/js/zhconv.svg)](https://www.npmjs.com/package/zhconv)
23
23
 
24
- # zhconv-rs 中文简繁及地區詞轉換
24
+ # zhconv-rs 中文简繁及地區詞轉換
25
25
 
26
- zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. `zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant`), backed by rulesets from MediaWiki/Wikipedia and OpenCC.
26
+ zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from [MediaWiki](https://github.com/wikimedia/mediawiki)/Wikipedia and [OpenCC](https://github.com/BYVoid/OpenCC), which are merged, flattened and prebuilt into [Aho‑Corasick](https://en.wikipedia.org/wiki/Aho–Corasick_algorithm) automata for single-pass, linear-time conversions.
27
27
 
28
- It leverages the [Aho-Corasick](https://github.com/daac-tools/daachorse) algorithm for linear time complexity with respect to the length of input text and conversion rules (`O(n+m)`), processing dozens of MiBs text per second.
28
+ 🔗 **Web app (wasm):** <https://zhconv.pages.dev> (w/ OpenCC dictionaries)
29
29
 
30
- 🔗 **Web app: https://zhconv.pages.dev** (powered by WASM)
30
+ ⚙️ **Cli**: `cargo install zhconv` or download from [releases](https://github.com/Gowee/zhconv-rs/releases)
31
31
 
32
- ⚙️ **Cli**: `cargo install zhconv-cli` or check [releases](https://github.com/Gowee/zhconv-rs/releases).
32
+ 🦀 **Rust crate**: `cargo add zhconv` (see [docs](https://docs.rs/zhconv/latest/zhconv/) for details)
33
33
 
34
- 🦀 **Rust crate**: `cargo add zhconv` (check [docs](https://docs.rs/zhconv/latest/zhconv/) for examples)
34
+ ```rust
35
+ use zhconv::{zhconv, Variant};
36
+ assert_eq!(zhconv("雾失楼台,月迷津渡", Variant::ZhTW), "霧失樓臺,月迷津渡");
37
+ assert_eq!(zhconv("驛寄梅花,魚傳尺素", "zh-Hans".parse().unwrap()), "驿寄梅花,鱼传尺素");
38
+ ```
35
39
 
36
- 🐍 **Python package w/ wheels via PyO3**: `pip install zhconv-rs` or `pip install zhconv-rs-opencc` (with rulesets from OpenCC)
40
+ 🐍 **Python package w/ wheels**: `pip install zhconv-rs` or `pip install zhconv-rs-opencc` (for OpenCC dictionaries)
37
41
 
38
42
  <details open>
39
43
  <summary>Python snippet</summary>
40
44
 
41
45
  ```python
42
46
  # > pip install zhconv_rs
43
- # Convert with builtin rulesets:
47
+ # Convert using the built-in rulesets:
44
48
  from zhconv_rs import zhconv
45
49
  assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
46
- assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
47
50
  assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
48
51
  assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"
49
52
 
50
- # Convert with custom rules:
53
+ # Convert using custom rules:
51
54
  from zhconv_rs import make_converter
52
55
  assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"
53
56
 
@@ -58,9 +61,15 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
58
61
 
59
62
  </details>
60
63
 
61
- **JS (Webpack)**: `npm install zhconv` or `yarn add zhconv` (WASM, [instructions](https://rustwasm.github.io/wasm-pack/book/tutorials/npm-browser-packages/using-your-library.html))
64
+ <a href="https://deploy.workers.cloudflare.com/?url=https://github.com/gowee/zhconv-rs">
65
+ <img src="https://deploy.workers.cloudflare.com/button" align="right" alt="Deploy to Cloudflare Workers">
66
+ </a>
67
+
68
+ 🧩 **API demo**: <https://zhconv.bamboo.workers.dev>
62
69
 
63
- **JS in browser**: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)
70
+ **Node.js package**: `npm install zhconv` or `yarn add zhconv`
71
+
72
+ **JS in browser**: <https://cdn.jsdelivr.net/npm/zhconv-web@latest>
64
73
 
65
74
  <details>
66
75
  <summary>HTML snippet</summary>
@@ -88,7 +97,9 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
88
97
 
89
98
  </details>
90
99
 
91
- ## Supported variants
100
+ ## Variants and dictionaries
101
+
102
+ Unlike OpenCC, whose dictionaries are bidirectional (e.g., `s2t`, `tw2s`), zhconv-rs follows MediaWiki’s approach and provides one dictionary per target variant:
92
103
 
93
104
  <details>
94
105
  <summary>zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY</summary>
@@ -104,12 +115,23 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
104
115
  | Chinese (Singapore) / 新加坡简体 | `zh-SG` | SC / 简 | Same as `zh-CN` for now. |
105
116
  | Chinese (Malaysia) / 大马简体 | `zh-MY` | SC / 简 | Same as `zh-CN` for now. |
106
117
 
107
- *Note:* `zh-TW` and `zh-HK` are based on `zh-Hant`. `zh-CN` are based on `zh-Hans`. Currently, `zh-MO` shares the same rulesets with `zh-HK` unless additional rules are manually configured; `zh-MY` and `zh-SG` shares the same rulesets with `zh-CN` unless additional rules are manually configured.
118
+ *Note:* `zh-TW` and `zh-HK` are derived from `zh-Hant`. `zh-CN` is derived from `zh-Hans`. Currently, `zh-MO` shares the same dictionary as `zh-HK`, and `zh-MY`/`zh-SG` share the same dictionary as `zh-CN`, unless additional rules are provided.
108
119
  </details>
109
120
 
121
+ Chained dictionary groups from OpenCC are flattened and merged with MediaWiki dictionaries for each target variant, then compiled into a single Aho-Corasick automaton at build time. After internal compression, the bundled dictionaries and automata occupy ~0.6 MiB (without OpenCC) or ~2.7 MiB (with OpenCC enabled).
122
+
110
123
  ## Performance
111
124
 
112
- `cargo bench` on `AMD EPYC 7B13` (GitPod) by v0.3:
125
+ Even with all dictionaries enabled, zhconv-rs remains faster than most alternatives. Check with `cargo bench compare --features opencc`:
126
+
127
+ ![Comparison with other crates, targetting zh-Hans](violin-to-hans.svg)
128
+ ![Comparison with other crates, targetting zh-TW](violin-to-tw.svg)
129
+
130
+ Conversion runs in a single pass in `O(n+m)` linear time by default, where `n` is the length of the input text and `m` is the maximum length of source word in dictionaries, regardless of enabled dictionaries. When converting wikitext containing MediaWiki conversion rules, the time complexity may degrade to `O(n*m)` in the worst case, if the corresponding function or flag is explicitly chosen.
131
+
132
+ On a typical modern PC, prebuilt converters load in a few milliseconds with default features (~2–5 ms). Enabling the optional opencc feature increases load time (typically 20–25 ms per target). Throughput generally ranges from 100–200 MB/s.
133
+
134
+ `cargo bench --features opencc` on `AMD EPYC 7B13` (GitPod) by v0.3:
113
135
 
114
136
  <details>
115
137
  <summary>w/ default features</summary>
@@ -138,7 +160,8 @@ is_hans data55k time: [404.73 µs 407.11 µs 409.59 µs]
138
160
  infer_variant data55k time: [1.0468 ms 1.0515 ms 1.0570 ms]
139
161
  is_hans data3185k time: [22.442 ms 22.589 ms 22.757 ms]
140
162
  infer_variant data3185k time: [60.205 ms 60.412 ms 60.627 ms]
141
- ```
163
+ ```
164
+
142
165
  </details>
143
166
 
144
167
  <details>
@@ -172,40 +195,29 @@ infer_variant data3185k time: [74.878 ms 76.262 ms 77.818 ms]
172
195
 
173
196
  </details>
174
197
 
175
- By default, only rulesets from MediaWiki are used. `opencc` feature can be enabled with `zhconv = { version = "...", features = [ "opencc" ] }`.
176
- But be noted that, other than performance decrease, it accounts for at least several MiBs in build output.
177
-
178
- <!--
179
- ## Differences with other converters
180
- * `ZhConver{sion,ter}.php` of MediaWiki: zhconv-rs just takes conversion tables listed in [`ZhConversion.php`](https://github.com/wikimedia/mediawiki/blob/master/includes/languages/data/ZhConversion.php#L14). MediaWiki relies on the inefficient PHP built-in function [`strtr`](https://github.com/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974). Under the basic mode, zhconv-rs guarantees linear time complexity (`T = O(n+m)` instead of `O(nm)`) and single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.
181
- * OpenCC: The [conversion rulesets](https://github.com/BYVoid/OpenCC/tree/master/data/dictionary) of OpenCC is independent of MediaWiki. The core [conversion implementation](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) of OpenCC is kinda similar to the aforementioned `strtr`. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.
182
- ## Comparisions with other tools
183
- - OpenCC: Dict::MatchPrefix (iterating from maxlen to minlen character by character to match) [https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Dict.cpp#L25](MatchPrefix), [segments converter](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) [segmentizer](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/MaxMatchSegmentation.cpp#L34)
184
- - zhConversion.php: strtr (iterating from maxlen to minlen for every known key length to match) [https://github.dev/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974]
185
- - zhconv-rs regex-based automaton
186
- -->
187
-
188
198
  ## Limitations
189
199
 
190
200
  ### Accuracy
191
201
 
192
- A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both `干 -> 幹` and `天干物燥 -> 天乾物燥`, the converter would prioritize `天乾物燥` because `天干物燥` gets matched earlier compared to `干` at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.
202
+ Rule-based converters cannot capture every possible linguistic nuance. Like most others, the implementation employs a leftmost-longest matching strategy (a.k.a forward maximum matching), prioritizing to the earliest and longest matches in the text. For example, if a ruleset contains both `干 幹`, `天干 → 天干`, and `天干物燥 天乾物燥`, the converter will prefer the longer match `天乾物燥`, since it appears earlier and spans more characters. This generally works well but may cause occasional mis-conversions.
193
203
 
194
204
  ### Wikitext support
195
205
 
196
- While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.
206
+ The implementation supports most MediaWiki conversion rules, while not fully compliant with the original MediaWiki implementation.
197
207
 
198
- For wikitext inputs containing global conversion rules (e.g., `-{H|zh-hans:鹿|zh-hant:马}-` in MediaWiki syntax), the implementation's time complexity may degrade to `O(n*m)` in the worst case, where `n` is the input text length and `m` is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.
208
+ Since rebuilding automata dynamically is impractical, rules (e.g., `-{H|zh-hans:鹿|zh-hant:马}-` in MediaWiki syntax) in text are extracted in a first pass, a temporary automaton is constructed, and the text is converted in a second pass. The time complexity may degrade to `O(n*m)` in the worst case, where `n` is the input text length and `m` is the maximum length of source words in dictionaries, which is equivalent to a brute-force approach.
199
209
 
200
210
  ## Credits
201
211
 
202
212
  Rulesets/Dictionaries: [MediaWiki](https://github.com/wikimedia/mediawiki) and [OpenCC](https://github.com/BYVoid/OpenCC).
203
213
 
204
- References:
205
- - https://github.com/gumblex/zhconv : Python implementation of `zhConver{ter,sion}.php`.
206
- - https://github.com/BYVoid/OpenCC/ : Widely adopted Chinese converter.
207
- - https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
208
- - https://zh.wikipedia.org/wiki/Help:高级字词转换语法
209
- - https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php
210
- <!--- https://www.hankcs.com/nlp/simplified-traditional-chinese-conversion.html-->
214
+ Fast double-array Aho-Corasick automata implementation in Rust: [daachorse](https://github.com/daac-tools/daachorse)
215
+
216
+ References & related implementations:
217
+
218
+ - <https://github.com/gumblex/zhconv> : Python implementation of `zhConver{ter,sion}.php`.
219
+ - <https://github.com/BYVoid/OpenCC/> : Widely adopted Chinese converter.
220
+ - <https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理>
221
+ - <https://zh.wikipedia.org/wiki/Help:高级字词转换语法>
222
+ - <https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php>
211
223
 
@@ -0,0 +1,5 @@
1
+ zhconv_rs_opencc-0.4.0.dist-info/METADATA,sha256=gIYKvShqsUrC-KQgmtF-WOTuOz1KTce7SsEXVY_XfgA,12835
2
+ zhconv_rs_opencc-0.4.0.dist-info/WHEEL,sha256=-ACW9IiXdK4LnXaTRwEcdj6JI7tqGnsz4Ppj7XWagMY,90
3
+ zhconv_rs_opencc/__init__.py,sha256=L3TIYc7ax1hTtB8KsKhGqldXEvEJM9S9AoJGd0JIdaA,147
4
+ zhconv_rs_opencc/zhconv_rs_opencc.pyd,sha256=w5k77YnotdtGjA1cjTyMKLFV59ymvx_hUJDN9UkAMu4,5448192
5
+ zhconv_rs_opencc-0.4.0.dist-info/RECORD,,
@@ -1,4 +1,4 @@
1
1
  Wheel-Version: 1.0
2
- Generator: maturin (1.7.5)
2
+ Generator: maturin (1.8.7)
3
3
  Root-Is-Purelib: false
4
4
  Tag: cp39-abi3-win32
@@ -1,5 +0,0 @@
1
- zhconv_rs_opencc-0.3.2.post2.dist-info/METADATA,sha256=jUSJd9SxFqMWNNk2l9GS1Hnj1C5aRuSFQvDbHPftTHo,13329
2
- zhconv_rs_opencc-0.3.2.post2.dist-info/WHEEL,sha256=M0Ic6pAGuWT2rFmFPpb-SpFKB1PqNMLrguwJiCZANxw,90
3
- zhconv_rs_opencc/__init__.py,sha256=L3TIYc7ax1hTtB8KsKhGqldXEvEJM9S9AoJGd0JIdaA,147
4
- zhconv_rs_opencc/zhconv_rs_opencc.pyd,sha256=WJ5FGJlUOIzCsU-krE-5-jlqSmKuvsV7497xUCtH5YU,5547520
5
- zhconv_rs_opencc-0.3.2.post2.dist-info/RECORD,,