zhconv-rs-opencc 0.3.2.post2__cp39-abi3-win32.whl → 0.4.0__cp39-abi3-win32.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- zhconv_rs_opencc/zhconv_rs_opencc.pyd +0 -0
- {zhconv_rs_opencc-0.3.2.post2.dist-info → zhconv_rs_opencc-0.4.0.dist-info}/METADATA +53 -41
- zhconv_rs_opencc-0.4.0.dist-info/RECORD +5 -0
- {zhconv_rs_opencc-0.3.2.post2.dist-info → zhconv_rs_opencc-0.4.0.dist-info}/WHEEL +1 -1
- zhconv_rs_opencc-0.3.2.post2.dist-info/RECORD +0 -5
|
Binary file
|
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
Metadata-Version: 2.
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
2
|
Name: zhconv-rs-opencc
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Classifier: Programming Language :: Rust
|
|
5
5
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
6
6
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
@@ -21,33 +21,36 @@ Project-URL: Source, https://github.com/Gowee/zhconv-rs/tree/main/pyo3
|
|
|
21
21
|
[](https://pypi.org/project/zhconv-rs/)
|
|
22
22
|
[](https://www.npmjs.com/package/zhconv)
|
|
23
23
|
|
|
24
|
-
# zhconv-rs 中文简繁及地區詞轉換
|
|
24
|
+
# zhconv-rs — 中文简繁及地區詞轉換
|
|
25
25
|
|
|
26
|
-
zhconv-rs converts Chinese
|
|
26
|
+
zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from [MediaWiki](https://github.com/wikimedia/mediawiki)/Wikipedia and [OpenCC](https://github.com/BYVoid/OpenCC), which are merged, flattened and prebuilt into [Aho‑Corasick](https://en.wikipedia.org/wiki/Aho–Corasick_algorithm) automata for single-pass, linear-time conversions.
|
|
27
27
|
|
|
28
|
-
|
|
28
|
+
🔗 **Web app (wasm):** <https://zhconv.pages.dev> (w/ OpenCC dictionaries)
|
|
29
29
|
|
|
30
|
-
|
|
30
|
+
⚙️ **Cli**: `cargo install zhconv` or download from [releases](https://github.com/Gowee/zhconv-rs/releases)
|
|
31
31
|
|
|
32
|
-
|
|
32
|
+
🦀 **Rust crate**: `cargo add zhconv` (see [docs](https://docs.rs/zhconv/latest/zhconv/) for details)
|
|
33
33
|
|
|
34
|
-
|
|
34
|
+
```rust
|
|
35
|
+
use zhconv::{zhconv, Variant};
|
|
36
|
+
assert_eq!(zhconv("雾失楼台,月迷津渡", Variant::ZhTW), "霧失樓臺,月迷津渡");
|
|
37
|
+
assert_eq!(zhconv("驛寄梅花,魚傳尺素", "zh-Hans".parse().unwrap()), "驿寄梅花,鱼传尺素");
|
|
38
|
+
```
|
|
35
39
|
|
|
36
|
-
🐍 **Python package w/ wheels
|
|
40
|
+
🐍 **Python package w/ wheels**: `pip install zhconv-rs` or `pip install zhconv-rs-opencc` (for OpenCC dictionaries)
|
|
37
41
|
|
|
38
42
|
<details open>
|
|
39
43
|
<summary>Python snippet</summary>
|
|
40
44
|
|
|
41
45
|
```python
|
|
42
46
|
# > pip install zhconv_rs
|
|
43
|
-
# Convert
|
|
47
|
+
# Convert using the built-in rulesets:
|
|
44
48
|
from zhconv_rs import zhconv
|
|
45
49
|
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
|
|
46
|
-
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
|
|
47
50
|
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
|
|
48
51
|
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"
|
|
49
52
|
|
|
50
|
-
# Convert
|
|
53
|
+
# Convert using custom rules:
|
|
51
54
|
from zhconv_rs import make_converter
|
|
52
55
|
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"
|
|
53
56
|
|
|
@@ -58,9 +61,15 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
|
|
|
58
61
|
|
|
59
62
|
</details>
|
|
60
63
|
|
|
61
|
-
|
|
64
|
+
<a href="https://deploy.workers.cloudflare.com/?url=https://github.com/gowee/zhconv-rs">
|
|
65
|
+
<img src="https://deploy.workers.cloudflare.com/button" align="right" alt="Deploy to Cloudflare Workers">
|
|
66
|
+
</a>
|
|
67
|
+
|
|
68
|
+
🧩 **API demo**: <https://zhconv.bamboo.workers.dev>
|
|
62
69
|
|
|
63
|
-
**
|
|
70
|
+
**Node.js package**: `npm install zhconv` or `yarn add zhconv`
|
|
71
|
+
|
|
72
|
+
**JS in browser**: <https://cdn.jsdelivr.net/npm/zhconv-web@latest>
|
|
64
73
|
|
|
65
74
|
<details>
|
|
66
75
|
<summary>HTML snippet</summary>
|
|
@@ -88,7 +97,9 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
|
|
|
88
97
|
|
|
89
98
|
</details>
|
|
90
99
|
|
|
91
|
-
##
|
|
100
|
+
## Variants and dictionaries
|
|
101
|
+
|
|
102
|
+
Unlike OpenCC, whose dictionaries are bidirectional (e.g., `s2t`, `tw2s`), zhconv-rs follows MediaWiki’s approach and provides one dictionary per target variant:
|
|
92
103
|
|
|
93
104
|
<details>
|
|
94
105
|
<summary>zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY</summary>
|
|
@@ -104,12 +115,23 @@ assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去
|
|
|
104
115
|
| Chinese (Singapore) / 新加坡简体 | `zh-SG` | SC / 简 | Same as `zh-CN` for now. |
|
|
105
116
|
| Chinese (Malaysia) / 大马简体 | `zh-MY` | SC / 简 | Same as `zh-CN` for now. |
|
|
106
117
|
|
|
107
|
-
*Note:* `zh-TW` and `zh-HK` are
|
|
118
|
+
*Note:* `zh-TW` and `zh-HK` are derived from `zh-Hant`. `zh-CN` is derived from `zh-Hans`. Currently, `zh-MO` shares the same dictionary as `zh-HK`, and `zh-MY`/`zh-SG` share the same dictionary as `zh-CN`, unless additional rules are provided.
|
|
108
119
|
</details>
|
|
109
120
|
|
|
121
|
+
Chained dictionary groups from OpenCC are flattened and merged with MediaWiki dictionaries for each target variant, then compiled into a single Aho-Corasick automaton at build time. After internal compression, the bundled dictionaries and automata occupy ~0.6 MiB (without OpenCC) or ~2.7 MiB (with OpenCC enabled).
|
|
122
|
+
|
|
110
123
|
## Performance
|
|
111
124
|
|
|
112
|
-
|
|
125
|
+
Even with all dictionaries enabled, zhconv-rs remains faster than most alternatives. Check with `cargo bench compare --features opencc`:
|
|
126
|
+
|
|
127
|
+

|
|
128
|
+

|
|
129
|
+
|
|
130
|
+
Conversion runs in a single pass in `O(n+m)` linear time by default, where `n` is the length of the input text and `m` is the maximum length of source word in dictionaries, regardless of enabled dictionaries. When converting wikitext containing MediaWiki conversion rules, the time complexity may degrade to `O(n*m)` in the worst case, if the corresponding function or flag is explicitly chosen.
|
|
131
|
+
|
|
132
|
+
On a typical modern PC, prebuilt converters load in a few milliseconds with default features (~2–5 ms). Enabling the optional opencc feature increases load time (typically 20–25 ms per target). Throughput generally ranges from 100–200 MB/s.
|
|
133
|
+
|
|
134
|
+
`cargo bench --features opencc` on `AMD EPYC 7B13` (GitPod) by v0.3:
|
|
113
135
|
|
|
114
136
|
<details>
|
|
115
137
|
<summary>w/ default features</summary>
|
|
@@ -138,7 +160,8 @@ is_hans data55k time: [404.73 µs 407.11 µs 409.59 µs]
|
|
|
138
160
|
infer_variant data55k time: [1.0468 ms 1.0515 ms 1.0570 ms]
|
|
139
161
|
is_hans data3185k time: [22.442 ms 22.589 ms 22.757 ms]
|
|
140
162
|
infer_variant data3185k time: [60.205 ms 60.412 ms 60.627 ms]
|
|
141
|
-
```
|
|
163
|
+
```
|
|
164
|
+
|
|
142
165
|
</details>
|
|
143
166
|
|
|
144
167
|
<details>
|
|
@@ -172,40 +195,29 @@ infer_variant data3185k time: [74.878 ms 76.262 ms 77.818 ms]
|
|
|
172
195
|
|
|
173
196
|
</details>
|
|
174
197
|
|
|
175
|
-
By default, only rulesets from MediaWiki are used. `opencc` feature can be enabled with `zhconv = { version = "...", features = [ "opencc" ] }`.
|
|
176
|
-
But be noted that, other than performance decrease, it accounts for at least several MiBs in build output.
|
|
177
|
-
|
|
178
|
-
<!--
|
|
179
|
-
## Differences with other converters
|
|
180
|
-
* `ZhConver{sion,ter}.php` of MediaWiki: zhconv-rs just takes conversion tables listed in [`ZhConversion.php`](https://github.com/wikimedia/mediawiki/blob/master/includes/languages/data/ZhConversion.php#L14). MediaWiki relies on the inefficient PHP built-in function [`strtr`](https://github.com/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974). Under the basic mode, zhconv-rs guarantees linear time complexity (`T = O(n+m)` instead of `O(nm)`) and single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.
|
|
181
|
-
* OpenCC: The [conversion rulesets](https://github.com/BYVoid/OpenCC/tree/master/data/dictionary) of OpenCC is independent of MediaWiki. The core [conversion implementation](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) of OpenCC is kinda similar to the aforementioned `strtr`. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.
|
|
182
|
-
## Comparisions with other tools
|
|
183
|
-
- OpenCC: Dict::MatchPrefix (iterating from maxlen to minlen character by character to match) [https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Dict.cpp#L25](MatchPrefix), [segments converter](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/Conversion.cpp#L27) [segmentizer](https://github.dev/BYVoid/OpenCC/blob/21995f5ea058441423aaff3ee89b0a5d4747674c/src/MaxMatchSegmentation.cpp#L34)
|
|
184
|
-
- zhConversion.php: strtr (iterating from maxlen to minlen for every known key length to match) [https://github.dev/php/php-src/blob/217fd932fa57d746ea4786b01d49321199a2f3d5/ext/standard/string.c#L2974]
|
|
185
|
-
- zhconv-rs regex-based automaton
|
|
186
|
-
-->
|
|
187
|
-
|
|
188
198
|
## Limitations
|
|
189
199
|
|
|
190
200
|
### Accuracy
|
|
191
201
|
|
|
192
|
-
|
|
202
|
+
Rule-based converters cannot capture every possible linguistic nuance. Like most others, the implementation employs a leftmost-longest matching strategy (a.k.a forward maximum matching), prioritizing to the earliest and longest matches in the text. For example, if a ruleset contains both `干 → 幹`, `天干 → 天干`, and `天干物燥 → 天乾物燥`, the converter will prefer the longer match `天乾物燥`, since it appears earlier and spans more characters. This generally works well but may cause occasional mis-conversions.
|
|
193
203
|
|
|
194
204
|
### Wikitext support
|
|
195
205
|
|
|
196
|
-
|
|
206
|
+
The implementation supports most MediaWiki conversion rules, while not fully compliant with the original MediaWiki implementation.
|
|
197
207
|
|
|
198
|
-
|
|
208
|
+
Since rebuilding automata dynamically is impractical, rules (e.g., `-{H|zh-hans:鹿|zh-hant:马}-` in MediaWiki syntax) in text are extracted in a first pass, a temporary automaton is constructed, and the text is converted in a second pass. The time complexity may degrade to `O(n*m)` in the worst case, where `n` is the input text length and `m` is the maximum length of source words in dictionaries, which is equivalent to a brute-force approach.
|
|
199
209
|
|
|
200
210
|
## Credits
|
|
201
211
|
|
|
202
212
|
Rulesets/Dictionaries: [MediaWiki](https://github.com/wikimedia/mediawiki) and [OpenCC](https://github.com/BYVoid/OpenCC).
|
|
203
213
|
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
- https://
|
|
209
|
-
- https://github.com/
|
|
210
|
-
|
|
214
|
+
Fast double-array Aho-Corasick automata implementation in Rust: [daachorse](https://github.com/daac-tools/daachorse)
|
|
215
|
+
|
|
216
|
+
References & related implementations:
|
|
217
|
+
|
|
218
|
+
- <https://github.com/gumblex/zhconv> : Python implementation of `zhConver{ter,sion}.php`.
|
|
219
|
+
- <https://github.com/BYVoid/OpenCC/> : Widely adopted Chinese converter.
|
|
220
|
+
- <https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理>
|
|
221
|
+
- <https://zh.wikipedia.org/wiki/Help:高级字词转换语法>
|
|
222
|
+
- <https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php>
|
|
211
223
|
|
|
@@ -0,0 +1,5 @@
|
|
|
1
|
+
zhconv_rs_opencc-0.4.0.dist-info/METADATA,sha256=gIYKvShqsUrC-KQgmtF-WOTuOz1KTce7SsEXVY_XfgA,12835
|
|
2
|
+
zhconv_rs_opencc-0.4.0.dist-info/WHEEL,sha256=-ACW9IiXdK4LnXaTRwEcdj6JI7tqGnsz4Ppj7XWagMY,90
|
|
3
|
+
zhconv_rs_opencc/__init__.py,sha256=L3TIYc7ax1hTtB8KsKhGqldXEvEJM9S9AoJGd0JIdaA,147
|
|
4
|
+
zhconv_rs_opencc/zhconv_rs_opencc.pyd,sha256=w5k77YnotdtGjA1cjTyMKLFV59ymvx_hUJDN9UkAMu4,5448192
|
|
5
|
+
zhconv_rs_opencc-0.4.0.dist-info/RECORD,,
|
|
@@ -1,5 +0,0 @@
|
|
|
1
|
-
zhconv_rs_opencc-0.3.2.post2.dist-info/METADATA,sha256=jUSJd9SxFqMWNNk2l9GS1Hnj1C5aRuSFQvDbHPftTHo,13329
|
|
2
|
-
zhconv_rs_opencc-0.3.2.post2.dist-info/WHEEL,sha256=M0Ic6pAGuWT2rFmFPpb-SpFKB1PqNMLrguwJiCZANxw,90
|
|
3
|
-
zhconv_rs_opencc/__init__.py,sha256=L3TIYc7ax1hTtB8KsKhGqldXEvEJM9S9AoJGd0JIdaA,147
|
|
4
|
-
zhconv_rs_opencc/zhconv_rs_opencc.pyd,sha256=WJ5FGJlUOIzCsU-krE-5-jlqSmKuvsV7497xUCtH5YU,5547520
|
|
5
|
-
zhconv_rs_opencc-0.3.2.post2.dist-info/RECORD,,
|