pyhannom 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pyhannom-0.1.0/LICENSE +37 -0
- pyhannom-0.1.0/PKG-INFO +348 -0
- pyhannom-0.1.0/README.md +301 -0
- pyhannom-0.1.0/pyhannom/__init__.py +26 -0
- pyhannom-0.1.0/pyhannom/data_store.py +18 -0
- pyhannom-0.1.0/pyhannom/func_test_demo.py +28 -0
- pyhannom-0.1.0/pyhannom/get_chuhannom_from_latin.py +45 -0
- pyhannom-0.1.0/pyhannom/get_chuhannom_word_from_latin.py +42 -0
- pyhannom-0.1.0/pyhannom/get_latin_from_chuhannom.py +15 -0
- pyhannom-0.1.0/pyhannom/get_latin_word_from_chuhannom.py +21 -0
- pyhannom-0.1.0/pyhannom/load_syllable_char_table.py +50 -0
- pyhannom-0.1.0/pyhannom/load_word_table.py +70 -0
- pyhannom-0.1.0/pyhannom.egg-info/PKG-INFO +348 -0
- pyhannom-0.1.0/pyhannom.egg-info/SOURCES.txt +16 -0
- pyhannom-0.1.0/pyhannom.egg-info/dependency_links.txt +1 -0
- pyhannom-0.1.0/pyhannom.egg-info/top_level.txt +1 -0
- pyhannom-0.1.0/pyproject.toml +21 -0
- pyhannom-0.1.0/setup.cfg +4 -0
pyhannom-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
MIT License (for code)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 ZHANG Zijie
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, subject to the following conditions:
|
|
10
|
+
|
|
11
|
+
The above copyright notice and this permission notice shall be included in all
|
|
12
|
+
copies or substantial portions of the Software.
|
|
13
|
+
|
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
15
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
16
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
17
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
18
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
19
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
20
|
+
SOFTWARE.
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
-------------------------------------------------------------------------------
|
|
24
|
+
Data Usage Notice
|
|
25
|
+
|
|
26
|
+
This project includes data derived from the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam" project:
|
|
27
|
+
|
|
28
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
29
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
30
|
+
|
|
31
|
+
All rights to the original data belong to the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam".
|
|
32
|
+
The data included in this package is provided strictly for research and
|
|
33
|
+
educational purposes. Commercial use of the data is NOT permitted.
|
|
34
|
+
|
|
35
|
+
Redistribution or modification of the data must include proper attribution
|
|
36
|
+
to the original source and comply with any additional requirements set by
|
|
37
|
+
the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam".
|
pyhannom-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,348 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: pyhannom
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A Python toolkit for the digitization of Han-Nom characters, enabling AI and natural language processing applications with Vietnamese Han-Nom data.
|
|
5
|
+
Author-email: ZHANG Zijie <zijiezhang@link.cuhk.edu.hk>
|
|
6
|
+
License: MIT License (for code)
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2025 ZHANG Zijie
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, subject to the following conditions:
|
|
15
|
+
|
|
16
|
+
The above copyright notice and this permission notice shall be included in all
|
|
17
|
+
copies or substantial portions of the Software.
|
|
18
|
+
|
|
19
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
20
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
21
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
22
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
23
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
24
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
25
|
+
SOFTWARE.
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
-------------------------------------------------------------------------------
|
|
29
|
+
Data Usage Notice
|
|
30
|
+
|
|
31
|
+
This project includes data derived from the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam" project:
|
|
32
|
+
|
|
33
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
34
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
35
|
+
|
|
36
|
+
All rights to the original data belong to the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam".
|
|
37
|
+
The data included in this package is provided strictly for research and
|
|
38
|
+
educational purposes. Commercial use of the data is NOT permitted.
|
|
39
|
+
|
|
40
|
+
Redistribution or modification of the data must include proper attribution
|
|
41
|
+
to the original source and comply with any additional requirements set by
|
|
42
|
+
the "委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam".
|
|
43
|
+
Project-URL: Homepage, https://pypi.org/project/pyhannom/
|
|
44
|
+
Requires-Python: >=3.0
|
|
45
|
+
Description-Content-Type: text/markdown
|
|
46
|
+
License-File: LICENSE
|
|
47
|
+
|
|
48
|
+
# PyHanNom
|
|
49
|
+
|
|
50
|
+
## 📖 Introduction
|
|
51
|
+
**PyHanNom** is a Python package dedicated to the **modernization and digitization of Han-Nom characters**, the classical script system of Vietnam. In the era of **artificial intelligence**, where Python has become the dominant language for computation and research, it is essential to provide robust tools that connect **modern Vietnamese Latin script (Quốc Ngữ)** with its **Han-Nom heritage**.
|
|
52
|
+
|
|
53
|
+
This package offers a foundation for computational work with Han-Nom by enabling **bidirectional lookup and conversion ↔️** at both the syllable, character, and word levels. Beyond simple character matching, PyHanNom is designed as part of a broader effort to make Han-Nom resources **machine-readable 💻, searchable 🔍, and ready for integration into AI-driven applications 🚀**.
|
|
54
|
+
|
|
55
|
+
By bridging modern orthography with historical scripts, PyHanNom contributes to:
|
|
56
|
+
- 🏛️ the **preservation and revival** of Han-Nom in digital form,
|
|
57
|
+
- 📚 the **development of linguistic resources** for computational linguistics and digital humanities,
|
|
58
|
+
- 🔮 and the **future of AI applications**, such as building parallel corpora between Latin Vietnamese and Han-Nom, or supporting natural language processing tasks.
|
|
59
|
+
|
|
60
|
+
PyHanNom is therefore not only a practical tool for researchers and developers today, but also a step toward the **long-term goal of bringing Han-Nom into the digital and AI era 🌏**.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## 📂 Data Source
|
|
65
|
+
- This project includes data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:
|
|
66
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
67
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
68
|
+
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.
|
|
69
|
+
- The data included in this package is provided strictly for **research and educational purposes**. Commercial use of the data is **NOT permitted**.
|
|
70
|
+
- Since the mappings are directly derived from this source, **all case handling and annotations in PyHanNom remain exactly consistent with the original data**. For details, please refer to the original source above.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## ⚙️ Installation
|
|
75
|
+
```bash
|
|
76
|
+
pip install pyhannom
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## 🚀 Usage
|
|
82
|
+
|
|
83
|
+
### 1. Load the syllable–character table
|
|
84
|
+
Before using any lookup functions, you need to load the Han-Nom syllable–character mapping table:
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from pyhannom import load_syllable_char_table
|
|
88
|
+
|
|
89
|
+
hannom_syllable_char_table = load_syllable_char_table()
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
This `hannom_syllable_char_table` handle must be passed into the following functions.
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
### 2. Convert Latin syllable → Han-Nom character
|
|
97
|
+
Use `get_chuhannom_from_latin` to retrieve the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.
|
|
98
|
+
|
|
99
|
+
Each Han-Nom character in the bracket is the simplified version of the one outside the bracket.
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
from pyhannom import get_chuhannom_from_latin
|
|
103
|
+
|
|
104
|
+
result = get_chuhannom_from_latin(
|
|
105
|
+
hannom_syllable_char_table,
|
|
106
|
+
"buộc"
|
|
107
|
+
)
|
|
108
|
+
print(result) # e.g. ['𫃚']
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
```python
|
|
112
|
+
from pyhannom import get_chuhannom_from_latin
|
|
113
|
+
|
|
114
|
+
result = get_chuhannom_from_latin(
|
|
115
|
+
hannom_syllable_char_table,
|
|
116
|
+
"anh"
|
|
117
|
+
)
|
|
118
|
+
print(result) # e.g. ['英', '英', '映', '罌', '嚶', '櫻', '鶯(𦾉)', '鸚']
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
#### Function signature
|
|
122
|
+
```python
|
|
123
|
+
get_chuhannom_from_latin(
|
|
124
|
+
handle,
|
|
125
|
+
input_latin_syllable: str,
|
|
126
|
+
normalize_input_case: bool = True,
|
|
127
|
+
case_insensitive_match: bool = True
|
|
128
|
+
)
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
132
|
+
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
|
|
133
|
+
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
|
|
134
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
135
|
+
- **returns**: a list of the matched Han-Nom characters corresponding to the input syllable.
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
### 3. Convert Latin syllable → Han-Nom Unicode code points
|
|
140
|
+
Use `get_chuhannom_unicode_from_latin` to retrieve the **Unicode code points** of the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.
|
|
141
|
+
|
|
142
|
+
This function has the same input parameters as `get_chuhannom_from_latin`, but instead of returning the characters themselves, it returns their Unicode representations.
|
|
143
|
+
|
|
144
|
+
Each Unicode code point in the bracket represents the Unicode code point of the Han-Nom character in the bracket introduced in the last section.
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
from pyhannom import get_chuhannom_unicode_from_latin
|
|
148
|
+
|
|
149
|
+
result = get_chuhannom_unicode_from_latin(
|
|
150
|
+
hannom_syllable_char_table,
|
|
151
|
+
"buộc"
|
|
152
|
+
)
|
|
153
|
+
print(result) # e.g. ['U+2B0DA']
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
from pyhannom import get_chuhannom_unicode_from_latin
|
|
158
|
+
|
|
159
|
+
result = get_chuhannom_unicode_from_latin(
|
|
160
|
+
hannom_syllable_char_table,
|
|
161
|
+
"anh"
|
|
162
|
+
)
|
|
163
|
+
print(result)
|
|
164
|
+
# e.g. ['U+82F1', 'U+82F1', 'U+6620', 'U+7F4C', 'U+56B6', 'U+6AFB', 'U+9DAF (U+26F89)', 'U+9E1A']
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
#### Function signature
|
|
168
|
+
```python
|
|
169
|
+
get_chuhannom_unicode_from_latin(
|
|
170
|
+
handle,
|
|
171
|
+
input_latin_syllable: str,
|
|
172
|
+
normalize_input_case: bool = True,
|
|
173
|
+
case_insensitive_match: bool = True
|
|
174
|
+
)
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
178
|
+
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
|
|
179
|
+
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
|
|
180
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
181
|
+
- **returns**: a list of Unicode code points (as strings) corresponding to the matched Han-Nom characters.
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
### 4. Convert Han-Nom character → Latin syllable
|
|
186
|
+
Use `get_latin_from_chuhannom` to retrieve the Vietnamese Latin syllable(s) corresponding to a given Han-Nom character.
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
from pyhannom import get_latin_from_chuhannom
|
|
190
|
+
|
|
191
|
+
result = get_latin_from_chuhannom(
|
|
192
|
+
hannom_syllable_char_table,
|
|
193
|
+
"心"
|
|
194
|
+
)
|
|
195
|
+
print(result) # e.g. ['TÂM', 'tim']
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
#### Function signature
|
|
199
|
+
```python
|
|
200
|
+
get_latin_from_chuhannom(
|
|
201
|
+
handle,
|
|
202
|
+
input_chuhannom: str,
|
|
203
|
+
normalize_output_case: bool = False
|
|
204
|
+
)
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
208
|
+
- **input_chuhannom** *(str)*: a single Han-Nom character.
|
|
209
|
+
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin syllables to lowercase.
|
|
210
|
+
- **returns**: a list of corresponding Vietnamese Latin syllables.
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
### 5. Load the word-level table
|
|
215
|
+
Before using word-level lookup functions, you need to load the Han-Nom **word-level mapping table**.
|
|
216
|
+
|
|
217
|
+
This table is built on top of the syllable–character table:
|
|
218
|
+
|
|
219
|
+
```python
|
|
220
|
+
from pyhannom import load_word_table
|
|
221
|
+
|
|
222
|
+
hannom_word_table = load_word_table(hannom_syllable_char_table)
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
#### Function signature
|
|
226
|
+
```python
|
|
227
|
+
load_word_table(handle)
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
231
|
+
- **returns**: a word-level mapping table to be used with word-level functions.
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
### 6. Convert Latin syllables → Han-Nom words
|
|
236
|
+
Use `get_chuhannom_word_from_latin` to retrieve Han-Nom word(s) corresponding to a given sequence of Vietnamese Latin syllables.
|
|
237
|
+
|
|
238
|
+
The function checks whether all provided Latin syllables occur as substrings within a Latin word in the word-level dictionary. If this condition is satisfied, it returns all matching Han-Nom words together with their Latin equivalents (and optional annotations).
|
|
239
|
+
```python
|
|
240
|
+
from pyhannom import get_chuhannom_word_from_latin
|
|
241
|
+
|
|
242
|
+
result = get_chuhannom_word_from_latin(
|
|
243
|
+
hannom_word_table,
|
|
244
|
+
"ác tâ"
|
|
245
|
+
)
|
|
246
|
+
print(result)
|
|
247
|
+
# e.g. {('革新', 'cách tân'), ('賓客', 'tân khách'), ('惡心', 'ác tâm')}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
```python
|
|
251
|
+
from pyhannom import get_chuhannom_word_from_latin
|
|
252
|
+
|
|
253
|
+
result = get_chuhannom_word_from_latin(
|
|
254
|
+
hannom_word_table,
|
|
255
|
+
"hưng hửng"
|
|
256
|
+
)
|
|
257
|
+
print(result)
|
|
258
|
+
# e.g. {('烝𬋙', 'chưng hửng', '[𠸨]'), ('𬋙𬋙', 'hưng hửng', '[𠸨]')}
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
#### Function signature
|
|
262
|
+
```python
|
|
263
|
+
get_chuhannom_word_from_latin(
|
|
264
|
+
handle,
|
|
265
|
+
input_latin_syllables: str,
|
|
266
|
+
normalize_input_case: bool = False,
|
|
267
|
+
case_insensitive_match: bool = True
|
|
268
|
+
)
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
- **handle**: the word-level table loaded by `load_word_table`.
|
|
272
|
+
- **input_latin_syllables** *(str)*: one or more Vietnamese Latin syllables.
|
|
273
|
+
- **normalize_input_case** *(bool, default=False)*: whether to normalize the input to lowercase before matching.
|
|
274
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
275
|
+
- **returns**: a `set` of tuples. Each tuple contains:
|
|
276
|
+
1. Han-Nom word (string)
|
|
277
|
+
2. Corresponding Latin word (string)
|
|
278
|
+
3. Optional annotation (string, may not always be present)
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
### 7. Convert Han-Nom words → Latin words
|
|
283
|
+
Use `get_latin_word_from_chuhannom` to retrieve the Vietnamese Latin word(s) corresponding to a given Han-Nom word or phrase.
|
|
284
|
+
The function checks whether the provided Han-Nom string (one or more characters) occurs as a substring within any Han-Nom word in the word-level dictionary. If this condition is satisfied, it returns all matching words together with their Latin equivalents (and optional annotations).
|
|
285
|
+
|
|
286
|
+
```python
|
|
287
|
+
from pyhannom import get_latin_word_from_chuhannom
|
|
288
|
+
|
|
289
|
+
result = get_latin_word_from_chuhannom(
|
|
290
|
+
hannom_word_table,
|
|
291
|
+
"稱雄"
|
|
292
|
+
)
|
|
293
|
+
print(result)
|
|
294
|
+
# e.g. {('稱雄', 'xưng hùng'), ('稱雄稱霸', 'xưng hùng xưng bá')}
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
```python
|
|
298
|
+
from pyhannom import get_latin_word_from_chuhannom
|
|
299
|
+
|
|
300
|
+
result = get_latin_word_from_chuhannom(
|
|
301
|
+
hannom_word_table,
|
|
302
|
+
"汴𠲅"
|
|
303
|
+
)
|
|
304
|
+
print(result)
|
|
305
|
+
# e.g. {('汴𠲅', 'bin (pin) sạc', '[摱]')}
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
#### Function signature
|
|
309
|
+
```python
|
|
310
|
+
get_latin_word_from_chuhannom(
|
|
311
|
+
handle,
|
|
312
|
+
input_chuhannom: str,
|
|
313
|
+
normalize_output_case: bool = False
|
|
314
|
+
)
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
- **handle**: the word-level table loaded by `load_word_table`.
|
|
318
|
+
- **input_chuhannom** *(str)*: one or more Han-Nom characters forming a string.
|
|
319
|
+
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin words to lowercase.
|
|
320
|
+
- **returns**: a `set` of tuples. Each tuple contains:
|
|
321
|
+
1. Han-Nom word (string)
|
|
322
|
+
2. Corresponding Latin word (string)
|
|
323
|
+
3. Optional annotation (string, may not always be present)
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## 📜 License
|
|
328
|
+
- **Code**: Licensed under the [MIT License](LICENSE).
|
|
329
|
+
- **Data**: Derived from the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project. Redistribution or modification of the data must include proper attribution and comply with the requirements set by the original source.
|
|
330
|
+
|
|
331
|
+
---
|
|
332
|
+
|
|
333
|
+
## 🤝 Contributing
|
|
334
|
+
- Pull requests are welcome.
|
|
335
|
+
- For major changes, please open an issue first to discuss what you’d like to change.
|
|
336
|
+
- Make sure to update tests as appropriate.
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
## 🌏 Acknowledgments
|
|
341
|
+
- This project makes use of data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:
|
|
342
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
343
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
344
|
+
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.
|
|
345
|
+
- I would like to express my gratitude to the open-source Han-Nom community and the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project for making these resources available for research and educational purposes.
|
|
346
|
+
- If there is any infringement or concern regarding the use of this data, please contact me immediately. I will respond promptly to resolve the issue, including the possibility of removing this project if necessary.
|
|
347
|
+
|
|
348
|
+
---
|
pyhannom-0.1.0/README.md
ADDED
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# PyHanNom
|
|
2
|
+
|
|
3
|
+
## 📖 Introduction
|
|
4
|
+
**PyHanNom** is a Python package dedicated to the **modernization and digitization of Han-Nom characters**, the classical script system of Vietnam. In the era of **artificial intelligence**, where Python has become the dominant language for computation and research, it is essential to provide robust tools that connect **modern Vietnamese Latin script (Quốc Ngữ)** with its **Han-Nom heritage**.
|
|
5
|
+
|
|
6
|
+
This package offers a foundation for computational work with Han-Nom by enabling **bidirectional lookup and conversion ↔️** at both the syllable, character, and word levels. Beyond simple character matching, PyHanNom is designed as part of a broader effort to make Han-Nom resources **machine-readable 💻, searchable 🔍, and ready for integration into AI-driven applications 🚀**.
|
|
7
|
+
|
|
8
|
+
By bridging modern orthography with historical scripts, PyHanNom contributes to:
|
|
9
|
+
- 🏛️ the **preservation and revival** of Han-Nom in digital form,
|
|
10
|
+
- 📚 the **development of linguistic resources** for computational linguistics and digital humanities,
|
|
11
|
+
- 🔮 and the **future of AI applications**, such as building parallel corpora between Latin Vietnamese and Han-Nom, or supporting natural language processing tasks.
|
|
12
|
+
|
|
13
|
+
PyHanNom is therefore not only a practical tool for researchers and developers today, but also a step toward the **long-term goal of bringing Han-Nom into the digital and AI era 🌏**.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 📂 Data Source
|
|
18
|
+
- This project includes data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:
|
|
19
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
20
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
21
|
+
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.
|
|
22
|
+
- The data included in this package is provided strictly for **research and educational purposes**. Commercial use of the data is **NOT permitted**.
|
|
23
|
+
- Since the mappings are directly derived from this source, **all case handling and annotations in PyHanNom remain exactly consistent with the original data**. For details, please refer to the original source above.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## ⚙️ Installation
|
|
28
|
+
```bash
|
|
29
|
+
pip install pyhannom
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## 🚀 Usage
|
|
35
|
+
|
|
36
|
+
### 1. Load the syllable–character table
|
|
37
|
+
Before using any lookup functions, you need to load the Han-Nom syllable–character mapping table:
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
from pyhannom import load_syllable_char_table
|
|
41
|
+
|
|
42
|
+
hannom_syllable_char_table = load_syllable_char_table()
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
This `hannom_syllable_char_table` handle must be passed into the following functions.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
### 2. Convert Latin syllable → Han-Nom character
|
|
50
|
+
Use `get_chuhannom_from_latin` to retrieve the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.
|
|
51
|
+
|
|
52
|
+
Each Han-Nom character in the bracket is the simplified version of the one outside the bracket.
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
from pyhannom import get_chuhannom_from_latin
|
|
56
|
+
|
|
57
|
+
result = get_chuhannom_from_latin(
|
|
58
|
+
hannom_syllable_char_table,
|
|
59
|
+
"buộc"
|
|
60
|
+
)
|
|
61
|
+
print(result) # e.g. ['𫃚']
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
from pyhannom import get_chuhannom_from_latin
|
|
66
|
+
|
|
67
|
+
result = get_chuhannom_from_latin(
|
|
68
|
+
hannom_syllable_char_table,
|
|
69
|
+
"anh"
|
|
70
|
+
)
|
|
71
|
+
print(result) # e.g. ['英', '英', '映', '罌', '嚶', '櫻', '鶯(𦾉)', '鸚']
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
#### Function signature
|
|
75
|
+
```python
|
|
76
|
+
get_chuhannom_from_latin(
|
|
77
|
+
handle,
|
|
78
|
+
input_latin_syllable: str,
|
|
79
|
+
normalize_input_case: bool = True,
|
|
80
|
+
case_insensitive_match: bool = True
|
|
81
|
+
)
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
85
|
+
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
|
|
86
|
+
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
|
|
87
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
88
|
+
- **returns**: a list of the matched Han-Nom characters corresponding to the input syllable.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
### 3. Convert Latin syllable → Han-Nom Unicode code points
|
|
93
|
+
Use `get_chuhannom_unicode_from_latin` to retrieve the **Unicode code points** of the Han-Nom character(s) corresponding to a given Vietnamese Latin syllable.
|
|
94
|
+
|
|
95
|
+
This function has the same input parameters as `get_chuhannom_from_latin`, but instead of returning the characters themselves, it returns their Unicode representations.
|
|
96
|
+
|
|
97
|
+
Each Unicode code point in the bracket represents the Unicode code point of the Han-Nom character in the bracket introduced in the last section.
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
from pyhannom import get_chuhannom_unicode_from_latin
|
|
101
|
+
|
|
102
|
+
result = get_chuhannom_unicode_from_latin(
|
|
103
|
+
hannom_syllable_char_table,
|
|
104
|
+
"buộc"
|
|
105
|
+
)
|
|
106
|
+
print(result) # e.g. ['U+2B0DA']
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
from pyhannom import get_chuhannom_unicode_from_latin
|
|
111
|
+
|
|
112
|
+
result = get_chuhannom_unicode_from_latin(
|
|
113
|
+
hannom_syllable_char_table,
|
|
114
|
+
"anh"
|
|
115
|
+
)
|
|
116
|
+
print(result)
|
|
117
|
+
# e.g. ['U+82F1', 'U+82F1', 'U+6620', 'U+7F4C', 'U+56B6', 'U+6AFB', 'U+9DAF (U+26F89)', 'U+9E1A']
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
#### Function signature
|
|
121
|
+
```python
|
|
122
|
+
get_chuhannom_unicode_from_latin(
|
|
123
|
+
handle,
|
|
124
|
+
input_latin_syllable: str,
|
|
125
|
+
normalize_input_case: bool = True,
|
|
126
|
+
case_insensitive_match: bool = True
|
|
127
|
+
)
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
131
|
+
- **input_latin_syllable** *(str)*: a single Vietnamese syllable in Latin script.
|
|
132
|
+
- **normalize_input_case** *(bool, default=True)*: whether to normalize the input to lowercase before matching.
|
|
133
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
134
|
+
- **returns**: a list of Unicode code points (as strings) corresponding to the matched Han-Nom characters.
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
### 4. Convert Han-Nom character → Latin syllable
|
|
139
|
+
Use `get_latin_from_chuhannom` to retrieve the Vietnamese Latin syllable(s) corresponding to a given Han-Nom character.
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
from pyhannom import get_latin_from_chuhannom
|
|
143
|
+
|
|
144
|
+
result = get_latin_from_chuhannom(
|
|
145
|
+
hannom_syllable_char_table,
|
|
146
|
+
"心"
|
|
147
|
+
)
|
|
148
|
+
print(result) # e.g. ['TÂM', 'tim']
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
#### Function signature
|
|
152
|
+
```python
|
|
153
|
+
get_latin_from_chuhannom(
|
|
154
|
+
handle,
|
|
155
|
+
input_chuhannom: str,
|
|
156
|
+
normalize_output_case: bool = False
|
|
157
|
+
)
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
161
|
+
- **input_chuhannom** *(str)*: a single Han-Nom character.
|
|
162
|
+
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin syllables to lowercase.
|
|
163
|
+
- **returns**: a list of corresponding Vietnamese Latin syllables.
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
### 5. Load the word-level table
|
|
168
|
+
Before using word-level lookup functions, you need to load the Han-Nom **word-level mapping table**.
|
|
169
|
+
|
|
170
|
+
This table is built on top of the syllable–character table:
|
|
171
|
+
|
|
172
|
+
```python
|
|
173
|
+
from pyhannom import load_word_table
|
|
174
|
+
|
|
175
|
+
hannom_word_table = load_word_table(hannom_syllable_char_table)
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
#### Function signature
|
|
179
|
+
```python
|
|
180
|
+
load_word_table(handle)
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
- **handle**: the syllable–character table loaded by `load_syllable_char_table`.
|
|
184
|
+
- **returns**: a word-level mapping table to be used with word-level functions.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
### 6. Convert Latin syllables → Han-Nom words
|
|
189
|
+
Use `get_chuhannom_word_from_latin` to retrieve Han-Nom word(s) corresponding to a given sequence of Vietnamese Latin syllables.
|
|
190
|
+
|
|
191
|
+
The function checks whether all provided Latin syllables occur as substrings within a Latin word in the word-level dictionary. If this condition is satisfied, it returns all matching Han-Nom words together with their Latin equivalents (and optional annotations).
|
|
192
|
+
```python
|
|
193
|
+
from pyhannom import get_chuhannom_word_from_latin
|
|
194
|
+
|
|
195
|
+
result = get_chuhannom_word_from_latin(
|
|
196
|
+
hannom_word_table,
|
|
197
|
+
"ác tâ"
|
|
198
|
+
)
|
|
199
|
+
print(result)
|
|
200
|
+
# e.g. {('革新', 'cách tân'), ('賓客', 'tân khách'), ('惡心', 'ác tâm')}
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
from pyhannom import get_chuhannom_word_from_latin
|
|
205
|
+
|
|
206
|
+
result = get_chuhannom_word_from_latin(
|
|
207
|
+
hannom_word_table,
|
|
208
|
+
"hưng hửng"
|
|
209
|
+
)
|
|
210
|
+
print(result)
|
|
211
|
+
# e.g. {('烝𬋙', 'chưng hửng', '[𠸨]'), ('𬋙𬋙', 'hưng hửng', '[𠸨]')}
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
#### Function signature
|
|
215
|
+
```python
|
|
216
|
+
get_chuhannom_word_from_latin(
|
|
217
|
+
handle,
|
|
218
|
+
input_latin_syllables: str,
|
|
219
|
+
normalize_input_case: bool = False,
|
|
220
|
+
case_insensitive_match: bool = True
|
|
221
|
+
)
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
- **handle**: the word-level table loaded by `load_word_table`.
|
|
225
|
+
- **input_latin_syllables** *(str)*: one or more Vietnamese Latin syllables.
|
|
226
|
+
- **normalize_input_case** *(bool, default=False)*: whether to normalize the input to lowercase before matching.
|
|
227
|
+
- **case_insensitive_match** *(bool, default=True)*: whether to perform case-insensitive matching.
|
|
228
|
+
- **returns**: a `set` of tuples. Each tuple contains:
|
|
229
|
+
1. Han-Nom word (string)
|
|
230
|
+
2. Corresponding Latin word (string)
|
|
231
|
+
3. Optional annotation (string, may not always be present)
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
### 7. Convert Han-Nom words → Latin words
|
|
236
|
+
Use `get_latin_word_from_chuhannom` to retrieve the Vietnamese Latin word(s) corresponding to a given Han-Nom word or phrase.
|
|
237
|
+
The function checks whether the provided Han-Nom string (one or more characters) occurs as a substring within any Han-Nom word in the word-level dictionary. If this condition is satisfied, it returns all matching words together with their Latin equivalents (and optional annotations).
|
|
238
|
+
|
|
239
|
+
```python
|
|
240
|
+
from pyhannom import get_latin_word_from_chuhannom
|
|
241
|
+
|
|
242
|
+
result = get_latin_word_from_chuhannom(
|
|
243
|
+
hannom_word_table,
|
|
244
|
+
"稱雄"
|
|
245
|
+
)
|
|
246
|
+
print(result)
|
|
247
|
+
# e.g. {('稱雄', 'xưng hùng'), ('稱雄稱霸', 'xưng hùng xưng bá')}
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
```python
|
|
251
|
+
from pyhannom import get_latin_word_from_chuhannom
|
|
252
|
+
|
|
253
|
+
result = get_latin_word_from_chuhannom(
|
|
254
|
+
hannom_word_table,
|
|
255
|
+
"汴𠲅"
|
|
256
|
+
)
|
|
257
|
+
print(result)
|
|
258
|
+
# e.g. {('汴𠲅', 'bin (pin) sạc', '[摱]')}
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
#### Function signature
|
|
262
|
+
```python
|
|
263
|
+
get_latin_word_from_chuhannom(
|
|
264
|
+
handle,
|
|
265
|
+
input_chuhannom: str,
|
|
266
|
+
normalize_output_case: bool = False
|
|
267
|
+
)
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
- **handle**: the word-level table loaded by `load_word_table`.
|
|
271
|
+
- **input_chuhannom** *(str)*: one or more Han-Nom characters forming a string.
|
|
272
|
+
- **normalize_output_case** *(bool, default=False)*: whether to normalize all returned Latin words to lowercase.
|
|
273
|
+
- **returns**: a `set` of tuples. Each tuple contains:
|
|
274
|
+
1. Han-Nom word (string)
|
|
275
|
+
2. Corresponding Latin word (string)
|
|
276
|
+
3. Optional annotation (string, may not always be present)
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## 📜 License
|
|
281
|
+
- **Code**: Licensed under the [MIT License](LICENSE).
|
|
282
|
+
- **Data**: Derived from the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project. Redistribution or modification of the data must include proper attribution and comply with the requirements set by the original source.
|
|
283
|
+
|
|
284
|
+
---
|
|
285
|
+
|
|
286
|
+
## 🤝 Contributing
|
|
287
|
+
- Pull requests are welcome.
|
|
288
|
+
- For major changes, please open an issue first to discuss what you’d like to change.
|
|
289
|
+
- Make sure to update tests as appropriate.
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## 🌏 Acknowledgments
|
|
294
|
+
- This project makes use of data derived from the **"委班復生漢喃越南 Uỷ ban Phục sinh Hán Nôm Việt Nam"** project:
|
|
295
|
+
- https://www.hannom-rcv.org/BCHNCTD.html
|
|
296
|
+
- https://www.hannom-rcv.org/Lookup-CHNC.html
|
|
297
|
+
- All rights to the original data belong to the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam*.
|
|
298
|
+
- I would like to express my gratitude to the open-source Han-Nom community and the *委班復生漢喃越南 Ủy ban Phục sinh Hán Nôm Việt Nam* project for making these resources available for research and educational purposes.
|
|
299
|
+
- If there is any infringement or concern regarding the use of this data, please contact me immediately. I will respond promptly to resolve the issue, including the possibility of removing this project if necessary.
|
|
300
|
+
|
|
301
|
+
---
|