sanzang 0.0.3 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/HACKING +22 -40
- data/MANUAL +312 -0
- data/README +35 -265
- data/bin/{sanzang-reflow → sanzang} +1 -1
- data/lib/sanzang.rb +12 -35
- data/lib/sanzang/batch_translator.rb +77 -0
- data/lib/sanzang/command/batch.rb +131 -0
- data/lib/sanzang/command/reflow.rb +35 -32
- data/lib/sanzang/command/sanzang_cmd.rb +132 -0
- data/lib/sanzang/command/translate.rb +47 -70
- data/lib/sanzang/text_formatter.rb +1 -1
- data/lib/sanzang/translation_table.rb +34 -49
- data/lib/sanzang/translator.rb +1 -65
- data/lib/sanzang/version.rb +3 -2
- data/test/tc_reflow_encodings.rb +2 -2
- data/test/tc_simple_translation.rb +8 -12
- data/test/utf-8/stage_3.txt +4 -0
- metadata +25 -31
- data/bin/sanzang-translate +0 -21
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ee6e8f6e1b7cbd116de22589a0089faa73b35726
|
4
|
+
data.tar.gz: 2110931d7a863e52dc6ade5b17c060d4dc9ab12e
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 65710d20a08d6b20f82d3c909468c8e5d03b54fd5f96657247d0eaededc587d1394c66183660e50c654990f0bccfcb1efc600af5672e518cbaa777cd9a5ba3f8
|
7
|
+
data.tar.gz: 039837a397ed9f5561d5f2c6af23fe4a8991b305d134a7f4cfe6be3f67bad0506bf1ca6d8bf332586f51243a8bfbd1acb8302beb60b107e4ca165b56f66550ea
|
data/HACKING
CHANGED
@@ -1,54 +1,36 @@
|
|
1
|
-
= Hacking Sanzang
|
1
|
+
= Hacking \Sanzang
|
2
2
|
|
3
|
-
==
|
4
|
-
|
5
|
-
* Use "rake test" to run all tests.
|
6
|
-
* Use "rake build" to create a gem package in the "dist" directory.
|
7
|
-
|
8
|
-
== Platforms
|
3
|
+
== Supported platforms
|
9
4
|
|
10
5
|
These programs should work on all platforms with Ruby 1.9 or later. Regular
|
11
|
-
testing takes place on GNU/Linux operating systems.
|
6
|
+
testing takes place on GNU/Linux operating systems. \Sanzang has not been fully
|
7
|
+
tested on other implementations of Ruby (e.g. JRuby, Rubinius, etc.).
|
12
8
|
|
13
|
-
== Languages
|
9
|
+
== Languages and scope
|
14
10
|
|
15
11
|
The translation program may not be very useful for all languages. It was
|
16
12
|
designed specifically for dealing with the more difficult aspects of
|
17
13
|
translating from ancient Chinese. For a language like Sanskrit or Tibetan, it
|
18
|
-
may not be so useful.
|
19
|
-
been investigated.
|
20
|
-
|
21
|
-
== Tables
|
22
|
-
|
23
|
-
A valid translation table is necessary when using the translator. More
|
24
|
-
information on this format is available in the README file. The following
|
25
|
-
tokens are reserved: "~|", "|", and "|~", and these should not be used in table
|
26
|
-
record data. No "escape sequences" are available for the inclusion of such
|
27
|
-
tokens as data.
|
14
|
+
may not be so useful.
|
28
15
|
|
29
|
-
== Multiprocessing
|
16
|
+
== Multiprocessing and fork(2)
|
30
17
|
|
31
18
|
Most every Unix-like system supports the fork(2) system call and therefore has
|
32
|
-
the potential to run Sanzang in a multiprocessing mode for batches. However,
|
33
|
-
each Ruby implementation and port may be different, and Sanzang will check to
|
34
|
-
see if the fork method has been implemented before attempting to
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
fork method, yet the "parallel" library attempts to fork anyhow. To avoid
|
41
|
-
potential errors, the translator will not even attempt to use multiprocessing
|
42
|
-
on platforms in which the fork method is not implemented.
|
43
|
-
|
44
|
-
Note that for Windows, the "cygwin" port of Ruby does implement does support
|
45
|
-
fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method and so
|
19
|
+
the potential to run \Sanzang in a multiprocessing mode for batches. However,
|
20
|
+
each Ruby implementation and port may be different, and \Sanzang will check to
|
21
|
+
see if the fork method has been implemented before attempting to utilize
|
22
|
+
multiprocessing. To avoid potential errors, the translator will not attempt to
|
23
|
+
use multiprocessing on platforms in which fork(2) is not implemented.
|
24
|
+
|
25
|
+
Note that for Windows, the Cygwin port of Ruby does implement does support
|
26
|
+
fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method, so
|
46
27
|
it supports standard multiprocessing. Therefore, Cygwin is the most robust
|
47
|
-
environment for running Sanzang on a Windows
|
28
|
+
environment for running \Sanzang on a computer with Windows.
|
48
29
|
|
49
|
-
==
|
30
|
+
== Text encoding quirks
|
50
31
|
|
51
|
-
Converters for several encodings have not yet been implemented by
|
52
|
-
|
53
|
-
|
54
|
-
|
32
|
+
Converters for several encodings have not yet been implemented by YARV. Most of
|
33
|
+
these are obscure and not widely used. Perhaps the most notable is EUC-TW,
|
34
|
+
which is an old Unix encoding for traditional Chinese. Encoding conversion is
|
35
|
+
used by the _reflow_ command for formatting CJK text, but encoding conversion
|
36
|
+
is not required for translation.
|
data/MANUAL
ADDED
@@ -0,0 +1,312 @@
|
|
1
|
+
= The \Sanzang Manual
|
2
|
+
|
3
|
+
== Introduction
|
4
|
+
|
5
|
+
\Sanzang is a compact, cross-platform machine translation system. This program
|
6
|
+
was developed specifically to fill the need for a competent application for
|
7
|
+
aiding translators of the Chinese Buddhist canon into other languages. However,
|
8
|
+
the translation method it uses is general enough that it may extend to other
|
9
|
+
translation domains as well, especially translations with CJK source languages
|
10
|
+
(Chinese, Japanese, and Korean). The name \Sanzang (三藏) is a literal
|
11
|
+
translation of the Sanskrit word “Tripitaka,” which is a general term for the
|
12
|
+
Buddhist canon.
|
13
|
+
|
14
|
+
\Sanzang is implemented as a Unix style "command suite" program that executes
|
15
|
+
in your operating system's command shell. The _sanzang_ command includes
|
16
|
+
subcommands for carrying out each of the available functions of the system.
|
17
|
+
|
18
|
+
\Sanzang is programmed in the Ruby programming language and is free software
|
19
|
+
("free as in freedom"). This program is licensed under the GNU General Public
|
20
|
+
License, version 3, which ensures that anyone can use the program for any
|
21
|
+
purpose, and that extensions to this program will remain freely available to
|
22
|
+
others.
|
23
|
+
|
24
|
+
== Background
|
25
|
+
|
26
|
+
At the time of this writing, nearly all machine translation systems available
|
27
|
+
are based on statistical machine translation (SMT), a method which has not
|
28
|
+
proven very useful for ancient Chinese texts. \Sanzang was developed as an
|
29
|
+
alternative simply because there was the practical need for a better and more
|
30
|
+
reliable tool.
|
31
|
+
|
32
|
+
The most significant difference between \Sanzang and other machine translation
|
33
|
+
systems is that it does not attempt to interpret or translate grammar. Instead,
|
34
|
+
it simply translates names, terms, and phrases based on rules stored in a
|
35
|
+
translation table. The \Sanzang translator applies this translation table at
|
36
|
+
runtime to generate a text listing as the output.
|
37
|
+
|
38
|
+
This method is simple and efficient, and produces predictable results that can
|
39
|
+
be made immediately available to the user for verification. To facilitate this
|
40
|
+
task, all translation listings generated by the program are collated
|
41
|
+
line-by-line with the original source text.
|
42
|
+
|
43
|
+
== Translation Method
|
44
|
+
|
45
|
+
\Sanzang provides mainly a simple machine translation engine. To use this
|
46
|
+
translation engine, it will also be necessary to have a text file in which all
|
47
|
+
translation rules are defined. This is called a translation table, and its
|
48
|
+
format is simple delimited text. At runtime, the translation rules in this file
|
49
|
+
are applied to the source text to generate the translation listing.
|
50
|
+
|
51
|
+
In a translation table text file, each line is a translation rule containing a
|
52
|
+
source term and its equivalent meanings in other languages. Each line starts
|
53
|
+
with "~|", has records delimited by "|", and ends with "|~". In the translation
|
54
|
+
table, the first column represents the source language, while the subsequent
|
55
|
+
columns represent destination languages. In this example, we want to create a
|
56
|
+
table capable of rendering the following title into English:
|
57
|
+
|
58
|
+
金剛般若波羅蜜經
|
59
|
+
|
60
|
+
We start by creating a new text file, named _table.txt_, or something similar.
|
61
|
+
In this text file, we may add the following rules:
|
62
|
+
|
63
|
+
~|波羅蜜| pāramitā|~
|
64
|
+
~|金剛| diamond|~
|
65
|
+
~|般若| prajñā|~
|
66
|
+
~|經| sūtra/classic|~
|
67
|
+
|
68
|
+
Notice that spaces were included prior to the English equivalents. This is
|
69
|
+
because Chinese does not typically include spaces between words, so we need to
|
70
|
+
insert our own leading spaces as part of the translation rules we are defining.
|
71
|
+
After we have written this table file, we can then run the \Sanzang translation
|
72
|
+
engine with our table. When it reads the Chinese title as the input text, it
|
73
|
+
then produces the following translation listing:
|
74
|
+
|
75
|
+
1.1 金剛般若波羅蜜經
|
76
|
+
1.2 diamond prajñā pāramitā sūtra/classic
|
77
|
+
|
78
|
+
The program first sorted our terms by the length of the source column, and
|
79
|
+
then applied each of these rules in sequence. It then collated the output and
|
80
|
+
created a translation listing. In the left margin, we can see numbers denoting
|
81
|
+
the line number of the source text, along with the column number of the
|
82
|
+
translation table.
|
83
|
+
|
84
|
+
As a final example, below is a snippet from an Indian Buddhist meditation text,
|
85
|
+
which was processed by the \Sanzang translation engine in the same manner:
|
86
|
+
|
87
|
+
105.1 阿難白佛言。
|
88
|
+
105.2 ānán bái-fó-yán ¶
|
89
|
+
105.3 ānanda addressed-the-buddha-saying ¶
|
90
|
+
|
91
|
+
106.1 唯然世尊。
|
92
|
+
106.2 wéi-rán shìzūn ¶
|
93
|
+
106.3 just-so bhagavān ¶
|
94
|
+
|
95
|
+
107.1 願樂欲聞。
|
96
|
+
107.2 yuànlè-yù-wén ¶
|
97
|
+
107.3 joyfully-wish-to-hear ¶
|
98
|
+
|
99
|
+
Here we can see a three-column translation table at work. The first column has
|
100
|
+
the traditional Chinese source text, the second column contains the Pinyin
|
101
|
+
transliteration, and the third column contains English. In this example we can
|
102
|
+
see that well-defined translation rules lead to a clear translation listing,
|
103
|
+
in which the meaning of the original text is readily understandable in English.
|
104
|
+
|
105
|
+
Considering the examples above, we can see that knowledge of the source
|
106
|
+
language and expertise in the relevant literary field is often still necessary.
|
107
|
+
Here again we can see that this translation system does not position itself as
|
108
|
+
a “silver bullet” for creating finished translations, but is rather a practical
|
109
|
+
tool for the purpose of assisting human readers and translators.
|
110
|
+
|
111
|
+
== Installation
|
112
|
+
|
113
|
+
=== Requirements
|
114
|
+
|
115
|
+
The standard way of installing \Sanzang is as a Ruby gem. To do this, the only
|
116
|
+
requirement is Ruby 1.9 or later, along with an Internet connection.
|
117
|
+
|
118
|
+
If you do not have an Internet connection available on the host computer, then
|
119
|
+
you may download the gem file and install it manually. If you choose this
|
120
|
+
route, then please be aware that the “sanzang” gem depends on the “parallel”
|
121
|
+
gem for multiprocessing.
|
122
|
+
|
123
|
+
For users of Microsoft Windows, please be aware that Windows ports of Ruby
|
124
|
+
typically do not include full multiprocessing support using the standard Unix
|
125
|
+
system calls. If you will be using \Sanzang in a Windows environment and you
|
126
|
+
require support for fast batch processing, then you should use the Cygwin port
|
127
|
+
of Ruby, which does not have this limitation. Unix-based platforms such as
|
128
|
+
Linux, BSD, and Mac OS X are unaffected by this issue.
|
129
|
+
|
130
|
+
In addition to installation requirements, it may also be very useful to have a
|
131
|
+
text editor that is aware of Unicode and other encodings, and able to display
|
132
|
+
multilingual texts. One such application that is known to work well for this
|
133
|
+
task is the _gedit_ text editor, which is free software and is available on a
|
134
|
+
variety of platforms.
|
135
|
+
|
136
|
+
=== Installation
|
137
|
+
|
138
|
+
To install \Sanzang, the following command should suffice.
|
139
|
+
|
140
|
+
# gem install sanzang
|
141
|
+
|
142
|
+
This command will download and install \Sanzang into your Ruby environment.
|
143
|
+
|
144
|
+
If you have installed Ruby 1.9 but cannot run the _gem_ command, then you may
|
145
|
+
need to set up your PATH environment variable first, so you can run _ruby_ and
|
146
|
+
_gem_ from the command line.
|
147
|
+
|
148
|
+
== Commands
|
149
|
+
|
150
|
+
\Sanzang functions are accessible through the _sanzang_ command and its
|
151
|
+
subcommands. Runtime behavior is set through command line options and
|
152
|
+
parameters. This allows _sanzang_ to be easily scripted and automated.
|
153
|
+
|
154
|
+
\Sanzang subcommands can also be abbreviated for the sake of convenience. For
|
155
|
+
example, the "translate" subcommand could be abbreviated as "trans", "tr", or
|
156
|
+
even the one letter "t".
|
157
|
+
|
158
|
+
=== _sanzang_
|
159
|
+
|
160
|
+
The main _sanzang_ program acts as a front-end to subcommands, and also
|
161
|
+
includes options for printing platform and version information.
|
162
|
+
|
163
|
+
Usage: sanzang [options]
|
164
|
+
Usage: sanzang <command> [options] [args]
|
165
|
+
|
166
|
+
Sanzang commands:
|
167
|
+
batch translate many files in parallel
|
168
|
+
reflow format CJK text for translation
|
169
|
+
translate standard single text translation
|
170
|
+
|
171
|
+
Options:
|
172
|
+
-h, --help show this help message and exit
|
173
|
+
-P, --platform show platform information and exit
|
174
|
+
-V, --version show version number and exit
|
175
|
+
|
176
|
+
=== _sanzang_ _batch_
|
177
|
+
|
178
|
+
The _sanzang_ _batch_ command can translates files in parallel. A list of files
|
179
|
+
is read from STDIN, while progress information is printed to STDERR. The list
|
180
|
+
of output files written is printed to STDOUT at the end of the batch. The
|
181
|
+
output directory is specified as a parameter.
|
182
|
+
|
183
|
+
Usage: sanzang batch [options] table output_dir < queue
|
184
|
+
|
185
|
+
Options:
|
186
|
+
-h, --help show this help message and exit
|
187
|
+
-E, --encoding=ENC set data encoding to ENC
|
188
|
+
-L, --list-encodings list possible encodings
|
189
|
+
-j, --jobs=N allow N concurrent processes
|
190
|
+
|
191
|
+
=== _sanzang_ _reflow_
|
192
|
+
|
193
|
+
The command _sanzang_ _reflow_ can reformat Chinese, Japanese, or Korean text,
|
194
|
+
in which terms are often split between lines. This formatter "reflows" the text
|
195
|
+
based on its punctuation and horizontal spacing, separating the source text
|
196
|
+
into lines that are much safer for translation.
|
197
|
+
|
198
|
+
Usage: sanzang reflow [options]
|
199
|
+
|
200
|
+
Options:
|
201
|
+
-h, --help show this help message and exit
|
202
|
+
-E, --encoding=ENC set data encoding to ENC
|
203
|
+
-L, --list-encodings list possible encodings
|
204
|
+
-i, --infile=FILE read input text from FILE
|
205
|
+
-o, --outfile=FILE write output text to FILE
|
206
|
+
|
207
|
+
=== _sanzang_ _translate_
|
208
|
+
|
209
|
+
The command _sanzang_ _translate_ can perform translation of a single text
|
210
|
+
stream or file. By default, this command reads from STDIN and writes to STDOUT.
|
211
|
+
For concurrent translation of multiple files, see the _batch_ command.
|
212
|
+
|
213
|
+
Usage: sanzang translate [options] table
|
214
|
+
|
215
|
+
Options:
|
216
|
+
-h, --help show this help message and exit
|
217
|
+
-E, --encoding=ENC set data encoding to ENC
|
218
|
+
-L, --list-encodings list possible encodings
|
219
|
+
-i, --infile=FILE read input text from FILE
|
220
|
+
-o, --outfile=FILE write output text to FILE
|
221
|
+
|
222
|
+
== Basic Usage
|
223
|
+
|
224
|
+
In the following example, we are working with a small text that we want to
|
225
|
+
translate. With the first command, we reformat the text using _reflow_.
|
226
|
+
Then we run _translate_ with our translation table, to generate a translation
|
227
|
+
listing.
|
228
|
+
|
229
|
+
$ sanzang reflow -i xinjing.txt -o lines.txt
|
230
|
+
$ sanzang translate -i lines.txt -o trans.txt TABLE.txt
|
231
|
+
|
232
|
+
The next two commands illustrate how these programs use standard input and
|
233
|
+
output streams by default, how they can easily operate as text filters, and
|
234
|
+
the way that _sanzang_ subcommands can be abbreviated.
|
235
|
+
|
236
|
+
$ sanzang r < xinjing.txt | sanzang t TABLE.txt > trans.txt
|
237
|
+
$ cat xinjing.txt | sanzang r | sanzang t TABLE.txt | less
|
238
|
+
|
239
|
+
== Advanced Usage
|
240
|
+
|
241
|
+
=== Batch Mode
|
242
|
+
|
243
|
+
We may have thousands of texts that we want to generate translation listings
|
244
|
+
for with our translation table. For example, if our translation table was
|
245
|
+
updated recently, we may want to regenerate an entire corpus of translation
|
246
|
+
listings. To do this, we can use the _find_ command to retrieve the file paths
|
247
|
+
to our text files, and then pipe that output into _sanzang_ _batch_.
|
248
|
+
|
249
|
+
$ find /srv/texts -type f | sanzang batch TABLE.txt /srv/trans
|
250
|
+
|
251
|
+
This command will find all files in the location specified, and then feed the
|
252
|
+
file paths to _sanzang_ _batch_, which will process them as a batch. If
|
253
|
+
multiprocessing is supported on your platform, then the batch will be divided
|
254
|
+
among all available processors.
|
255
|
+
|
256
|
+
To determine whether multiprocessing is available in your version of Ruby, you
|
257
|
+
can run the _sanzang_ command with the "-P" option:
|
258
|
+
|
259
|
+
$ sanzang -P
|
260
|
+
|
261
|
+
If you see that "Fork implemented" is "true", then your platform supports Unix
|
262
|
+
style multiprocessing, and you can gain performance benefits from using the
|
263
|
+
_batch_ command. You should also examine the value for "Processors found".
|
264
|
+
This is the number of logical processors detected on your platform.
|
265
|
+
|
266
|
+
If you see that only one processor is detected and you know that there are more
|
267
|
+
available on your platform, then you may want to use the "-j" flag when running
|
268
|
+
the batch command, to manually specify the number of processes to use. This
|
269
|
+
number can be set to the number of CPU cores available on your system. This
|
270
|
+
option may be necessary on some less common platforms, such as some BSD
|
271
|
+
distributions and commercial Unix variants.
|
272
|
+
|
273
|
+
Microsoft Windows ports of Ruby typically do not support Unix style
|
274
|
+
multiprocessing. If you require higher performance and utilizing multiple CPU
|
275
|
+
cores, then you should look into the Cygwin port of Ruby, which does not have
|
276
|
+
this limitation.
|
277
|
+
|
278
|
+
The performance benefits of running in batch mode with multiprocessing may be
|
279
|
+
very significant. The performance increases are typically proportional to the
|
280
|
+
number of CPU cores available on the system. For example, on a large SMP system
|
281
|
+
with 50 processors available, _sanzang_ _batch_ can run up to 50x as fast.
|
282
|
+
|
283
|
+
=== Text Encodings
|
284
|
+
|
285
|
+
\Sanzang supports many possible text encodings. Option "-L" will list all
|
286
|
+
available text encodings. Option "-E" will set the encoding to be used for all
|
287
|
+
text data such as input texts, output texts, and table files. The other program
|
288
|
+
I/O, such as messages for the terminal, will still be in the default encoding
|
289
|
+
of the environment. For example, in a Windows environment that by default uses
|
290
|
+
the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
|
291
|
+
\Sanzang to read and write all text data in UTF-16LE, but all other program
|
292
|
+
messages will still be displayed in the console's native IBM-437 encoding.
|
293
|
+
|
294
|
+
$ sanzang t -E UTF-16LE -i in.txt -o out.txt TABLE.txt
|
295
|
+
|
296
|
+
If the "-E" option is not specified, then \Sanzang will use the default
|
297
|
+
encoding inherited from the environment. For example, a GNU/Linux user running
|
298
|
+
\Sanzang in a UTF-8 terminal will by default have all text data read and
|
299
|
+
written to in the UTF-8 encoding. The one *exception* to this is for
|
300
|
+
environments using the IBM-437 encoding (typically an old Windows command
|
301
|
+
shell). In this case, \Sanzang will take pity on you and automatically switch
|
302
|
+
to UTF-8 by default, as if you had specified the option "-E" with value
|
303
|
+
"UTF-8".
|
304
|
+
|
305
|
+
== Responsible Use
|
306
|
+
|
307
|
+
With comprehensive translation tables, \Sanzang can often be quite accurate
|
308
|
+
and effective. However, this program is still comparable to a simple machine,
|
309
|
+
and it can never replace a human translator. Please understand the scope of
|
310
|
+
this translation system when using it. No machines can take responsibility for
|
311
|
+
a poor translation. In the end, it is you who are responsible for any and all
|
312
|
+
publications.
|
data/README
CHANGED
@@ -1,280 +1,50 @@
|
|
1
|
-
= Sanzang (三藏)
|
2
|
-
|
3
|
-
== Contents
|
4
|
-
|
5
|
-
* Introduction
|
6
|
-
* Concepts
|
7
|
-
* Installation
|
8
|
-
* Components
|
9
|
-
* Basic Usage
|
10
|
-
* Advanced Usage
|
11
|
-
* Responsible Use
|
1
|
+
= \Sanzang (三藏)
|
12
2
|
|
13
3
|
== Introduction
|
14
4
|
|
15
|
-
Sanzang is a compact
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
a translation of "trepitaka," the title for someone who is a master of such
|
23
|
-
teachings.
|
24
|
-
|
25
|
-
Sanzang is implemented as a small set of programs written in the Ruby
|
26
|
-
programming language. This system is free software (“free as in freedom”), and
|
27
|
-
it is licensed under the GNU General Public License, version 3. This ensures
|
28
|
-
that anyone can use the program for any purpose, and that any extensions to
|
29
|
-
Sanzang will remain freely available to others.
|
30
|
-
|
31
|
-
== Background
|
32
|
-
|
33
|
-
The most significant difference between Sanzang and other machine translation
|
34
|
-
systems is that it does not attempt to interpret grammar in any way. Instead,
|
35
|
-
it relies on direct translation of names, terms, and phrases based on a large
|
36
|
-
translation table. The Sanzang translator simply applies this translation table
|
37
|
-
at runtime, and does not attempt to interpret grammar or syntax in any way
|
38
|
-
whatsoever. The end result is that the accuracy of the translation is highly
|
39
|
-
dependent on the accuracy of the translation table.
|
40
|
-
|
41
|
-
The strength of the Sanzang method is that it is extremely simple and easy to
|
42
|
-
work with, and eliminates virtually all complexity in the translation process.
|
43
|
-
This system will never produce incorrect syntax because it does not interpret
|
44
|
-
syntax in the first place. This method is also efficient and yields predictable
|
45
|
-
results that can be made immediately available to the user for verification. To
|
46
|
-
facilitate this task, all translation listings are collated line-by-line with
|
47
|
-
the original source text.
|
48
|
-
|
49
|
-
== Concepts
|
50
|
-
|
51
|
-
Sanzang provides mainly a simple translation engine. For any actual
|
52
|
-
translation work, Sanzang requires a translation table in which all
|
53
|
-
translation rules are defined. This translation table is stored in a simple
|
54
|
-
text file. Each line is a record containing a source term and its equivalent
|
55
|
-
meanings in other languages. Each line starts with "~|", has records delimited
|
56
|
-
by "|", and ends with "|~". In a table, the first column represents the source
|
57
|
-
language, while the subsequent columns represent destination languages. In
|
58
|
-
this example, we want to create a table capable of rendering the following
|
59
|
-
title into English:
|
60
|
-
|
61
|
-
金剛般若波羅蜜經
|
62
|
-
|
63
|
-
We start by creating a new text file, named TABLE.txt or something similar. In
|
64
|
-
this text file, we may add the following rules:
|
65
|
-
|
66
|
-
~|波羅蜜| pāramitā|~
|
67
|
-
~|金剛| diamond|~
|
68
|
-
~|般若| prajñā|~
|
69
|
-
~|經| sūtra|~
|
70
|
-
|
71
|
-
Did you notice that we included spaces prior to the translations of these
|
72
|
-
terms? This is because Chinese does not typically include spaces between
|
73
|
-
words, so we need to insert our own leading spaces as part of the rules we are
|
74
|
-
defining. After we have written this table file, we can run the Sanzang
|
75
|
-
translator with our table. When it reads the Chinese title as the input text,
|
76
|
-
it then produces the following translation listing:
|
77
|
-
|
78
|
-
1.1 金剛般若波羅蜜經
|
79
|
-
1.2 diamond prajñā pāramitā sūtra
|
5
|
+
\Sanzang is a compact and simple cross-platform machine translation system.
|
6
|
+
This program is especially useful for translating from CJK languages (Chinese,
|
7
|
+
Korean, and Japanese), and it is very suitable for ancient and otherwise
|
8
|
+
difficult texts. Due to its origins in translating texts from the Chinese
|
9
|
+
Buddhist canon, the program is called \Sanzang (三藏), a literal translation of
|
10
|
+
the Sanskrit word "Tripitaka," which is a general term for the Buddhist canon.
|
11
|
+
As demonstrated by the _sanzang_ program itself:
|
80
12
|
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
translation table.
|
13
|
+
$ echo '三藏' | sanzang t sztab
|
14
|
+
[1.1] 三藏
|
15
|
+
[1.2] sānzàng
|
16
|
+
[1.3] tripiṭaka
|
86
17
|
|
87
|
-
|
88
|
-
|
18
|
+
Anyone can learn how to use \Sanzang, and use it to read and analyze texts.
|
19
|
+
Unlike most other systems, \Sanzang is small and approachable. Any user can
|
20
|
+
develop his or her own translation rules, and these are simply stored in a text
|
21
|
+
file that the program can read. For full details, refer to the MANUAL.
|
89
22
|
|
90
|
-
|
91
|
-
|
92
|
-
105.3 ānanda addressed-the-buddha-saying ¶
|
23
|
+
\Sanzang is free software ("free as in freedom"), and it is released under the
|
24
|
+
GNU General Public License, version 3.
|
93
25
|
|
94
|
-
|
95
|
-
106.2 wéi-rán shìzūn ¶
|
96
|
-
106.3 just-so bhagavān ¶
|
26
|
+
== Quick Install
|
97
27
|
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
Here we can see a three-column translation table at work. The first column has
|
103
|
-
the traditional Chinese source text, the second column contains the Pinyin
|
104
|
-
transliteration, and the third column contains English. In this example we can
|
105
|
-
see that well-defined translation rules lead to a clear translation listing,
|
106
|
-
at which the meaning of the original text is readily understandable in
|
107
|
-
English. If we wished to add additional columns for simplified Chinese,
|
108
|
-
Vietnamese, Japanese, Spanish, French, German, Russian, or any other languages,
|
109
|
-
then these could all be handled similarly without any technical difficulties.
|
110
|
-
|
111
|
-
Comprehensive translation tables could be quite large, containing tens of
|
112
|
-
thousands of entries. However, the work of building such a table is not so
|
113
|
-
significant compared to the long-term benefits which may be gained from such
|
114
|
-
tables. In addition, rules in these translation tables may be translated into
|
115
|
-
other languages as well. There is a potential here to assist readers all over
|
116
|
-
the world with understanding otherwise difficult works.
|
117
|
-
|
118
|
-
Considering the examples above, we can see that knowledge of the source
|
119
|
-
language and expertise in the relevant literary field is often still necessary.
|
120
|
-
Here again we can see that this translation system does not position itself as
|
121
|
-
a “silver bullet” for creating finished translations, but is rather a practical
|
122
|
-
set of utilities for the purpose of assisting human readers and translators.
|
123
|
-
|
124
|
-
== Installation
|
125
|
-
|
126
|
-
=== Requirements
|
127
|
-
|
128
|
-
The Sanzang system can be installed either as a Ruby gem, or manually from
|
129
|
-
an archive file. The only prerequisite to using Sanzang is:
|
130
|
-
|
131
|
-
* Ruby 1.9 or later
|
132
|
-
|
133
|
-
The "parallel" gem is required by Sanzang, but is installed automatically when
|
134
|
-
installing Sanzang using the standard method. Using the "parallel" gem, Sanzang
|
135
|
-
can support multiprocessing in batch mode (if the platform supports it).
|
136
|
-
Currently this method of multiprocessing will work automatically on Ruby ports
|
137
|
-
that implement the Process#fork system call.
|
138
|
-
|
139
|
-
In addition to the actual runtime requirements, it may also be very useful to
|
140
|
-
have a text editor that is aware of Unicode and other encodings, and able to
|
141
|
-
display multilingual texts. One such application that is known to work well
|
142
|
-
for this task is the _gedit_ text editor, which is free software and also
|
143
|
-
available on a variety of platforms.
|
144
|
-
|
145
|
-
=== Installation
|
146
|
-
|
147
|
-
To install Sanzang, the following command should suffice.
|
28
|
+
To install \Sanzang, the prerequisite is Ruby 1.9 or later. After Ruby has been
|
29
|
+
installed, you can run the _gem_ command from a command shell to automatically
|
30
|
+
download and install \Sanzang onto your computer.
|
148
31
|
|
149
32
|
# gem install sanzang
|
150
33
|
|
151
|
-
|
152
|
-
|
153
|
-
_gem_ from the command line.
|
154
|
-
|
155
|
-
== Components
|
156
|
-
|
157
|
-
The programs in Sanzang are designed in a traditional Unix style in which
|
158
|
-
programs are executed in a terminal, and program settings are specified
|
159
|
-
through command line options and parameters. This allows Sanzang programs
|
160
|
-
to be easily scripted and automated.
|
161
|
-
|
162
|
-
=== sanzang-reflow
|
163
|
-
|
164
|
-
The program sanzang-reflow can reformat Chinese, Japanese, or Korean text, in
|
165
|
-
which terms are often split between lines. This formatter "reflows" the text
|
166
|
-
instead based on its punctuation and horizontal spacing, separating the source
|
167
|
-
text into lines that are much safer for translation using the sanzang-translate
|
168
|
-
program.
|
169
|
-
|
170
|
-
Usage: sanzang-reflow [options]
|
171
|
-
|
172
|
-
Options:
|
173
|
-
-h, --help show this help message and exit
|
174
|
-
-E, --encoding=ENC set data encoding to ENC
|
175
|
-
-L, --list-encodings list possible encodings
|
176
|
-
-i, --infile=FILE read input text from FILE
|
177
|
-
-o, --outfile=FILE write output text to FILE
|
178
|
-
-V, --version show version number and exit
|
179
|
-
|
180
|
-
=== sanzang-translate
|
181
|
-
|
182
|
-
The program sanzang-translate (1) reads a translation table file, (2) applies
|
183
|
-
this table's rules to an input text, and then (3) generates a translation
|
184
|
-
listing. This program can also run in a special batch mode that can utilize
|
185
|
-
multiprocessing (multiple processors and processor cores) for high
|
186
|
-
performance.
|
187
|
-
|
188
|
-
Usage: sanzang-translate [options] table
|
189
|
-
Usage: sanzang-translate -B output_dir table < file_list
|
190
|
-
|
191
|
-
Options:
|
192
|
-
-h, --help show this help message and exit
|
193
|
-
-B, --batch-dir=DIR process from a queue into DIR
|
194
|
-
-E, --encoding=ENC set data encoding to ENC
|
195
|
-
-L, --list-encodings list possible encodings
|
196
|
-
-i, --infile=FILE read input text from FILE
|
197
|
-
-o, --outfile=FILE write output text to FILE
|
198
|
-
-P, --platform show platform information
|
199
|
-
-V, --version show version number and exit
|
200
|
-
|
201
|
-
== Basic Usage
|
202
|
-
|
203
|
-
=== Formatting and translating a single text
|
204
|
-
|
205
|
-
In the following example, we are working with a small text that we want to
|
206
|
-
translate. With the first command, we reformat the text using sanzang-reflow.
|
207
|
-
Then we run the sanzang-translate program with our translation table, to
|
208
|
-
generate a translation listing.
|
209
|
-
|
210
|
-
$ sanzang-reflow -i xinjing.txt -o lines.txt
|
211
|
-
$ sanzang-translate -i lines.txt -o trans.txt TABLE.txt
|
212
|
-
|
213
|
-
=== Redirecting I/O
|
214
|
-
|
215
|
-
The next two commands illustrate how these programs use standard input and
|
216
|
-
output streams by default, and how they can easily operate as text filters.
|
217
|
-
|
218
|
-
$ sanzang-reflow -i xinjing.txt | sanzang-translate -o trans.txt TABLE.txt
|
219
|
-
$ cat xinjing.txt | sanzang-reflow | sanzang-translate TABLE.txt | less
|
220
|
-
|
221
|
-
== Advanced Usage
|
222
|
-
|
223
|
-
=== Batch Mode and Multiprocessing
|
224
|
-
|
225
|
-
In the following example, we may have several thousand texts that we want to
|
226
|
-
run through sanzang-translate with our translation table. For example, if our
|
227
|
-
translation table was updated recently, we may want to regenerate our corpus
|
228
|
-
of translation listings. To do this, we can use the "find" command to retrieve
|
229
|
-
the file paths to our text files, and then pipe that output into the Sanzang
|
230
|
-
translation program.
|
231
|
-
|
232
|
-
$ find /srv/texts -type f | sanzang-translate -B /srv/trans TABLE.txt
|
233
|
-
|
234
|
-
This command will find all files in the location specified, and then feed the
|
235
|
-
file paths to sanzang-translate, which will process them as a batch. If the
|
236
|
-
"parallel" gem is available and functioning on the system, then the batch will
|
237
|
-
be divided among all available processors.
|
238
|
-
|
239
|
-
If this gem has been installed, then when running in batch mode, if we have six
|
240
|
-
CPU cores on the local machine, then we should be able to expect six
|
241
|
-
translation processes running concurrently. The exception to this is on the
|
242
|
-
"mswin" and "mingw" platforms, which do not have the necessary system calls
|
243
|
-
for Unix style multiprocessing. In this case, running Sanzang in the
|
244
|
-
Cygwin environment is a viable alternative.
|
245
|
-
|
246
|
-
The performance benefits of running with the "parallel" library can be very
|
247
|
-
significant, leading to a series of translation listings being generated in a
|
248
|
-
mere fraction of the time it would take to process them otherwise. This
|
249
|
-
performance gain is typically proportional to the number of processors and
|
250
|
-
processor cores available on the local system.
|
251
|
-
|
252
|
-
=== Text Encodings
|
253
|
-
|
254
|
-
Sanzang supports many possible text encodings. Option "-L" will list all
|
255
|
-
available text encodings. Option "-E" will set the encoding to be used for all
|
256
|
-
text data such as input texts, output texts, and table files. The other program
|
257
|
-
I/O, such as messages for the terminal, will still be in the default encoding
|
258
|
-
of the environment. For example, in a Windows environment that by default uses
|
259
|
-
the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
|
260
|
-
Sanzang to read and write all text data in UTF-16LE, but all other program
|
261
|
-
messages will still be displayed in the console's native IBM-437 encoding.
|
34
|
+
After this, you should be able to run the _sanzang_ command. Run the following
|
35
|
+
command to verify your installation and print platform information.
|
262
36
|
|
263
|
-
|
37
|
+
# sanzang -P
|
264
38
|
|
265
|
-
|
266
|
-
inherited from the environment. For example, a GNU/Linux user running Sanzang in
|
267
|
-
a UTF-8 terminal will by default have all text data read and written to in the
|
268
|
-
UTF-8 encoding. The one *exception* to this is for environments using the
|
269
|
-
IBM-437 encoding (typically an old Windows command shell). In this case,
|
270
|
-
Sanzang will take pity on you and automatically switch to UTF-8 by default, as
|
271
|
-
if you had specified the option "-E" with value "UTF-8".
|
39
|
+
This command should show a summary of your platform for running \Sanzang.
|
272
40
|
|
273
|
-
|
41
|
+
Ruby platform: x86_64-linux
|
42
|
+
Ruby version: 2.0.0
|
43
|
+
External encoding: UTF-8
|
44
|
+
Internal encoding: none
|
45
|
+
Fork implemented: true
|
46
|
+
Parallel version: 0.6.4
|
47
|
+
Processors found: 4
|
48
|
+
Sanzang version: 1.0.0
|
274
49
|
|
275
|
-
|
276
|
-
and effective. However, this program is still comparable to a simple machine,
|
277
|
-
and it can never replace a human translator. Please understand the scope of
|
278
|
-
this translation system when using it. No machines can take responsibility for
|
279
|
-
a poor translation. In the end, it is you who are responsible for any and all
|
280
|
-
publications.
|
50
|
+
You now have \Sanzang installed and running on your computer.
|