sanzang 0.0.3 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/HACKING +22 -40
- data/MANUAL +312 -0
- data/README +35 -265
- data/bin/{sanzang-reflow → sanzang} +1 -1
- data/lib/sanzang.rb +12 -35
- data/lib/sanzang/batch_translator.rb +77 -0
- data/lib/sanzang/command/batch.rb +131 -0
- data/lib/sanzang/command/reflow.rb +35 -32
- data/lib/sanzang/command/sanzang_cmd.rb +132 -0
- data/lib/sanzang/command/translate.rb +47 -70
- data/lib/sanzang/text_formatter.rb +1 -1
- data/lib/sanzang/translation_table.rb +34 -49
- data/lib/sanzang/translator.rb +1 -65
- data/lib/sanzang/version.rb +3 -2
- data/test/tc_reflow_encodings.rb +2 -2
- data/test/tc_simple_translation.rb +8 -12
- data/test/utf-8/stage_3.txt +4 -0
- metadata +25 -31
- data/bin/sanzang-translate +0 -21
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ee6e8f6e1b7cbd116de22589a0089faa73b35726
|
4
|
+
data.tar.gz: 2110931d7a863e52dc6ade5b17c060d4dc9ab12e
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 65710d20a08d6b20f82d3c909468c8e5d03b54fd5f96657247d0eaededc587d1394c66183660e50c654990f0bccfcb1efc600af5672e518cbaa777cd9a5ba3f8
|
7
|
+
data.tar.gz: 039837a397ed9f5561d5f2c6af23fe4a8991b305d134a7f4cfe6be3f67bad0506bf1ca6d8bf332586f51243a8bfbd1acb8302beb60b107e4ca165b56f66550ea
|
data/HACKING
CHANGED
@@ -1,54 +1,36 @@
|
|
1
|
-
= Hacking Sanzang
|
1
|
+
= Hacking \Sanzang
|
2
2
|
|
3
|
-
==
|
4
|
-
|
5
|
-
* Use "rake test" to run all tests.
|
6
|
-
* Use "rake build" to create a gem package in the "dist" directory.
|
7
|
-
|
8
|
-
== Platforms
|
3
|
+
== Supported platforms
|
9
4
|
|
10
5
|
These programs should work on all platforms with Ruby 1.9 or later. Regular
|
11
|
-
testing takes place on GNU/Linux operating systems.
|
6
|
+
testing takes place on GNU/Linux operating systems. \Sanzang has not been fully
|
7
|
+
tested on other implementations of Ruby (e.g. JRuby, Rubinius, etc.).
|
12
8
|
|
13
|
-
== Languages
|
9
|
+
== Languages and scope
|
14
10
|
|
15
11
|
The translation program may not be very useful for all languages. It was
|
16
12
|
designed specifically for dealing with the more difficult aspects of
|
17
13
|
translating from ancient Chinese. For a language like Sanskrit or Tibetan, it
|
18
|
-
may not be so useful.
|
19
|
-
been investigated.
|
20
|
-
|
21
|
-
== Tables
|
22
|
-
|
23
|
-
A valid translation table is necessary when using the translator. More
|
24
|
-
information on this format is available in the README file. The following
|
25
|
-
tokens are reserved: "~|", "|", and "|~", and these should not be used in table
|
26
|
-
record data. No "escape sequences" are available for the inclusion of such
|
27
|
-
tokens as data.
|
14
|
+
may not be so useful.
|
28
15
|
|
29
|
-
== Multiprocessing
|
16
|
+
== Multiprocessing and fork(2)
|
30
17
|
|
31
18
|
Most every Unix-like system supports the fork(2) system call and therefore has
|
32
|
-
the potential to run Sanzang in a multiprocessing mode for batches. However,
|
33
|
-
each Ruby implementation and port may be different, and Sanzang will check to
|
34
|
-
see if the fork method has been implemented before attempting to
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
fork method, yet the "parallel" library attempts to fork anyhow. To avoid
|
41
|
-
potential errors, the translator will not even attempt to use multiprocessing
|
42
|
-
on platforms in which the fork method is not implemented.
|
43
|
-
|
44
|
-
Note that for Windows, the "cygwin" port of Ruby does implement does support
|
45
|
-
fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method and so
|
19
|
+
the potential to run \Sanzang in a multiprocessing mode for batches. However,
|
20
|
+
each Ruby implementation and port may be different, and \Sanzang will check to
|
21
|
+
see if the fork method has been implemented before attempting to utilize
|
22
|
+
multiprocessing. To avoid potential errors, the translator will not attempt to
|
23
|
+
use multiprocessing on platforms in which fork(2) is not implemented.
|
24
|
+
|
25
|
+
Note that for Windows, the Cygwin port of Ruby does implement does support
|
26
|
+
fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method, so
|
46
27
|
it supports standard multiprocessing. Therefore, Cygwin is the most robust
|
47
|
-
environment for running Sanzang on a Windows
|
28
|
+
environment for running \Sanzang on a computer with Windows.
|
48
29
|
|
49
|
-
==
|
30
|
+
== Text encoding quirks
|
50
31
|
|
51
|
-
Converters for several encodings have not yet been implemented by
|
52
|
-
|
53
|
-
|
54
|
-
|
32
|
+
Converters for several encodings have not yet been implemented by YARV. Most of
|
33
|
+
these are obscure and not widely used. Perhaps the most notable is EUC-TW,
|
34
|
+
which is an old Unix encoding for traditional Chinese. Encoding conversion is
|
35
|
+
used by the _reflow_ command for formatting CJK text, but encoding conversion
|
36
|
+
is not required for translation.
|
data/MANUAL
ADDED
@@ -0,0 +1,312 @@
|
|
1
|
+
= The \Sanzang Manual
|
2
|
+
|
3
|
+
== Introduction
|
4
|
+
|
5
|
+
\Sanzang is a compact, cross-platform machine translation system. This program
|
6
|
+
was developed specifically to fill the need for a competent application for
|
7
|
+
aiding translators of the Chinese Buddhist canon into other languages. However,
|
8
|
+
the translation method it uses is general enough that it may extend to other
|
9
|
+
translation domains as well, especially translations with CJK source languages
|
10
|
+
(Chinese, Japanese, and Korean). The name \Sanzang (三藏) is a literal
|
11
|
+
translation of the Sanskrit word “Tripitaka,” which is a general term for the
|
12
|
+
Buddhist canon.
|
13
|
+
|
14
|
+
\Sanzang is implemented as a Unix style "command suite" program that executes
|
15
|
+
in your operating system's command shell. The _sanzang_ command includes
|
16
|
+
subcommands for carrying out each of the available functions of the system.
|
17
|
+
|
18
|
+
\Sanzang is programmed in the Ruby programming language and is free software
|
19
|
+
("free as in freedom"). This program is licensed under the GNU General Public
|
20
|
+
License, version 3, which ensures that anyone can use the program for any
|
21
|
+
purpose, and that extensions to this program will remain freely available to
|
22
|
+
others.
|
23
|
+
|
24
|
+
== Background
|
25
|
+
|
26
|
+
At the time of this writing, nearly all machine translation systems available
|
27
|
+
are based on statistical machine translation (SMT), a method which has not
|
28
|
+
proven very useful for ancient Chinese texts. \Sanzang was developed as an
|
29
|
+
alternative simply because there was the practical need for a better and more
|
30
|
+
reliable tool.
|
31
|
+
|
32
|
+
The most significant difference between \Sanzang and other machine translation
|
33
|
+
systems is that it does not attempt to interpret or translate grammar. Instead,
|
34
|
+
it simply translates names, terms, and phrases based on rules stored in a
|
35
|
+
translation table. The \Sanzang translator applies this translation table at
|
36
|
+
runtime to generate a text listing as the output.
|
37
|
+
|
38
|
+
This method is simple and efficient, and produces predictable results that can
|
39
|
+
be made immediately available to the user for verification. To facilitate this
|
40
|
+
task, all translation listings generated by the program are collated
|
41
|
+
line-by-line with the original source text.
|
42
|
+
|
43
|
+
== Translation Method
|
44
|
+
|
45
|
+
\Sanzang provides mainly a simple machine translation engine. To use this
|
46
|
+
translation engine, it will also be necessary to have a text file in which all
|
47
|
+
translation rules are defined. This is called a translation table, and its
|
48
|
+
format is simple delimited text. At runtime, the translation rules in this file
|
49
|
+
are applied to the source text to generate the translation listing.
|
50
|
+
|
51
|
+
In a translation table text file, each line is a translation rule containing a
|
52
|
+
source term and its equivalent meanings in other languages. Each line starts
|
53
|
+
with "~|", has records delimited by "|", and ends with "|~". In the translation
|
54
|
+
table, the first column represents the source language, while the subsequent
|
55
|
+
columns represent destination languages. In this example, we want to create a
|
56
|
+
table capable of rendering the following title into English:
|
57
|
+
|
58
|
+
金剛般若波羅蜜經
|
59
|
+
|
60
|
+
We start by creating a new text file, named _table.txt_, or something similar.
|
61
|
+
In this text file, we may add the following rules:
|
62
|
+
|
63
|
+
~|波羅蜜| pāramitā|~
|
64
|
+
~|金剛| diamond|~
|
65
|
+
~|般若| prajñā|~
|
66
|
+
~|經| sūtra/classic|~
|
67
|
+
|
68
|
+
Notice that spaces were included prior to the English equivalents. This is
|
69
|
+
because Chinese does not typically include spaces between words, so we need to
|
70
|
+
insert our own leading spaces as part of the translation rules we are defining.
|
71
|
+
After we have written this table file, we can then run the \Sanzang translation
|
72
|
+
engine with our table. When it reads the Chinese title as the input text, it
|
73
|
+
then produces the following translation listing:
|
74
|
+
|
75
|
+
1.1 金剛般若波羅蜜經
|
76
|
+
1.2 diamond prajñā pāramitā sūtra/classic
|
77
|
+
|
78
|
+
The program first sorted our terms by the length of the source column, and
|
79
|
+
then applied each of these rules in sequence. It then collated the output and
|
80
|
+
created a translation listing. In the left margin, we can see numbers denoting
|
81
|
+
the line number of the source text, along with the column number of the
|
82
|
+
translation table.
|
83
|
+
|
84
|
+
As a final example, below is a snippet from an Indian Buddhist meditation text,
|
85
|
+
which was processed by the \Sanzang translation engine in the same manner:
|
86
|
+
|
87
|
+
105.1 阿難白佛言。
|
88
|
+
105.2 ānán bái-fó-yán ¶
|
89
|
+
105.3 ānanda addressed-the-buddha-saying ¶
|
90
|
+
|
91
|
+
106.1 唯然世尊。
|
92
|
+
106.2 wéi-rán shìzūn ¶
|
93
|
+
106.3 just-so bhagavān ¶
|
94
|
+
|
95
|
+
107.1 願樂欲聞。
|
96
|
+
107.2 yuànlè-yù-wén ¶
|
97
|
+
107.3 joyfully-wish-to-hear ¶
|
98
|
+
|
99
|
+
Here we can see a three-column translation table at work. The first column has
|
100
|
+
the traditional Chinese source text, the second column contains the Pinyin
|
101
|
+
transliteration, and the third column contains English. In this example we can
|
102
|
+
see that well-defined translation rules lead to a clear translation listing,
|
103
|
+
in which the meaning of the original text is readily understandable in English.
|
104
|
+
|
105
|
+
Considering the examples above, we can see that knowledge of the source
|
106
|
+
language and expertise in the relevant literary field is often still necessary.
|
107
|
+
Here again we can see that this translation system does not position itself as
|
108
|
+
a “silver bullet” for creating finished translations, but is rather a practical
|
109
|
+
tool for the purpose of assisting human readers and translators.
|
110
|
+
|
111
|
+
== Installation
|
112
|
+
|
113
|
+
=== Requirements
|
114
|
+
|
115
|
+
The standard way of installing \Sanzang is as a Ruby gem. To do this, the only
|
116
|
+
requirement is Ruby 1.9 or later, along with an Internet connection.
|
117
|
+
|
118
|
+
If you do not have an Internet connection available on the host computer, then
|
119
|
+
you may download the gem file and install it manually. If you choose this
|
120
|
+
route, then please be aware that the “sanzang” gem depends on the “parallel”
|
121
|
+
gem for multiprocessing.
|
122
|
+
|
123
|
+
For users of Microsoft Windows, please be aware that Windows ports of Ruby
|
124
|
+
typically do not include full multiprocessing support using the standard Unix
|
125
|
+
system calls. If you will be using \Sanzang in a Windows environment and you
|
126
|
+
require support for fast batch processing, then you should use the Cygwin port
|
127
|
+
of Ruby, which does not have this limitation. Unix-based platforms such as
|
128
|
+
Linux, BSD, and Mac OS X are unaffected by this issue.
|
129
|
+
|
130
|
+
In addition to installation requirements, it may also be very useful to have a
|
131
|
+
text editor that is aware of Unicode and other encodings, and able to display
|
132
|
+
multilingual texts. One such application that is known to work well for this
|
133
|
+
task is the _gedit_ text editor, which is free software and is available on a
|
134
|
+
variety of platforms.
|
135
|
+
|
136
|
+
=== Installation
|
137
|
+
|
138
|
+
To install \Sanzang, the following command should suffice.
|
139
|
+
|
140
|
+
# gem install sanzang
|
141
|
+
|
142
|
+
This command will download and install \Sanzang into your Ruby environment.
|
143
|
+
|
144
|
+
If you have installed Ruby 1.9 but cannot run the _gem_ command, then you may
|
145
|
+
need to set up your PATH environment variable first, so you can run _ruby_ and
|
146
|
+
_gem_ from the command line.
|
147
|
+
|
148
|
+
== Commands
|
149
|
+
|
150
|
+
\Sanzang functions are accessible through the _sanzang_ command and its
|
151
|
+
subcommands. Runtime behavior is set through command line options and
|
152
|
+
parameters. This allows _sanzang_ to be easily scripted and automated.
|
153
|
+
|
154
|
+
\Sanzang subcommands can also be abbreviated for the sake of convenience. For
|
155
|
+
example, the "translate" subcommand could be abbreviated as "trans", "tr", or
|
156
|
+
even the one letter "t".
|
157
|
+
|
158
|
+
=== _sanzang_
|
159
|
+
|
160
|
+
The main _sanzang_ program acts as a front-end to subcommands, and also
|
161
|
+
includes options for printing platform and version information.
|
162
|
+
|
163
|
+
Usage: sanzang [options]
|
164
|
+
Usage: sanzang <command> [options] [args]
|
165
|
+
|
166
|
+
Sanzang commands:
|
167
|
+
batch translate many files in parallel
|
168
|
+
reflow format CJK text for translation
|
169
|
+
translate standard single text translation
|
170
|
+
|
171
|
+
Options:
|
172
|
+
-h, --help show this help message and exit
|
173
|
+
-P, --platform show platform information and exit
|
174
|
+
-V, --version show version number and exit
|
175
|
+
|
176
|
+
=== _sanzang_ _batch_
|
177
|
+
|
178
|
+
The _sanzang_ _batch_ command can translates files in parallel. A list of files
|
179
|
+
is read from STDIN, while progress information is printed to STDERR. The list
|
180
|
+
of output files written is printed to STDOUT at the end of the batch. The
|
181
|
+
output directory is specified as a parameter.
|
182
|
+
|
183
|
+
Usage: sanzang batch [options] table output_dir < queue
|
184
|
+
|
185
|
+
Options:
|
186
|
+
-h, --help show this help message and exit
|
187
|
+
-E, --encoding=ENC set data encoding to ENC
|
188
|
+
-L, --list-encodings list possible encodings
|
189
|
+
-j, --jobs=N allow N concurrent processes
|
190
|
+
|
191
|
+
=== _sanzang_ _reflow_
|
192
|
+
|
193
|
+
The command _sanzang_ _reflow_ can reformat Chinese, Japanese, or Korean text,
|
194
|
+
in which terms are often split between lines. This formatter "reflows" the text
|
195
|
+
based on its punctuation and horizontal spacing, separating the source text
|
196
|
+
into lines that are much safer for translation.
|
197
|
+
|
198
|
+
Usage: sanzang reflow [options]
|
199
|
+
|
200
|
+
Options:
|
201
|
+
-h, --help show this help message and exit
|
202
|
+
-E, --encoding=ENC set data encoding to ENC
|
203
|
+
-L, --list-encodings list possible encodings
|
204
|
+
-i, --infile=FILE read input text from FILE
|
205
|
+
-o, --outfile=FILE write output text to FILE
|
206
|
+
|
207
|
+
=== _sanzang_ _translate_
|
208
|
+
|
209
|
+
The command _sanzang_ _translate_ can perform translation of a single text
|
210
|
+
stream or file. By default, this command reads from STDIN and writes to STDOUT.
|
211
|
+
For concurrent translation of multiple files, see the _batch_ command.
|
212
|
+
|
213
|
+
Usage: sanzang translate [options] table
|
214
|
+
|
215
|
+
Options:
|
216
|
+
-h, --help show this help message and exit
|
217
|
+
-E, --encoding=ENC set data encoding to ENC
|
218
|
+
-L, --list-encodings list possible encodings
|
219
|
+
-i, --infile=FILE read input text from FILE
|
220
|
+
-o, --outfile=FILE write output text to FILE
|
221
|
+
|
222
|
+
== Basic Usage
|
223
|
+
|
224
|
+
In the following example, we are working with a small text that we want to
|
225
|
+
translate. With the first command, we reformat the text using _reflow_.
|
226
|
+
Then we run _translate_ with our translation table, to generate a translation
|
227
|
+
listing.
|
228
|
+
|
229
|
+
$ sanzang reflow -i xinjing.txt -o lines.txt
|
230
|
+
$ sanzang translate -i lines.txt -o trans.txt TABLE.txt
|
231
|
+
|
232
|
+
The next two commands illustrate how these programs use standard input and
|
233
|
+
output streams by default, how they can easily operate as text filters, and
|
234
|
+
the way that _sanzang_ subcommands can be abbreviated.
|
235
|
+
|
236
|
+
$ sanzang r < xinjing.txt | sanzang t TABLE.txt > trans.txt
|
237
|
+
$ cat xinjing.txt | sanzang r | sanzang t TABLE.txt | less
|
238
|
+
|
239
|
+
== Advanced Usage
|
240
|
+
|
241
|
+
=== Batch Mode
|
242
|
+
|
243
|
+
We may have thousands of texts that we want to generate translation listings
|
244
|
+
for with our translation table. For example, if our translation table was
|
245
|
+
updated recently, we may want to regenerate an entire corpus of translation
|
246
|
+
listings. To do this, we can use the _find_ command to retrieve the file paths
|
247
|
+
to our text files, and then pipe that output into _sanzang_ _batch_.
|
248
|
+
|
249
|
+
$ find /srv/texts -type f | sanzang batch TABLE.txt /srv/trans
|
250
|
+
|
251
|
+
This command will find all files in the location specified, and then feed the
|
252
|
+
file paths to _sanzang_ _batch_, which will process them as a batch. If
|
253
|
+
multiprocessing is supported on your platform, then the batch will be divided
|
254
|
+
among all available processors.
|
255
|
+
|
256
|
+
To determine whether multiprocessing is available in your version of Ruby, you
|
257
|
+
can run the _sanzang_ command with the "-P" option:
|
258
|
+
|
259
|
+
$ sanzang -P
|
260
|
+
|
261
|
+
If you see that "Fork implemented" is "true", then your platform supports Unix
|
262
|
+
style multiprocessing, and you can gain performance benefits from using the
|
263
|
+
_batch_ command. You should also examine the value for "Processors found".
|
264
|
+
This is the number of logical processors detected on your platform.
|
265
|
+
|
266
|
+
If you see that only one processor is detected and you know that there are more
|
267
|
+
available on your platform, then you may want to use the "-j" flag when running
|
268
|
+
the batch command, to manually specify the number of processes to use. This
|
269
|
+
number can be set to the number of CPU cores available on your system. This
|
270
|
+
option may be necessary on some less common platforms, such as some BSD
|
271
|
+
distributions and commercial Unix variants.
|
272
|
+
|
273
|
+
Microsoft Windows ports of Ruby typically do not support Unix style
|
274
|
+
multiprocessing. If you require higher performance and utilizing multiple CPU
|
275
|
+
cores, then you should look into the Cygwin port of Ruby, which does not have
|
276
|
+
this limitation.
|
277
|
+
|
278
|
+
The performance benefits of running in batch mode with multiprocessing may be
|
279
|
+
very significant. The performance increases are typically proportional to the
|
280
|
+
number of CPU cores available on the system. For example, on a large SMP system
|
281
|
+
with 50 processors available, _sanzang_ _batch_ can run up to 50x as fast.
|
282
|
+
|
283
|
+
=== Text Encodings
|
284
|
+
|
285
|
+
\Sanzang supports many possible text encodings. Option "-L" will list all
|
286
|
+
available text encodings. Option "-E" will set the encoding to be used for all
|
287
|
+
text data such as input texts, output texts, and table files. The other program
|
288
|
+
I/O, such as messages for the terminal, will still be in the default encoding
|
289
|
+
of the environment. For example, in a Windows environment that by default uses
|
290
|
+
the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
|
291
|
+
\Sanzang to read and write all text data in UTF-16LE, but all other program
|
292
|
+
messages will still be displayed in the console's native IBM-437 encoding.
|
293
|
+
|
294
|
+
$ sanzang t -E UTF-16LE -i in.txt -o out.txt TABLE.txt
|
295
|
+
|
296
|
+
If the "-E" option is not specified, then \Sanzang will use the default
|
297
|
+
encoding inherited from the environment. For example, a GNU/Linux user running
|
298
|
+
\Sanzang in a UTF-8 terminal will by default have all text data read and
|
299
|
+
written to in the UTF-8 encoding. The one *exception* to this is for
|
300
|
+
environments using the IBM-437 encoding (typically an old Windows command
|
301
|
+
shell). In this case, \Sanzang will take pity on you and automatically switch
|
302
|
+
to UTF-8 by default, as if you had specified the option "-E" with value
|
303
|
+
"UTF-8".
|
304
|
+
|
305
|
+
== Responsible Use
|
306
|
+
|
307
|
+
With comprehensive translation tables, \Sanzang can often be quite accurate
|
308
|
+
and effective. However, this program is still comparable to a simple machine,
|
309
|
+
and it can never replace a human translator. Please understand the scope of
|
310
|
+
this translation system when using it. No machines can take responsibility for
|
311
|
+
a poor translation. In the end, it is you who are responsible for any and all
|
312
|
+
publications.
|
data/README
CHANGED
@@ -1,280 +1,50 @@
|
|
1
|
-
= Sanzang (三藏)
|
2
|
-
|
3
|
-
== Contents
|
4
|
-
|
5
|
-
* Introduction
|
6
|
-
* Concepts
|
7
|
-
* Installation
|
8
|
-
* Components
|
9
|
-
* Basic Usage
|
10
|
-
* Advanced Usage
|
11
|
-
* Responsible Use
|
1
|
+
= \Sanzang (三藏)
|
12
2
|
|
13
3
|
== Introduction
|
14
4
|
|
15
|
-
Sanzang is a compact
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
a translation of "trepitaka," the title for someone who is a master of such
|
23
|
-
teachings.
|
24
|
-
|
25
|
-
Sanzang is implemented as a small set of programs written in the Ruby
|
26
|
-
programming language. This system is free software (“free as in freedom”), and
|
27
|
-
it is licensed under the GNU General Public License, version 3. This ensures
|
28
|
-
that anyone can use the program for any purpose, and that any extensions to
|
29
|
-
Sanzang will remain freely available to others.
|
30
|
-
|
31
|
-
== Background
|
32
|
-
|
33
|
-
The most significant difference between Sanzang and other machine translation
|
34
|
-
systems is that it does not attempt to interpret grammar in any way. Instead,
|
35
|
-
it relies on direct translation of names, terms, and phrases based on a large
|
36
|
-
translation table. The Sanzang translator simply applies this translation table
|
37
|
-
at runtime, and does not attempt to interpret grammar or syntax in any way
|
38
|
-
whatsoever. The end result is that the accuracy of the translation is highly
|
39
|
-
dependent on the accuracy of the translation table.
|
40
|
-
|
41
|
-
The strength of the Sanzang method is that it is extremely simple and easy to
|
42
|
-
work with, and eliminates virtually all complexity in the translation process.
|
43
|
-
This system will never produce incorrect syntax because it does not interpret
|
44
|
-
syntax in the first place. This method is also efficient and yields predictable
|
45
|
-
results that can be made immediately available to the user for verification. To
|
46
|
-
facilitate this task, all translation listings are collated line-by-line with
|
47
|
-
the original source text.
|
48
|
-
|
49
|
-
== Concepts
|
50
|
-
|
51
|
-
Sanzang provides mainly a simple translation engine. For any actual
|
52
|
-
translation work, Sanzang requires a translation table in which all
|
53
|
-
translation rules are defined. This translation table is stored in a simple
|
54
|
-
text file. Each line is a record containing a source term and its equivalent
|
55
|
-
meanings in other languages. Each line starts with "~|", has records delimited
|
56
|
-
by "|", and ends with "|~". In a table, the first column represents the source
|
57
|
-
language, while the subsequent columns represent destination languages. In
|
58
|
-
this example, we want to create a table capable of rendering the following
|
59
|
-
title into English:
|
60
|
-
|
61
|
-
金剛般若波羅蜜經
|
62
|
-
|
63
|
-
We start by creating a new text file, named TABLE.txt or something similar. In
|
64
|
-
this text file, we may add the following rules:
|
65
|
-
|
66
|
-
~|波羅蜜| pāramitā|~
|
67
|
-
~|金剛| diamond|~
|
68
|
-
~|般若| prajñā|~
|
69
|
-
~|經| sūtra|~
|
70
|
-
|
71
|
-
Did you notice that we included spaces prior to the translations of these
|
72
|
-
terms? This is because Chinese does not typically include spaces between
|
73
|
-
words, so we need to insert our own leading spaces as part of the rules we are
|
74
|
-
defining. After we have written this table file, we can run the Sanzang
|
75
|
-
translator with our table. When it reads the Chinese title as the input text,
|
76
|
-
it then produces the following translation listing:
|
77
|
-
|
78
|
-
1.1 金剛般若波羅蜜經
|
79
|
-
1.2 diamond prajñā pāramitā sūtra
|
5
|
+
\Sanzang is a compact and simple cross-platform machine translation system.
|
6
|
+
This program is especially useful for translating from CJK languages (Chinese,
|
7
|
+
Korean, and Japanese), and it is very suitable for ancient and otherwise
|
8
|
+
difficult texts. Due to its origins in translating texts from the Chinese
|
9
|
+
Buddhist canon, the program is called \Sanzang (三藏), a literal translation of
|
10
|
+
the Sanskrit word "Tripitaka," which is a general term for the Buddhist canon.
|
11
|
+
As demonstrated by the _sanzang_ program itself:
|
80
12
|
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
translation table.
|
13
|
+
$ echo '三藏' | sanzang t sztab
|
14
|
+
[1.1] 三藏
|
15
|
+
[1.2] sānzàng
|
16
|
+
[1.3] tripiṭaka
|
86
17
|
|
87
|
-
|
88
|
-
|
18
|
+
Anyone can learn how to use \Sanzang, and use it to read and analyze texts.
|
19
|
+
Unlike most other systems, \Sanzang is small and approachable. Any user can
|
20
|
+
develop his or her own translation rules, and these are simply stored in a text
|
21
|
+
file that the program can read. For full details, refer to the MANUAL.
|
89
22
|
|
90
|
-
|
91
|
-
|
92
|
-
105.3 ānanda addressed-the-buddha-saying ¶
|
23
|
+
\Sanzang is free software ("free as in freedom"), and it is released under the
|
24
|
+
GNU General Public License, version 3.
|
93
25
|
|
94
|
-
|
95
|
-
106.2 wéi-rán shìzūn ¶
|
96
|
-
106.3 just-so bhagavān ¶
|
26
|
+
== Quick Install
|
97
27
|
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
Here we can see a three-column translation table at work. The first column has
|
103
|
-
the traditional Chinese source text, the second column contains the Pinyin
|
104
|
-
transliteration, and the third column contains English. In this example we can
|
105
|
-
see that well-defined translation rules lead to a clear translation listing,
|
106
|
-
at which the meaning of the original text is readily understandable in
|
107
|
-
English. If we wished to add additional columns for simplified Chinese,
|
108
|
-
Vietnamese, Japanese, Spanish, French, German, Russian, or any other languages,
|
109
|
-
then these could all be handled similarly without any technical difficulties.
|
110
|
-
|
111
|
-
Comprehensive translation tables could be quite large, containing tens of
|
112
|
-
thousands of entries. However, the work of building such a table is not so
|
113
|
-
significant compared to the long-term benefits which may be gained from such
|
114
|
-
tables. In addition, rules in these translation tables may be translated into
|
115
|
-
other languages as well. There is a potential here to assist readers all over
|
116
|
-
the world with understanding otherwise difficult works.
|
117
|
-
|
118
|
-
Considering the examples above, we can see that knowledge of the source
|
119
|
-
language and expertise in the relevant literary field is often still necessary.
|
120
|
-
Here again we can see that this translation system does not position itself as
|
121
|
-
a “silver bullet” for creating finished translations, but is rather a practical
|
122
|
-
set of utilities for the purpose of assisting human readers and translators.
|
123
|
-
|
124
|
-
== Installation
|
125
|
-
|
126
|
-
=== Requirements
|
127
|
-
|
128
|
-
The Sanzang system can be installed either as a Ruby gem, or manually from
|
129
|
-
an archive file. The only prerequisite to using Sanzang is:
|
130
|
-
|
131
|
-
* Ruby 1.9 or later
|
132
|
-
|
133
|
-
The "parallel" gem is required by Sanzang, but is installed automatically when
|
134
|
-
installing Sanzang using the standard method. Using the "parallel" gem, Sanzang
|
135
|
-
can support multiprocessing in batch mode (if the platform supports it).
|
136
|
-
Currently this method of multiprocessing will work automatically on Ruby ports
|
137
|
-
that implement the Process#fork system call.
|
138
|
-
|
139
|
-
In addition to the actual runtime requirements, it may also be very useful to
|
140
|
-
have a text editor that is aware of Unicode and other encodings, and able to
|
141
|
-
display multilingual texts. One such application that is known to work well
|
142
|
-
for this task is the _gedit_ text editor, which is free software and also
|
143
|
-
available on a variety of platforms.
|
144
|
-
|
145
|
-
=== Installation
|
146
|
-
|
147
|
-
To install Sanzang, the following command should suffice.
|
28
|
+
To install \Sanzang, the prerequisite is Ruby 1.9 or later. After Ruby has been
|
29
|
+
installed, you can run the _gem_ command from a command shell to automatically
|
30
|
+
download and install \Sanzang onto your computer.
|
148
31
|
|
149
32
|
# gem install sanzang
|
150
33
|
|
151
|
-
|
152
|
-
|
153
|
-
_gem_ from the command line.
|
154
|
-
|
155
|
-
== Components
|
156
|
-
|
157
|
-
The programs in Sanzang are designed in a traditional Unix style in which
|
158
|
-
programs are executed in a terminal, and program settings are specified
|
159
|
-
through command line options and parameters. This allows Sanzang programs
|
160
|
-
to be easily scripted and automated.
|
161
|
-
|
162
|
-
=== sanzang-reflow
|
163
|
-
|
164
|
-
The program sanzang-reflow can reformat Chinese, Japanese, or Korean text, in
|
165
|
-
which terms are often split between lines. This formatter "reflows" the text
|
166
|
-
instead based on its punctuation and horizontal spacing, separating the source
|
167
|
-
text into lines that are much safer for translation using the sanzang-translate
|
168
|
-
program.
|
169
|
-
|
170
|
-
Usage: sanzang-reflow [options]
|
171
|
-
|
172
|
-
Options:
|
173
|
-
-h, --help show this help message and exit
|
174
|
-
-E, --encoding=ENC set data encoding to ENC
|
175
|
-
-L, --list-encodings list possible encodings
|
176
|
-
-i, --infile=FILE read input text from FILE
|
177
|
-
-o, --outfile=FILE write output text to FILE
|
178
|
-
-V, --version show version number and exit
|
179
|
-
|
180
|
-
=== sanzang-translate
|
181
|
-
|
182
|
-
The program sanzang-translate (1) reads a translation table file, (2) applies
|
183
|
-
this table's rules to an input text, and then (3) generates a translation
|
184
|
-
listing. This program can also run in a special batch mode that can utilize
|
185
|
-
multiprocessing (multiple processors and processor cores) for high
|
186
|
-
performance.
|
187
|
-
|
188
|
-
Usage: sanzang-translate [options] table
|
189
|
-
Usage: sanzang-translate -B output_dir table < file_list
|
190
|
-
|
191
|
-
Options:
|
192
|
-
-h, --help show this help message and exit
|
193
|
-
-B, --batch-dir=DIR process from a queue into DIR
|
194
|
-
-E, --encoding=ENC set data encoding to ENC
|
195
|
-
-L, --list-encodings list possible encodings
|
196
|
-
-i, --infile=FILE read input text from FILE
|
197
|
-
-o, --outfile=FILE write output text to FILE
|
198
|
-
-P, --platform show platform information
|
199
|
-
-V, --version show version number and exit
|
200
|
-
|
201
|
-
== Basic Usage
|
202
|
-
|
203
|
-
=== Formatting and translating a single text
|
204
|
-
|
205
|
-
In the following example, we are working with a small text that we want to
|
206
|
-
translate. With the first command, we reformat the text using sanzang-reflow.
|
207
|
-
Then we run the sanzang-translate program with our translation table, to
|
208
|
-
generate a translation listing.
|
209
|
-
|
210
|
-
$ sanzang-reflow -i xinjing.txt -o lines.txt
|
211
|
-
$ sanzang-translate -i lines.txt -o trans.txt TABLE.txt
|
212
|
-
|
213
|
-
=== Redirecting I/O
|
214
|
-
|
215
|
-
The next two commands illustrate how these programs use standard input and
|
216
|
-
output streams by default, and how they can easily operate as text filters.
|
217
|
-
|
218
|
-
$ sanzang-reflow -i xinjing.txt | sanzang-translate -o trans.txt TABLE.txt
|
219
|
-
$ cat xinjing.txt | sanzang-reflow | sanzang-translate TABLE.txt | less
|
220
|
-
|
221
|
-
== Advanced Usage
|
222
|
-
|
223
|
-
=== Batch Mode and Multiprocessing
|
224
|
-
|
225
|
-
In the following example, we may have several thousand texts that we want to
|
226
|
-
run through sanzang-translate with our translation table. For example, if our
|
227
|
-
translation table was updated recently, we may want to regenerate our corpus
|
228
|
-
of translation listings. To do this, we can use the "find" command to retrieve
|
229
|
-
the file paths to our text files, and then pipe that output into the Sanzang
|
230
|
-
translation program.
|
231
|
-
|
232
|
-
$ find /srv/texts -type f | sanzang-translate -B /srv/trans TABLE.txt
|
233
|
-
|
234
|
-
This command will find all files in the location specified, and then feed the
|
235
|
-
file paths to sanzang-translate, which will process them as a batch. If the
|
236
|
-
"parallel" gem is available and functioning on the system, then the batch will
|
237
|
-
be divided among all available processors.
|
238
|
-
|
239
|
-
If this gem has been installed, then when running in batch mode, if we have six
|
240
|
-
CPU cores on the local machine, then we should be able to expect six
|
241
|
-
translation processes running concurrently. The exception to this is on the
|
242
|
-
"mswin" and "mingw" platforms, which do not have the necessary system calls
|
243
|
-
for Unix style multiprocessing. In this case, running Sanzang in the
|
244
|
-
Cygwin environment is a viable alternative.
|
245
|
-
|
246
|
-
The performance benefits of running with the "parallel" library can be very
|
247
|
-
significant, leading to a series of translation listings being generated in a
|
248
|
-
mere fraction of the time it would take to process them otherwise. This
|
249
|
-
performance gain is typically proportional to the number of processors and
|
250
|
-
processor cores available on the local system.
|
251
|
-
|
252
|
-
=== Text Encodings
|
253
|
-
|
254
|
-
Sanzang supports many possible text encodings. Option "-L" will list all
|
255
|
-
available text encodings. Option "-E" will set the encoding to be used for all
|
256
|
-
text data such as input texts, output texts, and table files. The other program
|
257
|
-
I/O, such as messages for the terminal, will still be in the default encoding
|
258
|
-
of the environment. For example, in a Windows environment that by default uses
|
259
|
-
the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
|
260
|
-
Sanzang to read and write all text data in UTF-16LE, but all other program
|
261
|
-
messages will still be displayed in the console's native IBM-437 encoding.
|
34
|
+
After this, you should be able to run the _sanzang_ command. Run the following
|
35
|
+
command to verify your installation and print platform information.
|
262
36
|
|
263
|
-
|
37
|
+
# sanzang -P
|
264
38
|
|
265
|
-
|
266
|
-
inherited from the environment. For example, a GNU/Linux user running Sanzang in
|
267
|
-
a UTF-8 terminal will by default have all text data read and written to in the
|
268
|
-
UTF-8 encoding. The one *exception* to this is for environments using the
|
269
|
-
IBM-437 encoding (typically an old Windows command shell). In this case,
|
270
|
-
Sanzang will take pity on you and automatically switch to UTF-8 by default, as
|
271
|
-
if you had specified the option "-E" with value "UTF-8".
|
39
|
+
This command should show a summary of your platform for running \Sanzang.
|
272
40
|
|
273
|
-
|
41
|
+
Ruby platform: x86_64-linux
|
42
|
+
Ruby version: 2.0.0
|
43
|
+
External encoding: UTF-8
|
44
|
+
Internal encoding: none
|
45
|
+
Fork implemented: true
|
46
|
+
Parallel version: 0.6.4
|
47
|
+
Processors found: 4
|
48
|
+
Sanzang version: 1.0.0
|
274
49
|
|
275
|
-
|
276
|
-
and effective. However, this program is still comparable to a simple machine,
|
277
|
-
and it can never replace a human translator. Please understand the scope of
|
278
|
-
this translation system when using it. No machines can take responsibility for
|
279
|
-
a poor translation. In the end, it is you who are responsible for any and all
|
280
|
-
publications.
|
50
|
+
You now have \Sanzang installed and running on your computer.
|