sanzang 0.0.3 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ee6e8f6e1b7cbd116de22589a0089faa73b35726
4
+ data.tar.gz: 2110931d7a863e52dc6ade5b17c060d4dc9ab12e
5
+ SHA512:
6
+ metadata.gz: 65710d20a08d6b20f82d3c909468c8e5d03b54fd5f96657247d0eaededc587d1394c66183660e50c654990f0bccfcb1efc600af5672e518cbaa777cd9a5ba3f8
7
+ data.tar.gz: 039837a397ed9f5561d5f2c6af23fe4a8991b305d134a7f4cfe6be3f67bad0506bf1ca6d8bf332586f51243a8bfbd1acb8302beb60b107e4ca165b56f66550ea
data/HACKING CHANGED
@@ -1,54 +1,36 @@
1
- = Hacking Sanzang
1
+ = Hacking \Sanzang
2
2
 
3
- == Testing and building
4
-
5
- * Use "rake test" to run all tests.
6
- * Use "rake build" to create a gem package in the "dist" directory.
7
-
8
- == Platforms
3
+ == Supported platforms
9
4
 
10
5
  These programs should work on all platforms with Ruby 1.9 or later. Regular
11
- testing takes place on GNU/Linux operating systems.
6
+ testing takes place on GNU/Linux operating systems. \Sanzang has not been fully
7
+ tested on other implementations of Ruby (e.g. JRuby, Rubinius, etc.).
12
8
 
13
- == Languages
9
+ == Languages and scope
14
10
 
15
11
  The translation program may not be very useful for all languages. It was
16
12
  designed specifically for dealing with the more difficult aspects of
17
13
  translating from ancient Chinese. For a language like Sanskrit or Tibetan, it
18
- may not be so useful. However, the limitations of this program have not yet
19
- been investigated.
20
-
21
- == Tables
22
-
23
- A valid translation table is necessary when using the translator. More
24
- information on this format is available in the README file. The following
25
- tokens are reserved: "~|", "|", and "|~", and these should not be used in table
26
- record data. No "escape sequences" are available for the inclusion of such
27
- tokens as data.
14
+ may not be so useful.
28
15
 
29
- == Multiprocessing
16
+ == Multiprocessing and fork(2)
30
17
 
31
18
  Most every Unix-like system supports the fork(2) system call and therefore has
32
- the potential to run Sanzang in a multiprocessing mode for batches. However,
33
- each Ruby implementation and port may be different, and Sanzang will check to
34
- see if the fork method has been implemented before attempting to use the
35
- "parallel" gem for multiprocessing.
36
-
37
- Batch mode will print a warning message if it cannot use the "parallel" gem.
38
- It will then process the batch, but will not utilize multiprocessing, so it
39
- will run slower. The "mingw" and "mswin" ports of Ruby do not have support for
40
- fork method, yet the "parallel" library attempts to fork anyhow. To avoid
41
- potential errors, the translator will not even attempt to use multiprocessing
42
- on platforms in which the fork method is not implemented.
43
-
44
- Note that for Windows, the "cygwin" port of Ruby does implement does support
45
- fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method and so
19
+ the potential to run \Sanzang in a multiprocessing mode for batches. However,
20
+ each Ruby implementation and port may be different, and \Sanzang will check to
21
+ see if the fork method has been implemented before attempting to utilize
22
+ multiprocessing. To avoid potential errors, the translator will not attempt to
23
+ use multiprocessing on platforms in which fork(2) is not implemented.
24
+
25
+ Note that for Windows, the Cygwin port of Ruby does implement does support
26
+ fork(2) perfectly, so Ruby 1.9+ on Cygwin can utilize the fork method, so
46
27
  it supports standard multiprocessing. Therefore, Cygwin is the most robust
47
- environment for running Sanzang on a Windows PC.
28
+ environment for running \Sanzang on a computer with Windows.
48
29
 
49
- == Encodings
30
+ == Text encoding quirks
50
31
 
51
- Converters for several encodings have not yet been implemented by MRI Ruby.
52
- Most of these are obscure and not widely used anyways. Perhaps the most
53
- notable is EUC-TW, which is an old Unix encoding for traditional Chinese.
54
- However, even this most "notable" missing encoding is obscure and unimportant.
32
+ Converters for several encodings have not yet been implemented by YARV. Most of
33
+ these are obscure and not widely used. Perhaps the most notable is EUC-TW,
34
+ which is an old Unix encoding for traditional Chinese. Encoding conversion is
35
+ used by the _reflow_ command for formatting CJK text, but encoding conversion
36
+ is not required for translation.
data/MANUAL ADDED
@@ -0,0 +1,312 @@
1
+ = The \Sanzang Manual
2
+
3
+ == Introduction
4
+
5
+ \Sanzang is a compact, cross-platform machine translation system. This program
6
+ was developed specifically to fill the need for a competent application for
7
+ aiding translators of the Chinese Buddhist canon into other languages. However,
8
+ the translation method it uses is general enough that it may extend to other
9
+ translation domains as well, especially translations with CJK source languages
10
+ (Chinese, Japanese, and Korean). The name \Sanzang (三藏) is a literal
11
+ translation of the Sanskrit word “Tripitaka,” which is a general term for the
12
+ Buddhist canon.
13
+
14
+ \Sanzang is implemented as a Unix style "command suite" program that executes
15
+ in your operating system's command shell. The _sanzang_ command includes
16
+ subcommands for carrying out each of the available functions of the system.
17
+
18
+ \Sanzang is programmed in the Ruby programming language and is free software
19
+ ("free as in freedom"). This program is licensed under the GNU General Public
20
+ License, version 3, which ensures that anyone can use the program for any
21
+ purpose, and that extensions to this program will remain freely available to
22
+ others.
23
+
24
+ == Background
25
+
26
+ At the time of this writing, nearly all machine translation systems available
27
+ are based on statistical machine translation (SMT), a method which has not
28
+ proven very useful for ancient Chinese texts. \Sanzang was developed as an
29
+ alternative simply because there was the practical need for a better and more
30
+ reliable tool.
31
+
32
+ The most significant difference between \Sanzang and other machine translation
33
+ systems is that it does not attempt to interpret or translate grammar. Instead,
34
+ it simply translates names, terms, and phrases based on rules stored in a
35
+ translation table. The \Sanzang translator applies this translation table at
36
+ runtime to generate a text listing as the output.
37
+
38
+ This method is simple and efficient, and produces predictable results that can
39
+ be made immediately available to the user for verification. To facilitate this
40
+ task, all translation listings generated by the program are collated
41
+ line-by-line with the original source text.
42
+
43
+ == Translation Method
44
+
45
+ \Sanzang provides mainly a simple machine translation engine. To use this
46
+ translation engine, it will also be necessary to have a text file in which all
47
+ translation rules are defined. This is called a translation table, and its
48
+ format is simple delimited text. At runtime, the translation rules in this file
49
+ are applied to the source text to generate the translation listing.
50
+
51
+ In a translation table text file, each line is a translation rule containing a
52
+ source term and its equivalent meanings in other languages. Each line starts
53
+ with "~|", has records delimited by "|", and ends with "|~". In the translation
54
+ table, the first column represents the source language, while the subsequent
55
+ columns represent destination languages. In this example, we want to create a
56
+ table capable of rendering the following title into English:
57
+
58
+ 金剛般若波羅蜜經
59
+
60
+ We start by creating a new text file, named _table.txt_, or something similar.
61
+ In this text file, we may add the following rules:
62
+
63
+ ~|波羅蜜| pāramitā|~
64
+ ~|金剛| diamond|~
65
+ ~|般若| prajñā|~
66
+ ~|經| sūtra/classic|~
67
+
68
+ Notice that spaces were included prior to the English equivalents. This is
69
+ because Chinese does not typically include spaces between words, so we need to
70
+ insert our own leading spaces as part of the translation rules we are defining.
71
+ After we have written this table file, we can then run the \Sanzang translation
72
+ engine with our table. When it reads the Chinese title as the input text, it
73
+ then produces the following translation listing:
74
+
75
+ 1.1 金剛般若波羅蜜經
76
+ 1.2 diamond prajñā pāramitā sūtra/classic
77
+
78
+ The program first sorted our terms by the length of the source column, and
79
+ then applied each of these rules in sequence. It then collated the output and
80
+ created a translation listing. In the left margin, we can see numbers denoting
81
+ the line number of the source text, along with the column number of the
82
+ translation table.
83
+
84
+ As a final example, below is a snippet from an Indian Buddhist meditation text,
85
+ which was processed by the \Sanzang translation engine in the same manner:
86
+
87
+ 105.1 阿難白佛言。
88
+ 105.2 ānán bái-fó-yán ¶
89
+ 105.3 ānanda addressed-the-buddha-saying ¶
90
+
91
+ 106.1 唯然世尊。
92
+ 106.2 wéi-rán shìzūn ¶
93
+ 106.3 just-so bhagavān ¶
94
+
95
+ 107.1 願樂欲聞。
96
+ 107.2 yuànlè-yù-wén ¶
97
+ 107.3 joyfully-wish-to-hear ¶
98
+
99
+ Here we can see a three-column translation table at work. The first column has
100
+ the traditional Chinese source text, the second column contains the Pinyin
101
+ transliteration, and the third column contains English. In this example we can
102
+ see that well-defined translation rules lead to a clear translation listing,
103
+ in which the meaning of the original text is readily understandable in English.
104
+
105
+ Considering the examples above, we can see that knowledge of the source
106
+ language and expertise in the relevant literary field is often still necessary.
107
+ Here again we can see that this translation system does not position itself as
108
+ a “silver bullet” for creating finished translations, but is rather a practical
109
+ tool for the purpose of assisting human readers and translators.
110
+
111
+ == Installation
112
+
113
+ === Requirements
114
+
115
+ The standard way of installing \Sanzang is as a Ruby gem. To do this, the only
116
+ requirement is Ruby 1.9 or later, along with an Internet connection.
117
+
118
+ If you do not have an Internet connection available on the host computer, then
119
+ you may download the gem file and install it manually. If you choose this
120
+ route, then please be aware that the “sanzang” gem depends on the “parallel”
121
+ gem for multiprocessing.
122
+
123
+ For users of Microsoft Windows, please be aware that Windows ports of Ruby
124
+ typically do not include full multiprocessing support using the standard Unix
125
+ system calls. If you will be using \Sanzang in a Windows environment and you
126
+ require support for fast batch processing, then you should use the Cygwin port
127
+ of Ruby, which does not have this limitation. Unix-based platforms such as
128
+ Linux, BSD, and Mac OS X are unaffected by this issue.
129
+
130
+ In addition to installation requirements, it may also be very useful to have a
131
+ text editor that is aware of Unicode and other encodings, and able to display
132
+ multilingual texts. One such application that is known to work well for this
133
+ task is the _gedit_ text editor, which is free software and is available on a
134
+ variety of platforms.
135
+
136
+ === Installation
137
+
138
+ To install \Sanzang, the following command should suffice.
139
+
140
+ # gem install sanzang
141
+
142
+ This command will download and install \Sanzang into your Ruby environment.
143
+
144
+ If you have installed Ruby 1.9 but cannot run the _gem_ command, then you may
145
+ need to set up your PATH environment variable first, so you can run _ruby_ and
146
+ _gem_ from the command line.
147
+
148
+ == Commands
149
+
150
+ \Sanzang functions are accessible through the _sanzang_ command and its
151
+ subcommands. Runtime behavior is set through command line options and
152
+ parameters. This allows _sanzang_ to be easily scripted and automated.
153
+
154
+ \Sanzang subcommands can also be abbreviated for the sake of convenience. For
155
+ example, the "translate" subcommand could be abbreviated as "trans", "tr", or
156
+ even the one letter "t".
157
+
158
+ === _sanzang_
159
+
160
+ The main _sanzang_ program acts as a front-end to subcommands, and also
161
+ includes options for printing platform and version information.
162
+
163
+ Usage: sanzang [options]
164
+ Usage: sanzang <command> [options] [args]
165
+
166
+ Sanzang commands:
167
+ batch translate many files in parallel
168
+ reflow format CJK text for translation
169
+ translate standard single text translation
170
+
171
+ Options:
172
+ -h, --help show this help message and exit
173
+ -P, --platform show platform information and exit
174
+ -V, --version show version number and exit
175
+
176
+ === _sanzang_ _batch_
177
+
178
+ The _sanzang_ _batch_ command can translates files in parallel. A list of files
179
+ is read from STDIN, while progress information is printed to STDERR. The list
180
+ of output files written is printed to STDOUT at the end of the batch. The
181
+ output directory is specified as a parameter.
182
+
183
+ Usage: sanzang batch [options] table output_dir < queue
184
+
185
+ Options:
186
+ -h, --help show this help message and exit
187
+ -E, --encoding=ENC set data encoding to ENC
188
+ -L, --list-encodings list possible encodings
189
+ -j, --jobs=N allow N concurrent processes
190
+
191
+ === _sanzang_ _reflow_
192
+
193
+ The command _sanzang_ _reflow_ can reformat Chinese, Japanese, or Korean text,
194
+ in which terms are often split between lines. This formatter "reflows" the text
195
+ based on its punctuation and horizontal spacing, separating the source text
196
+ into lines that are much safer for translation.
197
+
198
+ Usage: sanzang reflow [options]
199
+
200
+ Options:
201
+ -h, --help show this help message and exit
202
+ -E, --encoding=ENC set data encoding to ENC
203
+ -L, --list-encodings list possible encodings
204
+ -i, --infile=FILE read input text from FILE
205
+ -o, --outfile=FILE write output text to FILE
206
+
207
+ === _sanzang_ _translate_
208
+
209
+ The command _sanzang_ _translate_ can perform translation of a single text
210
+ stream or file. By default, this command reads from STDIN and writes to STDOUT.
211
+ For concurrent translation of multiple files, see the _batch_ command.
212
+
213
+ Usage: sanzang translate [options] table
214
+
215
+ Options:
216
+ -h, --help show this help message and exit
217
+ -E, --encoding=ENC set data encoding to ENC
218
+ -L, --list-encodings list possible encodings
219
+ -i, --infile=FILE read input text from FILE
220
+ -o, --outfile=FILE write output text to FILE
221
+
222
+ == Basic Usage
223
+
224
+ In the following example, we are working with a small text that we want to
225
+ translate. With the first command, we reformat the text using _reflow_.
226
+ Then we run _translate_ with our translation table, to generate a translation
227
+ listing.
228
+
229
+ $ sanzang reflow -i xinjing.txt -o lines.txt
230
+ $ sanzang translate -i lines.txt -o trans.txt TABLE.txt
231
+
232
+ The next two commands illustrate how these programs use standard input and
233
+ output streams by default, how they can easily operate as text filters, and
234
+ the way that _sanzang_ subcommands can be abbreviated.
235
+
236
+ $ sanzang r < xinjing.txt | sanzang t TABLE.txt > trans.txt
237
+ $ cat xinjing.txt | sanzang r | sanzang t TABLE.txt | less
238
+
239
+ == Advanced Usage
240
+
241
+ === Batch Mode
242
+
243
+ We may have thousands of texts that we want to generate translation listings
244
+ for with our translation table. For example, if our translation table was
245
+ updated recently, we may want to regenerate an entire corpus of translation
246
+ listings. To do this, we can use the _find_ command to retrieve the file paths
247
+ to our text files, and then pipe that output into _sanzang_ _batch_.
248
+
249
+ $ find /srv/texts -type f | sanzang batch TABLE.txt /srv/trans
250
+
251
+ This command will find all files in the location specified, and then feed the
252
+ file paths to _sanzang_ _batch_, which will process them as a batch. If
253
+ multiprocessing is supported on your platform, then the batch will be divided
254
+ among all available processors.
255
+
256
+ To determine whether multiprocessing is available in your version of Ruby, you
257
+ can run the _sanzang_ command with the "-P" option:
258
+
259
+ $ sanzang -P
260
+
261
+ If you see that "Fork implemented" is "true", then your platform supports Unix
262
+ style multiprocessing, and you can gain performance benefits from using the
263
+ _batch_ command. You should also examine the value for "Processors found".
264
+ This is the number of logical processors detected on your platform.
265
+
266
+ If you see that only one processor is detected and you know that there are more
267
+ available on your platform, then you may want to use the "-j" flag when running
268
+ the batch command, to manually specify the number of processes to use. This
269
+ number can be set to the number of CPU cores available on your system. This
270
+ option may be necessary on some less common platforms, such as some BSD
271
+ distributions and commercial Unix variants.
272
+
273
+ Microsoft Windows ports of Ruby typically do not support Unix style
274
+ multiprocessing. If you require higher performance and utilizing multiple CPU
275
+ cores, then you should look into the Cygwin port of Ruby, which does not have
276
+ this limitation.
277
+
278
+ The performance benefits of running in batch mode with multiprocessing may be
279
+ very significant. The performance increases are typically proportional to the
280
+ number of CPU cores available on the system. For example, on a large SMP system
281
+ with 50 processors available, _sanzang_ _batch_ can run up to 50x as fast.
282
+
283
+ === Text Encodings
284
+
285
+ \Sanzang supports many possible text encodings. Option "-L" will list all
286
+ available text encodings. Option "-E" will set the encoding to be used for all
287
+ text data such as input texts, output texts, and table files. The other program
288
+ I/O, such as messages for the terminal, will still be in the default encoding
289
+ of the environment. For example, in a Windows environment that by default uses
290
+ the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
291
+ \Sanzang to read and write all text data in UTF-16LE, but all other program
292
+ messages will still be displayed in the console's native IBM-437 encoding.
293
+
294
+ $ sanzang t -E UTF-16LE -i in.txt -o out.txt TABLE.txt
295
+
296
+ If the "-E" option is not specified, then \Sanzang will use the default
297
+ encoding inherited from the environment. For example, a GNU/Linux user running
298
+ \Sanzang in a UTF-8 terminal will by default have all text data read and
299
+ written to in the UTF-8 encoding. The one *exception* to this is for
300
+ environments using the IBM-437 encoding (typically an old Windows command
301
+ shell). In this case, \Sanzang will take pity on you and automatically switch
302
+ to UTF-8 by default, as if you had specified the option "-E" with value
303
+ "UTF-8".
304
+
305
+ == Responsible Use
306
+
307
+ With comprehensive translation tables, \Sanzang can often be quite accurate
308
+ and effective. However, this program is still comparable to a simple machine,
309
+ and it can never replace a human translator. Please understand the scope of
310
+ this translation system when using it. No machines can take responsibility for
311
+ a poor translation. In the end, it is you who are responsible for any and all
312
+ publications.
data/README CHANGED
@@ -1,280 +1,50 @@
1
- = Sanzang (三藏)
2
-
3
- == Contents
4
-
5
- * Introduction
6
- * Concepts
7
- * Installation
8
- * Components
9
- * Basic Usage
10
- * Advanced Usage
11
- * Responsible Use
1
+ = \Sanzang (三藏)
12
2
 
13
3
  == Introduction
14
4
 
15
- Sanzang is a compact, cross-platform machine translation system. This program
16
- was developed specifically to fill the need for a competent application for
17
- aiding translators of the Chinese Buddhist canon into other languages. However,
18
- the translation method it uses is general enough that it may extend to other
19
- translation domains as well, especially those in which Chinese is the source
20
- language. Sanzang (三藏) is a literal translation of the Sanskrit word
21
- "Tripitaka," a general term for the Buddhist canon. Sanzang is alternately
22
- a translation of "trepitaka," the title for someone who is a master of such
23
- teachings.
24
-
25
- Sanzang is implemented as a small set of programs written in the Ruby
26
- programming language. This system is free software (“free as in freedom”), and
27
- it is licensed under the GNU General Public License, version 3. This ensures
28
- that anyone can use the program for any purpose, and that any extensions to
29
- Sanzang will remain freely available to others.
30
-
31
- == Background
32
-
33
- The most significant difference between Sanzang and other machine translation
34
- systems is that it does not attempt to interpret grammar in any way. Instead,
35
- it relies on direct translation of names, terms, and phrases based on a large
36
- translation table. The Sanzang translator simply applies this translation table
37
- at runtime, and does not attempt to interpret grammar or syntax in any way
38
- whatsoever. The end result is that the accuracy of the translation is highly
39
- dependent on the accuracy of the translation table.
40
-
41
- The strength of the Sanzang method is that it is extremely simple and easy to
42
- work with, and eliminates virtually all complexity in the translation process.
43
- This system will never produce incorrect syntax because it does not interpret
44
- syntax in the first place. This method is also efficient and yields predictable
45
- results that can be made immediately available to the user for verification. To
46
- facilitate this task, all translation listings are collated line-by-line with
47
- the original source text.
48
-
49
- == Concepts
50
-
51
- Sanzang provides mainly a simple translation engine. For any actual
52
- translation work, Sanzang requires a translation table in which all
53
- translation rules are defined. This translation table is stored in a simple
54
- text file. Each line is a record containing a source term and its equivalent
55
- meanings in other languages. Each line starts with "~|", has records delimited
56
- by "|", and ends with "|~". In a table, the first column represents the source
57
- language, while the subsequent columns represent destination languages. In
58
- this example, we want to create a table capable of rendering the following
59
- title into English:
60
-
61
- 金剛般若波羅蜜經
62
-
63
- We start by creating a new text file, named TABLE.txt or something similar. In
64
- this text file, we may add the following rules:
65
-
66
- ~|波羅蜜| pāramitā|~
67
- ~|金剛| diamond|~
68
- ~|般若| prajñā|~
69
- ~|經| sūtra|~
70
-
71
- Did you notice that we included spaces prior to the translations of these
72
- terms? This is because Chinese does not typically include spaces between
73
- words, so we need to insert our own leading spaces as part of the rules we are
74
- defining. After we have written this table file, we can run the Sanzang
75
- translator with our table. When it reads the Chinese title as the input text,
76
- it then produces the following translation listing:
77
-
78
- 1.1 金剛般若波羅蜜經
79
- 1.2 diamond prajñā pāramitā sūtra
5
+ \Sanzang is a compact and simple cross-platform machine translation system.
6
+ This program is especially useful for translating from CJK languages (Chinese,
7
+ Korean, and Japanese), and it is very suitable for ancient and otherwise
8
+ difficult texts. Due to its origins in translating texts from the Chinese
9
+ Buddhist canon, the program is called \Sanzang (三藏), a literal translation of
10
+ the Sanskrit word "Tripitaka," which is a general term for the Buddhist canon.
11
+ As demonstrated by the _sanzang_ program itself:
80
12
 
81
- The program first sorted our terms by the length of the source column, and
82
- then applied each of these rules in sequence. It then collated the output and
83
- created a translation listing. In the left margin, we can see numbers denoting
84
- the line number of the source text, along with the column number of the
85
- translation table.
13
+ $ echo '三藏' | sanzang t sztab
14
+ [1.1] 三藏
15
+ [1.2] sānzàng
16
+ [1.3] tripiṭaka
86
17
 
87
- As a final example, below is a snippet from an ancient meditation text, which
88
- was also processed by the Sanzang translator in the same manner:
18
+ Anyone can learn how to use \Sanzang, and use it to read and analyze texts.
19
+ Unlike most other systems, \Sanzang is small and approachable. Any user can
20
+ develop his or her own translation rules, and these are simply stored in a text
21
+ file that the program can read. For full details, refer to the MANUAL.
89
22
 
90
- 105.1 阿難白佛言。
91
- 105.2 ānán bái-fó-yán
92
- 105.3 ānanda addressed-the-buddha-saying ¶
23
+ \Sanzang is free software ("free as in freedom"), and it is released under the
24
+ GNU General Public License, version 3.
93
25
 
94
- 106.1 唯然世尊。
95
- 106.2 wéi-rán shìzūn ¶
96
- 106.3 just-so bhagavān ¶
26
+ == Quick Install
97
27
 
98
- 107.1 願樂欲聞。
99
- 107.2 yuànlè-yù-wén
100
- 107.3 joyfully-wish-to-hear ¶
101
-
102
- Here we can see a three-column translation table at work. The first column has
103
- the traditional Chinese source text, the second column contains the Pinyin
104
- transliteration, and the third column contains English. In this example we can
105
- see that well-defined translation rules lead to a clear translation listing,
106
- at which the meaning of the original text is readily understandable in
107
- English. If we wished to add additional columns for simplified Chinese,
108
- Vietnamese, Japanese, Spanish, French, German, Russian, or any other languages,
109
- then these could all be handled similarly without any technical difficulties.
110
-
111
- Comprehensive translation tables could be quite large, containing tens of
112
- thousands of entries. However, the work of building such a table is not so
113
- significant compared to the long-term benefits which may be gained from such
114
- tables. In addition, rules in these translation tables may be translated into
115
- other languages as well. There is a potential here to assist readers all over
116
- the world with understanding otherwise difficult works.
117
-
118
- Considering the examples above, we can see that knowledge of the source
119
- language and expertise in the relevant literary field is often still necessary.
120
- Here again we can see that this translation system does not position itself as
121
- a “silver bullet” for creating finished translations, but is rather a practical
122
- set of utilities for the purpose of assisting human readers and translators.
123
-
124
- == Installation
125
-
126
- === Requirements
127
-
128
- The Sanzang system can be installed either as a Ruby gem, or manually from
129
- an archive file. The only prerequisite to using Sanzang is:
130
-
131
- * Ruby 1.9 or later
132
-
133
- The "parallel" gem is required by Sanzang, but is installed automatically when
134
- installing Sanzang using the standard method. Using the "parallel" gem, Sanzang
135
- can support multiprocessing in batch mode (if the platform supports it).
136
- Currently this method of multiprocessing will work automatically on Ruby ports
137
- that implement the Process#fork system call.
138
-
139
- In addition to the actual runtime requirements, it may also be very useful to
140
- have a text editor that is aware of Unicode and other encodings, and able to
141
- display multilingual texts. One such application that is known to work well
142
- for this task is the _gedit_ text editor, which is free software and also
143
- available on a variety of platforms.
144
-
145
- === Installation
146
-
147
- To install Sanzang, the following command should suffice.
28
+ To install \Sanzang, the prerequisite is Ruby 1.9 or later. After Ruby has been
29
+ installed, you can run the _gem_ command from a command shell to automatically
30
+ download and install \Sanzang onto your computer.
148
31
 
149
32
  # gem install sanzang
150
33
 
151
- If you have installed Ruby 1.9 but cannot run the "gem" command, then you may
152
- need to set up your PATH environment variable first, so you can run _ruby_ and
153
- _gem_ from the command line.
154
-
155
- == Components
156
-
157
- The programs in Sanzang are designed in a traditional Unix style in which
158
- programs are executed in a terminal, and program settings are specified
159
- through command line options and parameters. This allows Sanzang programs
160
- to be easily scripted and automated.
161
-
162
- === sanzang-reflow
163
-
164
- The program sanzang-reflow can reformat Chinese, Japanese, or Korean text, in
165
- which terms are often split between lines. This formatter "reflows" the text
166
- instead based on its punctuation and horizontal spacing, separating the source
167
- text into lines that are much safer for translation using the sanzang-translate
168
- program.
169
-
170
- Usage: sanzang-reflow [options]
171
-
172
- Options:
173
- -h, --help show this help message and exit
174
- -E, --encoding=ENC set data encoding to ENC
175
- -L, --list-encodings list possible encodings
176
- -i, --infile=FILE read input text from FILE
177
- -o, --outfile=FILE write output text to FILE
178
- -V, --version show version number and exit
179
-
180
- === sanzang-translate
181
-
182
- The program sanzang-translate (1) reads a translation table file, (2) applies
183
- this table's rules to an input text, and then (3) generates a translation
184
- listing. This program can also run in a special batch mode that can utilize
185
- multiprocessing (multiple processors and processor cores) for high
186
- performance.
187
-
188
- Usage: sanzang-translate [options] table
189
- Usage: sanzang-translate -B output_dir table < file_list
190
-
191
- Options:
192
- -h, --help show this help message and exit
193
- -B, --batch-dir=DIR process from a queue into DIR
194
- -E, --encoding=ENC set data encoding to ENC
195
- -L, --list-encodings list possible encodings
196
- -i, --infile=FILE read input text from FILE
197
- -o, --outfile=FILE write output text to FILE
198
- -P, --platform show platform information
199
- -V, --version show version number and exit
200
-
201
- == Basic Usage
202
-
203
- === Formatting and translating a single text
204
-
205
- In the following example, we are working with a small text that we want to
206
- translate. With the first command, we reformat the text using sanzang-reflow.
207
- Then we run the sanzang-translate program with our translation table, to
208
- generate a translation listing.
209
-
210
- $ sanzang-reflow -i xinjing.txt -o lines.txt
211
- $ sanzang-translate -i lines.txt -o trans.txt TABLE.txt
212
-
213
- === Redirecting I/O
214
-
215
- The next two commands illustrate how these programs use standard input and
216
- output streams by default, and how they can easily operate as text filters.
217
-
218
- $ sanzang-reflow -i xinjing.txt | sanzang-translate -o trans.txt TABLE.txt
219
- $ cat xinjing.txt | sanzang-reflow | sanzang-translate TABLE.txt | less
220
-
221
- == Advanced Usage
222
-
223
- === Batch Mode and Multiprocessing
224
-
225
- In the following example, we may have several thousand texts that we want to
226
- run through sanzang-translate with our translation table. For example, if our
227
- translation table was updated recently, we may want to regenerate our corpus
228
- of translation listings. To do this, we can use the "find" command to retrieve
229
- the file paths to our text files, and then pipe that output into the Sanzang
230
- translation program.
231
-
232
- $ find /srv/texts -type f | sanzang-translate -B /srv/trans TABLE.txt
233
-
234
- This command will find all files in the location specified, and then feed the
235
- file paths to sanzang-translate, which will process them as a batch. If the
236
- "parallel" gem is available and functioning on the system, then the batch will
237
- be divided among all available processors.
238
-
239
- If this gem has been installed, then when running in batch mode, if we have six
240
- CPU cores on the local machine, then we should be able to expect six
241
- translation processes running concurrently. The exception to this is on the
242
- "mswin" and "mingw" platforms, which do not have the necessary system calls
243
- for Unix style multiprocessing. In this case, running Sanzang in the
244
- Cygwin environment is a viable alternative.
245
-
246
- The performance benefits of running with the "parallel" library can be very
247
- significant, leading to a series of translation listings being generated in a
248
- mere fraction of the time it would take to process them otherwise. This
249
- performance gain is typically proportional to the number of processors and
250
- processor cores available on the local system.
251
-
252
- === Text Encodings
253
-
254
- Sanzang supports many possible text encodings. Option "-L" will list all
255
- available text encodings. Option "-E" will set the encoding to be used for all
256
- text data such as input texts, output texts, and table files. The other program
257
- I/O, such as messages for the terminal, will still be in the default encoding
258
- of the environment. For example, in a Windows environment that by default uses
259
- the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
260
- Sanzang to read and write all text data in UTF-16LE, but all other program
261
- messages will still be displayed in the console's native IBM-437 encoding.
34
+ After this, you should be able to run the _sanzang_ command. Run the following
35
+ command to verify your installation and print platform information.
262
36
 
263
- $ sanzang-translate -E UTF-16LE -i in.txt -o out.txt TABLE.txt
37
+ # sanzang -P
264
38
 
265
- If the "-E" option is not specified, then Sanzang will use the default encoding
266
- inherited from the environment. For example, a GNU/Linux user running Sanzang in
267
- a UTF-8 terminal will by default have all text data read and written to in the
268
- UTF-8 encoding. The one *exception* to this is for environments using the
269
- IBM-437 encoding (typically an old Windows command shell). In this case,
270
- Sanzang will take pity on you and automatically switch to UTF-8 by default, as
271
- if you had specified the option "-E" with value "UTF-8".
39
+ This command should show a summary of your platform for running \Sanzang.
272
40
 
273
- == Responsible Use
41
+ Ruby platform: x86_64-linux
42
+ Ruby version: 2.0.0
43
+ External encoding: UTF-8
44
+ Internal encoding: none
45
+ Fork implemented: true
46
+ Parallel version: 0.6.4
47
+ Processors found: 4
48
+ Sanzang version: 1.0.0
274
49
 
275
- With comprehensive translation tables, Sanzang can often be quite accurate
276
- and effective. However, this program is still comparable to a simple machine,
277
- and it can never replace a human translator. Please understand the scope of
278
- this translation system when using it. No machines can take responsibility for
279
- a poor translation. In the end, it is you who are responsible for any and all
280
- publications.
50
+ You now have \Sanzang installed and running on your computer.