sanzang 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README ADDED
@@ -0,0 +1,280 @@
1
+ = Sanzang (三藏)
2
+
3
+ == Contents
4
+
5
+ * Introduction
6
+ * Concepts
7
+ * Installation
8
+ * Components
9
+ * Basic Usage
10
+ * Advanced Usage
11
+ * Responsible Use
12
+
13
+ == Introduction
14
+
15
+ Sanzang is a compact, cross-platform machine translation system. This program
16
+ was developed specifically to fill the need for a competent application for
17
+ aiding translators of the Chinese Buddhist canon into other languages. However,
18
+ the translation method it uses is general enough that it may extend to other
19
+ translation domains as well, especially those in which Chinese is the source
20
+ language. Sanzang (三藏) is a literal translation of the Sanskrit word
21
+ "Tripitaka," a general term for the Buddhist canon. Sanzang is alternately
22
+ a translation of "trepitaka," the title for someone who is a master of such
23
+ teachings.
24
+
25
+ Sanzang is implemented as a small set of programs written in the Ruby
26
+ programming language. This system is free software (“free as in freedom”), and
27
+ it is licensed under the GNU General Public License, version 3. This ensures
28
+ that anyone can use the program for any purpose, and that any extensions to
29
+ Sanzang will remain freely available to others.
30
+
31
+ == Background
32
+
33
+ The most significant difference between Sanzang and other machine translation
34
+ systems is that it does not attempt to interpret grammar in any way. Instead,
35
+ it relies on direct translation of names, terms, and phrases based on a large
36
+ translation table. The Sanzang translator simply applies this translation table
37
+ at runtime, and does not attempt to interpret grammar or syntax in any way
38
+ whatsoever. The end result is that the accuracy of the translation is highly
39
+ dependent on the accuracy of the translation table.
40
+
41
+ The strength of the Sanzang method is that it is extremely simple and easy to
42
+ work with, and eliminates virtually all complexity in the translation process.
43
+ This system will never produce incorrect syntax because it does not interpret
44
+ syntax in the first place. This method is also efficient and yields predictable
45
+ results that can be made immediately available to the user for verification. To
46
+ facilitate this task, all translation listings are collated line-by-line with
47
+ the original source text.
48
+
49
+ == Concepts
50
+
51
+ Sanzang provides mainly a simple translation engine. For any actual
52
+ translation work, Sanzang requires a translation table in which all
53
+ translation rules are defined. This translation table is stored in a simple
54
+ text file. Each line is a record containing a source term and its equivalent
55
+ meanings in other languages. Each line starts with "~|", has records delimited
56
+ by "|", and ends with "|~". In a table, the first column represents the source
57
+ language, while the subsequent columns represent destination languages. In
58
+ this example, we want to create a table capable of rendering the following
59
+ title into English:
60
+
61
+ 金剛般若波羅蜜經
62
+
63
+ We start by creating a new text file, named TABLE.txt or something similar. In
64
+ this text file, we may add the following rules:
65
+
66
+ ~|波羅蜜| pāramitā|~
67
+ ~|金剛| diamond|~
68
+ ~|般若| prajñā|~
69
+ ~|經| sūtra|~
70
+
71
+ Did you notice that we included spaces prior to the translations of these
72
+ terms? This is because Chinese does not typically include spaces between
73
+ words, so we need to insert our own leading spaces as part of the rules we are
74
+ defining. After we have written this table file, we can run the Sanzang
75
+ translator with our table. When it reads the Chinese title as the input text,
76
+ it then produces the following translation listing:
77
+
78
+ 1.1 金剛般若波羅蜜經
79
+ 1.2 diamond prajñā pāramitā sūtra
80
+
81
+ The program first sorted our terms by the length of the source column, and
82
+ then applied each of these rules in sequence. It then collated the output and
83
+ created a translation listing. In the left margin, we can see numbers denoting
84
+ the line number of the source text, along with the column number of the
85
+ translation table.
86
+
87
+ As a final example, below is a snippet from an ancient meditation text, which
88
+ was also processed by the Sanzang translator in the same manner:
89
+
90
+ 105.1 阿難白佛言。
91
+ 105.2 ānán bái-fó-yán ¶
92
+ 105.3 ānanda addressed-the-buddha-saying ¶
93
+
94
+ 106.1 唯然世尊。
95
+ 106.2 wéi-rán shìzūn ¶
96
+ 106.3 just-so bhagavān ¶
97
+
98
+ 107.1 願樂欲聞。
99
+ 107.2 yuànlè-yù-wén ¶
100
+ 107.3 joyfully-wish-to-hear ¶
101
+
102
+ Here we can see a three-column translation table at work. The first column has
103
+ the traditional Chinese source text, the second column contains the Pinyin
104
+ transliteration, and the third column contains English. In this example we can
105
+ see that well-defined translation rules lead to a clear translation listing,
106
+ at which the meaning of the original text is readily understandable in
107
+ English. If we wished to add additional columns for simplified Chinese,
108
+ Vietnamese, Japanese, Spanish, French, German, Russian, or any other languages,
109
+ then these could all be handled similarly without any technical difficulties.
110
+
111
+ Comprehensive translation tables could be quite large, containing tens of
112
+ thousands of entries. However, the work of building such a table is not so
113
+ significant compared to the long-term benefits which may be gained from such
114
+ tables. In addition, rules in these translation tables may be translated into
115
+ other languages as well. There is a potential here to assist readers all over
116
+ the world with understanding otherwise difficult works.
117
+
118
+ Considering the examples above, we can see that knowledge of the source
119
+ language and expertise in the relevant literary field is often still necessary.
120
+ Here again we can see that this translation system does not position itself as
121
+ a “silver bullet” for creating finished translations, but is rather a practical
122
+ set of utilities for the purpose of assisting human readers and translators.
123
+
124
+ == Installation
125
+
126
+ === Requirements
127
+
128
+ The Sanzang system can be installed either as a Ruby gem, or manually from
129
+ an archive file. The only prerequisite to using Sanzang is:
130
+
131
+ * Ruby 1.9 or later
132
+
133
+ The "parallel" gem is required by Sanzang, but is installed automatically when
134
+ installing Sanzang using the standard method. Using the "parallel" gem, Sanzang
135
+ can support multiprocessing in batch mode (if the platform supports it).
136
+ Currently this method of multiprocessing will work automatically on Ruby ports
137
+ that implement the Process#fork system call.
138
+
139
+ In addition to the actual runtime requirements, it may also be very useful to
140
+ have a text editor that is aware of Unicode and other encodings, and able to
141
+ display multilingual texts. One such application that is known to work well
142
+ for this task is the _gedit_ text editor, which is free software and also
143
+ available on a variety of platforms.
144
+
145
+ === Installation
146
+
147
+ To install Sanzang, the following command should suffice.
148
+
149
+ # gem install sanzang
150
+
151
+ If you have installed Ruby 1.9 but cannot run the "gem" command, then you may
152
+ need to set up your PATH environment variable first, so you can run _ruby_ and
153
+ _gem_ from the command line.
154
+
155
+ == Components
156
+
157
+ The programs in Sanzang are designed in a traditional Unix style in which
158
+ programs are executed in a terminal, and program settings are specified
159
+ through command line options and parameters. This allows Sanzang programs
160
+ to be easily scripted and automated.
161
+
162
+ === sanzang-reflow
163
+
164
+ The program sanzang-reflow can reformat Chinese, Japanese, or Korean text, in
165
+ which terms are often split between lines. This formatter "reflows" the text
166
+ instead based on its punctuation and horizontal spacing, separating the source
167
+ text into lines that are much safer for translation using the sanzang-translate
168
+ program.
169
+
170
+ Usage: sanzang-reflow [options]
171
+
172
+ Options:
173
+ -h, --help show this help message and exit
174
+ -E, --encoding=ENC set data encoding to ENC
175
+ -L, --list-encodings list possible encodings
176
+ -i, --infile=FILE read input text from FILE
177
+ -o, --outfile=FILE write output text to FILE
178
+ -V, --version show version number and exit
179
+
180
+ === sanzang-translate
181
+
182
+ The program sanzang-translate (1) reads a translation table file, (2) applies
183
+ this table's rules to an input text, and then (3) generates a translation
184
+ listing. This program can also run in a special batch mode that can utilize
185
+ multiprocessing (multiple processors and processor cores) for high
186
+ performance.
187
+
188
+ Usage: sanzang-translate [options] table
189
+ Usage: sanzang-translate -B output_dir table < file_list
190
+
191
+ Options:
192
+ -h, --help show this help message and exit
193
+ -B, --batch-dir=DIR process from a queue into DIR
194
+ -E, --encoding=ENC set data encoding to ENC
195
+ -L, --list-encodings list possible encodings
196
+ -i, --infile=FILE read input text from FILE
197
+ -o, --outfile=FILE write output text to FILE
198
+ -P, --platform show platform information
199
+ -V, --version show version number and exit
200
+
201
+ == Basic Usage
202
+
203
+ === Formatting and translating a single text
204
+
205
+ In the following example, we are working with a small text that we want to
206
+ translate. With the first command, we reformat the text using sanzang-reflow.
207
+ Then we run the sanzang-translate program with our translation table, to
208
+ generate a translation listing.
209
+
210
+ $ sanzang-reflow -i xinjing.txt -o lines.txt
211
+ $ sanzang-translate -i lines.txt -o trans.txt TABLE.txt
212
+
213
+ === Redirecting I/O
214
+
215
+ The next two commands illustrate how these programs use standard input and
216
+ output streams by default, and how they can easily operate as text filters.
217
+
218
+ $ sanzang-reflow -i xinjing.txt | sanzang-translate -o trans.txt TABLE.txt
219
+ $ cat xinjing.txt | sanzang-reflow | sanzang-translate TABLE.txt | less
220
+
221
+ == Advanced Usage
222
+
223
+ === Batch Mode and Multiprocessing
224
+
225
+ In the following example, we may have several thousand texts that we want to
226
+ run through sanzang-translate with our translation table. For example, if our
227
+ translation table was updated recently, we may want to regenerate our corpus
228
+ of translation listings. To do this, we can use the "find" command to retrieve
229
+ the file paths to our text files, and then pipe that output into the Sanzang
230
+ translation program.
231
+
232
+ $ find /srv/texts -type f | sanzang-translate -B /srv/trans TABLE.txt
233
+
234
+ This command will find all files in the location specified, and then feed the
235
+ file paths to sanzang-translate, which will process them as a batch. If the
236
+ "parallel" gem is available and functioning on the system, then the batch will
237
+ be divided among all available processors.
238
+
239
+ If this gem has been installed, then when running in batch mode, if we have six
240
+ CPU cores on the local machine, then we should be able to expect six
241
+ translation processes running concurrently. The exception to this is on the
242
+ "mswin" and "mingw" platforms, which do not have the necessary system calls
243
+ for Unix style multiprocessing. In this case, running Sanzang in the
244
+ Cygwin environment is a viable alternative.
245
+
246
+ The performance benefits of running with the "parallel" library can be very
247
+ significant, leading to a series of translation listings being generated in a
248
+ mere fraction of the time it would take to process them otherwise. This
249
+ performance gain is typically proportional to the number of processors and
250
+ processor cores available on the local system.
251
+
252
+ === Text Encodings
253
+
254
+ Sanzang supports many possible text encodings. Option "-L" will list all
255
+ available text encodings. Option "-E" will set the encoding to be used for all
256
+ text data such as input texts, output texts, and table files. The other program
257
+ I/O, such as messages for the terminal, will still be in the default encoding
258
+ of the environment. For example, in a Windows environment that by default uses
259
+ the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
260
+ Sanzang to read and write all text data in UTF-16LE, but all other program
261
+ messages will still be displayed in the console's native IBM-437 encoding.
262
+
263
+ $ sanzang-translate -E UTF-16LE -i in.txt -o out.txt TABLE.txt
264
+
265
+ If the "-E" option is not specified, then Sanzang will use the default encoding
266
+ inherited from the environment. For example, a GNU/Linux user running Sanzang in
267
+ a UTF-8 terminal will by default have all text data read and written to in the
268
+ UTF-8 encoding. The one *exception* to this is for environments using the
269
+ IBM-437 encoding (typically an old Windows command shell). In this case,
270
+ Sanzang will take pity on you and automatically switch to UTF-8 by default, as
271
+ if you had specified the option "-E" with value "UTF-8".
272
+
273
+ == Responsible Use
274
+
275
+ With comprehensive translation tables, Sanzang can often be quite accurate
276
+ and effective. However, this program is still comparable to a simple machine,
277
+ and it can never replace a human translator. Please understand the scope of
278
+ this translation system when using it. No machines can take responsibility for
279
+ a poor translation. In the end, it is you who are responsible for any and all
280
+ publications.
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env ruby
2
+ # -*- encoding: UTF-8 -*-
3
+ #--
4
+ # Copyright (C) 2012 Lapis Lazuli Texts
5
+ #
6
+ # This program is free software: you can redistribute it and/or modify it under
7
+ # the terms of the GNU General Public License as published by the Free Software
8
+ # Foundation, either version 3 of the License, or (at your option) any later
9
+ # version.
10
+ #
11
+ # This program is distributed in the hope that it will be useful, but WITHOUT
12
+ # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
13
+ # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
14
+ # details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License along with
17
+ # this program. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require_relative File.join("..", "lib", "sanzang")
20
+
21
+ Kernel.exit(Sanzang::Command::Reflow.new.run(ARGV))
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env ruby
2
+ # -*- encoding: UTF-8 -*-
3
+ #--
4
+ # Copyright (C) 2012 Lapis Lazuli Texts
5
+ #
6
+ # This program is free software: you can redistribute it and/or modify it under
7
+ # the terms of the GNU General Public License as published by the Free Software
8
+ # Foundation, either version 3 of the License, or (at your option) any later
9
+ # version.
10
+ #
11
+ # This program is distributed in the hope that it will be useful, but WITHOUT
12
+ # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
13
+ # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
14
+ # details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License along with
17
+ # this program. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require_relative File.join("..", "lib", "sanzang")
20
+
21
+ Kernel.exit(Sanzang::Command::Translate.new.run(ARGV))
data/lib/sanzang.rb ADDED
@@ -0,0 +1,65 @@
1
+ #!/usr/bin/env ruby -w
2
+ # -*- encoding: UTF-8 -*-
3
+
4
+ # == Description
5
+ #
6
+ # The Sanzang module contains a basic infrastructure for machine translation
7
+ # using a simple direct translation method that does not attempt to change the
8
+ # underlying grammar of the source text. The Sanzang module also contains
9
+ # functionality for preparing source texts by reformatting them in a manner
10
+ # that will facilitates both machine translation as well as the readability of
11
+ # the final translation listing that is generated. All program source code for
12
+ # the Sanzang system is contained within the Sanzang module, with code for the
13
+ # Sanzang commands being located in the Sanzang::Command module.
14
+ #
15
+ # == Copyright
16
+ #
17
+ # Copyright (C) 2012 Lapis Lazuli Texts
18
+ #
19
+ # This program is free software: you can redistribute it and/or modify it under
20
+ # the terms of the GNU General Public License as published by the Free Software
21
+ # Foundation, either version 3 of the License, or (at your option) any later
22
+ # version.
23
+ #
24
+ # This program is distributed in the hope that it will be useful, but WITHOUT
25
+ # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
26
+ # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
27
+ # details.
28
+ #
29
+ # You should have received a copy of the GNU General Public License along with
30
+ # this program. If not, see <http://www.gnu.org/licenses/>.
31
+ #
32
+ module Sanzang; end
33
+
34
+ require_relative File.join("sanzang", "text_formatter")
35
+ require_relative File.join("sanzang", "translation_table")
36
+ require_relative File.join("sanzang", "translator")
37
+ require_relative File.join("sanzang", "version")
38
+
39
+ # == Description
40
+ #
41
+ # The Sanzang::Command module contains Unix style commands utilizing the
42
+ # Sanzang module. Each class is typically a different command, with usage
43
+ # information given when running the command with the "-h" or "--help" options.
44
+ #
45
+ # == Copyright
46
+ #
47
+ # Copyright (C) 2012 Lapis Lazuli Texts
48
+ #
49
+ # This program is free software: you can redistribute it and/or modify it under
50
+ # the terms of the GNU General Public License as published by the Free Software
51
+ # Foundation, either version 3 of the License, or (at your option) any later
52
+ # version.
53
+ #
54
+ # This program is distributed in the hope that it will be useful, but WITHOUT
55
+ # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
56
+ # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
57
+ # details.
58
+ #
59
+ # You should have received a copy of the GNU General Public License along with
60
+ # this program. If not, see <http://www.gnu.org/licenses/>.
61
+ #
62
+ module Sanzang::Command; end
63
+
64
+ require_relative File.join("sanzang", "command", "reflow")
65
+ require_relative File.join("sanzang", "command", "translate")
@@ -0,0 +1,136 @@
1
+ #!/usr/bin/env ruby
2
+ # -*- encoding: UTF-8 -*-
3
+ #--
4
+ # Copyright (C) 2012 Lapis Lazuli Texts
5
+ #
6
+ # This program is free software: you can redistribute it and/or modify it under
7
+ # the terms of the GNU General Public License as published by the Free Software
8
+ # Foundation, either version 3 of the License, or (at your option) any later
9
+ # version.
10
+ #
11
+ # This program is distributed in the hope that it will be useful, but WITHOUT
12
+ # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
13
+ # FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
14
+ # details.
15
+ #
16
+ # You should have received a copy of the GNU General Public License along with
17
+ # this program. If not, see <http://www.gnu.org/licenses/>.
18
+
19
+ require "optparse"
20
+
21
+ require_relative File.join("..", "text_formatter")
22
+ require_relative File.join("..", "version")
23
+
24
+ module Sanzang::Command
25
+
26
+ # The Sanzang::Command::Reflow class provides a Unix-style command for
27
+ # text reformatting. This reformatting is typically for use prior to
28
+ # processing the text with the Sanzang::Command::Translate. The reason for
29
+ # this is to do initial text transformations to ensure (1) that terms will
30
+ # be translated reliably, and (2) that the final output of the translation
31
+ # will be readable by the user (i.e. lines not too long).
32
+ #
33
+ class Reflow
34
+
35
+ # Create a new instance of the Reflow class.
36
+ #
37
+ def initialize
38
+ @name = "sanzang-reflow"
39
+ @encoding = Encoding.default_external
40
+ @infile = nil
41
+ @outfile = nil
42
+ end
43
+
44
+ # Run the Reflow command with the given arguments. The parameter _args_
45
+ # would typically be an Array of Unix-style command parameters. Calling
46
+ # this with the "-h" or "--help" option will print full usage information
47
+ # necessary for running this command.
48
+ #
49
+ def run(args)
50
+ parser = option_parser
51
+ parser.parse!(args)
52
+
53
+ if args.length != 0
54
+ puts(parser)
55
+ return 1
56
+ end
57
+
58
+ set_data_encoding
59
+
60
+ begin
61
+ fin = @infile ? File.open(@infile, "r") : $stdin
62
+ fin.binmode.set_encoding(@encoding)
63
+ fout = @outfile ? File.open(@outfile, "w") : $stdout
64
+ fout.binmode.set_encoding(@encoding)
65
+ fout.write(Sanzang::TextFormatter.new.reflow_cjk_text(fin.read))
66
+ ensure
67
+ if defined?(fin) and fin != $stdin
68
+ fin.close if not fin.closed?
69
+ end
70
+ if defined?(fout) and fin != $stdout
71
+ fout.close if not fout.closed?
72
+ end
73
+ end
74
+
75
+ return 0
76
+ rescue SystemExit => err
77
+ return err.status
78
+ rescue Exception => err
79
+ $stderr.puts err.backtrace
80
+ $stderr.puts "ERROR: #{err.inspect}"
81
+ return 1
82
+ end
83
+
84
+ private
85
+
86
+ def set_data_encoding
87
+ if @encoding == nil
88
+ if Encoding.default_external == Encoding::IBM437
89
+ $stderr.puts "Switching to UTF-8 for text data encoding."
90
+ @encoding = Encoding::UTF_8
91
+ else
92
+ @encoding = Encoding.default_external
93
+ end
94
+ end
95
+ end
96
+
97
+ def option_parser
98
+ OptionParser.new do |pr|
99
+ pr.banner = "Usage: #{@name} [options]\n"
100
+
101
+ pr.banner << "\nReformat text file contents into lines based on "
102
+ pr.banner << "spacing, punctuation, etc.\n"
103
+ pr.banner << "\nExamples:\n"
104
+ pr.banner << " #{@name} -i in/mytext.txt -o out/mytext.txt\n"
105
+ pr.banner << "\nOptions:\n"
106
+
107
+ pr.on("-h", "--help", "show this help message and exit") do |v|
108
+ puts pr
109
+ exit 0
110
+ end
111
+ pr.on("-E", "--encoding=ENC", "set data encoding to ENC") do |v|
112
+ @encoding = Encoding.find(v)
113
+ end
114
+ pr.on("-L", "--list-encodings", "list possible encodings") do |v|
115
+ puts(Encoding.list.collect {|e| e.to_s }.sort)
116
+ exit 0
117
+ end
118
+ pr.on("-i", "--infile=FILE", "read input text from FILE") do |v|
119
+ @infile = v
120
+ end
121
+ pr.on("-o", "--outfile=FILE", "write output text to FILE") do |v|
122
+ @outfile = v
123
+ end
124
+ pr.on("-V", "--version", "show version number and exit") do |v|
125
+ puts "Sanzang version: #{Sanzang::VERSION}"
126
+ exit 0
127
+ end
128
+ end
129
+ end
130
+
131
+ # The standard name for the command.
132
+ #
133
+ attr_reader :name
134
+
135
+ end
136
+ end