sanzang 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/HACKING +54 -0
- data/LICENSE +628 -0
- data/README +280 -0
- data/bin/sanzang-reflow +21 -0
- data/bin/sanzang-translate +21 -0
- data/lib/sanzang.rb +65 -0
- data/lib/sanzang/command/reflow.rb +136 -0
- data/lib/sanzang/command/translate.rb +168 -0
- data/lib/sanzang/text_formatter.rb +71 -0
- data/lib/sanzang/translation_table.rb +113 -0
- data/lib/sanzang/translator.rb +174 -0
- data/lib/sanzang/version.rb +24 -0
- data/test/tc_commands.rb +17 -0
- data/test/tc_reflow_encodings.rb +98 -0
- data/test/tc_simple_translation.rb +97 -0
- data/test/utf-8/batch/file_1.txt +8 -0
- data/test/utf-8/batch/file_2.txt +8 -0
- data/test/utf-8/batch/file_3.txt +8 -0
- data/test/utf-8/batch/file_4.txt +8 -0
- data/test/utf-8/file_1.txt +2 -0
- data/test/utf-8/file_2.txt +2 -0
- data/test/utf-8/file_3.txt +2 -0
- data/test/utf-8/file_4.txt +2 -0
- data/test/utf-8/stage_1.txt +1 -0
- data/test/utf-8/stage_2.txt +2 -0
- data/test/utf-8/stage_3.txt +4 -0
- data/test/utf-8/table.txt +8 -0
- metadata +102 -0
data/README
ADDED
@@ -0,0 +1,280 @@
|
|
1
|
+
= Sanzang (三藏)
|
2
|
+
|
3
|
+
== Contents
|
4
|
+
|
5
|
+
* Introduction
|
6
|
+
* Concepts
|
7
|
+
* Installation
|
8
|
+
* Components
|
9
|
+
* Basic Usage
|
10
|
+
* Advanced Usage
|
11
|
+
* Responsible Use
|
12
|
+
|
13
|
+
== Introduction
|
14
|
+
|
15
|
+
Sanzang is a compact, cross-platform machine translation system. This program
|
16
|
+
was developed specifically to fill the need for a competent application for
|
17
|
+
aiding translators of the Chinese Buddhist canon into other languages. However,
|
18
|
+
the translation method it uses is general enough that it may extend to other
|
19
|
+
translation domains as well, especially those in which Chinese is the source
|
20
|
+
language. Sanzang (三藏) is a literal translation of the Sanskrit word
|
21
|
+
"Tripitaka," a general term for the Buddhist canon. Sanzang is alternately
|
22
|
+
a translation of "trepitaka," the title for someone who is a master of such
|
23
|
+
teachings.
|
24
|
+
|
25
|
+
Sanzang is implemented as a small set of programs written in the Ruby
|
26
|
+
programming language. This system is free software (“free as in freedom”), and
|
27
|
+
it is licensed under the GNU General Public License, version 3. This ensures
|
28
|
+
that anyone can use the program for any purpose, and that any extensions to
|
29
|
+
Sanzang will remain freely available to others.
|
30
|
+
|
31
|
+
== Background
|
32
|
+
|
33
|
+
The most significant difference between Sanzang and other machine translation
|
34
|
+
systems is that it does not attempt to interpret grammar in any way. Instead,
|
35
|
+
it relies on direct translation of names, terms, and phrases based on a large
|
36
|
+
translation table. The Sanzang translator simply applies this translation table
|
37
|
+
at runtime, and does not attempt to interpret grammar or syntax in any way
|
38
|
+
whatsoever. The end result is that the accuracy of the translation is highly
|
39
|
+
dependent on the accuracy of the translation table.
|
40
|
+
|
41
|
+
The strength of the Sanzang method is that it is extremely simple and easy to
|
42
|
+
work with, and eliminates virtually all complexity in the translation process.
|
43
|
+
This system will never produce incorrect syntax because it does not interpret
|
44
|
+
syntax in the first place. This method is also efficient and yields predictable
|
45
|
+
results that can be made immediately available to the user for verification. To
|
46
|
+
facilitate this task, all translation listings are collated line-by-line with
|
47
|
+
the original source text.
|
48
|
+
|
49
|
+
== Concepts
|
50
|
+
|
51
|
+
Sanzang provides mainly a simple translation engine. For any actual
|
52
|
+
translation work, Sanzang requires a translation table in which all
|
53
|
+
translation rules are defined. This translation table is stored in a simple
|
54
|
+
text file. Each line is a record containing a source term and its equivalent
|
55
|
+
meanings in other languages. Each line starts with "~|", has records delimited
|
56
|
+
by "|", and ends with "|~". In a table, the first column represents the source
|
57
|
+
language, while the subsequent columns represent destination languages. In
|
58
|
+
this example, we want to create a table capable of rendering the following
|
59
|
+
title into English:
|
60
|
+
|
61
|
+
金剛般若波羅蜜經
|
62
|
+
|
63
|
+
We start by creating a new text file, named TABLE.txt or something similar. In
|
64
|
+
this text file, we may add the following rules:
|
65
|
+
|
66
|
+
~|波羅蜜| pāramitā|~
|
67
|
+
~|金剛| diamond|~
|
68
|
+
~|般若| prajñā|~
|
69
|
+
~|經| sūtra|~
|
70
|
+
|
71
|
+
Did you notice that we included spaces prior to the translations of these
|
72
|
+
terms? This is because Chinese does not typically include spaces between
|
73
|
+
words, so we need to insert our own leading spaces as part of the rules we are
|
74
|
+
defining. After we have written this table file, we can run the Sanzang
|
75
|
+
translator with our table. When it reads the Chinese title as the input text,
|
76
|
+
it then produces the following translation listing:
|
77
|
+
|
78
|
+
1.1 金剛般若波羅蜜經
|
79
|
+
1.2 diamond prajñā pāramitā sūtra
|
80
|
+
|
81
|
+
The program first sorted our terms by the length of the source column, and
|
82
|
+
then applied each of these rules in sequence. It then collated the output and
|
83
|
+
created a translation listing. In the left margin, we can see numbers denoting
|
84
|
+
the line number of the source text, along with the column number of the
|
85
|
+
translation table.
|
86
|
+
|
87
|
+
As a final example, below is a snippet from an ancient meditation text, which
|
88
|
+
was also processed by the Sanzang translator in the same manner:
|
89
|
+
|
90
|
+
105.1 阿難白佛言。
|
91
|
+
105.2 ānán bái-fó-yán ¶
|
92
|
+
105.3 ānanda addressed-the-buddha-saying ¶
|
93
|
+
|
94
|
+
106.1 唯然世尊。
|
95
|
+
106.2 wéi-rán shìzūn ¶
|
96
|
+
106.3 just-so bhagavān ¶
|
97
|
+
|
98
|
+
107.1 願樂欲聞。
|
99
|
+
107.2 yuànlè-yù-wén ¶
|
100
|
+
107.3 joyfully-wish-to-hear ¶
|
101
|
+
|
102
|
+
Here we can see a three-column translation table at work. The first column has
|
103
|
+
the traditional Chinese source text, the second column contains the Pinyin
|
104
|
+
transliteration, and the third column contains English. In this example we can
|
105
|
+
see that well-defined translation rules lead to a clear translation listing,
|
106
|
+
at which the meaning of the original text is readily understandable in
|
107
|
+
English. If we wished to add additional columns for simplified Chinese,
|
108
|
+
Vietnamese, Japanese, Spanish, French, German, Russian, or any other languages,
|
109
|
+
then these could all be handled similarly without any technical difficulties.
|
110
|
+
|
111
|
+
Comprehensive translation tables could be quite large, containing tens of
|
112
|
+
thousands of entries. However, the work of building such a table is not so
|
113
|
+
significant compared to the long-term benefits which may be gained from such
|
114
|
+
tables. In addition, rules in these translation tables may be translated into
|
115
|
+
other languages as well. There is a potential here to assist readers all over
|
116
|
+
the world with understanding otherwise difficult works.
|
117
|
+
|
118
|
+
Considering the examples above, we can see that knowledge of the source
|
119
|
+
language and expertise in the relevant literary field is often still necessary.
|
120
|
+
Here again we can see that this translation system does not position itself as
|
121
|
+
a “silver bullet” for creating finished translations, but is rather a practical
|
122
|
+
set of utilities for the purpose of assisting human readers and translators.
|
123
|
+
|
124
|
+
== Installation
|
125
|
+
|
126
|
+
=== Requirements
|
127
|
+
|
128
|
+
The Sanzang system can be installed either as a Ruby gem, or manually from
|
129
|
+
an archive file. The only prerequisite to using Sanzang is:
|
130
|
+
|
131
|
+
* Ruby 1.9 or later
|
132
|
+
|
133
|
+
The "parallel" gem is required by Sanzang, but is installed automatically when
|
134
|
+
installing Sanzang using the standard method. Using the "parallel" gem, Sanzang
|
135
|
+
can support multiprocessing in batch mode (if the platform supports it).
|
136
|
+
Currently this method of multiprocessing will work automatically on Ruby ports
|
137
|
+
that implement the Process#fork system call.
|
138
|
+
|
139
|
+
In addition to the actual runtime requirements, it may also be very useful to
|
140
|
+
have a text editor that is aware of Unicode and other encodings, and able to
|
141
|
+
display multilingual texts. One such application that is known to work well
|
142
|
+
for this task is the _gedit_ text editor, which is free software and also
|
143
|
+
available on a variety of platforms.
|
144
|
+
|
145
|
+
=== Installation
|
146
|
+
|
147
|
+
To install Sanzang, the following command should suffice.
|
148
|
+
|
149
|
+
# gem install sanzang
|
150
|
+
|
151
|
+
If you have installed Ruby 1.9 but cannot run the "gem" command, then you may
|
152
|
+
need to set up your PATH environment variable first, so you can run _ruby_ and
|
153
|
+
_gem_ from the command line.
|
154
|
+
|
155
|
+
== Components
|
156
|
+
|
157
|
+
The programs in Sanzang are designed in a traditional Unix style in which
|
158
|
+
programs are executed in a terminal, and program settings are specified
|
159
|
+
through command line options and parameters. This allows Sanzang programs
|
160
|
+
to be easily scripted and automated.
|
161
|
+
|
162
|
+
=== sanzang-reflow
|
163
|
+
|
164
|
+
The program sanzang-reflow can reformat Chinese, Japanese, or Korean text, in
|
165
|
+
which terms are often split between lines. This formatter "reflows" the text
|
166
|
+
instead based on its punctuation and horizontal spacing, separating the source
|
167
|
+
text into lines that are much safer for translation using the sanzang-translate
|
168
|
+
program.
|
169
|
+
|
170
|
+
Usage: sanzang-reflow [options]
|
171
|
+
|
172
|
+
Options:
|
173
|
+
-h, --help show this help message and exit
|
174
|
+
-E, --encoding=ENC set data encoding to ENC
|
175
|
+
-L, --list-encodings list possible encodings
|
176
|
+
-i, --infile=FILE read input text from FILE
|
177
|
+
-o, --outfile=FILE write output text to FILE
|
178
|
+
-V, --version show version number and exit
|
179
|
+
|
180
|
+
=== sanzang-translate
|
181
|
+
|
182
|
+
The program sanzang-translate (1) reads a translation table file, (2) applies
|
183
|
+
this table's rules to an input text, and then (3) generates a translation
|
184
|
+
listing. This program can also run in a special batch mode that can utilize
|
185
|
+
multiprocessing (multiple processors and processor cores) for high
|
186
|
+
performance.
|
187
|
+
|
188
|
+
Usage: sanzang-translate [options] table
|
189
|
+
Usage: sanzang-translate -B output_dir table < file_list
|
190
|
+
|
191
|
+
Options:
|
192
|
+
-h, --help show this help message and exit
|
193
|
+
-B, --batch-dir=DIR process from a queue into DIR
|
194
|
+
-E, --encoding=ENC set data encoding to ENC
|
195
|
+
-L, --list-encodings list possible encodings
|
196
|
+
-i, --infile=FILE read input text from FILE
|
197
|
+
-o, --outfile=FILE write output text to FILE
|
198
|
+
-P, --platform show platform information
|
199
|
+
-V, --version show version number and exit
|
200
|
+
|
201
|
+
== Basic Usage
|
202
|
+
|
203
|
+
=== Formatting and translating a single text
|
204
|
+
|
205
|
+
In the following example, we are working with a small text that we want to
|
206
|
+
translate. With the first command, we reformat the text using sanzang-reflow.
|
207
|
+
Then we run the sanzang-translate program with our translation table, to
|
208
|
+
generate a translation listing.
|
209
|
+
|
210
|
+
$ sanzang-reflow -i xinjing.txt -o lines.txt
|
211
|
+
$ sanzang-translate -i lines.txt -o trans.txt TABLE.txt
|
212
|
+
|
213
|
+
=== Redirecting I/O
|
214
|
+
|
215
|
+
The next two commands illustrate how these programs use standard input and
|
216
|
+
output streams by default, and how they can easily operate as text filters.
|
217
|
+
|
218
|
+
$ sanzang-reflow -i xinjing.txt | sanzang-translate -o trans.txt TABLE.txt
|
219
|
+
$ cat xinjing.txt | sanzang-reflow | sanzang-translate TABLE.txt | less
|
220
|
+
|
221
|
+
== Advanced Usage
|
222
|
+
|
223
|
+
=== Batch Mode and Multiprocessing
|
224
|
+
|
225
|
+
In the following example, we may have several thousand texts that we want to
|
226
|
+
run through sanzang-translate with our translation table. For example, if our
|
227
|
+
translation table was updated recently, we may want to regenerate our corpus
|
228
|
+
of translation listings. To do this, we can use the "find" command to retrieve
|
229
|
+
the file paths to our text files, and then pipe that output into the Sanzang
|
230
|
+
translation program.
|
231
|
+
|
232
|
+
$ find /srv/texts -type f | sanzang-translate -B /srv/trans TABLE.txt
|
233
|
+
|
234
|
+
This command will find all files in the location specified, and then feed the
|
235
|
+
file paths to sanzang-translate, which will process them as a batch. If the
|
236
|
+
"parallel" gem is available and functioning on the system, then the batch will
|
237
|
+
be divided among all available processors.
|
238
|
+
|
239
|
+
If this gem has been installed, then when running in batch mode, if we have six
|
240
|
+
CPU cores on the local machine, then we should be able to expect six
|
241
|
+
translation processes running concurrently. The exception to this is on the
|
242
|
+
"mswin" and "mingw" platforms, which do not have the necessary system calls
|
243
|
+
for Unix style multiprocessing. In this case, running Sanzang in the
|
244
|
+
Cygwin environment is a viable alternative.
|
245
|
+
|
246
|
+
The performance benefits of running with the "parallel" library can be very
|
247
|
+
significant, leading to a series of translation listings being generated in a
|
248
|
+
mere fraction of the time it would take to process them otherwise. This
|
249
|
+
performance gain is typically proportional to the number of processors and
|
250
|
+
processor cores available on the local system.
|
251
|
+
|
252
|
+
=== Text Encodings
|
253
|
+
|
254
|
+
Sanzang supports many possible text encodings. Option "-L" will list all
|
255
|
+
available text encodings. Option "-E" will set the encoding to be used for all
|
256
|
+
text data such as input texts, output texts, and table files. The other program
|
257
|
+
I/O, such as messages for the terminal, will still be in the default encoding
|
258
|
+
of the environment. For example, in a Windows environment that by default uses
|
259
|
+
the IBM-437 encoding, specifying "-E" with a value of "UTF-16LE" will cause
|
260
|
+
Sanzang to read and write all text data in UTF-16LE, but all other program
|
261
|
+
messages will still be displayed in the console's native IBM-437 encoding.
|
262
|
+
|
263
|
+
$ sanzang-translate -E UTF-16LE -i in.txt -o out.txt TABLE.txt
|
264
|
+
|
265
|
+
If the "-E" option is not specified, then Sanzang will use the default encoding
|
266
|
+
inherited from the environment. For example, a GNU/Linux user running Sanzang in
|
267
|
+
a UTF-8 terminal will by default have all text data read and written to in the
|
268
|
+
UTF-8 encoding. The one *exception* to this is for environments using the
|
269
|
+
IBM-437 encoding (typically an old Windows command shell). In this case,
|
270
|
+
Sanzang will take pity on you and automatically switch to UTF-8 by default, as
|
271
|
+
if you had specified the option "-E" with value "UTF-8".
|
272
|
+
|
273
|
+
== Responsible Use
|
274
|
+
|
275
|
+
With comprehensive translation tables, Sanzang can often be quite accurate
|
276
|
+
and effective. However, this program is still comparable to a simple machine,
|
277
|
+
and it can never replace a human translator. Please understand the scope of
|
278
|
+
this translation system when using it. No machines can take responsibility for
|
279
|
+
a poor translation. In the end, it is you who are responsible for any and all
|
280
|
+
publications.
|
data/bin/sanzang-reflow
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# -*- encoding: UTF-8 -*-
|
3
|
+
#--
|
4
|
+
# Copyright (C) 2012 Lapis Lazuli Texts
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify it under
|
7
|
+
# the terms of the GNU General Public License as published by the Free Software
|
8
|
+
# Foundation, either version 3 of the License, or (at your option) any later
|
9
|
+
# version.
|
10
|
+
#
|
11
|
+
# This program is distributed in the hope that it will be useful, but WITHOUT
|
12
|
+
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
13
|
+
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
|
14
|
+
# details.
|
15
|
+
#
|
16
|
+
# You should have received a copy of the GNU General Public License along with
|
17
|
+
# this program. If not, see <http://www.gnu.org/licenses/>.
|
18
|
+
|
19
|
+
require_relative File.join("..", "lib", "sanzang")
|
20
|
+
|
21
|
+
Kernel.exit(Sanzang::Command::Reflow.new.run(ARGV))
|
@@ -0,0 +1,21 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# -*- encoding: UTF-8 -*-
|
3
|
+
#--
|
4
|
+
# Copyright (C) 2012 Lapis Lazuli Texts
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify it under
|
7
|
+
# the terms of the GNU General Public License as published by the Free Software
|
8
|
+
# Foundation, either version 3 of the License, or (at your option) any later
|
9
|
+
# version.
|
10
|
+
#
|
11
|
+
# This program is distributed in the hope that it will be useful, but WITHOUT
|
12
|
+
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
13
|
+
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
|
14
|
+
# details.
|
15
|
+
#
|
16
|
+
# You should have received a copy of the GNU General Public License along with
|
17
|
+
# this program. If not, see <http://www.gnu.org/licenses/>.
|
18
|
+
|
19
|
+
require_relative File.join("..", "lib", "sanzang")
|
20
|
+
|
21
|
+
Kernel.exit(Sanzang::Command::Translate.new.run(ARGV))
|
data/lib/sanzang.rb
ADDED
@@ -0,0 +1,65 @@
|
|
1
|
+
#!/usr/bin/env ruby -w
|
2
|
+
# -*- encoding: UTF-8 -*-
|
3
|
+
|
4
|
+
# == Description
|
5
|
+
#
|
6
|
+
# The Sanzang module contains a basic infrastructure for machine translation
|
7
|
+
# using a simple direct translation method that does not attempt to change the
|
8
|
+
# underlying grammar of the source text. The Sanzang module also contains
|
9
|
+
# functionality for preparing source texts by reformatting them in a manner
|
10
|
+
# that will facilitates both machine translation as well as the readability of
|
11
|
+
# the final translation listing that is generated. All program source code for
|
12
|
+
# the Sanzang system is contained within the Sanzang module, with code for the
|
13
|
+
# Sanzang commands being located in the Sanzang::Command module.
|
14
|
+
#
|
15
|
+
# == Copyright
|
16
|
+
#
|
17
|
+
# Copyright (C) 2012 Lapis Lazuli Texts
|
18
|
+
#
|
19
|
+
# This program is free software: you can redistribute it and/or modify it under
|
20
|
+
# the terms of the GNU General Public License as published by the Free Software
|
21
|
+
# Foundation, either version 3 of the License, or (at your option) any later
|
22
|
+
# version.
|
23
|
+
#
|
24
|
+
# This program is distributed in the hope that it will be useful, but WITHOUT
|
25
|
+
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
26
|
+
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
|
27
|
+
# details.
|
28
|
+
#
|
29
|
+
# You should have received a copy of the GNU General Public License along with
|
30
|
+
# this program. If not, see <http://www.gnu.org/licenses/>.
|
31
|
+
#
|
32
|
+
module Sanzang; end
|
33
|
+
|
34
|
+
require_relative File.join("sanzang", "text_formatter")
|
35
|
+
require_relative File.join("sanzang", "translation_table")
|
36
|
+
require_relative File.join("sanzang", "translator")
|
37
|
+
require_relative File.join("sanzang", "version")
|
38
|
+
|
39
|
+
# == Description
|
40
|
+
#
|
41
|
+
# The Sanzang::Command module contains Unix style commands utilizing the
|
42
|
+
# Sanzang module. Each class is typically a different command, with usage
|
43
|
+
# information given when running the command with the "-h" or "--help" options.
|
44
|
+
#
|
45
|
+
# == Copyright
|
46
|
+
#
|
47
|
+
# Copyright (C) 2012 Lapis Lazuli Texts
|
48
|
+
#
|
49
|
+
# This program is free software: you can redistribute it and/or modify it under
|
50
|
+
# the terms of the GNU General Public License as published by the Free Software
|
51
|
+
# Foundation, either version 3 of the License, or (at your option) any later
|
52
|
+
# version.
|
53
|
+
#
|
54
|
+
# This program is distributed in the hope that it will be useful, but WITHOUT
|
55
|
+
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
56
|
+
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
|
57
|
+
# details.
|
58
|
+
#
|
59
|
+
# You should have received a copy of the GNU General Public License along with
|
60
|
+
# this program. If not, see <http://www.gnu.org/licenses/>.
|
61
|
+
#
|
62
|
+
module Sanzang::Command; end
|
63
|
+
|
64
|
+
require_relative File.join("sanzang", "command", "reflow")
|
65
|
+
require_relative File.join("sanzang", "command", "translate")
|
@@ -0,0 +1,136 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# -*- encoding: UTF-8 -*-
|
3
|
+
#--
|
4
|
+
# Copyright (C) 2012 Lapis Lazuli Texts
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify it under
|
7
|
+
# the terms of the GNU General Public License as published by the Free Software
|
8
|
+
# Foundation, either version 3 of the License, or (at your option) any later
|
9
|
+
# version.
|
10
|
+
#
|
11
|
+
# This program is distributed in the hope that it will be useful, but WITHOUT
|
12
|
+
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
13
|
+
# FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
|
14
|
+
# details.
|
15
|
+
#
|
16
|
+
# You should have received a copy of the GNU General Public License along with
|
17
|
+
# this program. If not, see <http://www.gnu.org/licenses/>.
|
18
|
+
|
19
|
+
require "optparse"
|
20
|
+
|
21
|
+
require_relative File.join("..", "text_formatter")
|
22
|
+
require_relative File.join("..", "version")
|
23
|
+
|
24
|
+
module Sanzang::Command
|
25
|
+
|
26
|
+
# The Sanzang::Command::Reflow class provides a Unix-style command for
|
27
|
+
# text reformatting. This reformatting is typically for use prior to
|
28
|
+
# processing the text with the Sanzang::Command::Translate. The reason for
|
29
|
+
# this is to do initial text transformations to ensure (1) that terms will
|
30
|
+
# be translated reliably, and (2) that the final output of the translation
|
31
|
+
# will be readable by the user (i.e. lines not too long).
|
32
|
+
#
|
33
|
+
class Reflow
|
34
|
+
|
35
|
+
# Create a new instance of the Reflow class.
|
36
|
+
#
|
37
|
+
def initialize
|
38
|
+
@name = "sanzang-reflow"
|
39
|
+
@encoding = Encoding.default_external
|
40
|
+
@infile = nil
|
41
|
+
@outfile = nil
|
42
|
+
end
|
43
|
+
|
44
|
+
# Run the Reflow command with the given arguments. The parameter _args_
|
45
|
+
# would typically be an Array of Unix-style command parameters. Calling
|
46
|
+
# this with the "-h" or "--help" option will print full usage information
|
47
|
+
# necessary for running this command.
|
48
|
+
#
|
49
|
+
def run(args)
|
50
|
+
parser = option_parser
|
51
|
+
parser.parse!(args)
|
52
|
+
|
53
|
+
if args.length != 0
|
54
|
+
puts(parser)
|
55
|
+
return 1
|
56
|
+
end
|
57
|
+
|
58
|
+
set_data_encoding
|
59
|
+
|
60
|
+
begin
|
61
|
+
fin = @infile ? File.open(@infile, "r") : $stdin
|
62
|
+
fin.binmode.set_encoding(@encoding)
|
63
|
+
fout = @outfile ? File.open(@outfile, "w") : $stdout
|
64
|
+
fout.binmode.set_encoding(@encoding)
|
65
|
+
fout.write(Sanzang::TextFormatter.new.reflow_cjk_text(fin.read))
|
66
|
+
ensure
|
67
|
+
if defined?(fin) and fin != $stdin
|
68
|
+
fin.close if not fin.closed?
|
69
|
+
end
|
70
|
+
if defined?(fout) and fin != $stdout
|
71
|
+
fout.close if not fout.closed?
|
72
|
+
end
|
73
|
+
end
|
74
|
+
|
75
|
+
return 0
|
76
|
+
rescue SystemExit => err
|
77
|
+
return err.status
|
78
|
+
rescue Exception => err
|
79
|
+
$stderr.puts err.backtrace
|
80
|
+
$stderr.puts "ERROR: #{err.inspect}"
|
81
|
+
return 1
|
82
|
+
end
|
83
|
+
|
84
|
+
private
|
85
|
+
|
86
|
+
def set_data_encoding
|
87
|
+
if @encoding == nil
|
88
|
+
if Encoding.default_external == Encoding::IBM437
|
89
|
+
$stderr.puts "Switching to UTF-8 for text data encoding."
|
90
|
+
@encoding = Encoding::UTF_8
|
91
|
+
else
|
92
|
+
@encoding = Encoding.default_external
|
93
|
+
end
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
def option_parser
|
98
|
+
OptionParser.new do |pr|
|
99
|
+
pr.banner = "Usage: #{@name} [options]\n"
|
100
|
+
|
101
|
+
pr.banner << "\nReformat text file contents into lines based on "
|
102
|
+
pr.banner << "spacing, punctuation, etc.\n"
|
103
|
+
pr.banner << "\nExamples:\n"
|
104
|
+
pr.banner << " #{@name} -i in/mytext.txt -o out/mytext.txt\n"
|
105
|
+
pr.banner << "\nOptions:\n"
|
106
|
+
|
107
|
+
pr.on("-h", "--help", "show this help message and exit") do |v|
|
108
|
+
puts pr
|
109
|
+
exit 0
|
110
|
+
end
|
111
|
+
pr.on("-E", "--encoding=ENC", "set data encoding to ENC") do |v|
|
112
|
+
@encoding = Encoding.find(v)
|
113
|
+
end
|
114
|
+
pr.on("-L", "--list-encodings", "list possible encodings") do |v|
|
115
|
+
puts(Encoding.list.collect {|e| e.to_s }.sort)
|
116
|
+
exit 0
|
117
|
+
end
|
118
|
+
pr.on("-i", "--infile=FILE", "read input text from FILE") do |v|
|
119
|
+
@infile = v
|
120
|
+
end
|
121
|
+
pr.on("-o", "--outfile=FILE", "write output text to FILE") do |v|
|
122
|
+
@outfile = v
|
123
|
+
end
|
124
|
+
pr.on("-V", "--version", "show version number and exit") do |v|
|
125
|
+
puts "Sanzang version: #{Sanzang::VERSION}"
|
126
|
+
exit 0
|
127
|
+
end
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
# The standard name for the command.
|
132
|
+
#
|
133
|
+
attr_reader :name
|
134
|
+
|
135
|
+
end
|
136
|
+
end
|