sanzang 1.1.0 → 1.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/HACKING.rdoc +7 -0
- data/NEWS.rdoc +80 -0
- data/README.rdoc +1 -1
- data/lib/sanzang/platform.rb +1 -1
- data/lib/sanzang/translation_table.rb +12 -7
- data/lib/sanzang/translator.rb +8 -6
- data/lib/sanzang/version.rb +2 -2
- data/test/tc_simple_translation.rb +4 -4
- data/test/utf-8/stage_3.txt +2 -2
- metadata +10 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f5ca5ec3584fad6ed5ef48cce5338db925441921
|
4
|
+
data.tar.gz: 978ca3153152f25b8f0a9d61ae328296eb18b291
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d6d8cf630f82b81227946a005664e0afcbec73fd8209dffd554c8135e911311b8ec08878288d31f489d73b0c42cec3e26bbbdea6e697485ba3f105c51b332cf7
|
7
|
+
data.tar.gz: 8c1d3c7b1d1667f6ca778d8dc80dbf5171b54bd7b76432d75eca93d714db724904b279383954b8da0760b1a9a30a3ccb903a320fac157e167f7a83e5e484e392
|
data/HACKING.rdoc
CHANGED
@@ -38,3 +38,10 @@ Converters for several encodings have not yet been implemented by MRI. Most of
|
|
38
38
|
these are obscure and not widely used. Perhaps the most notable is EUC-TW,
|
39
39
|
which is an old Unix encoding for traditional Chinese. Text encodings that
|
40
40
|
cannot be converted to and from UTF-8 are not currently supported.
|
41
|
+
|
42
|
+
== Reserved characters
|
43
|
+
|
44
|
+
\Sanzang internally uses an ASCII control character as a temporary marker. The
|
45
|
+
following character will be removed from any translated text:
|
46
|
+
|
47
|
+
* 0x1F -- "US" -- "Unit separator"
|
data/NEWS.rdoc
ADDED
@@ -0,0 +1,80 @@
|
|
1
|
+
= News
|
2
|
+
|
3
|
+
== Release History
|
4
|
+
|
5
|
+
=== v1.1.1
|
6
|
+
* Updated horizontal space handling to be more robust.
|
7
|
+
* Horizontal spaces will not be added at the end of any lines.
|
8
|
+
* Fixed transcoding logic for when translate_io parameters are file paths.
|
9
|
+
* Documentation and build file revisions.
|
10
|
+
|
11
|
+
=== v1.1.0
|
12
|
+
* New feature -- automatic spacing between translated terms.
|
13
|
+
* Updated Sanzang::Platform for greater compatibility between Ruby versions.
|
14
|
+
|
15
|
+
=== v1.0.9
|
16
|
+
* Limiting \Sanzang on JRuby to UTF-8 -- JRuby encoding support is limited.
|
17
|
+
* Minor encoding handling fixes.
|
18
|
+
|
19
|
+
=== v1.0.8
|
20
|
+
* Added support for a SANZANG_ENCODING environment variable.
|
21
|
+
* Documentation updates.
|
22
|
+
* Minor output formatting fixes.
|
23
|
+
* Fixed TypeError exception raised when listing encodings in Ruby 2.1.
|
24
|
+
|
25
|
+
=== v1.0.7
|
26
|
+
* Fixes for I/O exception handling logic and file descriptor handling.
|
27
|
+
* Fixed processor counting logic on MS Windows platforms.
|
28
|
+
|
29
|
+
=== v1.0.6
|
30
|
+
* Fixed a bug in file descriptor handling when the FD is nil.
|
31
|
+
* Rakefile updates for greater portability and accepting non-GNU tar.
|
32
|
+
|
33
|
+
=== v1.0.5
|
34
|
+
* Faster translation table loading.
|
35
|
+
* Added support for JRuby including multithreaded batches.
|
36
|
+
* Rakefile will not attempt to build tar archives by default.
|
37
|
+
* Gemfile updates and revisions for better requirements specification.
|
38
|
+
|
39
|
+
=== v1.0.4
|
40
|
+
* Introduced buffered I/O for better performance and memory usage.
|
41
|
+
* Added the Sanzang::Platform module for more accessing system information.
|
42
|
+
|
43
|
+
=== v1.0.3
|
44
|
+
* Test case file updates following encoding handling changes.
|
45
|
+
* Rakefile updates for greater portability (aiming at BSD compatibility).
|
46
|
+
* Using UTF-8 as the default encoding for ASCII and IBM CP terminals.
|
47
|
+
|
48
|
+
=== v1.0.2
|
49
|
+
* Encoding list should only display those that can convert to UTF-8.
|
50
|
+
* Encoding fixes when transcoding. Using UTF-8 internally for translation.
|
51
|
+
* Added a "verbose mode" for debugging.
|
52
|
+
|
53
|
+
=== v1.0.1
|
54
|
+
* Reflow command will only list encodings that can be converted to UTF-8.
|
55
|
+
* Rewrote Sanzang::Translator#translate_io for simpler file handling.
|
56
|
+
* Pipe handling updated to break quietly rather than report an error.
|
57
|
+
* Added additional checks for translation table formatting.
|
58
|
+
|
59
|
+
=== v1.0.0
|
60
|
+
* Many documentation updates and additions.
|
61
|
+
* Consolidated multiple executables into a single "sanzang" command suite.
|
62
|
+
* Previous versions of \Sanzang should be uninstalled before installing this.
|
63
|
+
|
64
|
+
=== v0.0.4
|
65
|
+
* Error message formatting.
|
66
|
+
* Enabled case-insensitive sorting for encodings list.
|
67
|
+
|
68
|
+
=== v0.0.3
|
69
|
+
* Source code formatting.
|
70
|
+
* Fixed usage message error for the translate command.
|
71
|
+
|
72
|
+
=== v0.0.2
|
73
|
+
* Updated Parallel requirement to v0.5.19, following a SIGINT bug.
|
74
|
+
* Rakefile additions and revisions for robustness.
|
75
|
+
* Added empty file to batch testing directory.
|
76
|
+
* Added README.md file.
|
77
|
+
* Fixed file permissions.
|
78
|
+
|
79
|
+
=== v0.0.1
|
80
|
+
* Initial commit to version control, and the first release.
|
data/README.rdoc
CHANGED
@@ -40,6 +40,6 @@ command to verify your installation and print version information.
|
|
40
40
|
|
41
41
|
This command should show a summary of your \Sanzang version and environment.
|
42
42
|
|
43
|
-
sanzang 1.
|
43
|
+
sanzang 1.1.1 (UTF-8) ruby-2.1.0p0 x86_64-linux
|
44
44
|
|
45
45
|
You now have \Sanzang installed on your computer.
|
data/lib/sanzang/platform.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# coding: UTF-8
|
2
2
|
#--
|
3
|
-
# Copyright (C) 2012-
|
3
|
+
# Copyright (C) 2012-2014 Lapis Lazuli Texts
|
4
4
|
#
|
5
5
|
# This program is free software: you can redistribute it and/or modify it under
|
6
6
|
# the terms of the GNU General Public License as published by the Free Software
|
@@ -1,6 +1,6 @@
|
|
1
1
|
# coding: UTF-8
|
2
2
|
#--
|
3
|
-
# Copyright (C) 2012-
|
3
|
+
# Copyright (C) 2012-2014 Lapis Lazuli Texts
|
4
4
|
#
|
5
5
|
# This program is free software: you can redistribute it and/or modify it under
|
6
6
|
# the terms of the GNU General Public License as published by the Free Software
|
@@ -23,12 +23,19 @@ module Sanzang
|
|
23
23
|
#
|
24
24
|
class TranslationTable
|
25
25
|
|
26
|
+
# The records for the translation table, as an array
|
27
|
+
#
|
28
|
+
attr_reader :records
|
29
|
+
|
30
|
+
# Original encoding when the table was read
|
31
|
+
#
|
32
|
+
attr_reader :source_encoding
|
33
|
+
|
26
34
|
# A table is created from a formatted string of translation rules. The
|
27
35
|
# string is in the format of delimited text. The text format can be
|
28
36
|
# summarized as follows:
|
29
37
|
#
|
30
38
|
# - Each line of text is a record for a translation rule.
|
31
|
-
# - Each record may begin with "~|" and end with "|~".
|
32
39
|
# - Fields in the record are separated by the "|" character.
|
33
40
|
# - The first field contains the term in the source language.
|
34
41
|
# - Subsequent fields are equivalent terms in destination languages.
|
@@ -41,7 +48,9 @@ module Sanzang
|
|
41
48
|
#
|
42
49
|
def initialize(rules)
|
43
50
|
contents = rules.kind_of?(String) ? rules : rules.read
|
51
|
+
@source_encoding = contents.encoding
|
44
52
|
contents.encode!(Encoding::UTF_8)
|
53
|
+
|
45
54
|
contents.strip! # Rm outside empty lines
|
46
55
|
contents.gsub!("~|", "") # Rm left delimiter
|
47
56
|
contents.gsub!("|~", "") # Rm right delimiter
|
@@ -72,7 +81,7 @@ module Sanzang
|
|
72
81
|
@records[index]
|
73
82
|
end
|
74
83
|
|
75
|
-
# The text encoding used for all translation table data
|
84
|
+
# The text encoding used internally for all translation table data
|
76
85
|
#
|
77
86
|
def encoding
|
78
87
|
Encoding::UTF_8
|
@@ -96,9 +105,5 @@ module Sanzang
|
|
96
105
|
@records[0].length
|
97
106
|
end
|
98
107
|
|
99
|
-
# The records for the translation table, as an array
|
100
|
-
#
|
101
|
-
attr_reader :records
|
102
|
-
|
103
108
|
end
|
104
109
|
end
|
data/lib/sanzang/translator.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# coding: UTF-8
|
2
2
|
#--
|
3
|
-
# Copyright (C) 2012-
|
3
|
+
# Copyright (C) 2012-2014 Lapis Lazuli Texts
|
4
4
|
#
|
5
5
|
# This program is free software: you can redistribute it and/or modify it under
|
6
6
|
# the terms of the GNU General Public License as published by the Free Software
|
@@ -56,10 +56,12 @@ module Sanzang
|
|
56
56
|
vocab_terms = text_vocab(source_text)
|
57
57
|
1.upto(@table.width - 1) do |column_i|
|
58
58
|
translation = String.new(source_text)
|
59
|
+
translation.delete!("\x1F")
|
59
60
|
vocab_terms.each do |term|
|
60
|
-
translation.gsub!(term[0], "\
|
61
|
+
translation.gsub!(term[0], "\x1F#{term[column_i]}\x1F")
|
61
62
|
end
|
62
|
-
translation.gsub!(/\
|
63
|
+
translation.gsub!(/\x1F(?=[\r\n])/, "")
|
64
|
+
translation.gsub!(/\x1F+/, " ")
|
63
65
|
text_collection << translation
|
64
66
|
end
|
65
67
|
text_collection
|
@@ -95,17 +97,17 @@ module Sanzang
|
|
95
97
|
#
|
96
98
|
def translate_io(input, output)
|
97
99
|
if input.kind_of?(String)
|
98
|
-
io_in = File.open(input, "rb", encoding: @table.
|
100
|
+
io_in = File.open(input, "rb", encoding: @table.source_encoding)
|
99
101
|
else
|
100
102
|
io_in = input
|
101
103
|
end
|
102
104
|
if output.kind_of?(String)
|
103
|
-
io_out = File.open(output, "wb", encoding: @table.
|
105
|
+
io_out = File.open(output, "wb", encoding: @table.source_encoding)
|
104
106
|
else
|
105
107
|
io_out = output
|
106
108
|
end
|
107
109
|
|
108
|
-
buf_size =
|
110
|
+
buf_size = 100
|
109
111
|
buffer = ""
|
110
112
|
io_in.each do |line|
|
111
113
|
buffer << line
|
data/lib/sanzang/version.rb
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# coding: UTF-8
|
2
2
|
#--
|
3
|
-
# Copyright (C) 2012-
|
3
|
+
# Copyright (C) 2012-2014 Lapis Lazuli Texts
|
4
4
|
#
|
5
5
|
# This program is free software: you can redistribute it and/or modify it under
|
6
6
|
# the terms of the GNU General Public License as published by the Free Software
|
@@ -19,6 +19,6 @@ module Sanzang
|
|
19
19
|
|
20
20
|
# Current version number of Sanzang
|
21
21
|
#
|
22
|
-
VERSION = "1.1.
|
22
|
+
VERSION = "1.1.1"
|
23
23
|
|
24
24
|
end
|
@@ -27,13 +27,13 @@ class TestSanzang < Test::Unit::TestCase
|
|
27
27
|
|
28
28
|
def stage_3
|
29
29
|
"[1.1] 大唐三藏法師玄奘奉\r\n" \
|
30
|
-
<< "[1.2] dà táng sānzàng fǎshī xuánzàng fèng
|
30
|
+
<< "[1.2] dà táng sānzàng fǎshī xuánzàng fèng\r\n" \
|
31
31
|
<< "[1.3] great tang tripiṭaka dharma-master xuanzang " \
|
32
|
-
<< "reverently
|
32
|
+
<< "reverently\r\n" \
|
33
33
|
<< "\r\n" \
|
34
34
|
<< "[2.1] 詔譯\r\n" \
|
35
|
-
<< "[2.2] zhào
|
36
|
-
<< "[2.3] imperial-order translate/interpret
|
35
|
+
<< "[2.2] zhào yì\r\n" \
|
36
|
+
<< "[2.3] imperial-order translate/interpret\r\n" \
|
37
37
|
<< "\r\n"
|
38
38
|
end
|
39
39
|
|
data/test/utf-8/stage_3.txt
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: sanzang
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Lapis Lazuli Texts
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-01-
|
11
|
+
date: 2014-01-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: parallel
|
@@ -24,12 +24,11 @@ dependencies:
|
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0.8'
|
27
|
-
description: Sanzang is a
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
under the GNU GPL, version 3.
|
27
|
+
description: Sanzang is a compact and simple cross-platform machine translation system.
|
28
|
+
It was designed especially for translating from the CJK languages (Chinese, Japanese,
|
29
|
+
and Korean), and it is suitable even for translating from ancient texts. Sanzang
|
30
|
+
is implemented as a Unix style command suite program, with each subcommand carrying
|
31
|
+
out a major function of the system.
|
33
32
|
email: lapislazulitexts@gmail.com
|
34
33
|
executables:
|
35
34
|
- sanzang
|
@@ -38,11 +37,13 @@ extra_rdoc_files:
|
|
38
37
|
- HACKING.rdoc
|
39
38
|
- LICENSE.rdoc
|
40
39
|
- MANUAL.rdoc
|
40
|
+
- NEWS.rdoc
|
41
41
|
- README.rdoc
|
42
42
|
files:
|
43
43
|
- HACKING.rdoc
|
44
44
|
- LICENSE.rdoc
|
45
45
|
- MANUAL.rdoc
|
46
|
+
- NEWS.rdoc
|
46
47
|
- README.rdoc
|
47
48
|
- bin/sanzang
|
48
49
|
- lib/sanzang.rb
|
@@ -91,7 +92,7 @@ rubyforge_project:
|
|
91
92
|
rubygems_version: 2.2.0
|
92
93
|
signing_key:
|
93
94
|
specification_version: 4
|
94
|
-
summary: Machine translation
|
95
|
+
summary: Machine translation from CJK languages
|
95
96
|
test_files:
|
96
97
|
- test/tc_reflow_encodings.rb
|
97
98
|
- test/tc_simple_translation.rb
|