wenlin_db_scanner 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.document +5 -0
- data/Gemfile +13 -0
- data/Gemfile.lock +23 -0
- data/LICENSE.txt +122 -0
- data/README.md +102 -0
- data/Rakefile +36 -0
- data/VERSION +1 -0
- data/bin/wenlin_dbdump +24 -0
- data/bin/wenlin_dict +24 -0
- data/bin/wenlin_hanzi +13 -0
- data/bin/wenlin_parts +23 -0
- data/lib/wenlin_db_scanner.rb +13 -0
- data/lib/wenlin_db_scanner/chars.rb +210 -0
- data/lib/wenlin_db_scanner/db.rb +453 -0
- data/lib/wenlin_db_scanner/db_record.rb +43 -0
- data/lib/wenlin_db_scanner/dict.rb +373 -0
- data/lib/wenlin_db_scanner/speech_parts.rb +68 -0
- data/reversed/README.md +38 -0
- data/reversed/code.asm +1616 -0
- data/reversed/magic.txt +27 -0
- data/reversed/notes.txt +235 -0
- metadata +147 -0
data/.document
ADDED
data/Gemfile
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
source "http://rubygems.org"
|
2
|
+
# Add dependencies required to use your gem here.
|
3
|
+
# Example:
|
4
|
+
# gem "activesupport", ">= 2.3.5"
|
5
|
+
|
6
|
+
# Add dependencies to develop your gem here.
|
7
|
+
# Include everything needed to run rake, tests, features, etc.
|
8
|
+
group :development do
|
9
|
+
gem "yard", ">= 0.8.2.1"
|
10
|
+
gem "rdoc", ">= 3.12"
|
11
|
+
gem "bundler", ">= 1.2.0"
|
12
|
+
gem "jeweler", ">= 1.8.4"
|
13
|
+
end
|
data/Gemfile.lock
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
GEM
|
2
|
+
remote: http://rubygems.org/
|
3
|
+
specs:
|
4
|
+
git (1.2.5)
|
5
|
+
jeweler (1.8.4)
|
6
|
+
bundler (~> 1.0)
|
7
|
+
git (>= 1.2.5)
|
8
|
+
rake
|
9
|
+
rdoc
|
10
|
+
json (1.7.5)
|
11
|
+
rake (0.9.2.2)
|
12
|
+
rdoc (3.12)
|
13
|
+
json (~> 1.4)
|
14
|
+
yard (0.8.2.1)
|
15
|
+
|
16
|
+
PLATFORMS
|
17
|
+
ruby
|
18
|
+
|
19
|
+
DEPENDENCIES
|
20
|
+
bundler (>= 1.2.0)
|
21
|
+
jeweler (>= 1.8.4)
|
22
|
+
rdoc (>= 3.12)
|
23
|
+
yard (>= 0.8.2.1)
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,122 @@
|
|
1
|
+
Creative Commons Legal Code
|
2
|
+
|
3
|
+
CC0 1.0 Universal
|
4
|
+
|
5
|
+
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
|
6
|
+
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
|
7
|
+
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
|
8
|
+
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
|
9
|
+
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
|
10
|
+
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
|
11
|
+
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
|
12
|
+
HEREUNDER.
|
13
|
+
|
14
|
+
Statement of Purpose
|
15
|
+
|
16
|
+
The laws of most jurisdictions throughout the world automatically confer
|
17
|
+
exclusive Copyright and Related Rights (defined below) upon the creator
|
18
|
+
and subsequent owner(s) (each and all, an "owner") of an original work of
|
19
|
+
authorship and/or a database (each, a "Work").
|
20
|
+
|
21
|
+
Certain owners wish to permanently relinquish those rights to a Work for
|
22
|
+
the purpose of contributing to a commons of creative, cultural and
|
23
|
+
scientific works ("Commons") that the public can reliably and without fear
|
24
|
+
of later claims of infringement build upon, modify, incorporate in other
|
25
|
+
works, reuse and redistribute as freely as possible in any form whatsoever
|
26
|
+
and for any purposes, including without limitation commercial purposes.
|
27
|
+
These owners may contribute to the Commons to promote the ideal of a free
|
28
|
+
culture and the further production of creative, cultural and scientific
|
29
|
+
works, or to gain reputation or greater distribution for their Work in
|
30
|
+
part through the use and efforts of others.
|
31
|
+
|
32
|
+
For these and/or other purposes and motivations, and without any
|
33
|
+
expectation of additional consideration or compensation, the person
|
34
|
+
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
|
35
|
+
is an owner of Copyright and Related Rights in the Work, voluntarily
|
36
|
+
elects to apply CC0 to the Work and publicly distribute the Work under its
|
37
|
+
terms, with knowledge of his or her Copyright and Related Rights in the
|
38
|
+
Work and the meaning and intended legal effect of CC0 on those rights.
|
39
|
+
|
40
|
+
1. Copyright and Related Rights. A Work made available under CC0 may be
|
41
|
+
protected by copyright and related or neighboring rights ("Copyright and
|
42
|
+
Related Rights"). Copyright and Related Rights include, but are not
|
43
|
+
limited to, the following:
|
44
|
+
|
45
|
+
i. the right to reproduce, adapt, distribute, perform, display,
|
46
|
+
communicate, and translate a Work;
|
47
|
+
ii. moral rights retained by the original author(s) and/or performer(s);
|
48
|
+
iii. publicity and privacy rights pertaining to a person's image or
|
49
|
+
likeness depicted in a Work;
|
50
|
+
iv. rights protecting against unfair competition in regards to a Work,
|
51
|
+
subject to the limitations in paragraph 4(a), below;
|
52
|
+
v. rights protecting the extraction, dissemination, use and reuse of data
|
53
|
+
in a Work;
|
54
|
+
vi. database rights (such as those arising under Directive 96/9/EC of the
|
55
|
+
European Parliament and of the Council of 11 March 1996 on the legal
|
56
|
+
protection of databases, and under any national implementation
|
57
|
+
thereof, including any amended or successor version of such
|
58
|
+
directive); and
|
59
|
+
vii. other similar, equivalent or corresponding rights throughout the
|
60
|
+
world based on applicable law or treaty, and any national
|
61
|
+
implementations thereof.
|
62
|
+
|
63
|
+
2. Waiver. To the greatest extent permitted by, but not in contravention
|
64
|
+
of, applicable law, Affirmer hereby overtly, fully, permanently,
|
65
|
+
irrevocably and unconditionally waives, abandons, and surrenders all of
|
66
|
+
Affirmer's Copyright and Related Rights and associated claims and causes
|
67
|
+
of action, whether now known or unknown (including existing as well as
|
68
|
+
future claims and causes of action), in the Work (i) in all territories
|
69
|
+
worldwide, (ii) for the maximum duration provided by applicable law or
|
70
|
+
treaty (including future time extensions), (iii) in any current or future
|
71
|
+
medium and for any number of copies, and (iv) for any purpose whatsoever,
|
72
|
+
including without limitation commercial, advertising or promotional
|
73
|
+
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
|
74
|
+
member of the public at large and to the detriment of Affirmer's heirs and
|
75
|
+
successors, fully intending that such Waiver shall not be subject to
|
76
|
+
revocation, rescission, cancellation, termination, or any other legal or
|
77
|
+
equitable action to disrupt the quiet enjoyment of the Work by the public
|
78
|
+
as contemplated by Affirmer's express Statement of Purpose.
|
79
|
+
|
80
|
+
3. Public License Fallback. Should any part of the Waiver for any reason
|
81
|
+
be judged legally invalid or ineffective under applicable law, then the
|
82
|
+
Waiver shall be preserved to the maximum extent permitted taking into
|
83
|
+
account Affirmer's express Statement of Purpose. In addition, to the
|
84
|
+
extent the Waiver is so judged Affirmer hereby grants to each affected
|
85
|
+
person a royalty-free, non transferable, non sublicensable, non exclusive,
|
86
|
+
irrevocable and unconditional license to exercise Affirmer's Copyright and
|
87
|
+
Related Rights in the Work (i) in all territories worldwide, (ii) for the
|
88
|
+
maximum duration provided by applicable law or treaty (including future
|
89
|
+
time extensions), (iii) in any current or future medium and for any number
|
90
|
+
of copies, and (iv) for any purpose whatsoever, including without
|
91
|
+
limitation commercial, advertising or promotional purposes (the
|
92
|
+
"License"). The License shall be deemed effective as of the date CC0 was
|
93
|
+
applied by Affirmer to the Work. Should any part of the License for any
|
94
|
+
reason be judged legally invalid or ineffective under applicable law, such
|
95
|
+
partial invalidity or ineffectiveness shall not invalidate the remainder
|
96
|
+
of the License, and in such case Affirmer hereby affirms that he or she
|
97
|
+
will not (i) exercise any of his or her remaining Copyright and Related
|
98
|
+
Rights in the Work or (ii) assert any associated claims and causes of
|
99
|
+
action with respect to the Work, in either case contrary to Affirmer's
|
100
|
+
express Statement of Purpose.
|
101
|
+
|
102
|
+
4. Limitations and Disclaimers.
|
103
|
+
|
104
|
+
a. No trademark or patent rights held by Affirmer are waived, abandoned,
|
105
|
+
surrendered, licensed or otherwise affected by this document.
|
106
|
+
b. Affirmer offers the Work as-is and makes no representations or
|
107
|
+
warranties of any kind concerning the Work, express, implied,
|
108
|
+
statutory or otherwise, including without limitation warranties of
|
109
|
+
title, merchantability, fitness for a particular purpose, non
|
110
|
+
infringement, or the absence of latent or other defects, accuracy, or
|
111
|
+
the present or absence of errors, whether or not discoverable, all to
|
112
|
+
the greatest extent permissible under applicable law.
|
113
|
+
c. Affirmer disclaims responsibility for clearing rights of other persons
|
114
|
+
that may apply to the Work or any use thereof, including without
|
115
|
+
limitation any person's Copyright and Related Rights in the Work.
|
116
|
+
Further, Affirmer disclaims responsibility for obtaining any necessary
|
117
|
+
consents, permissions or other rights required for any use of the
|
118
|
+
Work.
|
119
|
+
d. Affirmer understands and acknowledges that Creative Commons is not a
|
120
|
+
party to this document and has no duty or obligation with respect to
|
121
|
+
this CC0 or use of the Work.
|
122
|
+
|
data/README.md
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
# wenlin_db_scanner
|
2
|
+
|
3
|
+
Extracts the data from the Wenlin dictionary program.
|
4
|
+
|
5
|
+
The Wenlin Dictionary contains two great databases, the
|
6
|
+
[ABC English<->Chinese Dictionary](http://www.wenlin.com/abc.htm) and the
|
7
|
+
[Character Description Language](http://www.wenlin.com/cdl/) (CDL).
|
8
|
+
|
9
|
+
Unfortunately, this great data is wrapped by a less-than-great UI. This code is
|
10
|
+
intended to be useful to Chinese language students who wish to interact with
|
11
|
+
the data on their own terms.
|
12
|
+
|
13
|
+
|
14
|
+
## Installation
|
15
|
+
|
16
|
+
The tool ships as a Ruby gem, and the standard installation process applies.
|
17
|
+
The code relies on Ruby 1.9 syntax and String encoding. It was tested to work
|
18
|
+
with MRI 1.9.3.
|
19
|
+
|
20
|
+
```bash
|
21
|
+
gem install wenlin_db_server
|
22
|
+
```
|
23
|
+
|
24
|
+
|
25
|
+
## Command-Line Usage
|
26
|
+
|
27
|
+
The following commands assume that the current directory of your `Terminal` /
|
28
|
+
`Command Prompt` is the Wenlin application's main directory. If your current
|
29
|
+
directory contains a `W4DB` directory, you're probably in the right place.
|
30
|
+
|
31
|
+
### wenlin_dict
|
32
|
+
|
33
|
+
Parses a dictionary database into a file containing one JSON line per entry.
|
34
|
+
|
35
|
+
```bash
|
36
|
+
wenlin_dict W4DB/ en-zh > en_zh.json
|
37
|
+
wenlin_dict W4DB/ zh-en > zh_en.json
|
38
|
+
wenlin_dict W4DB/ hz-en > hz_en.json
|
39
|
+
```
|
40
|
+
|
41
|
+
### wenlin_hanzi
|
42
|
+
|
43
|
+
Parses the database that breaks down hanzi (Chinese characters) into
|
44
|
+
components.
|
45
|
+
|
46
|
+
```bash
|
47
|
+
wenlin_hanzi W4DB > hanzi.json
|
48
|
+
```
|
49
|
+
|
50
|
+
### wenlin_parts
|
51
|
+
|
52
|
+
Parses a parts-of-speech database into a file containing one JSON line per part
|
53
|
+
of speech.
|
54
|
+
|
55
|
+
The parts of speech are referenced by the word defintion databases, which use
|
56
|
+
their abbreviations.
|
57
|
+
|
58
|
+
```bash
|
59
|
+
wenlin_parts W4DB/ en > en_parts.json
|
60
|
+
wenlin_parts W4DB/ zh > zh_parts.json
|
61
|
+
```
|
62
|
+
|
63
|
+
### wenlin_dbdump
|
64
|
+
|
65
|
+
Extracts the raw text entries in a .db file. Useful for debugging and
|
66
|
+
understanding the record format.
|
67
|
+
|
68
|
+
```bash
|
69
|
+
wenlin_dbdumb W4DB/abc_ce.db
|
70
|
+
```
|
71
|
+
|
72
|
+
|
73
|
+
## API Usage
|
74
|
+
|
75
|
+
The scripts in the `bin` directory are thin wrappers over the API. Read them if
|
76
|
+
you want to use the Ruby API directly.
|
77
|
+
|
78
|
+
It is very likely that you'll get your job done faster by using the output of
|
79
|
+
the CLI tools.
|
80
|
+
|
81
|
+
|
82
|
+
## Testing
|
83
|
+
|
84
|
+
I test this code by runing the tools inside `bin` against the Wenlin databases,
|
85
|
+
and by spot-checking the output.
|
86
|
+
|
87
|
+
|
88
|
+
## Contributing
|
89
|
+
|
90
|
+
This tool works fairly well on the Wenlin 4 data files. Bugfixes and support
|
91
|
+
for new .db file formats are welcome, other features are most likely outside
|
92
|
+
the project's scope.
|
93
|
+
|
94
|
+
Note that this tool is designed to help moving the data into another program,
|
95
|
+
so it only supports full table scans. Support for random access using the
|
96
|
+
B-tree indexes is outside the scope of this project.
|
97
|
+
|
98
|
+
|
99
|
+
## Copyright
|
100
|
+
|
101
|
+
This code is licensed under the
|
102
|
+
[CC0 Public Domain](http://creativecommons.org/publicdomain/zero/1.0/) license.
|
data/Rakefile
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
require 'rubygems'
|
4
|
+
require 'bundler'
|
5
|
+
begin
|
6
|
+
Bundler.setup(:default, :development)
|
7
|
+
rescue Bundler::BundlerError => e
|
8
|
+
$stderr.puts e.message
|
9
|
+
$stderr.puts "Run `bundle install` to install missing gems"
|
10
|
+
exit e.status_code
|
11
|
+
end
|
12
|
+
require 'rake'
|
13
|
+
|
14
|
+
require 'jeweler'
|
15
|
+
Jeweler::Tasks.new do |gem|
|
16
|
+
# gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
|
17
|
+
gem.name = "wenlin_db_scanner"
|
18
|
+
gem.homepage = "http://github.com/pwnall/wenlin_db_scanner"
|
19
|
+
gem.license = "CC0"
|
20
|
+
gem.summary = %Q{Extracts the data from the Wenlin dictionary}
|
21
|
+
gem.description = <<END
|
22
|
+
The Wenlin dictionary contains two great databases, the ABC English<->Chinese
|
23
|
+
dictionary, and the Character Description Language (CDL). Unfortunately, this
|
24
|
+
data is wrapped by a less-than-great UI. This gem lets you extract the data so
|
25
|
+
you can build your own UI for it.
|
26
|
+
END
|
27
|
+
gem.email = "victor@costan.us"
|
28
|
+
gem.authors = ["Victor Costan"]
|
29
|
+
# dependencies defined in Gemfile
|
30
|
+
end
|
31
|
+
Jeweler::RubygemsDotOrgTasks.new
|
32
|
+
|
33
|
+
task :default => :install
|
34
|
+
|
35
|
+
require 'yard'
|
36
|
+
YARD::Rake::YardocTask.new
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.2.0
|
data/bin/wenlin_dbdump
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'wenlin_db_scanner'
|
5
|
+
|
6
|
+
unless ARGV.length == 1
|
7
|
+
STDERR.puts "Usage: #{$0} path-to-db-file"
|
8
|
+
exit 1
|
9
|
+
end
|
10
|
+
|
11
|
+
db = WenlinDbScanner::Db.new ARGV[0]
|
12
|
+
db.records.each do |record|
|
13
|
+
puts "---------- record tag: #{record.tag} = 0b#{'%b' % record.tag}"
|
14
|
+
|
15
|
+
if record.binary?
|
16
|
+
puts "---------- binary record, size: #{record.size}"
|
17
|
+
next
|
18
|
+
end
|
19
|
+
|
20
|
+
puts record.text
|
21
|
+
puts "---------- record end"
|
22
|
+
end
|
23
|
+
db.close
|
24
|
+
|
data/bin/wenlin_dict
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 2
|
8
|
+
STDERR.puts "Usage: #{$0} path-to-db-dir en-zh|zh-en|hz-en"
|
9
|
+
exit 1
|
10
|
+
end
|
11
|
+
|
12
|
+
case ARGV[1]
|
13
|
+
when 'en-zh'
|
14
|
+
entries = WenlinDbScanner::Dicts.en_zh ARGV[0]
|
15
|
+
when 'zh-en'
|
16
|
+
entries = WenlinDbScanner::Dicts.zh_en ARGV[0]
|
17
|
+
when 'hz-en'
|
18
|
+
entries = WenlinDbScanner::Chars.hz_en ARGV[0]
|
19
|
+
else
|
20
|
+
STDERR.puts "Unknown dictionary #{ARGV[1]}\nUse en-zh, zh-en, or hz-en\n"
|
21
|
+
exit 1
|
22
|
+
end
|
23
|
+
|
24
|
+
entries.each { |entry| puts entry.to_hash.to_json }
|
data/bin/wenlin_hanzi
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 1
|
8
|
+
puts "Usage: #{$0} path-to-db-dir"
|
9
|
+
end
|
10
|
+
|
11
|
+
chars = WenlinDbScanner::Chars.hanzi ARGV[0]
|
12
|
+
|
13
|
+
chars.each { |char| puts char.to_hash.to_json }
|
data/bin/wenlin_parts
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 2
|
8
|
+
STDERR.puts "Usage: #{$0} path-to-db-dir en|zh"
|
9
|
+
exit 1
|
10
|
+
end
|
11
|
+
|
12
|
+
case ARGV[1]
|
13
|
+
when 'en'
|
14
|
+
parts = WenlinDbScanner::SpeechParts.en ARGV[0]
|
15
|
+
when 'zh'
|
16
|
+
parts = WenlinDbScanner::SpeechParts.zh ARGV[0]
|
17
|
+
else
|
18
|
+
STDERR.puts "Unknown language #{ARGV[1]}\nUse en or zh"
|
19
|
+
exit 1
|
20
|
+
end
|
21
|
+
|
22
|
+
parts.each { |part| puts part.to_hash.to_json }
|
23
|
+
|
@@ -0,0 +1,13 @@
|
|
1
|
+
# Namespace for the Db scanner classes.
|
2
|
+
#
|
3
|
+
# The awfully long name was chosen on purpose, to save the good names for more
|
4
|
+
# useful libraries.
|
5
|
+
module WenlinDbScanner
|
6
|
+
end
|
7
|
+
|
8
|
+
require 'wenlin_db_scanner/chars.rb'
|
9
|
+
require 'wenlin_db_scanner/db.rb'
|
10
|
+
require 'wenlin_db_scanner/db_record.rb'
|
11
|
+
require 'wenlin_db_scanner/dict.rb'
|
12
|
+
require 'wenlin_db_scanner/speech_parts.rb'
|
13
|
+
|
@@ -0,0 +1,210 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
|
3
|
+
require 'rexml/document'
|
4
|
+
|
5
|
+
module WenlinDbScanner
|
6
|
+
|
7
|
+
# Parses the data in the character (hanzi) databases.
|
8
|
+
module Chars
|
9
|
+
# The entries in the database that breaks down hanzi into components.
|
10
|
+
#
|
11
|
+
# @param [String] db_root the directory containing the .db files
|
12
|
+
# @return [Enumerator<Hash>]
|
13
|
+
def self.hanzi(db_root)
|
14
|
+
_hanzi File.join(db_root, 'cdl.db')
|
15
|
+
end
|
16
|
+
|
17
|
+
# Decoder for a CDL database.
|
18
|
+
#
|
19
|
+
# @param [String] db_file path to the .db file containing CDL data
|
20
|
+
# @return [Enumerator<Hash>]
|
21
|
+
def self._hanzi(db_file)
|
22
|
+
Enumerator.new do |yielder|
|
23
|
+
db = Db.new db_file
|
24
|
+
db.records.each do |record|
|
25
|
+
next if record.binary?
|
26
|
+
xml = REXML::Document.new record.text
|
27
|
+
|
28
|
+
entry = {}
|
29
|
+
xml.root.attributes.each do |name, raw_value|
|
30
|
+
key = name.to_sym
|
31
|
+
entry[key] = cdl_attribute_value key, raw_value
|
32
|
+
end
|
33
|
+
|
34
|
+
entry[:parts] = xml.root.elements.map do |element|
|
35
|
+
part = { part: element.name.to_sym }
|
36
|
+
element.attributes.each do |name, raw_value|
|
37
|
+
key = name.to_sym
|
38
|
+
part[key] = cdl_attribute_value key, raw_value
|
39
|
+
end
|
40
|
+
part
|
41
|
+
end
|
42
|
+
|
43
|
+
yielder << entry
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
# Decodes known attributes for CDL XML elements.
|
49
|
+
#
|
50
|
+
# @param [Symbol] key the attribute's name, symbolized
|
51
|
+
# @param [String] value the attribute's value
|
52
|
+
# @return [Integer, Array, String] a more programmer-friendly value
|
53
|
+
def self.cdl_attribute_value(key, raw_value)
|
54
|
+
case key
|
55
|
+
when :points # coordinates
|
56
|
+
raw_value.split(' ').map do |pair|
|
57
|
+
pair.split(',').map { |coord| coord.strip.to_i }
|
58
|
+
end
|
59
|
+
when :radical # dictionary radicals?
|
60
|
+
raw_value.strip.split(' ').map(&:strip)
|
61
|
+
when :type # stroke type
|
62
|
+
raw_value.strip.to_sym
|
63
|
+
when :uni # unicode value
|
64
|
+
raw_value.strip.to_i(16)
|
65
|
+
else
|
66
|
+
raw_value.strip
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
# The entries in the hanzi -> English meaning dictionary.
|
71
|
+
#
|
72
|
+
# @param [String] db_root the directory containing the .db files
|
73
|
+
# @return [Enumerator<CharMeaning>]
|
74
|
+
def self.hz_en(db_root)
|
75
|
+
_hz_en File.join(db_root, 'zidian.db')
|
76
|
+
end
|
77
|
+
|
78
|
+
# Decodeder for a database of hanzi -> English meaning entries.
|
79
|
+
#
|
80
|
+
# @param [String] db_file path to the .db file containing dictionary data
|
81
|
+
# @return [Enumerator<DictEntry>]
|
82
|
+
def self._hz_en(db_file)
|
83
|
+
Enumerator.new do |yielder|
|
84
|
+
db = Db.new db_file
|
85
|
+
db.records.each do |record|
|
86
|
+
next if record.binary?
|
87
|
+
lines = record.text.split("\n").map(&:strip).reject(&:empty?)
|
88
|
+
|
89
|
+
header = lines[0]
|
90
|
+
|
91
|
+
entry = CharMeaning.new
|
92
|
+
entry.char = header[0, 1]
|
93
|
+
header = header[1..-1]
|
94
|
+
|
95
|
+
entry.pinyin = header.scan(/\[([^\]]*)\]/).
|
96
|
+
map { |match| match.first.strip }
|
97
|
+
entry.latin_pinyin =
|
98
|
+
entry.pinyin.map { |pinyin| pinyin_to_latin pinyin }
|
99
|
+
header.gsub!(/\[[^\]]*\]/, '')
|
100
|
+
header.strip!
|
101
|
+
|
102
|
+
header.scan(/\([^\)]+\)/).each do |aside|
|
103
|
+
aside_text = aside[1...-1]
|
104
|
+
case aside_text[0]
|
105
|
+
when '='
|
106
|
+
entry.variants = aside_text[1..-1].chars.to_a
|
107
|
+
header.gsub! aside, ''
|
108
|
+
when '!', '?'
|
109
|
+
entry.related ||= []
|
110
|
+
entry.related += aside_text[1..-1].chars.to_a
|
111
|
+
header.gsub! aside, ''
|
112
|
+
when 'F'
|
113
|
+
entry.complex_forms = aside_text[1..-1].chars.to_a
|
114
|
+
header.gsub! aside, ''
|
115
|
+
when 'S'
|
116
|
+
entry.simplified_forms = aside_text[1..-1].chars.to_a
|
117
|
+
header.gsub! aside, ''
|
118
|
+
when 'u', 'U'
|
119
|
+
if /^Unihan/i =~ aside_text
|
120
|
+
header.gsub! aside, ''
|
121
|
+
end
|
122
|
+
end
|
123
|
+
end
|
124
|
+
header.strip!
|
125
|
+
# Many definitions start with a (note).
|
126
|
+
if note_match = /^\(([^\)]*)\)/.match(header)
|
127
|
+
entry.note = note_match[1]
|
128
|
+
header = header[note_match[0].length..-1].strip
|
129
|
+
end
|
130
|
+
entry.meaning = header.gsub(/\s*<hr\s*\/?>\s*/, "\n")
|
131
|
+
|
132
|
+
lines[1..-1].each do |line|
|
133
|
+
unless line[0] == ?#
|
134
|
+
if entry.note
|
135
|
+
entry.note << "/ #{line}"
|
136
|
+
else
|
137
|
+
entry.note = line
|
138
|
+
end
|
139
|
+
next
|
140
|
+
end
|
141
|
+
|
142
|
+
tag, data = line[1], line[2..-1].strip
|
143
|
+
case 'tag'
|
144
|
+
when 'c'
|
145
|
+
entry.components = data.chars.to_a
|
146
|
+
when 'r'
|
147
|
+
# NOTE: skipping remarks
|
148
|
+
when 'y'
|
149
|
+
entry.cantonese = data
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
yielder << entry
|
154
|
+
end
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
158
|
+
# Removes the accents from a pinyin string.
|
159
|
+
#
|
160
|
+
# This computes the closest Latin alphabet string matching the given pinyin
|
161
|
+
# string. It is what users will most likely type to refer to the character,
|
162
|
+
# word or phrase inside the pinyin-spelling string.
|
163
|
+
#
|
164
|
+
# @param [String] pinyin a string that uses pinyin spelling
|
165
|
+
# @return [String] the closest approximation to the given string that only
|
166
|
+
# uses Latin characters
|
167
|
+
def self.pinyin_to_latin(pinyin)
|
168
|
+
pinyin.tr 'āēīōūǖĀĒĪŌŪǕáéíóúǘÁÉÍÓÚǗǎěǐǒǔǚǍĚǏǑǓǙàèìòùǜÀÈÌÒÙǛüÜ',
|
169
|
+
'aeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVvV'
|
170
|
+
end
|
171
|
+
end # module WenlinDbScanner::Dicts
|
172
|
+
|
173
|
+
# Wraps a record in a dictionary database
|
174
|
+
class CharMeaning < Struct.new(:char, :meaning, :note, :pinyin, :variants,
|
175
|
+
:complex_forms, :simplified_forms,
|
176
|
+
:components, :cantonese, :related,
|
177
|
+
:latin_pinyin)
|
178
|
+
# @!attribute [r] char
|
179
|
+
# @return [String] 1-character string containing the defined character
|
180
|
+
# @!attribute [r] meaning
|
181
|
+
# @return [String] the character's definition, in English
|
182
|
+
# @!attribute [r] note
|
183
|
+
# @return [String] e.g., "same as X" or
|
184
|
+
# @!attribute [r] pinyin
|
185
|
+
# @return [Array<String>] pinyin pronunciation(s) of the character
|
186
|
+
# @!attribute [r] latin_pinyin
|
187
|
+
# @return [Array<String>] pinyin pronunciation(s) of the character, with
|
188
|
+
# with the accents removed; this is what users type to get the
|
189
|
+
# character
|
190
|
+
# @!attribute [r] variants
|
191
|
+
# @return [Array<String>] other variants of the character
|
192
|
+
# @!attribute [r] related
|
193
|
+
# @return [Array<String>] characters that are somehow related
|
194
|
+
# @!attribute [r] simplified_forms
|
195
|
+
# @return [Array<String>] simplified variants of the character
|
196
|
+
# @!attribute [r] complex_forms
|
197
|
+
# @return [Array<String>] this character is a simplified variant of them
|
198
|
+
# @!attribute [r] components
|
199
|
+
# @return [Array<String>] 1-character strings with characters that are
|
200
|
+
# contained in this character's image
|
201
|
+
# @!attribute [r] cantonese
|
202
|
+
# @return [Array<String>] character's pronunciation in Cantonese
|
203
|
+
|
204
|
+
# @return [Hash]
|
205
|
+
def to_hash
|
206
|
+
Hash[each_pair.reject { |k, v| v.nil? }.to_a]
|
207
|
+
end
|
208
|
+
end # class WenlinDbScanner::DictEntry
|
209
|
+
|
210
|
+
end # namespace WenlinDbScanner
|