wenlin_db_scanner 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.document +5 -0
- data/Gemfile +13 -0
- data/Gemfile.lock +23 -0
- data/LICENSE.txt +122 -0
- data/README.md +102 -0
- data/Rakefile +36 -0
- data/VERSION +1 -0
- data/bin/wenlin_dbdump +24 -0
- data/bin/wenlin_dict +24 -0
- data/bin/wenlin_hanzi +13 -0
- data/bin/wenlin_parts +23 -0
- data/lib/wenlin_db_scanner.rb +13 -0
- data/lib/wenlin_db_scanner/chars.rb +210 -0
- data/lib/wenlin_db_scanner/db.rb +453 -0
- data/lib/wenlin_db_scanner/db_record.rb +43 -0
- data/lib/wenlin_db_scanner/dict.rb +373 -0
- data/lib/wenlin_db_scanner/speech_parts.rb +68 -0
- data/reversed/README.md +38 -0
- data/reversed/code.asm +1616 -0
- data/reversed/magic.txt +27 -0
- data/reversed/notes.txt +235 -0
- metadata +147 -0
data/.document
ADDED
data/Gemfile
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
source "http://rubygems.org"
|
2
|
+
# Add dependencies required to use your gem here.
|
3
|
+
# Example:
|
4
|
+
# gem "activesupport", ">= 2.3.5"
|
5
|
+
|
6
|
+
# Add dependencies to develop your gem here.
|
7
|
+
# Include everything needed to run rake, tests, features, etc.
|
8
|
+
group :development do
|
9
|
+
gem "yard", ">= 0.8.2.1"
|
10
|
+
gem "rdoc", ">= 3.12"
|
11
|
+
gem "bundler", ">= 1.2.0"
|
12
|
+
gem "jeweler", ">= 1.8.4"
|
13
|
+
end
|
data/Gemfile.lock
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
GEM
|
2
|
+
remote: http://rubygems.org/
|
3
|
+
specs:
|
4
|
+
git (1.2.5)
|
5
|
+
jeweler (1.8.4)
|
6
|
+
bundler (~> 1.0)
|
7
|
+
git (>= 1.2.5)
|
8
|
+
rake
|
9
|
+
rdoc
|
10
|
+
json (1.7.5)
|
11
|
+
rake (0.9.2.2)
|
12
|
+
rdoc (3.12)
|
13
|
+
json (~> 1.4)
|
14
|
+
yard (0.8.2.1)
|
15
|
+
|
16
|
+
PLATFORMS
|
17
|
+
ruby
|
18
|
+
|
19
|
+
DEPENDENCIES
|
20
|
+
bundler (>= 1.2.0)
|
21
|
+
jeweler (>= 1.8.4)
|
22
|
+
rdoc (>= 3.12)
|
23
|
+
yard (>= 0.8.2.1)
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,122 @@
|
|
1
|
+
Creative Commons Legal Code
|
2
|
+
|
3
|
+
CC0 1.0 Universal
|
4
|
+
|
5
|
+
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
|
6
|
+
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
|
7
|
+
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
|
8
|
+
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
|
9
|
+
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
|
10
|
+
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
|
11
|
+
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
|
12
|
+
HEREUNDER.
|
13
|
+
|
14
|
+
Statement of Purpose
|
15
|
+
|
16
|
+
The laws of most jurisdictions throughout the world automatically confer
|
17
|
+
exclusive Copyright and Related Rights (defined below) upon the creator
|
18
|
+
and subsequent owner(s) (each and all, an "owner") of an original work of
|
19
|
+
authorship and/or a database (each, a "Work").
|
20
|
+
|
21
|
+
Certain owners wish to permanently relinquish those rights to a Work for
|
22
|
+
the purpose of contributing to a commons of creative, cultural and
|
23
|
+
scientific works ("Commons") that the public can reliably and without fear
|
24
|
+
of later claims of infringement build upon, modify, incorporate in other
|
25
|
+
works, reuse and redistribute as freely as possible in any form whatsoever
|
26
|
+
and for any purposes, including without limitation commercial purposes.
|
27
|
+
These owners may contribute to the Commons to promote the ideal of a free
|
28
|
+
culture and the further production of creative, cultural and scientific
|
29
|
+
works, or to gain reputation or greater distribution for their Work in
|
30
|
+
part through the use and efforts of others.
|
31
|
+
|
32
|
+
For these and/or other purposes and motivations, and without any
|
33
|
+
expectation of additional consideration or compensation, the person
|
34
|
+
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
|
35
|
+
is an owner of Copyright and Related Rights in the Work, voluntarily
|
36
|
+
elects to apply CC0 to the Work and publicly distribute the Work under its
|
37
|
+
terms, with knowledge of his or her Copyright and Related Rights in the
|
38
|
+
Work and the meaning and intended legal effect of CC0 on those rights.
|
39
|
+
|
40
|
+
1. Copyright and Related Rights. A Work made available under CC0 may be
|
41
|
+
protected by copyright and related or neighboring rights ("Copyright and
|
42
|
+
Related Rights"). Copyright and Related Rights include, but are not
|
43
|
+
limited to, the following:
|
44
|
+
|
45
|
+
i. the right to reproduce, adapt, distribute, perform, display,
|
46
|
+
communicate, and translate a Work;
|
47
|
+
ii. moral rights retained by the original author(s) and/or performer(s);
|
48
|
+
iii. publicity and privacy rights pertaining to a person's image or
|
49
|
+
likeness depicted in a Work;
|
50
|
+
iv. rights protecting against unfair competition in regards to a Work,
|
51
|
+
subject to the limitations in paragraph 4(a), below;
|
52
|
+
v. rights protecting the extraction, dissemination, use and reuse of data
|
53
|
+
in a Work;
|
54
|
+
vi. database rights (such as those arising under Directive 96/9/EC of the
|
55
|
+
European Parliament and of the Council of 11 March 1996 on the legal
|
56
|
+
protection of databases, and under any national implementation
|
57
|
+
thereof, including any amended or successor version of such
|
58
|
+
directive); and
|
59
|
+
vii. other similar, equivalent or corresponding rights throughout the
|
60
|
+
world based on applicable law or treaty, and any national
|
61
|
+
implementations thereof.
|
62
|
+
|
63
|
+
2. Waiver. To the greatest extent permitted by, but not in contravention
|
64
|
+
of, applicable law, Affirmer hereby overtly, fully, permanently,
|
65
|
+
irrevocably and unconditionally waives, abandons, and surrenders all of
|
66
|
+
Affirmer's Copyright and Related Rights and associated claims and causes
|
67
|
+
of action, whether now known or unknown (including existing as well as
|
68
|
+
future claims and causes of action), in the Work (i) in all territories
|
69
|
+
worldwide, (ii) for the maximum duration provided by applicable law or
|
70
|
+
treaty (including future time extensions), (iii) in any current or future
|
71
|
+
medium and for any number of copies, and (iv) for any purpose whatsoever,
|
72
|
+
including without limitation commercial, advertising or promotional
|
73
|
+
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
|
74
|
+
member of the public at large and to the detriment of Affirmer's heirs and
|
75
|
+
successors, fully intending that such Waiver shall not be subject to
|
76
|
+
revocation, rescission, cancellation, termination, or any other legal or
|
77
|
+
equitable action to disrupt the quiet enjoyment of the Work by the public
|
78
|
+
as contemplated by Affirmer's express Statement of Purpose.
|
79
|
+
|
80
|
+
3. Public License Fallback. Should any part of the Waiver for any reason
|
81
|
+
be judged legally invalid or ineffective under applicable law, then the
|
82
|
+
Waiver shall be preserved to the maximum extent permitted taking into
|
83
|
+
account Affirmer's express Statement of Purpose. In addition, to the
|
84
|
+
extent the Waiver is so judged Affirmer hereby grants to each affected
|
85
|
+
person a royalty-free, non transferable, non sublicensable, non exclusive,
|
86
|
+
irrevocable and unconditional license to exercise Affirmer's Copyright and
|
87
|
+
Related Rights in the Work (i) in all territories worldwide, (ii) for the
|
88
|
+
maximum duration provided by applicable law or treaty (including future
|
89
|
+
time extensions), (iii) in any current or future medium and for any number
|
90
|
+
of copies, and (iv) for any purpose whatsoever, including without
|
91
|
+
limitation commercial, advertising or promotional purposes (the
|
92
|
+
"License"). The License shall be deemed effective as of the date CC0 was
|
93
|
+
applied by Affirmer to the Work. Should any part of the License for any
|
94
|
+
reason be judged legally invalid or ineffective under applicable law, such
|
95
|
+
partial invalidity or ineffectiveness shall not invalidate the remainder
|
96
|
+
of the License, and in such case Affirmer hereby affirms that he or she
|
97
|
+
will not (i) exercise any of his or her remaining Copyright and Related
|
98
|
+
Rights in the Work or (ii) assert any associated claims and causes of
|
99
|
+
action with respect to the Work, in either case contrary to Affirmer's
|
100
|
+
express Statement of Purpose.
|
101
|
+
|
102
|
+
4. Limitations and Disclaimers.
|
103
|
+
|
104
|
+
a. No trademark or patent rights held by Affirmer are waived, abandoned,
|
105
|
+
surrendered, licensed or otherwise affected by this document.
|
106
|
+
b. Affirmer offers the Work as-is and makes no representations or
|
107
|
+
warranties of any kind concerning the Work, express, implied,
|
108
|
+
statutory or otherwise, including without limitation warranties of
|
109
|
+
title, merchantability, fitness for a particular purpose, non
|
110
|
+
infringement, or the absence of latent or other defects, accuracy, or
|
111
|
+
the present or absence of errors, whether or not discoverable, all to
|
112
|
+
the greatest extent permissible under applicable law.
|
113
|
+
c. Affirmer disclaims responsibility for clearing rights of other persons
|
114
|
+
that may apply to the Work or any use thereof, including without
|
115
|
+
limitation any person's Copyright and Related Rights in the Work.
|
116
|
+
Further, Affirmer disclaims responsibility for obtaining any necessary
|
117
|
+
consents, permissions or other rights required for any use of the
|
118
|
+
Work.
|
119
|
+
d. Affirmer understands and acknowledges that Creative Commons is not a
|
120
|
+
party to this document and has no duty or obligation with respect to
|
121
|
+
this CC0 or use of the Work.
|
122
|
+
|
data/README.md
ADDED
@@ -0,0 +1,102 @@
|
|
1
|
+
# wenlin_db_scanner
|
2
|
+
|
3
|
+
Extracts the data from the Wenlin dictionary program.
|
4
|
+
|
5
|
+
The Wenlin Dictionary contains two great databases, the
|
6
|
+
[ABC English<->Chinese Dictionary](http://www.wenlin.com/abc.htm) and the
|
7
|
+
[Character Description Language](http://www.wenlin.com/cdl/) (CDL).
|
8
|
+
|
9
|
+
Unfortunately, this great data is wrapped by a less-than-great UI. This code is
|
10
|
+
intended to be useful to Chinese language students who wish to interact with
|
11
|
+
the data on their own terms.
|
12
|
+
|
13
|
+
|
14
|
+
## Installation
|
15
|
+
|
16
|
+
The tool ships as a Ruby gem, and the standard installation process applies.
|
17
|
+
The code relies on Ruby 1.9 syntax and String encoding. It was tested to work
|
18
|
+
with MRI 1.9.3.
|
19
|
+
|
20
|
+
```bash
|
21
|
+
gem install wenlin_db_server
|
22
|
+
```
|
23
|
+
|
24
|
+
|
25
|
+
## Command-Line Usage
|
26
|
+
|
27
|
+
The following commands assume that the current directory of your `Terminal` /
|
28
|
+
`Command Prompt` is the Wenlin application's main directory. If your current
|
29
|
+
directory contains a `W4DB` directory, you're probably in the right place.
|
30
|
+
|
31
|
+
### wenlin_dict
|
32
|
+
|
33
|
+
Parses a dictionary database into a file containing one JSON line per entry.
|
34
|
+
|
35
|
+
```bash
|
36
|
+
wenlin_dict W4DB/ en-zh > en_zh.json
|
37
|
+
wenlin_dict W4DB/ zh-en > zh_en.json
|
38
|
+
wenlin_dict W4DB/ hz-en > hz_en.json
|
39
|
+
```
|
40
|
+
|
41
|
+
### wenlin_hanzi
|
42
|
+
|
43
|
+
Parses the database that breaks down hanzi (Chinese characters) into
|
44
|
+
components.
|
45
|
+
|
46
|
+
```bash
|
47
|
+
wenlin_hanzi W4DB > hanzi.json
|
48
|
+
```
|
49
|
+
|
50
|
+
### wenlin_parts
|
51
|
+
|
52
|
+
Parses a parts-of-speech database into a file containing one JSON line per part
|
53
|
+
of speech.
|
54
|
+
|
55
|
+
The parts of speech are referenced by the word defintion databases, which use
|
56
|
+
their abbreviations.
|
57
|
+
|
58
|
+
```bash
|
59
|
+
wenlin_parts W4DB/ en > en_parts.json
|
60
|
+
wenlin_parts W4DB/ zh > zh_parts.json
|
61
|
+
```
|
62
|
+
|
63
|
+
### wenlin_dbdump
|
64
|
+
|
65
|
+
Extracts the raw text entries in a .db file. Useful for debugging and
|
66
|
+
understanding the record format.
|
67
|
+
|
68
|
+
```bash
|
69
|
+
wenlin_dbdumb W4DB/abc_ce.db
|
70
|
+
```
|
71
|
+
|
72
|
+
|
73
|
+
## API Usage
|
74
|
+
|
75
|
+
The scripts in the `bin` directory are thin wrappers over the API. Read them if
|
76
|
+
you want to use the Ruby API directly.
|
77
|
+
|
78
|
+
It is very likely that you'll get your job done faster by using the output of
|
79
|
+
the CLI tools.
|
80
|
+
|
81
|
+
|
82
|
+
## Testing
|
83
|
+
|
84
|
+
I test this code by runing the tools inside `bin` against the Wenlin databases,
|
85
|
+
and by spot-checking the output.
|
86
|
+
|
87
|
+
|
88
|
+
## Contributing
|
89
|
+
|
90
|
+
This tool works fairly well on the Wenlin 4 data files. Bugfixes and support
|
91
|
+
for new .db file formats are welcome, other features are most likely outside
|
92
|
+
the project's scope.
|
93
|
+
|
94
|
+
Note that this tool is designed to help moving the data into another program,
|
95
|
+
so it only supports full table scans. Support for random access using the
|
96
|
+
B-tree indexes is outside the scope of this project.
|
97
|
+
|
98
|
+
|
99
|
+
## Copyright
|
100
|
+
|
101
|
+
This code is licensed under the
|
102
|
+
[CC0 Public Domain](http://creativecommons.org/publicdomain/zero/1.0/) license.
|
data/Rakefile
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
|
3
|
+
require 'rubygems'
|
4
|
+
require 'bundler'
|
5
|
+
begin
|
6
|
+
Bundler.setup(:default, :development)
|
7
|
+
rescue Bundler::BundlerError => e
|
8
|
+
$stderr.puts e.message
|
9
|
+
$stderr.puts "Run `bundle install` to install missing gems"
|
10
|
+
exit e.status_code
|
11
|
+
end
|
12
|
+
require 'rake'
|
13
|
+
|
14
|
+
require 'jeweler'
|
15
|
+
Jeweler::Tasks.new do |gem|
|
16
|
+
# gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
|
17
|
+
gem.name = "wenlin_db_scanner"
|
18
|
+
gem.homepage = "http://github.com/pwnall/wenlin_db_scanner"
|
19
|
+
gem.license = "CC0"
|
20
|
+
gem.summary = %Q{Extracts the data from the Wenlin dictionary}
|
21
|
+
gem.description = <<END
|
22
|
+
The Wenlin dictionary contains two great databases, the ABC English<->Chinese
|
23
|
+
dictionary, and the Character Description Language (CDL). Unfortunately, this
|
24
|
+
data is wrapped by a less-than-great UI. This gem lets you extract the data so
|
25
|
+
you can build your own UI for it.
|
26
|
+
END
|
27
|
+
gem.email = "victor@costan.us"
|
28
|
+
gem.authors = ["Victor Costan"]
|
29
|
+
# dependencies defined in Gemfile
|
30
|
+
end
|
31
|
+
Jeweler::RubygemsDotOrgTasks.new
|
32
|
+
|
33
|
+
task :default => :install
|
34
|
+
|
35
|
+
require 'yard'
|
36
|
+
YARD::Rake::YardocTask.new
|
data/VERSION
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
0.2.0
|
data/bin/wenlin_dbdump
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'wenlin_db_scanner'
|
5
|
+
|
6
|
+
unless ARGV.length == 1
|
7
|
+
STDERR.puts "Usage: #{$0} path-to-db-file"
|
8
|
+
exit 1
|
9
|
+
end
|
10
|
+
|
11
|
+
db = WenlinDbScanner::Db.new ARGV[0]
|
12
|
+
db.records.each do |record|
|
13
|
+
puts "---------- record tag: #{record.tag} = 0b#{'%b' % record.tag}"
|
14
|
+
|
15
|
+
if record.binary?
|
16
|
+
puts "---------- binary record, size: #{record.size}"
|
17
|
+
next
|
18
|
+
end
|
19
|
+
|
20
|
+
puts record.text
|
21
|
+
puts "---------- record end"
|
22
|
+
end
|
23
|
+
db.close
|
24
|
+
|
data/bin/wenlin_dict
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 2
|
8
|
+
STDERR.puts "Usage: #{$0} path-to-db-dir en-zh|zh-en|hz-en"
|
9
|
+
exit 1
|
10
|
+
end
|
11
|
+
|
12
|
+
case ARGV[1]
|
13
|
+
when 'en-zh'
|
14
|
+
entries = WenlinDbScanner::Dicts.en_zh ARGV[0]
|
15
|
+
when 'zh-en'
|
16
|
+
entries = WenlinDbScanner::Dicts.zh_en ARGV[0]
|
17
|
+
when 'hz-en'
|
18
|
+
entries = WenlinDbScanner::Chars.hz_en ARGV[0]
|
19
|
+
else
|
20
|
+
STDERR.puts "Unknown dictionary #{ARGV[1]}\nUse en-zh, zh-en, or hz-en\n"
|
21
|
+
exit 1
|
22
|
+
end
|
23
|
+
|
24
|
+
entries.each { |entry| puts entry.to_hash.to_json }
|
data/bin/wenlin_hanzi
ADDED
@@ -0,0 +1,13 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 1
|
8
|
+
puts "Usage: #{$0} path-to-db-dir"
|
9
|
+
end
|
10
|
+
|
11
|
+
chars = WenlinDbScanner::Chars.hanzi ARGV[0]
|
12
|
+
|
13
|
+
chars.each { |char| puts char.to_hash.to_json }
|
data/bin/wenlin_parts
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# Requires Ruby 1.9, tested on MRI 1.9.3.
|
3
|
+
|
4
|
+
require 'json'
|
5
|
+
require 'wenlin_db_scanner'
|
6
|
+
|
7
|
+
unless ARGV.length == 2
|
8
|
+
STDERR.puts "Usage: #{$0} path-to-db-dir en|zh"
|
9
|
+
exit 1
|
10
|
+
end
|
11
|
+
|
12
|
+
case ARGV[1]
|
13
|
+
when 'en'
|
14
|
+
parts = WenlinDbScanner::SpeechParts.en ARGV[0]
|
15
|
+
when 'zh'
|
16
|
+
parts = WenlinDbScanner::SpeechParts.zh ARGV[0]
|
17
|
+
else
|
18
|
+
STDERR.puts "Unknown language #{ARGV[1]}\nUse en or zh"
|
19
|
+
exit 1
|
20
|
+
end
|
21
|
+
|
22
|
+
parts.each { |part| puts part.to_hash.to_json }
|
23
|
+
|
@@ -0,0 +1,13 @@
|
|
1
|
+
# Namespace for the Db scanner classes.
|
2
|
+
#
|
3
|
+
# The awfully long name was chosen on purpose, to save the good names for more
|
4
|
+
# useful libraries.
|
5
|
+
module WenlinDbScanner
|
6
|
+
end
|
7
|
+
|
8
|
+
require 'wenlin_db_scanner/chars.rb'
|
9
|
+
require 'wenlin_db_scanner/db.rb'
|
10
|
+
require 'wenlin_db_scanner/db_record.rb'
|
11
|
+
require 'wenlin_db_scanner/dict.rb'
|
12
|
+
require 'wenlin_db_scanner/speech_parts.rb'
|
13
|
+
|
@@ -0,0 +1,210 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
|
3
|
+
require 'rexml/document'
|
4
|
+
|
5
|
+
module WenlinDbScanner
|
6
|
+
|
7
|
+
# Parses the data in the character (hanzi) databases.
|
8
|
+
module Chars
|
9
|
+
# The entries in the database that breaks down hanzi into components.
|
10
|
+
#
|
11
|
+
# @param [String] db_root the directory containing the .db files
|
12
|
+
# @return [Enumerator<Hash>]
|
13
|
+
def self.hanzi(db_root)
|
14
|
+
_hanzi File.join(db_root, 'cdl.db')
|
15
|
+
end
|
16
|
+
|
17
|
+
# Decoder for a CDL database.
|
18
|
+
#
|
19
|
+
# @param [String] db_file path to the .db file containing CDL data
|
20
|
+
# @return [Enumerator<Hash>]
|
21
|
+
def self._hanzi(db_file)
|
22
|
+
Enumerator.new do |yielder|
|
23
|
+
db = Db.new db_file
|
24
|
+
db.records.each do |record|
|
25
|
+
next if record.binary?
|
26
|
+
xml = REXML::Document.new record.text
|
27
|
+
|
28
|
+
entry = {}
|
29
|
+
xml.root.attributes.each do |name, raw_value|
|
30
|
+
key = name.to_sym
|
31
|
+
entry[key] = cdl_attribute_value key, raw_value
|
32
|
+
end
|
33
|
+
|
34
|
+
entry[:parts] = xml.root.elements.map do |element|
|
35
|
+
part = { part: element.name.to_sym }
|
36
|
+
element.attributes.each do |name, raw_value|
|
37
|
+
key = name.to_sym
|
38
|
+
part[key] = cdl_attribute_value key, raw_value
|
39
|
+
end
|
40
|
+
part
|
41
|
+
end
|
42
|
+
|
43
|
+
yielder << entry
|
44
|
+
end
|
45
|
+
end
|
46
|
+
end
|
47
|
+
|
48
|
+
# Decodes known attributes for CDL XML elements.
|
49
|
+
#
|
50
|
+
# @param [Symbol] key the attribute's name, symbolized
|
51
|
+
# @param [String] value the attribute's value
|
52
|
+
# @return [Integer, Array, String] a more programmer-friendly value
|
53
|
+
def self.cdl_attribute_value(key, raw_value)
|
54
|
+
case key
|
55
|
+
when :points # coordinates
|
56
|
+
raw_value.split(' ').map do |pair|
|
57
|
+
pair.split(',').map { |coord| coord.strip.to_i }
|
58
|
+
end
|
59
|
+
when :radical # dictionary radicals?
|
60
|
+
raw_value.strip.split(' ').map(&:strip)
|
61
|
+
when :type # stroke type
|
62
|
+
raw_value.strip.to_sym
|
63
|
+
when :uni # unicode value
|
64
|
+
raw_value.strip.to_i(16)
|
65
|
+
else
|
66
|
+
raw_value.strip
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
# The entries in the hanzi -> English meaning dictionary.
|
71
|
+
#
|
72
|
+
# @param [String] db_root the directory containing the .db files
|
73
|
+
# @return [Enumerator<CharMeaning>]
|
74
|
+
def self.hz_en(db_root)
|
75
|
+
_hz_en File.join(db_root, 'zidian.db')
|
76
|
+
end
|
77
|
+
|
78
|
+
# Decodeder for a database of hanzi -> English meaning entries.
|
79
|
+
#
|
80
|
+
# @param [String] db_file path to the .db file containing dictionary data
|
81
|
+
# @return [Enumerator<DictEntry>]
|
82
|
+
def self._hz_en(db_file)
|
83
|
+
Enumerator.new do |yielder|
|
84
|
+
db = Db.new db_file
|
85
|
+
db.records.each do |record|
|
86
|
+
next if record.binary?
|
87
|
+
lines = record.text.split("\n").map(&:strip).reject(&:empty?)
|
88
|
+
|
89
|
+
header = lines[0]
|
90
|
+
|
91
|
+
entry = CharMeaning.new
|
92
|
+
entry.char = header[0, 1]
|
93
|
+
header = header[1..-1]
|
94
|
+
|
95
|
+
entry.pinyin = header.scan(/\[([^\]]*)\]/).
|
96
|
+
map { |match| match.first.strip }
|
97
|
+
entry.latin_pinyin =
|
98
|
+
entry.pinyin.map { |pinyin| pinyin_to_latin pinyin }
|
99
|
+
header.gsub!(/\[[^\]]*\]/, '')
|
100
|
+
header.strip!
|
101
|
+
|
102
|
+
header.scan(/\([^\)]+\)/).each do |aside|
|
103
|
+
aside_text = aside[1...-1]
|
104
|
+
case aside_text[0]
|
105
|
+
when '='
|
106
|
+
entry.variants = aside_text[1..-1].chars.to_a
|
107
|
+
header.gsub! aside, ''
|
108
|
+
when '!', '?'
|
109
|
+
entry.related ||= []
|
110
|
+
entry.related += aside_text[1..-1].chars.to_a
|
111
|
+
header.gsub! aside, ''
|
112
|
+
when 'F'
|
113
|
+
entry.complex_forms = aside_text[1..-1].chars.to_a
|
114
|
+
header.gsub! aside, ''
|
115
|
+
when 'S'
|
116
|
+
entry.simplified_forms = aside_text[1..-1].chars.to_a
|
117
|
+
header.gsub! aside, ''
|
118
|
+
when 'u', 'U'
|
119
|
+
if /^Unihan/i =~ aside_text
|
120
|
+
header.gsub! aside, ''
|
121
|
+
end
|
122
|
+
end
|
123
|
+
end
|
124
|
+
header.strip!
|
125
|
+
# Many definitions start with a (note).
|
126
|
+
if note_match = /^\(([^\)]*)\)/.match(header)
|
127
|
+
entry.note = note_match[1]
|
128
|
+
header = header[note_match[0].length..-1].strip
|
129
|
+
end
|
130
|
+
entry.meaning = header.gsub(/\s*<hr\s*\/?>\s*/, "\n")
|
131
|
+
|
132
|
+
lines[1..-1].each do |line|
|
133
|
+
unless line[0] == ?#
|
134
|
+
if entry.note
|
135
|
+
entry.note << "/ #{line}"
|
136
|
+
else
|
137
|
+
entry.note = line
|
138
|
+
end
|
139
|
+
next
|
140
|
+
end
|
141
|
+
|
142
|
+
tag, data = line[1], line[2..-1].strip
|
143
|
+
case 'tag'
|
144
|
+
when 'c'
|
145
|
+
entry.components = data.chars.to_a
|
146
|
+
when 'r'
|
147
|
+
# NOTE: skipping remarks
|
148
|
+
when 'y'
|
149
|
+
entry.cantonese = data
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
yielder << entry
|
154
|
+
end
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
158
|
+
# Removes the accents from a pinyin string.
|
159
|
+
#
|
160
|
+
# This computes the closest Latin alphabet string matching the given pinyin
|
161
|
+
# string. It is what users will most likely type to refer to the character,
|
162
|
+
# word or phrase inside the pinyin-spelling string.
|
163
|
+
#
|
164
|
+
# @param [String] pinyin a string that uses pinyin spelling
|
165
|
+
# @return [String] the closest approximation to the given string that only
|
166
|
+
# uses Latin characters
|
167
|
+
def self.pinyin_to_latin(pinyin)
|
168
|
+
pinyin.tr 'āēīōūǖĀĒĪŌŪǕáéíóúǘÁÉÍÓÚǗǎěǐǒǔǚǍĚǏǑǓǙàèìòùǜÀÈÌÒÙǛüÜ',
|
169
|
+
'aeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVaeiouvAEIOUVvV'
|
170
|
+
end
|
171
|
+
end # module WenlinDbScanner::Dicts
|
172
|
+
|
173
|
+
# Wraps a record in a dictionary database
|
174
|
+
class CharMeaning < Struct.new(:char, :meaning, :note, :pinyin, :variants,
|
175
|
+
:complex_forms, :simplified_forms,
|
176
|
+
:components, :cantonese, :related,
|
177
|
+
:latin_pinyin)
|
178
|
+
# @!attribute [r] char
|
179
|
+
# @return [String] 1-character string containing the defined character
|
180
|
+
# @!attribute [r] meaning
|
181
|
+
# @return [String] the character's definition, in English
|
182
|
+
# @!attribute [r] note
|
183
|
+
# @return [String] e.g., "same as X" or
|
184
|
+
# @!attribute [r] pinyin
|
185
|
+
# @return [Array<String>] pinyin pronunciation(s) of the character
|
186
|
+
# @!attribute [r] latin_pinyin
|
187
|
+
# @return [Array<String>] pinyin pronunciation(s) of the character, with
|
188
|
+
# with the accents removed; this is what users type to get the
|
189
|
+
# character
|
190
|
+
# @!attribute [r] variants
|
191
|
+
# @return [Array<String>] other variants of the character
|
192
|
+
# @!attribute [r] related
|
193
|
+
# @return [Array<String>] characters that are somehow related
|
194
|
+
# @!attribute [r] simplified_forms
|
195
|
+
# @return [Array<String>] simplified variants of the character
|
196
|
+
# @!attribute [r] complex_forms
|
197
|
+
# @return [Array<String>] this character is a simplified variant of them
|
198
|
+
# @!attribute [r] components
|
199
|
+
# @return [Array<String>] 1-character strings with characters that are
|
200
|
+
# contained in this character's image
|
201
|
+
# @!attribute [r] cantonese
|
202
|
+
# @return [Array<String>] character's pronunciation in Cantonese
|
203
|
+
|
204
|
+
# @return [Hash]
|
205
|
+
def to_hash
|
206
|
+
Hash[each_pair.reject { |k, v| v.nil? }.to_a]
|
207
|
+
end
|
208
|
+
end # class WenlinDbScanner::DictEntry
|
209
|
+
|
210
|
+
end # namespace WenlinDbScanner
|