eiwa 0.0.2 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c5888e4802408cc8efdb55ddadfb560ec38d10971fee95b20dd53af2f31f487c
4
- data.tar.gz: 93b7101b430ee123a905065f87e5d8ba336e1a18f145c7801daeb2b9b9a5ba72
3
+ metadata.gz: ef277fcf117e28dcbc32fc6e17c8d34f1783b546f9adf5b9452e04d073d31ee8
4
+ data.tar.gz: d1e744a10c6e688e532da9855790ba1a9fad89338b75f0924ee9a5f90294e3ee
5
5
  SHA512:
6
- metadata.gz: 64faccd9958b9c359fcd7a7ff40de013bd1060bbf9a59735be4d52c824dd3e6e77abbb81ab42ae037b9175990dd15baa33722503b47d8aaebbb037a8a303f965
7
- data.tar.gz: 209efe931acfa8563ea1819f4e9b5a7d07a996d1545458b139de14f82971b27f47e0da726ed5151b3922aa05f6c5809bd3196269e0f6d254cb87e801c81ffe6e
6
+ metadata.gz: '090773b16ffc636c53cbd957e5f6131449c455bf4e00c39908f78403c2c284bd7364b9e33ea9f6d8ac9a930777cfdc1f67e296d701ee7224fd6572da812f5f93'
7
+ data.tar.gz: d43f3a9ded86a0a3238eaac28c86b1d3d7b95dbe3d7a7c6eb8fe389014c7393600009f0c79f69cf7b06ff8dcd84650c0fa679524703cd36207b795b994892375
@@ -1,42 +1,50 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- eiwa (0.0.2)
4
+ eiwa (0.1.0)
5
5
  nokogiri
6
6
 
7
7
  GEM
8
8
  remote: https://rubygems.org/
9
9
  specs:
10
- ast (2.4.0)
11
- coderay (1.1.2)
12
- jaro_winkler (1.5.3)
13
- method_source (0.9.2)
14
- mini_portile2 (2.4.0)
15
- minitest (5.11.3)
16
- nokogiri (1.10.9)
17
- mini_portile2 (~> 2.4.0)
18
- parallel (1.17.0)
19
- parser (2.6.4.1)
20
- ast (~> 2.4.0)
21
- pry (0.12.2)
22
- coderay (~> 1.1.0)
23
- method_source (~> 0.9.0)
10
+ ast (2.4.1)
11
+ coderay (1.1.3)
12
+ method_source (1.0.0)
13
+ mini_portile2 (2.5.0)
14
+ minitest (5.14.3)
15
+ nokogiri (1.11.1)
16
+ mini_portile2 (~> 2.5.0)
17
+ racc (~> 1.4)
18
+ parallel (1.20.1)
19
+ parser (3.0.0.0)
20
+ ast (~> 2.4.1)
21
+ pry (0.13.1)
22
+ coderay (~> 1.1)
23
+ method_source (~> 1.0)
24
+ racc (1.5.2)
24
25
  rainbow (3.0.0)
25
- rake (13.0.1)
26
- rubocop (0.72.0)
27
- jaro_winkler (~> 1.5.1)
26
+ rake (13.0.3)
27
+ regexp_parser (2.0.3)
28
+ rexml (3.2.4)
29
+ rubocop (1.7.0)
28
30
  parallel (~> 1.10)
29
- parser (>= 2.6)
31
+ parser (>= 2.7.1.5)
30
32
  rainbow (>= 2.2.2, < 4.0)
33
+ regexp_parser (>= 1.8, < 3.0)
34
+ rexml
35
+ rubocop-ast (>= 1.2.0, < 2.0)
31
36
  ruby-progressbar (~> 1.7)
32
- unicode-display_width (>= 1.4.0, < 1.7)
33
- rubocop-performance (1.4.1)
34
- rubocop (>= 0.71.0)
35
- ruby-progressbar (1.10.1)
36
- standard (0.1.4)
37
- rubocop (~> 0.72.0)
38
- rubocop-performance (~> 1.4.0)
39
- unicode-display_width (1.6.0)
37
+ unicode-display_width (>= 1.4.0, < 2.0)
38
+ rubocop-ast (1.4.0)
39
+ parser (>= 2.7.1.5)
40
+ rubocop-performance (1.9.2)
41
+ rubocop (>= 0.90.0, < 2.0)
42
+ rubocop-ast (>= 0.4.0)
43
+ ruby-progressbar (1.11.0)
44
+ standard (0.11.0)
45
+ rubocop (= 1.7.0)
46
+ rubocop-performance (= 1.9.2)
47
+ unicode-display_width (1.7.0)
40
48
 
41
49
  PLATFORMS
42
50
  ruby
data/README.md CHANGED
@@ -1,7 +1,12 @@
1
1
  # eiwa / 英和
2
2
 
3
- Parses the Japanese-English version of JMDict, a daily export of the WWWJDIC
4
- online Japanese dictionary.
3
+ Parses two types of Japanese-English dictionaries:
4
+
5
+ * `:jmdict_e` - [JMDict](http://www.edrdg.org/jmdict/edict_doc.html)'s
6
+ English-only export of the WWWJDIC online Japanese dictionary.
7
+ * `:kanjidic2` - the
8
+ [KANJIDIC2](http://www.edrdg.org/wiki/index.php/KANJIDIC_Project) dictionary
9
+ of roughly 13,000 kanji characters
5
10
 
6
11
  ## Usage
7
12
 
@@ -23,15 +28,24 @@ gem 'eiwa'
23
28
 
24
29
  Get your hands on a supported dictionary. Right now eiwa only parses
25
30
  [JMDict](http://www.edrdg.org/jmdict/j_jmdict.html), which can be fetched from
26
- the [Monash ftp site](http://ftp.monash.edu/pub/nihongo/00INDEX.html) or with a
31
+ the [EDRDG ftp site](http://ftp.edrdg.org/pub/Nihongo/00INDEX.html) or with a
27
32
  script like this, for the Japanese-English export:
28
33
 
29
34
  ```bash
30
- curl http://ftp.monash.edu/pub/nihongo/JMdict_e -o jmdict.xml
35
+ # Download JMDICT-E:
36
+ $ curl http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz -o jmdict.xml.gz"
37
+ # Unzip to jmdict.xml
38
+ $ gunzip jmdict.xml.gz
39
+
40
+ # Download KANJIDIC2:
41
+ $ curl http://www.edrdg.org/kanjidic/kanjidic2.xml.gz -o kanjidic2.xml.gz
42
+ # Unzip to kanjidic2.xml
43
+ $ gunzip kanjidic2.xml.gz
31
44
  ```
32
45
 
33
- This file is updated daily, and is essentially an export of all vocabulary on
34
- the [WWWJDIC application](http://nihongo.monash.edu/cgi-bin/wwwjdic?1C)
46
+ These files are updated daily, and are essentially an export of all vocabulary
47
+ and kanji in the [WWWJDIC
48
+ application](http://nihongo.monash.edu/cgi-bin/wwwjdic?1C)
35
49
 
36
50
  ### Parse the dictionary
37
51
 
@@ -44,13 +58,11 @@ array and one that will invoke a provided block with each entry, but which won't
44
58
  retain a reference to the entries, allowing Ruby to garbage collect them as it
45
59
  goes.
46
60
 
47
- Parsing the dictionary is CPU intensive, and takes about 13 seconds on my 2019
48
- 13" MacBook Pro.
49
-
50
61
  #### Passing a block
51
62
 
52
63
  If you just want to do some processing on each entry, it probably makes sense to
53
- invoke the library by passing a block
64
+ invoke the library by passing a block (note that supported types include only
65
+ `:jmdict_e` and `:kanjidic2`)
54
66
 
55
67
  ```ruby
56
68
  Eiwa.parse_file("path/to/some.xml", type: :jmdict_e) do |entry|
@@ -74,6 +86,3 @@ entries = Eiwa.parse_file("path/to/some.xml", type: :jmdict_e)
74
86
  Note that for the abridged Japanese-English dictionary, this will consume about
75
87
  500MB of RAM.
76
88
 
77
- ### The entry object model
78
-
79
- I haven't documented the [Entry](https://github.com/searls/eiwa/blob/master/lib/eiwa/tag/entry.rb) type or its child types yet, but they should be pretty easy to piece together by inspecting the output and [checking the source listings](https://github.com/searls/eiwa/blob/master/lib/eiwa/tag).
@@ -1,15 +1,27 @@
1
1
  require "eiwa/version"
2
- require "eiwa/parses_jmdict_file"
2
+
3
+ require "eiwa/tag/any"
4
+ require "eiwa/tag/character"
5
+ require "eiwa/tag/bag"
6
+ require "eiwa/tag/list"
7
+ require "eiwa/tag/reading_meaning"
8
+ require "eiwa/tag/entry"
9
+ require "eiwa/tag/spelling"
10
+ require "eiwa/tag/reading"
11
+ require "eiwa/tag/meaning"
12
+ require "eiwa/tag/entity"
13
+ require "eiwa/tag/cross_reference"
14
+ require "eiwa/tag/antonym"
15
+ require "eiwa/tag/source_language"
16
+ require "eiwa/tag/definition"
17
+ require "eiwa/tag/other"
18
+
19
+ require "eiwa/parses_file"
3
20
 
4
21
  module Eiwa
5
22
  class Error < StandardError; end
6
23
 
7
24
  def self.parse_file(filename, type: :jmdict_e, &each_entry_block)
8
- case type
9
- when :jmdict_e
10
- ParsesJmdictFile.new.call(filename, each_entry_block)
11
- else
12
- raise Eiwa::Error.new("Unknown file type: #{type}")
13
- end
25
+ ParsesFile.new.call(filename, type, each_entry_block)
14
26
  end
15
27
  end
@@ -0,0 +1,85 @@
1
+ require_relative "entities"
2
+
3
+ module Eiwa
4
+ module Jmdict
5
+ TAGS = {
6
+ "entry" => Tag::Entry,
7
+ "k_ele" => Tag::Spelling,
8
+ "r_ele" => Tag::Reading,
9
+ "sense" => Tag::Meaning,
10
+ "pos" => Tag::Entity,
11
+ "misc" => Tag::Entity,
12
+ "dial" => Tag::Entity,
13
+ "field" => Tag::Entity,
14
+ "ke_inf" => Tag::Entity,
15
+ "re_inf" => Tag::Entity,
16
+ "xref" => Tag::CrossReference,
17
+ "ant" => Tag::Antonym,
18
+ "lsource" => Tag::SourceLanguage,
19
+ "gloss" => Tag::Definition
20
+ }
21
+
22
+ class Doc < Nokogiri::XML::SAX::Document
23
+ def initialize(each_entry_block)
24
+ @each_entry_block = each_entry_block
25
+ @current = nil
26
+ end
27
+
28
+ def start_document
29
+ end
30
+
31
+ def end_document
32
+ end
33
+
34
+ def start_element(name, attrs)
35
+ parent = @current
36
+ @current = (TAGS[name] || Tag::Other).new
37
+ @current.start(name, attrs, parent)
38
+ end
39
+
40
+ def end_element(name)
41
+ raise Eiwa::Error.new("Parsing error. Expected <#{@current.tag_name}> to close before <#{name}>") if @current.tag_name != name
42
+ ending = @current
43
+ ending.end_self
44
+ if ending.is_a?(Tag::Entry)
45
+ @each_entry_block&.call(ending)
46
+ end
47
+
48
+ @current = ending.parent
49
+ @current&.end_child(ending)
50
+ end
51
+
52
+ def characters(s)
53
+ @current.add_characters(s)
54
+ end
55
+
56
+ # def comment string
57
+ # puts "comment #{string}"
58
+ # end
59
+
60
+ # def warning string
61
+ # puts "warning #{string}"
62
+ # end
63
+
64
+ def error(msg)
65
+ if (matches = msg.match(/Entity '(\S+)' not defined/))
66
+ # See: http://github.com/sparklemotion/nokogiri/issues/1926
67
+ code = matches[1]
68
+ @current.set_entity(code, ENTITIES[code])
69
+ elsif msg == "Detected an entity reference loop\n"
70
+ # Do nothing and hope this does not matter.
71
+ else
72
+ raise Eiwa::Error.new("Parsing error: #{msg}")
73
+ end
74
+ end
75
+
76
+ # def cdata_block string
77
+ # puts "cdata_block #{string}"
78
+ # end
79
+
80
+ # def processing_instruction name, content
81
+ # puts "processing_instruction #{name}, #{content}"
82
+ # end
83
+ end
84
+ end
85
+ end
@@ -0,0 +1,180 @@
1
+ module Eiwa
2
+ module Jmdict
3
+ ENTITIES = {
4
+ "Buddh" => "Buddhist term",
5
+ "MA" => "martial arts term",
6
+ "Shinto" => "Shinto term",
7
+ "X" => "rude or X-rated term (not displayed in educational software)",
8
+ "abbr" => "abbreviation",
9
+ "adj-f" => "noun or verb acting prenominally",
10
+ "adj-i" => "adjective (keiyoushi)",
11
+ "adj-ix" => "adjective (keiyoushi) - yoi/ii class",
12
+ "adj-kari" => "`kari' adjective (archaic)",
13
+ "adj-ku" => "`ku' adjective (archaic)",
14
+ "adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
15
+ "adj-nari" => "archaic/formal form of na-adjective",
16
+ "adj-no" => "nouns which may take the genitive case particle `no'",
17
+ "adj-pn" => "pre-noun adjectival (rentaishi)",
18
+ "adj-shiku" => "`shiku' adjective (archaic)",
19
+ "adj-t" => "`taru' adjective",
20
+ "adv" => "adverb (fukushi)",
21
+ "adv-to" => "adverb taking the `to' particle",
22
+ "anat" => "anatomical term",
23
+ "arch" => "archaism",
24
+ "archit" => "architecture term",
25
+ "astron" => "astronomy, etc. term",
26
+ "ateji" => "ateji (phonetic) reading",
27
+ "aux" => "auxiliary",
28
+ "aux-adj" => "auxiliary adjective",
29
+ "aux-v" => "auxiliary verb",
30
+ "baseb" => "baseball term",
31
+ "biol" => "biology term",
32
+ "bot" => "botany term",
33
+ "bus" => "business term",
34
+ "chem" => "chemistry term",
35
+ "chn" => "children's language",
36
+ "col" => "colloquialism",
37
+ "comp" => "computer terminology",
38
+ "conj" => "conjunction",
39
+ "cop" => "copula",
40
+ "cop-da" => "copula",
41
+ "ctr" => "counter",
42
+ "derog" => "derogatory",
43
+ "eK" => "exclusively kanji",
44
+ "econ" => "economics term",
45
+ "ek" => "exclusively kana",
46
+ "engr" => "engineering term",
47
+ "exp" => "expressions (phrases, clauses, etc.)",
48
+ "fam" => "familiar language",
49
+ "fem" => "female term or language",
50
+ "finc" => "finance term",
51
+ "food" => "food term",
52
+ "geol" => "geology, etc. term",
53
+ "geom" => "geometry term",
54
+ "gikun" => "gikun (meaning as reading) or jukujikun (special kanji reading)",
55
+ "hob" => "Hokkaido-ben",
56
+ "hon" => "honorific or respectful (sonkeigo) language",
57
+ "hum" => "humble (kenjougo) language",
58
+ "iK" => "word containing irregular kanji usage",
59
+ "id" => "idiomatic expression",
60
+ "ik" => "word containing irregular kana usage",
61
+ "int" => "interjection (kandoushi)",
62
+ "io" => "irregular okurigana usage",
63
+ "iv" => "irregular verb",
64
+ "joc" => "jocular, humorous term",
65
+ "ksb" => "Kansai-ben",
66
+ "ktb" => "Kantou-ben",
67
+ "kyb" => "Kyoto-ben",
68
+ "kyu" => "Kyuushuu-ben",
69
+ "law" => "law, etc. term",
70
+ "ling" => "linguistics terminology",
71
+ "m-sl" => "manga slang",
72
+ "mahj" => "mahjong term",
73
+ "male" => "male term or language",
74
+ "male-sl" => "male slang",
75
+ "math" => "mathematics",
76
+ "med" => "medicine, etc. term",
77
+ "mil" => "military",
78
+ "music" => "music term",
79
+ "n" => "noun (common) (futsuumeishi)",
80
+ "n-adv" => "adverbial noun (fukushitekimeishi)",
81
+ "n-pr" => "proper noun",
82
+ "n-pref" => "noun, used as a prefix",
83
+ "n-suf" => "noun, used as a suffix",
84
+ "n-t" => "noun (temporal) (jisoumeishi)",
85
+ "nab" => "Nagano-ben",
86
+ "num" => "numeric",
87
+ "oK" => "word containing out-dated kanji",
88
+ "obs" => "obsolete term",
89
+ "obsc" => "obscure term",
90
+ "oik" => "old or irregular kana form",
91
+ "ok" => "out-dated or obsolete kana usage",
92
+ "on-mim" => "onomatopoeic or mimetic word",
93
+ "osb" => "Osaka-ben",
94
+ "physics" => "physics terminology",
95
+ "pn" => "pronoun",
96
+ "poet" => "poetical term",
97
+ "pol" => "polite (teineigo) language",
98
+ "pref" => "prefix",
99
+ "proverb" => "proverb",
100
+ "prt" => "particle",
101
+ "quote" => "quotation",
102
+ "rare" => "rare",
103
+ "rkb" => "Ryuukyuu-ben",
104
+ "sens" => "sensitive",
105
+ "shogi" => "shogi term",
106
+ "sl" => "slang",
107
+ "sports" => "sports term",
108
+ "suf" => "suffix",
109
+ "sumo" => "sumo term",
110
+ "thb" => "Touhoku-ben",
111
+ "tsb" => "Tosa-ben",
112
+ "tsug" => "Tsugaru-ben",
113
+ "uK" => "word usually written using kanji alone",
114
+ "uk" => "word usually written using kana alone",
115
+ "unc" => "unclassified",
116
+ "v-unspec" => "verb unspecified",
117
+ "v1" => "Ichidan verb",
118
+ "v1-s" => "Ichidan verb - kureru special class",
119
+ "v2a-s" => "Nidan verb with 'u' ending (archaic)",
120
+ "v2b-k" => "Nidan verb (upper class) with `bu' ending (archaic)",
121
+ "v2b-s" => "Nidan verb (lower class) with `bu' ending (archaic)",
122
+ "v2d-k" => "Nidan verb (upper class) with `dzu' ending (archaic)",
123
+ "v2d-s" => "Nidan verb (lower class) with `dzu' ending (archaic)",
124
+ "v2g-k" => "Nidan verb (upper class) with `gu' ending (archaic)",
125
+ "v2g-s" => "Nidan verb (lower class) with `gu' ending (archaic)",
126
+ "v2h-k" => "Nidan verb (upper class) with `hu/fu' ending (archaic)",
127
+ "v2h-s" => "Nidan verb (lower class) with `hu/fu' ending (archaic)",
128
+ "v2k-k" => "Nidan verb (upper class) with `ku' ending (archaic)",
129
+ "v2k-s" => "Nidan verb (lower class) with `ku' ending (archaic)",
130
+ "v2m-k" => "Nidan verb (upper class) with `mu' ending (archaic)",
131
+ "v2m-s" => "Nidan verb (lower class) with `mu' ending (archaic)",
132
+ "v2n-s" => "Nidan verb (lower class) with `nu' ending (archaic)",
133
+ "v2r-k" => "Nidan verb (upper class) with `ru' ending (archaic)",
134
+ "v2r-s" => "Nidan verb (lower class) with `ru' ending (archaic)",
135
+ "v2s-s" => "Nidan verb (lower class) with `su' ending (archaic)",
136
+ "v2t-k" => "Nidan verb (upper class) with `tsu' ending (archaic)",
137
+ "v2t-s" => "Nidan verb (lower class) with `tsu' ending (archaic)",
138
+ "v2w-s" => "Nidan verb (lower class) with `u' ending and `we' conjugation (archaic)",
139
+ "v2y-k" => "Nidan verb (upper class) with `yu' ending (archaic)",
140
+ "v2y-s" => "Nidan verb (lower class) with `yu' ending (archaic)",
141
+ "v2z-s" => "Nidan verb (lower class) with `zu' ending (archaic)",
142
+ "v4b" => "Yodan verb with `bu' ending (archaic)",
143
+ "v4g" => "Yodan verb with `gu' ending (archaic)",
144
+ "v4h" => "Yodan verb with `hu/fu' ending (archaic)",
145
+ "v4k" => "Yodan verb with `ku' ending (archaic)",
146
+ "v4m" => "Yodan verb with `mu' ending (archaic)",
147
+ "v4n" => "Yodan verb with `nu' ending (archaic)",
148
+ "v4r" => "Yodan verb with `ru' ending (archaic)",
149
+ "v4s" => "Yodan verb with `su' ending (archaic)",
150
+ "v4t" => "Yodan verb with `tsu' ending (archaic)",
151
+ "v5aru" => "Godan verb - -aru special class",
152
+ "v5b" => "Godan verb with `bu' ending",
153
+ "v5g" => "Godan verb with `gu' ending",
154
+ "v5k" => "Godan verb with `ku' ending",
155
+ "v5k-s" => "Godan verb - Iku/Yuku special class",
156
+ "v5m" => "Godan verb with `mu' ending",
157
+ "v5n" => "Godan verb with `nu' ending",
158
+ "v5r" => "Godan verb with `ru' ending",
159
+ "v5r-i" => "Godan verb with `ru' ending (irregular verb)",
160
+ "v5s" => "Godan verb with `su' ending",
161
+ "v5t" => "Godan verb with `tsu' ending",
162
+ "v5u" => "Godan verb with `u' ending",
163
+ "v5u-s" => "Godan verb with `u' ending (special class)",
164
+ "v5uru" => "Godan verb - Uru old class verb (old form of Eru)",
165
+ "vi" => "intransitive verb",
166
+ "vk" => "Kuru verb - special class",
167
+ "vn" => "irregular nu verb",
168
+ "vr" => "irregular ru verb, plain form ends with -ri",
169
+ "vs" => "noun or participle which takes the aux. verb suru",
170
+ "vs-c" => "su verb - precursor to the modern suru",
171
+ "vs-i" => "suru verb - included",
172
+ "vs-s" => "suru verb - special class",
173
+ "vt" => "transitive verb",
174
+ "vulg" => "vulgar expression or word",
175
+ "vz" => "Ichidan verb - zuru verb (alternative form of -jiru verbs)",
176
+ "yoji" => "yojijukugo",
177
+ "zool" => "zoology term"
178
+ }
179
+ end
180
+ end
@@ -0,0 +1,43 @@
1
+ module Eiwa
2
+ module Kanjidic
3
+ TAGS = {
4
+ "character" => Tag::Character,
5
+ "misc" => Tag::Bag,
6
+ "reading_meaning" => Tag::ReadingMeaning,
7
+ "rmgroup" => Tag::List
8
+ }
9
+
10
+ class Doc < Nokogiri::XML::SAX::Document
11
+ def initialize(each_entry_block)
12
+ @each_entry_block = each_entry_block
13
+ @current = nil
14
+ end
15
+
16
+ def start_element(name, attrs)
17
+ parent = @current
18
+ @current = (TAGS[name] || Tag::Other).new
19
+ @current.start(name, attrs, parent)
20
+ end
21
+
22
+ def end_element(name)
23
+ raise Eiwa::Error.new("Parsing error. Expected <#{@current.tag_name}> to close before <#{name}>") if @current.tag_name != name
24
+ ending = @current
25
+ ending.end_self
26
+ if ending.is_a?(Tag::Character)
27
+ @each_entry_block&.call(ending)
28
+ end
29
+
30
+ @current = ending.parent
31
+ @current&.end_child(ending)
32
+ end
33
+
34
+ def characters(s)
35
+ @current.add_characters(s)
36
+ end
37
+
38
+ def error(msg)
39
+ raise Eiwa::Error.new("Parsing error: #{msg}")
40
+ end
41
+ end
42
+ end
43
+ end
@@ -0,0 +1,35 @@
1
+ require "nokogiri"
2
+ require_relative "jmdict/doc"
3
+ require_relative "kanjidic/doc"
4
+
5
+ module Eiwa
6
+ class ParsesFile
7
+ def call(filename, type, each_entry_block)
8
+ if each_entry_block.nil?
9
+ entries = []
10
+ each_entry_block ||= ->(e) { entries << e }
11
+ end
12
+
13
+ doc_for(type).new(each_entry_block).tap do |doc|
14
+ Nokogiri::XML::SAX::Parser.new(doc).parse_file(filename) do |ctx|
15
+ ctx.recovery = true
16
+ end
17
+ end
18
+
19
+ entries
20
+ end
21
+
22
+ private
23
+
24
+ def doc_for(type)
25
+ case type
26
+ when :jmdict_e
27
+ Jmdict::Doc
28
+ when :kanjidic2
29
+ Kanjidic::Doc
30
+ else
31
+ raise Eiwa::Error.new("Unknown file type: #{type}")
32
+ end
33
+ end
34
+ end
35
+ end
@@ -18,7 +18,7 @@ module Eiwa
18
18
  @text == other.text &&
19
19
  @sense_ordinal == other.sense_ordinal
20
20
  end
21
- alias == eql?
21
+ alias_method :==, :eql?
22
22
 
23
23
  def hash
24
24
  @text.hash + @sense_ordinal.hash
@@ -0,0 +1,21 @@
1
+ module Eiwa
2
+ module Tag
3
+ # For simple elements that contain child element_name, value pairs that could plop into a hash nicely
4
+ class Bag < Any
5
+ attr_reader :values
6
+
7
+ def initialize
8
+ @values = {}
9
+ end
10
+
11
+ def [](key)
12
+ @values[key]
13
+ end
14
+
15
+ def end_child(child)
16
+ # Don't overwrite, first dupe tends to be authorative one
17
+ @values[child.tag_name] = child.text unless @values.key?(child.tag_name)
18
+ end
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,24 @@
1
+ module Eiwa
2
+ module Tag
3
+ class Character < Any
4
+ attr_reader :text,
5
+ :grade, :stroke_count, :freq, :jlpt,
6
+ :onyomi, :kunyomi, :meanings
7
+
8
+ def end_child(child)
9
+ if child.tag_name == "literal"
10
+ @text = child.text
11
+ elsif child.tag_name == "reading_meaning"
12
+ @onyomi = child.rmgroup.items.select { |item| item.name == "reading" && item.attrs["r_type"] == "ja_on" }.map(&:text)
13
+ @kunyomi = child.rmgroup.items.select { |item| item.name == "reading" && item.attrs["r_type"] == "ja_kun" }.map(&:text)
14
+ @meanings = child.rmgroup.items.select { |item| item.name == "meaning" && (item.attrs["m_lang"].nil? || item.attrs["m_lang"] == "en") }.map(&:text)
15
+ elsif child.tag_name == "misc"
16
+ @grade = child["grade"]&.to_i
17
+ @stroke_count = child["stroke_count"]&.to_i
18
+ @freq = child["freq"]&.to_i
19
+ @jlpt = child["jlpt"]&.to_i
20
+ end
21
+ end
22
+ end
23
+ end
24
+ end
@@ -21,7 +21,7 @@ module Eiwa
21
21
  @reading == other.reading &&
22
22
  @sense_ordinal == other.sense_ordinal
23
23
  end
24
- alias == eql?
24
+ alias_method :==, :eql?
25
25
 
26
26
  def hash
27
27
  @text.hash + @reading.hash + @sense_ordinal.hash
@@ -35,7 +35,7 @@ module Eiwa
35
35
  @gender == other.gender &&
36
36
  @type == other.type
37
37
  end
38
- alias == eql?
38
+ alias_method :==, :eql?
39
39
 
40
40
  def hash
41
41
  @text.hash + @language.hash + @gender.hash + @type.hash
@@ -1,5 +1,3 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Entity < Any
@@ -19,7 +17,7 @@ module Eiwa
19
17
  @code == other.code &&
20
18
  @text == other.text
21
19
  end
22
- alias == eql?
20
+ alias_method :==, :eql?
23
21
 
24
22
  def hash
25
23
  @code.hash + @text.hash
@@ -1,5 +1,3 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Entry < Any
@@ -0,0 +1,18 @@
1
+ module Eiwa
2
+ module Tag
3
+ # For containers of lists or repeated elements
4
+ class List < Any
5
+ Item = Struct.new(:name, :attrs, :text, keyword_init: true)
6
+
7
+ attr_reader :items
8
+
9
+ def initialize
10
+ @items = []
11
+ end
12
+
13
+ def end_child(child)
14
+ @items << Item.new(name: child.tag_name, attrs: child.attrs, text: child.text)
15
+ end
16
+ end
17
+ end
18
+ end
@@ -1,5 +1,3 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Meaning < Any
@@ -1,9 +1,11 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Other < Any
6
- attr_reader :text
4
+ attr_reader :attrs
5
+
6
+ def text
7
+ @characters
8
+ end
7
9
  end
8
10
  end
9
11
  end
@@ -1,5 +1,3 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Reading < Any
@@ -0,0 +1,11 @@
1
+ module Eiwa
2
+ module Tag
3
+ class ReadingMeaning < Any
4
+ attr_reader :rmgroup
5
+
6
+ def end_child(child)
7
+ @rmgroup = child if child.tag_name == "rmgroup"
8
+ end
9
+ end
10
+ end
11
+ end
@@ -23,7 +23,7 @@ module Eiwa
23
23
  @wasei == other.wasei &&
24
24
  @type == other.type
25
25
  end
26
- alias == eql?
26
+ alias_method :==, :eql?
27
27
 
28
28
  def hash
29
29
  @text.hash + @language.hash + @wasei.hash + @type.hash
@@ -1,5 +1,3 @@
1
- require_relative "any"
2
-
3
1
  module Eiwa
4
2
  module Tag
5
3
  class Spelling < Any
@@ -1,3 +1,3 @@
1
1
  module Eiwa
2
- VERSION = "0.0.2"
2
+ VERSION = "0.1.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: eiwa
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justin Searls
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2020-03-09 00:00:00.000000000 Z
11
+ date: 2021-01-19 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -94,7 +94,7 @@ dependencies:
94
94
  - - ">="
95
95
  - !ruby/object:Gem::Version
96
96
  version: '0'
97
- description:
97
+ description:
98
98
  email:
99
99
  - searls@gmail.com
100
100
  executables: []
@@ -112,18 +112,23 @@ files:
112
112
  - bin/setup
113
113
  - eiwa.gemspec
114
114
  - lib/eiwa.rb
115
- - lib/eiwa/jmdict_doc.rb
116
- - lib/eiwa/jmdict_entities.rb
117
- - lib/eiwa/parses_jmdict_file.rb
115
+ - lib/eiwa/jmdict/doc.rb
116
+ - lib/eiwa/jmdict/entities.rb
117
+ - lib/eiwa/kanjidic/doc.rb
118
+ - lib/eiwa/parses_file.rb
118
119
  - lib/eiwa/tag/antonym.rb
119
120
  - lib/eiwa/tag/any.rb
121
+ - lib/eiwa/tag/bag.rb
122
+ - lib/eiwa/tag/character.rb
120
123
  - lib/eiwa/tag/cross_reference.rb
121
124
  - lib/eiwa/tag/definition.rb
122
125
  - lib/eiwa/tag/entity.rb
123
126
  - lib/eiwa/tag/entry.rb
127
+ - lib/eiwa/tag/list.rb
124
128
  - lib/eiwa/tag/meaning.rb
125
129
  - lib/eiwa/tag/other.rb
126
130
  - lib/eiwa/tag/reading.rb
131
+ - lib/eiwa/tag/reading_meaning.rb
127
132
  - lib/eiwa/tag/source_language.rb
128
133
  - lib/eiwa/tag/spelling.rb
129
134
  - lib/eiwa/version.rb
@@ -133,7 +138,7 @@ homepage: https://github.com/searls/eiwa
133
138
  licenses:
134
139
  - MIT
135
140
  metadata: {}
136
- post_install_message:
141
+ post_install_message:
137
142
  rdoc_options: []
138
143
  require_paths:
139
144
  - lib
@@ -148,8 +153,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
148
153
  - !ruby/object:Gem::Version
149
154
  version: '0'
150
155
  requirements: []
151
- rubygems_version: 3.0.3
152
- signing_key:
156
+ rubygems_version: 3.1.4
157
+ signing_key:
153
158
  specification_version: 4
154
159
  summary: Parses the JMDict Japanese-English dictionary
155
160
  test_files: []
@@ -1,93 +0,0 @@
1
- require_relative "tag/entry"
2
- require_relative "tag/spelling"
3
- require_relative "tag/reading"
4
- require_relative "tag/meaning"
5
- require_relative "tag/entity"
6
- require_relative "tag/cross_reference"
7
- require_relative "tag/antonym"
8
- require_relative "tag/source_language"
9
- require_relative "tag/definition"
10
- require_relative "tag/other"
11
-
12
- require_relative "jmdict_entities"
13
-
14
- module Eiwa
15
- TAGS = {
16
- "entry" => Tag::Entry,
17
- "k_ele" => Tag::Spelling,
18
- "r_ele" => Tag::Reading,
19
- "sense" => Tag::Meaning,
20
- "pos" => Tag::Entity,
21
- "misc" => Tag::Entity,
22
- "dial" => Tag::Entity,
23
- "field" => Tag::Entity,
24
- "ke_inf" => Tag::Entity,
25
- "re_inf" => Tag::Entity,
26
- "xref" => Tag::CrossReference,
27
- "ant" => Tag::Antonym,
28
- "lsource" => Tag::SourceLanguage,
29
- "gloss" => Tag::Definition
30
- }
31
-
32
- class JmdictDoc < Nokogiri::XML::SAX::Document
33
- def initialize(each_entry_block)
34
- @each_entry_block = each_entry_block
35
- end
36
-
37
- def start_document
38
- end
39
-
40
- def end_document
41
- end
42
-
43
- def start_element(name, attrs)
44
- parent = @current
45
- @current = (TAGS[name] || Tag::Other).new
46
- @current.start(name, attrs, parent)
47
- end
48
-
49
- def end_element(name)
50
- raise Eiwa::Error.new("Parsing error. Expected <#{@current.tag_name}> to close before <#{name}>") if @current.tag_name != name
51
- ending = @current
52
- ending.end_self
53
- if ending.is_a?(Tag::Entry)
54
- @each_entry_block&.call(ending)
55
- end
56
-
57
- @current = ending.parent
58
- @current&.end_child(ending)
59
- end
60
-
61
- def characters(s)
62
- @current.add_characters(s)
63
- end
64
-
65
- # def comment string
66
- # puts "comment #{string}"
67
- # end
68
-
69
- # def warning string
70
- # puts "warning #{string}"
71
- # end
72
-
73
- def error(msg)
74
- if (matches = msg.match(/Entity '([\S]+)' not defined/))
75
- # See: http://github.com/sparklemotion/nokogiri/issues/1926
76
- code = matches[1]
77
- @current.set_entity(code, JMDICT_ENTITIES[code])
78
- elsif msg == "Detected an entity reference loop\n"
79
- # Do nothing and hope this does not matter.
80
- else
81
- raise Eiwa::Error.new("Parsing error: #{msg}")
82
- end
83
- end
84
-
85
- # def cdata_block string
86
- # puts "cdata_block #{string}"
87
- # end
88
-
89
- # def processing_instruction name, content
90
- # puts "processing_instruction #{name}, #{content}"
91
- # end
92
- end
93
- end
@@ -1,178 +0,0 @@
1
- module Eiwa
2
- JMDICT_ENTITIES = {
3
- "Buddh" => "Buddhist term",
4
- "MA" => "martial arts term",
5
- "Shinto" => "Shinto term",
6
- "X" => "rude or X-rated term (not displayed in educational software)",
7
- "abbr" => "abbreviation",
8
- "adj-f" => "noun or verb acting prenominally",
9
- "adj-i" => "adjective (keiyoushi)",
10
- "adj-ix" => "adjective (keiyoushi) - yoi/ii class",
11
- "adj-kari" => "`kari' adjective (archaic)",
12
- "adj-ku" => "`ku' adjective (archaic)",
13
- "adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
14
- "adj-nari" => "archaic/formal form of na-adjective",
15
- "adj-no" => "nouns which may take the genitive case particle `no'",
16
- "adj-pn" => "pre-noun adjectival (rentaishi)",
17
- "adj-shiku" => "`shiku' adjective (archaic)",
18
- "adj-t" => "`taru' adjective",
19
- "adv" => "adverb (fukushi)",
20
- "adv-to" => "adverb taking the `to' particle",
21
- "anat" => "anatomical term",
22
- "arch" => "archaism",
23
- "archit" => "architecture term",
24
- "astron" => "astronomy, etc. term",
25
- "ateji" => "ateji (phonetic) reading",
26
- "aux" => "auxiliary",
27
- "aux-adj" => "auxiliary adjective",
28
- "aux-v" => "auxiliary verb",
29
- "baseb" => "baseball term",
30
- "biol" => "biology term",
31
- "bot" => "botany term",
32
- "bus" => "business term",
33
- "chem" => "chemistry term",
34
- "chn" => "children's language",
35
- "col" => "colloquialism",
36
- "comp" => "computer terminology",
37
- "conj" => "conjunction",
38
- "cop" => "copula",
39
- "cop-da" => "copula",
40
- "ctr" => "counter",
41
- "derog" => "derogatory",
42
- "eK" => "exclusively kanji",
43
- "econ" => "economics term",
44
- "ek" => "exclusively kana",
45
- "engr" => "engineering term",
46
- "exp" => "expressions (phrases, clauses, etc.)",
47
- "fam" => "familiar language",
48
- "fem" => "female term or language",
49
- "finc" => "finance term",
50
- "food" => "food term",
51
- "geol" => "geology, etc. term",
52
- "geom" => "geometry term",
53
- "gikun" => "gikun (meaning as reading) or jukujikun (special kanji reading)",
54
- "hob" => "Hokkaido-ben",
55
- "hon" => "honorific or respectful (sonkeigo) language",
56
- "hum" => "humble (kenjougo) language",
57
- "iK" => "word containing irregular kanji usage",
58
- "id" => "idiomatic expression",
59
- "ik" => "word containing irregular kana usage",
60
- "int" => "interjection (kandoushi)",
61
- "io" => "irregular okurigana usage",
62
- "iv" => "irregular verb",
63
- "joc" => "jocular, humorous term",
64
- "ksb" => "Kansai-ben",
65
- "ktb" => "Kantou-ben",
66
- "kyb" => "Kyoto-ben",
67
- "kyu" => "Kyuushuu-ben",
68
- "law" => "law, etc. term",
69
- "ling" => "linguistics terminology",
70
- "m-sl" => "manga slang",
71
- "mahj" => "mahjong term",
72
- "male" => "male term or language",
73
- "male-sl" => "male slang",
74
- "math" => "mathematics",
75
- "med" => "medicine, etc. term",
76
- "mil" => "military",
77
- "music" => "music term",
78
- "n" => "noun (common) (futsuumeishi)",
79
- "n-adv" => "adverbial noun (fukushitekimeishi)",
80
- "n-pr" => "proper noun",
81
- "n-pref" => "noun, used as a prefix",
82
- "n-suf" => "noun, used as a suffix",
83
- "n-t" => "noun (temporal) (jisoumeishi)",
84
- "nab" => "Nagano-ben",
85
- "num" => "numeric",
86
- "oK" => "word containing out-dated kanji",
87
- "obs" => "obsolete term",
88
- "obsc" => "obscure term",
89
- "oik" => "old or irregular kana form",
90
- "ok" => "out-dated or obsolete kana usage",
91
- "on-mim" => "onomatopoeic or mimetic word",
92
- "osb" => "Osaka-ben",
93
- "physics" => "physics terminology",
94
- "pn" => "pronoun",
95
- "poet" => "poetical term",
96
- "pol" => "polite (teineigo) language",
97
- "pref" => "prefix",
98
- "proverb" => "proverb",
99
- "prt" => "particle",
100
- "quote" => "quotation",
101
- "rare" => "rare",
102
- "rkb" => "Ryuukyuu-ben",
103
- "sens" => "sensitive",
104
- "shogi" => "shogi term",
105
- "sl" => "slang",
106
- "sports" => "sports term",
107
- "suf" => "suffix",
108
- "sumo" => "sumo term",
109
- "thb" => "Touhoku-ben",
110
- "tsb" => "Tosa-ben",
111
- "tsug" => "Tsugaru-ben",
112
- "uK" => "word usually written using kanji alone",
113
- "uk" => "word usually written using kana alone",
114
- "unc" => "unclassified",
115
- "v-unspec" => "verb unspecified",
116
- "v1" => "Ichidan verb",
117
- "v1-s" => "Ichidan verb - kureru special class",
118
- "v2a-s" => "Nidan verb with 'u' ending (archaic)",
119
- "v2b-k" => "Nidan verb (upper class) with `bu' ending (archaic)",
120
- "v2b-s" => "Nidan verb (lower class) with `bu' ending (archaic)",
121
- "v2d-k" => "Nidan verb (upper class) with `dzu' ending (archaic)",
122
- "v2d-s" => "Nidan verb (lower class) with `dzu' ending (archaic)",
123
- "v2g-k" => "Nidan verb (upper class) with `gu' ending (archaic)",
124
- "v2g-s" => "Nidan verb (lower class) with `gu' ending (archaic)",
125
- "v2h-k" => "Nidan verb (upper class) with `hu/fu' ending (archaic)",
126
- "v2h-s" => "Nidan verb (lower class) with `hu/fu' ending (archaic)",
127
- "v2k-k" => "Nidan verb (upper class) with `ku' ending (archaic)",
128
- "v2k-s" => "Nidan verb (lower class) with `ku' ending (archaic)",
129
- "v2m-k" => "Nidan verb (upper class) with `mu' ending (archaic)",
130
- "v2m-s" => "Nidan verb (lower class) with `mu' ending (archaic)",
131
- "v2n-s" => "Nidan verb (lower class) with `nu' ending (archaic)",
132
- "v2r-k" => "Nidan verb (upper class) with `ru' ending (archaic)",
133
- "v2r-s" => "Nidan verb (lower class) with `ru' ending (archaic)",
134
- "v2s-s" => "Nidan verb (lower class) with `su' ending (archaic)",
135
- "v2t-k" => "Nidan verb (upper class) with `tsu' ending (archaic)",
136
- "v2t-s" => "Nidan verb (lower class) with `tsu' ending (archaic)",
137
- "v2w-s" => "Nidan verb (lower class) with `u' ending and `we' conjugation (archaic)",
138
- "v2y-k" => "Nidan verb (upper class) with `yu' ending (archaic)",
139
- "v2y-s" => "Nidan verb (lower class) with `yu' ending (archaic)",
140
- "v2z-s" => "Nidan verb (lower class) with `zu' ending (archaic)",
141
- "v4b" => "Yodan verb with `bu' ending (archaic)",
142
- "v4g" => "Yodan verb with `gu' ending (archaic)",
143
- "v4h" => "Yodan verb with `hu/fu' ending (archaic)",
144
- "v4k" => "Yodan verb with `ku' ending (archaic)",
145
- "v4m" => "Yodan verb with `mu' ending (archaic)",
146
- "v4n" => "Yodan verb with `nu' ending (archaic)",
147
- "v4r" => "Yodan verb with `ru' ending (archaic)",
148
- "v4s" => "Yodan verb with `su' ending (archaic)",
149
- "v4t" => "Yodan verb with `tsu' ending (archaic)",
150
- "v5aru" => "Godan verb - -aru special class",
151
- "v5b" => "Godan verb with `bu' ending",
152
- "v5g" => "Godan verb with `gu' ending",
153
- "v5k" => "Godan verb with `ku' ending",
154
- "v5k-s" => "Godan verb - Iku/Yuku special class",
155
- "v5m" => "Godan verb with `mu' ending",
156
- "v5n" => "Godan verb with `nu' ending",
157
- "v5r" => "Godan verb with `ru' ending",
158
- "v5r-i" => "Godan verb with `ru' ending (irregular verb)",
159
- "v5s" => "Godan verb with `su' ending",
160
- "v5t" => "Godan verb with `tsu' ending",
161
- "v5u" => "Godan verb with `u' ending",
162
- "v5u-s" => "Godan verb with `u' ending (special class)",
163
- "v5uru" => "Godan verb - Uru old class verb (old form of Eru)",
164
- "vi" => "intransitive verb",
165
- "vk" => "Kuru verb - special class",
166
- "vn" => "irregular nu verb",
167
- "vr" => "irregular ru verb, plain form ends with -ri",
168
- "vs" => "noun or participle which takes the aux. verb suru",
169
- "vs-c" => "su verb - precursor to the modern suru",
170
- "vs-i" => "suru verb - included",
171
- "vs-s" => "suru verb - special class",
172
- "vt" => "transitive verb",
173
- "vulg" => "vulgar expression or word",
174
- "vz" => "Ichidan verb - zuru verb (alternative form of -jiru verbs)",
175
- "yoji" => "yojijukugo",
176
- "zool" => "zoology term"
177
- }
178
- end
@@ -1,21 +0,0 @@
1
- require "nokogiri"
2
- require_relative "jmdict_doc"
3
-
4
- module Eiwa
5
- class ParsesJmdictFile
6
- def call(filename, each_entry_block)
7
- if each_entry_block.nil?
8
- entries = []
9
- each_entry_block ||= ->(e) { entries << e }
10
- end
11
-
12
- JmdictDoc.new(each_entry_block).tap do |doc|
13
- Nokogiri::XML::SAX::Parser.new(doc).parse_file(filename) do |ctx|
14
- ctx.recovery = true
15
- end
16
- end
17
-
18
- entries
19
- end
20
- end
21
- end