anystyle-parser 0.0.10 → 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/HISTORY.md CHANGED
@@ -1,3 +1,7 @@
1
+ 0.1.0 / 2012-03-03
2
+ ==================
3
+ * Added redis as data store option
4
+
1
5
  0.0.10 / 2012-03-01
2
6
  ===================
3
7
  * Added new output format: tags (to generate training data)
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright 2011 Sylvester Keil. All rights reserved.
1
+ Copyright 2011-2012 Sylvester Keil. All rights reserved.
2
2
 
3
3
  Redistribution and use in source and binary forms, with or without
4
4
  modification, are permitted provided that the following conditions are met:
data/README.md CHANGED
@@ -3,12 +3,14 @@ Anystyle-Parser
3
3
 
4
4
  Anystyle-Parser is a very fast and smart parser for academic references. It
5
5
  is inspired by [ParsCit](http://aye.comp.nus.edu.sg/parsCit/) and
6
- [FreeCite](http://freecite.library.brown.edu/); Anystyle-Parser is designed
7
- for raw speed (it uses [wapiti](https://github.com/inukshuk/wapiti-ruby) based
6
+ [FreeCite](http://freecite.library.brown.edu/); Anystyle-Parser uses machine
7
+ learning algorithms and is designed
8
+ for raw speed (it [wapiti](https://github.com/inukshuk/wapiti-ruby) based
8
9
  conditional random fields and [Kyoto Cabinet](http://fallabs.com/kyotocabinet/)
9
- as a key-value store), flexibility (it is easy to train the model with
10
- data that is relevant to your parsing needs), and compatibility (Anystyle-Parser
11
- exports to Ruby Hashes, BibTeX, or the CiteProc JSON format).
10
+ or [Redis](http://redis.io) as a key-value store), flexibility (it is easy to
11
+ train the model with data that is relevant to your parsing needs), and
12
+ compatibility (Anystyle-Parser exports to Ruby Hashes, BibTeX, or the
13
+ CSL/CiteProc JSON format).
12
14
 
13
15
  Installation
14
16
  ------------
@@ -31,7 +33,36 @@ note that you will need write permissions in the directory where the file
31
33
  is to be created. You can change the Dictionary's default path in the
32
34
  Dictionary's options:
33
35
 
34
- Anystyle::Parser::Dictionary.instance.options[:path]
36
+ Anystyle::Parser::Dictionary.instance.options[:cabinet]
37
+
38
+ Starting with version 0.1.0, Anystyle-Parser also supports
39
+ [Redis](http://redis.io); to use Redis as the data store you need to install
40
+ the `redis` gem (and, optionally, the `hiredis` gem).
41
+
42
+ $ [sudo] gem install hiredis
43
+ $ [sudo] gem install redis
44
+
45
+ To see which data store modes are available in you current environment,
46
+ check the output of `Dictionary.modes`:
47
+
48
+ > Anystyle::Parser::Dictionary.modes
49
+ => [:kyoto, :redis, :hash]
50
+
51
+ To select one of the available modes, use the dictionary instance options:
52
+
53
+ > Anystyle::Parser::Dictionary.instance.options[:mode]
54
+ => :kyoto
55
+
56
+ To use [Redis](http://redis.io) you also need to set the host or unix socket
57
+ where your redis server is available. For example:
58
+
59
+ Anystyle::Parser::Dictionary.instance.options[:mode] = :redis
60
+ Anystyle::Parser::Dictionary.instance.options[:host] = 'localhost'
61
+
62
+ When the data store is opened using redis-mode and the data store is empty,
63
+ the feature dictionary will be imported automatically. If you want to import
64
+ the data explicitly you can use `Dictionary#create` after setting the
65
+ required options.
35
66
 
36
67
 
37
68
  Usage
@@ -152,7 +183,7 @@ and open a pull request on GitHub.
152
183
  License
153
184
  -------
154
185
 
155
- Copyright 2011 Sylvester Keil. All rights reserved.
186
+ Copyright 2011-2012 Sylvester Keil. All rights reserved.
156
187
 
157
188
  Some of the code in Anystyle-Parser's post processing (normalizing) routines
158
189
  was originally based on the source code of FreeCite and
@@ -15,7 +15,7 @@ Gem::Specification.new do |s|
15
15
  s.description = 'A sophisticated parser for academic references based on conditional random fields.'
16
16
  s.license = 'FreeBSD'
17
17
 
18
- s.add_runtime_dependency('bibtex-ruby', '~>1.3')
18
+ s.add_runtime_dependency('bibtex-ruby', '~>2.0')
19
19
  s.add_runtime_dependency('wapiti', '~>0.0')
20
20
 
21
21
  s.add_development_dependency('rake', ['~>0.9'])
@@ -1,165 +1,229 @@
1
1
  module Anystyle
2
- module Parser
3
-
4
- # Dictionary is a Singleton object that provides a key-value store of
5
- # the Anystyle Parser dictionary required for feature elicitation.
6
- # This dictionary acts essentially like a Ruby Hash object, but because
7
- # of the dictionary's size it is not efficient to keep the entire
8
- # dictionary in memory at all times. For that reason, Dictionary
9
- # creates a persistent data store on disk using Kyoto Cabinet; if
10
- # Kyoto Cabinet is not installed a Ruby Hash is used as a fall-back.
11
- #
12
- # The database will be automatically created from the dictionary file
13
- # using the best available DBM the first time it is accessed. Once
14
- # database file exists, the database will be restored from file.
15
- # Therefore, if you make changes to the dictionary file, you will have
16
- # to delete the old database file for a new one to be created.
17
- #
18
- # Database creation requires write permissions. By default, the database
19
- # will be created in the support directory of the Parser; if you have
20
- # installed the gem version of the Parser, you may not have write
21
- # permissions, but you can change the path in the Dictionary's options.
22
- #
23
- # Dictionary.instance.options[:path] # => the database file
24
- # Dictionary.instance.options[:source] # => the (zipped) dictionary file
25
- #
26
- class Dictionary
2
+ module Parser
3
+
4
+ # Dictionary is a Singleton object that provides a key-value store of
5
+ # the Anystyle Parser dictionary required for feature elicitation.
6
+ # This dictionary acts essentially like a Ruby Hash object, but because
7
+ # of the dictionary's size it is not efficient to keep the entire
8
+ # dictionary in memory at all times. For that reason, Dictionary
9
+ # creates a persistent data store on disk using Kyoto Cabinet; if
10
+ # Kyoto Cabinet is not installed a Ruby Hash is used as a fall-back.
11
+ #
12
+ # Starting with version 0.1.0 Redis support was added. If you would
13
+ # like to use Redis as the dictionary data store you can do so by
14
+ #
15
+ #
16
+ # The database will be automatically created from the dictionary file
17
+ # using the best available DBM the first time it is accessed. Once
18
+ # database file exists, the database will be restored from file.
19
+ # Therefore, if you make changes to the dictionary file, you will have
20
+ # to delete the old database file for a new one to be created.
21
+ #
22
+ # Database creation in Kyoto-Cabinet mode requires write permissions.
23
+ # By default, the database
24
+ # will be created in the support directory of the Parser; if you have
25
+ # installed the gem version of the Parser, you may not have write
26
+ # permissions, but you can change the path in the Dictionary's options.
27
+ #
28
+ # ## Configuration
29
+ #
30
+ # To set the database mode:
31
+ #
32
+ # Dictionary.instance.options[:mode] # => the database mode
33
+ #
34
+ # For a list of database modes available in your environment consult:
35
+ #
36
+ # Dictionary.modes # => [:kyoto, :redis, :hash]
37
+ #
38
+ # Further options include:
39
+ #
40
+ # Dictionary.instance.options[:source] # => the zipped dictionary file
41
+ # Dictionary.instance.options[:cabinet] # => the database file (kyoto)
42
+ # Dictionary.instance.options[:path] # => the database socket (redis)
43
+ # Dictionary.instance.options[:host] # => dictionary host (redis)
44
+ # Dictionary.instance.options[:part] # => dictionary port (redis)
45
+ #
46
+ class Dictionary
27
47
 
28
- include Singleton
29
-
30
- @defaults = {
31
- :source => File.expand_path('../support/dict.txt.gz', __FILE__),
32
- :path => File.expand_path('../support/dict.kch', __FILE__)
33
- }.freeze
34
-
35
- @keys = [:male, :female, :surname, :month, :place, :publisher, :journal].freeze
48
+ include Singleton
49
+
50
+ @keys = [:male, :female, :surname, :month, :place, :publisher, :journal].freeze
36
51
 
37
- @code = Hash[*@keys.zip(0.upto(@keys.length-1).map { |i| 2**i }).flatten]
38
- @code.default = 0
39
- @code.freeze
52
+ @code = Hash[*@keys.zip(0.upto(@keys.length-1).map { |i| 2**i }).flatten]
53
+ @code.default = 0
54
+ @code.freeze
40
55
 
41
- @mode = begin
42
- require 'kyotocabinet'
43
- :kyoto
44
- rescue LoadError
45
- :hash
46
- end
47
-
48
- class << self
49
-
50
- attr_reader :keys, :code, :defaults, :mode
51
-
52
- end
56
+ @modes = [:hash]
53
57
 
54
- attr_reader :options
55
-
56
- def initialize
57
- @options = Dictionary.defaults.dup
58
- end
59
-
60
- def [](key)
61
- db[key.to_s].to_i
62
- end
63
-
64
- def []=(key, value)
65
- db[key.to_s] = value
66
- end
67
-
68
- def create
69
- case Dictionary.mode
70
- when :kyoto
71
- truncate
72
- @db = KyotoCabinet::DB.new
73
- unless @db.open(path, KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE)
74
- raise DatabaseError, "failed to create cabinet file #{path}: #{@db.error}"
75
- end
76
- populate
77
- close
78
- else
79
- # nothing
80
- end
81
- end
82
-
83
- def truncate
84
- close
85
- File.unlink(path) if File.exists?(path)
86
- end
87
-
88
- def open
89
- create unless File.exists?(path)
58
+ begin
59
+ require 'redis/connection/hiredis'
60
+ rescue LoadError
61
+ # ignore
62
+ end
90
63
 
91
- case Dictionary.mode
92
- when :kyoto
93
- at_exit { ::Anystyle::Parser::Dictionary.instance.close }
94
-
95
- @db = KyotoCabinet::DB.new
96
- unless @db.open(path, KyotoCabinet::DB::OREADER)
97
- raise DictionaryError, "failed to open cabinet file #{path}: #{@db.error}"
98
- end
99
- else
100
- @db = Hash.new(0)
101
- populate
102
- end
103
-
104
- @db
105
- end
106
-
107
- def open?; !!@db; end
108
-
109
- def close
110
- @db.close if @db.respond_to?(:close)
111
- @db = nil
112
- end
113
-
114
- def path
115
- options[:path]
116
- end
117
-
118
- private
119
-
120
- def db
121
- @db || open
122
- end
123
-
124
- def populate
125
- require 'zlib'
64
+ begin
65
+ require 'redis'
66
+ @modes.unshift :redis
67
+ rescue LoadError
68
+ # info 'no redis support detected'
69
+ end
70
+
71
+ begin
72
+ require 'kyotocabinet'
73
+ @modes.unshift :kyoto
74
+ rescue LoadError
75
+ # info 'no kyoto-cabinet support detected'
76
+ end
77
+
78
+ @defaults = {
79
+ :mode => @modes[0],
80
+ :source => File.expand_path('../support/dict.txt.gz', __FILE__),
81
+ :cabinet => File.expand_path('../support/dict.kch', __FILE__),
82
+ :port => 6379
83
+ }.freeze
84
+
85
+
86
+ class << self
87
+
88
+ attr_reader :keys, :code, :defaults, :modes
89
+
90
+ end
126
91
 
127
- File.open(options[:source], 'r:UTF-8') do |f|
128
- mode = 0
92
+ attr_reader :options
93
+
94
+ def initialize
95
+ @options = Dictionary.defaults.dup
96
+ end
97
+
98
+ def [](key)
99
+ db[key.to_s].to_i
100
+ end
101
+
102
+ def []=(key, value)
103
+ db[key.to_s] = value
104
+ end
105
+
106
+ def create
107
+ case Dictionary.mode
108
+ when :kyoto
109
+ truncate
110
+ @db = KyotoCabinet::DB.new
111
+ unless @db.open(path, KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE)
112
+ raise DatabaseError, "failed to create cabinet file #{path}: #{@db.error}"
113
+ end
114
+ populate
115
+ close
116
+
117
+ when :redis
118
+ @db ||= Redis.new(options)
119
+ populate
120
+ close
121
+
122
+ else
123
+ # nothing
124
+ end
125
+ end
126
+
127
+ def truncate
128
+ close
129
+ File.unlink(path) if File.exists?(path)
130
+ end
131
+
132
+ def open
133
+ case options[:mode]
134
+ when :kyoto
135
+ at_exit { ::Anystyle::Parser::Dictionary.instance.close }
129
136
 
130
- Zlib::GzipReader.new(f).each do |line|
131
- line.strip!
132
-
133
- if line.start_with?('#')
134
- case line
135
- when /^## male/i
136
- mode = Dictionary.code[:male]
137
- when /^## female/i
138
- mode = Dictionary.code[:female]
139
- when /^## (?:surname|last|chinese)/i
140
- mode = Dictionary.code[:surname]
141
- when /^## months/i
142
- mode = Dictionary.code[:month]
143
- when /^## place/i
144
- mode = Dictionary.code[:place]
145
- when /^## publisher/i
146
- mode = Dictionary.code[:publisher]
147
- when /^## journal/i
148
- mode = Dictionary.code[:journal]
149
- else
150
- # skip comments
151
- end
152
- else
153
- key, probability = line.split(/\s+(\d+\.\d+)\s*$/)
154
- value = self[key]
155
- self[key] = value + mode if value < mode
156
- end
157
- end
158
- end
137
+ create unless File.exists?(path)
138
+
139
+ @db = KyotoCabinet::DB.new
140
+ unless @db.open(path, KyotoCabinet::DB::OREADER)
141
+ raise DictionaryError, "failed to open cabinet file #{path}: #{@db.error}"
142
+ end
143
+
144
+ when :redis
145
+ at_exit { ::Anystyle::Parser::Dictionary.instance.close }
146
+ @db = Redis.new(options)
147
+
148
+ populate if @db.dbsize.zero?
149
+
150
+ else
151
+ @db = Hash.new(0)
152
+ populate
153
+ end
154
+
155
+ @db
156
+ end
157
+
158
+ def open?; !!@db; end
159
+
160
+ def close
161
+ case
162
+ when @db.respond_to?(:close)
163
+ @db.close
164
+ when @db.respond_to?(:quit)
165
+ @db.quit
166
+ end
167
+
168
+ @db = nil
169
+ end
170
+
171
+ def path
172
+ case options[:mode]
173
+ when :kyoto
174
+ options[:cabinet] || options[:path]
175
+ when :redis
176
+ options[:path] || options.values_at(:host, :port).join(':')
177
+ else
178
+ 'hash'
179
+ end
180
+ end
181
+
182
+ private
183
+
184
+ def db
185
+ @db || open
186
+ end
187
+
188
+ def populate
189
+ require 'zlib'
159
190
 
160
- end
161
-
162
- end
163
-
164
- end
191
+ File.open(options[:source], 'r:UTF-8') do |f|
192
+ mode = 0
193
+
194
+ Zlib::GzipReader.new(f).each do |line|
195
+ line.strip!
196
+
197
+ if line.start_with?('#')
198
+ case line
199
+ when /^## male/i
200
+ mode = Dictionary.code[:male]
201
+ when /^## female/i
202
+ mode = Dictionary.code[:female]
203
+ when /^## (?:surname|last|chinese)/i
204
+ mode = Dictionary.code[:surname]
205
+ when /^## months/i
206
+ mode = Dictionary.code[:month]
207
+ when /^## place/i
208
+ mode = Dictionary.code[:place]
209
+ when /^## publisher/i
210
+ mode = Dictionary.code[:publisher]
211
+ when /^## journal/i
212
+ mode = Dictionary.code[:journal]
213
+ else
214
+ # skip comments
215
+ end
216
+ else
217
+ key, probability = line.split(/\s+(\d+\.\d+)\s*$/)
218
+ value = self[key]
219
+ self[key] = value + mode if value < mode
220
+ end
221
+ end
222
+ end
223
+
224
+ end
225
+
226
+ end
227
+
228
+ end
165
229
  end
@@ -1,5 +1,5 @@
1
1
  module Anystyle
2
2
  module Parser
3
- VERSION = '0.0.10'.freeze
3
+ VERSION = '0.1.0'.freeze
4
4
  end
5
5
  end
@@ -10,6 +10,15 @@ module Anystyle
10
10
  it { Dictionary.should_not respond_to(:new) }
11
11
  it { dict.should_not be nil }
12
12
 
13
+ describe '.modes' do
14
+ it 'returns an array' do
15
+ Dictionary.modes.should be_a(Array)
16
+ end
17
+
18
+ it 'contains at least :hash' do
19
+ Dictionary.modes.should include(:hash)
20
+ end
21
+ end
13
22
 
14
23
  describe "the dictionary" do
15
24
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: anystyle-parser
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.10
4
+ version: 0.1.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,22 +9,22 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-03-01 00:00:00.000000000 Z
12
+ date: 2012-03-03 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: bibtex-ruby
16
- requirement: &70338916773120 !ruby/object:Gem::Requirement
16
+ requirement: &70166358058660 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ~>
20
20
  - !ruby/object:Gem::Version
21
- version: '1.3'
21
+ version: '2.0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70338916773120
24
+ version_requirements: *70166358058660
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: wapiti
27
- requirement: &70338916772120 !ruby/object:Gem::Requirement
27
+ requirement: &70166358047080 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: '0.0'
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *70338916772120
35
+ version_requirements: *70166358047080
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: rake
38
- requirement: &70338916770100 !ruby/object:Gem::Requirement
38
+ requirement: &70166358046040 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '0.9'
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *70338916770100
46
+ version_requirements: *70166358046040
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: racc
49
- requirement: &70338916769520 !ruby/object:Gem::Requirement
49
+ requirement: &70166358045460 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '1.4'
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *70338916769520
57
+ version_requirements: *70166358045460
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: cucumber
60
- requirement: &70338916768880 !ruby/object:Gem::Requirement
60
+ requirement: &70166358044240 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: '1.0'
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *70338916768880
68
+ version_requirements: *70166358044240
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rspec
71
- requirement: &70338916768220 !ruby/object:Gem::Requirement
71
+ requirement: &70166358043560 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: '2.6'
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *70338916768220
79
+ version_requirements: *70166358043560
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: ZenTest
82
- requirement: &70338916767680 !ruby/object:Gem::Requirement
82
+ requirement: &70166358042940 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,7 +87,7 @@ dependencies:
87
87
  version: '4.6'
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *70338916767680
90
+ version_requirements: *70166358042940
91
91
  description: A sophisticated parser for academic references based on conditional random
92
92
  fields.
93
93
  email:
@@ -148,7 +148,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
148
148
  version: '0'
149
149
  segments:
150
150
  - 0
151
- hash: -640366899922045737
151
+ hash: 1324194107450037111
152
152
  required_rubygems_version: !ruby/object:Gem::Requirement
153
153
  none: false
154
154
  requirements:
@@ -157,7 +157,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
157
157
  version: '0'
158
158
  segments:
159
159
  - 0
160
- hash: -640366899922045737
160
+ hash: 1324194107450037111
161
161
  requirements: []
162
162
  rubyforge_project:
163
163
  rubygems_version: 1.8.10