markov_words 1.0.1 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: ae8fa0d54564e0c30c1407409374d086c3d935a0
4
- data.tar.gz: 9662260cf2cdb1ee137c3839900da60aec770977
3
+ metadata.gz: 2605351d6864f6d2cb9ffc5f498b647b306dc600
4
+ data.tar.gz: 85d646b15bb737aca9394f69cde8ea03922bcc10
5
5
  SHA512:
6
- metadata.gz: 6b81af4c6a78491a42f009d7975a044ab3e8012e772cd78dae0f78efbd13a89274d2085e11396921502e8aba342d3b2a16507669af79dc7e77cf227b0a5af717
7
- data.tar.gz: 510837581b8b014155a4fc1084329c18539bb33e0013733090fd9f8bec22c32dc6bedba647b81e78f93f4d33465de788bc4876b8617d82a7a2c1b1a91a791f5e
6
+ metadata.gz: 1222be879d0fec47f344f71893b43cfba653bb05e5fdca56ba4409dad723efd4158458b7f4065d80fb4a31bcaa09878568264564ba36b5c3620fe7afe2092802
7
+ data.tar.gz: 13bd043e2168d8f2dbe1cf3f33c38268b0bed4a6f0fcfde0f5ae360a770c0b376913ef4973b350a3ad143864a5766ddfddd22de23ddfedb6b328d1c1e8864a08
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- markov_words (1.0.1)
4
+ markov_words (2.0.0)
5
5
  sqlite3-ruby (~> 1.3)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -72,28 +72,72 @@ You can also clear out the contents of the data file (because `MarkovWords` will
72
72
  generator = MarkovWords::Generator.new(data_file: /tmp/markov.data, flush_data: true)
73
73
  ```
74
74
 
75
+ ### Custom Metadata
75
76
 
76
- ### Caching
77
+ A `Generator` object gives you access to its `.data_store`, which is an instance of a `FileStore` object. This gives you the ability to store custom metadata into the same database that holds the n-gram information.
77
78
 
78
- Because calculation can get slow, especially at high n-gram sizes, `MarkovWords` will cache 100 words by default . If you want to control caching, you can adjust caching parameters eg:
79
+ One example of how you might use this would be to cache words for later use (since initial word generation can be slow, even after the database has been generated the first time):
79
80
 
80
81
  ```ruby
81
- # For no caching whatsoever
82
- generator = MarkovWords::Generator.new(perform_caching: false)
82
+ generator = MarkovWords::Generator.new
83
+ my_cache = 100.times.map { generator.word }
84
+ generator.data_store.store_data :cache, my_cache
83
85
 
84
- # To change the number of pre-computed/stored words to 1000:
85
- generator = MarkovWords::Generator.new(cache_size: 1000)
86
+ # then later, perhaps on another page load in a web server...
87
+ my_cache = generator.data_store.retrieve_data :cache
86
88
  ```
87
89
 
88
- You can "top off" the cache to make sure it's full with:
90
+ ### Benchmarking
89
91
 
90
- ```ruby
91
- generator = MarkovWords::Generator.new
92
- generator.refresh_cache
92
+ We've included a `bin/benchmark` script, which will measure initial load times, and then the time it takes to generate 100 words at various dictionary n-gram sizes.
93
+
94
+ Here is an example run:
95
+ ```
96
+ bin/benchmark 1 6 '/usr/share/dict/words'
97
+ Minimum n-gram size set to 1
98
+ Maximum n-gram size set to 6
99
+ Corpus file set to /usr/share/dict/words
100
+
101
+ Test initial database creation time versus gram size? (y/n) y
102
+ ------------------------------------------------------------
103
+ user system total real
104
+ size: 1 4.080000 0.010000 4.090000 ( 4.108898)
105
+ size: 2 8.320000 0.090000 8.410000 ( 8.554122)
106
+ size: 3 12.710000 0.080000 12.790000 ( 12.869257)
107
+ size: 4 18.750000 0.160000 18.910000 ( 19.102232)
108
+ size: 5 25.440000 0.250000 25.690000 ( 25.953532)
109
+ size: 6 31.060000 0.340000 31.400000 ( 31.680680)
110
+ ------------------------------------------------------------
111
+
112
+ Test existing database on disk, initial memory load? (y/n) y
113
+ ------------------------------------------------------------
114
+ user system total real
115
+ size: 1 0.000000 0.000000 0.000000 ( 0.000587)
116
+ size: 2 0.000000 0.000000 0.000000 ( 0.005109)
117
+ size: 3 0.080000 0.010000 0.090000 ( 0.077303)
118
+ size: 4 0.330000 0.070000 0.400000 ( 0.395079)
119
+ size: 5 1.030000 0.130000 1.160000 ( 1.157014)
120
+ size: 6 2.920000 0.120000 3.040000 ( 3.045219)
121
+ ------------------------------------------------------------
122
+
123
+ Test word generation averages for 100 words per gram size? (y/n) y
124
+ ------------------------------------------------------------
125
+ user system total real
126
+ size: 1 0.010000 0.000000 0.010000 ( 0.003971)
127
+ size: 2 0.010000 0.000000 0.010000 ( 0.009460)
128
+ size: 3 0.120000 0.000000 0.120000 ( 0.127297)
129
+ size: 4 0.350000 0.010000 0.360000 ( 0.354564)
130
+ size: 5 2.250000 0.020000 2.270000 ( 2.302405)
131
+ size: 6 4.000000 0.120000 4.120000 ( 4.186757)
132
+ ------------------------------------------------------------
93
133
  ```
94
134
 
95
135
  ## Change Log
96
136
 
137
+ - `2.0.0`
138
+ - Breaking changes:
139
+ - Removed all caching functions from `Generator`. They were cluttering up the code, without being a necessary function of a `Generator`.
140
+ - Added an `attr_accessor` for `Generator.data_store`, so that users can implement custom metadata for `Generator` objects, and store it in the same `FileStore` object that holds the database.
97
141
  - `1.0.0` introduced a couple of breaking changes:
98
142
  - `Words` class renamed to `Generator`.
99
143
  - `Generator`:
data/bin/benchmark ADDED
@@ -0,0 +1,84 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen-string-literal: true
3
+
4
+ require 'benchmark'
5
+ require 'bundler/setup'
6
+ require 'markov_words'
7
+
8
+ # Utility class to generate benchmarks for MarkovWords
9
+ class GeneratorBenchmark
10
+ LABEL_WIDTH = 7
11
+ def run
12
+ test_if_desired 'initial database creation time versus gram size' do
13
+ Benchmark.bm(LABEL_WIDTH) do |x|
14
+ @min_gram_size.upto(@max_gram_size) do |size|
15
+ generator =
16
+ MarkovWords::Generator.new(flush_data: true,
17
+ gram_size: size,
18
+ corpus_file: @corpus_file)
19
+ x.report("size: #{size}") { generator.word }
20
+ end
21
+ end
22
+ end
23
+
24
+ test_if_desired 'existing database on disk, initial memory load' do
25
+ Benchmark.bm(LABEL_WIDTH) do |x|
26
+ @min_gram_size.upto(@max_gram_size) do |size|
27
+ generator =
28
+ MarkovWords::Generator.new(flush_data: true,
29
+ gram_size: size,
30
+ corpus_file: @corpus_file)
31
+ _word = generator.word # this will run initial setup
32
+ generator_load_data_from_file =
33
+ MarkovWords::Generator.new(gram_size: size,
34
+ corpus_file: @corpus_file)
35
+ x.report("size: #{size}") { generator_load_data_from_file.word }
36
+ end
37
+ end
38
+ end
39
+
40
+ test_if_desired 'word generation averages for 100 words per gram size' do
41
+ Benchmark.bm(LABEL_WIDTH) do |x|
42
+ @min_gram_size.upto(@max_gram_size) do |size|
43
+ generator =
44
+ MarkovWords::Generator.new(flush_data: true,
45
+ gram_size: size,
46
+ perform_caching: false,
47
+ corpus_file: @corpus_file)
48
+ _word = generator.word # this will run initial setup
49
+ x.report("size: #{size}") { 1.upto(100) { generator.word } }
50
+ end
51
+ end
52
+ end
53
+ end
54
+
55
+ def initialize(opts)
56
+ @min_gram_size = opts.fetch :min_gram_size, 1
57
+ @max_gram_size = opts.fetch :max_gram_size, 6
58
+ @corpus_file = opts.fetch :corpus_file, '/usr/share/dict/words'
59
+ puts "Minimum n-gram size set to #{@min_gram_size}"
60
+ puts "Maximum n-gram size set to #{@max_gram_size}"
61
+ puts "Corpus file set to #{@corpus_file}"
62
+ end
63
+
64
+ def print_separator
65
+ printf "%s\n", Array.new(60).map { '-' }.join
66
+ end
67
+
68
+ def test_if_desired(description, &block)
69
+ printf "\n%s", "Test #{description}? (y/n) "
70
+ if /y/.match?($stdin.readline)
71
+ print_separator
72
+ yield(block)
73
+ print_separator
74
+ end
75
+ end
76
+ end
77
+
78
+ if ARGV.empty?
79
+ puts "USAGE: bin/benchmark min_gram_size max_gram_size corpus_file\n"
80
+ end
81
+ bm = GeneratorBenchmark.new(min_gram_size: ARGV[0].to_i,
82
+ max_gram_size: ARGV[1].to_i,
83
+ corpus_file: ARGV[2])
84
+ bm.run
@@ -1,26 +1,27 @@
1
1
  # frozen-string-literal: true
2
2
 
3
3
  module MarkovWords
4
- # This class takes care of word generation, caching, and data storage.
4
+ # This class takes care of word generation, and will store the database into
5
+ # a `FileStore` object.
5
6
  class Generator
6
- # The current list of cached words.
7
- # @return [Array<String>] All words in the cache.
8
- def cache
9
- @data_store.retrieve_data(:cache)
10
- end
7
+ # It's useful to be able to access the data store object directly, for
8
+ # example if you were to want to implement storage of related metadata
9
+ # into the same storage system that holds the database.
10
+ attr_reader :data_store
11
11
 
12
12
  # The current database of n-gram mappings
13
13
  # @return [Hash] n-gram database
14
14
  def grams
15
- @grams = @grams ||
16
- @data_store.retrieve_data(:grams) ||
17
- markov_corpus(@corpus_file, @gram_size)
15
+ if @grams.nil?
16
+ @grams = @data_store.retrieve_data(:grams) ||
17
+ markov_corpus(@corpus_file, @gram_size)
18
+ else
19
+ @grams
20
+ end
18
21
  end
19
22
 
20
23
  # Create a new "Words" object
21
24
  # @param opts [Hash]
22
- # @option opts [Integer] :cache_size How many words to pre-calculate +
23
- # store in the cache for quick retrieval
24
25
  # @option opts [String] :corpus_file ('/usr/share/dict/words') Your
25
26
  # dictionary of words.
26
27
  # @option opts [String] :data_file Location where calculations are
@@ -34,7 +35,6 @@ module MarkovWords
34
35
  # NOTE: If your corpus size is very small (<1000 words or so), it's hard
35
36
  # to guarantee a min_length because so many n-grams will have no
36
37
  # association, which terminates word generation.
37
- # @option opts [Boolean] :perform_caching (true) Perform caching?
38
38
  # @return [Words] A `MarkovWords::Generator` object.
39
39
  def initialize(opts = {})
40
40
  @grams = nil
@@ -42,33 +42,13 @@ module MarkovWords
42
42
  @max_length = opts.fetch :max_length, 16
43
43
  @min_length = opts.fetch :min_length, 3
44
44
 
45
- initialize_cache(opts)
46
45
  initialize_data(opts)
47
46
  end
48
47
 
49
- # "Top off" the cache of stored words, and ensure that it's at
50
- # `@cache_size`. If `perform_caching` is set to `false`, returns an empty
51
- # array.
52
- # @return [Array<String>] All words in the cache.
53
- def refresh_cache
54
- if @perform_caching
55
- words_array = @data_store.retrieve_data(:cache) || []
56
- words_array << generate_word while words_array.length < @cache_size
57
- @data_store.store_data(:cache, words_array)
58
- words_array
59
- else
60
- []
61
- end
62
- end
63
-
64
- # Generate a new word, or return one from the cache if available.
48
+ # Generate a new word
65
49
  # @return [String] The word.
66
50
  def word
67
- if @perform_caching
68
- load_word_from_cache
69
- else
70
- generate_word
71
- end
51
+ generate_word
72
52
  end
73
53
 
74
54
  private
@@ -81,11 +61,6 @@ module MarkovWords
81
61
  end
82
62
  end
83
63
 
84
- def initialize_cache(opts)
85
- @cache_size = opts.fetch :cache_size, 100
86
- @perform_caching = opts.fetch :perform_caching, true
87
- end
88
-
89
64
  def initialize_data(opts)
90
65
  @corpus_file = opts.fetch :corpus_file, '/usr/share/dict/words'
91
66
  @data_file = opts.fetch :data_file, 'tmp/markov_words.data'
@@ -138,18 +113,6 @@ module MarkovWords
138
113
  /[\r\n]/.match? word
139
114
  end
140
115
 
141
- def load_word_from_cache
142
- words_array = @data_store.retrieve_data(:cache)
143
- if words_array.nil? || words_array.empty?
144
- words_array = Array.new(@cache_size) { generate_word }
145
- end
146
-
147
- word = words_array.pop
148
- @data_store.store_data(:cache, words_array)
149
-
150
- word
151
- end
152
-
153
116
  # Generate a MarkovWords corpus from a datafile, with a given size of
154
117
  # n-gram. Returns a hash of "grams", which are a map of a letter to the
155
118
  # frequency of the letters that follow it, eg: {"c" => {"a" => 1, "b" =>
@@ -165,6 +128,7 @@ module MarkovWords
165
128
  end
166
129
  end
167
130
 
131
+ @data_store.store_data(:grams, grams)
168
132
  grams
169
133
  end
170
134
 
@@ -2,5 +2,5 @@
2
2
 
3
3
  module MarkovWords
4
4
  # Current version
5
- VERSION = '1.0.1'
5
+ VERSION = '2.0.0'
6
6
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: markov_words
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Donald Merand
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2017-12-09 00:00:00.000000000 Z
11
+ date: 2017-12-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -114,6 +114,7 @@ files:
114
114
  - LICENSE.txt
115
115
  - README.md
116
116
  - Rakefile
117
+ - bin/benchmark
117
118
  - bin/console
118
119
  - bin/setup
119
120
  - lib/markov_words.rb