markovian 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 290b5c05432cd805aa1aafdae2d93b68cf1e9a8a
4
- data.tar.gz: c51deea8332351976638c6767603ad137c85fb4b
3
+ metadata.gz: a3434070498b33afcd46afc82fd46aa827a46abf
4
+ data.tar.gz: 0b416ef501bfbaec9bdaf7d2e072b2ef347891e3
5
5
  SHA512:
6
- metadata.gz: eca6c116a0e9686b90ebd3e9335cd55f3a48261a3824dd5d2d71c58e6ba97b8749c738b042da0f2b72c02df58936a924ae9953a1970b64d02d70b58f3f953ae9
7
- data.tar.gz: e2279a199969da3cf587952a57a6eb1fb3d7f22e967cba3f4dc700ba5022e5562e4e70eb5a34f3f199a65ee458ed232952a9f7e17b254054e0b3bd7327d89839
6
+ metadata.gz: 06f77a167d5e5f8e9699385dc160e239bd92f5f7ef484bfb9e96d1bdba8ba0fd96f36f9d10786a496f57500e79754c192d6cdac24a3afa2b2e1653b15245fa59
7
+ data.tar.gz: 3773b056fb2356813780e191cd27536a861a3b007a0f4fafe94e6c55c9122512fd5cd5ae11e5448f9b5033d2944cebd31b5d348c49182d797b4fdc6a8f1320c2
data/README.md CHANGED
@@ -15,27 +15,31 @@ Clone from Github, and then execute:
15
15
  Fuller documentation will come shortly. For now, let's see how we can use Markovian to build some tweets from a Twitter archive we've downloaded:
16
16
 
17
17
  ```ruby
18
- > path = #{path_to_twitter_archive}
19
- => path_to_twitter_archive
20
- > importer = Markovian::Importers::Twitter::CsvImporter.new(path)
21
- => #<Markovian::Importers::Twitter::CsvImporter:0x007fd0ca3282a8 @path=path_to_twitter_archive>
22
- # now assemble the chain based on the tweets -- this may take a few seconds to compile
23
- > chain = importer.chain
24
- => #<Markovian::Corpus:0x007fd0ca03df70 ...>
18
+ > chain = Markovian::Chain.new
19
+ > chain.lengthen("there", next_word: "friend")
20
+ > chain.lengthen("there", next_word: "are")
21
+ > chain.lengthen("are", next_word: "four", previous_word: "four")
22
+ > chain.lengthen("four", next_word: "lights", previous_word: "four")
23
+ > chain.lengthen("are", next_word: "we")
24
+ > chain.lengthen("friend", next_word: "cat")
25
+ > chain.lengthen("cat", next_word: "rocks", previous_word: "friend")
25
26
  # Now, we can build some text!
26
- > Markovian::TextBuilder.new(chain).construct("markov")
27
- => "markov chains a lot better than a month, i've been here half an hour of night when you can get behind belgium for the offline train journey"
27
+ > Markovian::TextBuilder.new(chain).construct("there")
28
+ => "there friend cat rocks"
28
29
  ```
29
30
 
30
31
  Exactly!
31
32
 
33
+ Markovian is most easily used with the [markovian-tools
34
+ gem](https://github.com/arsduo/markovian-tools), which provides utilities for importing
35
+ Twitter and Facebook archives and for posting tweets, among other things.
36
+
32
37
  ## Features
33
38
 
34
39
  So far, Markovian gives you the ability to, given a set of inputs, generate random text. In
35
40
  addition, your money gets you:
36
41
 
37
- * A built-in importer to turn Twitter csv archives into Markov chain-derived text
38
- * A built-in filter to remove final words that statistically (in the corpus) rarely end sentences.
42
+ * A built-in filter to remove final words that statistically (in the corpus) rarely end sentences.
39
43
  Avoid unsightly sentences ending in "and so of" and so on!
40
44
 
41
45
  ## Development
data/changelog.md CHANGED
@@ -1,5 +1,14 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## 0.4.0
4
+
5
+ * Extract SentenceBuilder from TextBuilder for future use
6
+ * Chain#lengthen can now take strings as well as Tokeneyes::Words
7
+ * Fix bug preventing reuse of TextBuilder objects
8
+ * Update EndOfSentenceFilter (works when no words match, has no limit, uses proper probabilities)
9
+ * Bumped up the significant occurrence threshold for filtering to 500 occurrences
10
+ * Handle edge cases of words that always end sentences
11
+
3
12
  ## 0.3.0
4
13
 
5
14
  * TextBuilder now filters out final words that statistically rarely end sentences (first filter!)
data/lib/markovian.rb CHANGED
@@ -1,8 +1,6 @@
1
1
  require 'markovian/text_builder'
2
2
  require 'markovian/chain'
3
3
  require 'markovian/chain/compiler'
4
- # importers
5
- require 'markovian/importers/twitter/csv_importer'
6
4
 
7
5
  # The base module.
8
6
  module Markovian
@@ -17,10 +17,7 @@ module Markovian
17
17
  end
18
18
 
19
19
  def lengthen(word, next_word:, previous_word: nil)
20
- # When we encounter a word, we track its metadata and and what words surround it
21
- write_to_dictionary(@one_key_dictionary, word, word, next_word)
22
- write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
23
- word
20
+ push(tokeneyes(word), tokeneyes(next_word), tokeneyes(previous_word))
24
21
  end
25
22
 
26
23
  def next_word(word, previous_word: nil)
@@ -40,6 +37,12 @@ module Markovian
40
37
 
41
38
  protected
42
39
 
40
+ def push(word, next_word, previous_word)
41
+ write_to_dictionary(@one_key_dictionary, word, word, next_word)
42
+ write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
43
+ word
44
+ end
45
+
43
46
  # for equality checking
44
47
  attr_reader :one_key_dictionary, :two_key_dictionary
45
48
 
@@ -77,5 +80,17 @@ module Markovian
77
80
  dictionary[key].record_observance(word_instance)
78
81
  dictionary[key].push(next_word)
79
82
  end
83
+
84
+ # Allow strings to be passed in natively. There won't be metadata, but for small things this
85
+ # makes the gem much easier to use.
86
+ def tokeneyes(word)
87
+ return nil unless word
88
+
89
+ if word.is_a?(Tokeneyes::Word)
90
+ word
91
+ else
92
+ Tokeneyes::Word.new(word)
93
+ end
94
+ end
80
95
  end
81
96
  end
@@ -2,7 +2,9 @@ module Markovian
2
2
  class Chain
3
3
  class DictionaryEntry
4
4
  # Below this, we don't have enough occurrences to draw conclusions about how a word is used.
5
- SIGNIFICANT_OCCURRENCE_THRESHOLD = 50
5
+ # Longer-term, this could possibly be calculated in a more dynamic and effective way by
6
+ # analyzing the corpus itself.
7
+ SIGNIFICANT_OCCURRENCE_THRESHOLD = 500
6
8
 
7
9
  attr_reader :word, :counts
8
10
  def initialize(word)
@@ -38,7 +40,8 @@ module Markovian
38
40
  end
39
41
 
40
42
  def ==(other)
41
- self.word == other.word &&
43
+ other &&
44
+ self.word == other.word &&
42
45
  self.next_words == other.next_words &&
43
46
  self.previous_words == other.previous_words
44
47
  end
@@ -1,24 +1,19 @@
1
1
  require 'markovian/utils/text_splitter'
2
+ require 'markovian/text_builder/sentence_builder'
2
3
  require 'markovian/text_builder/end_of_sentence_filter'
3
4
 
4
5
  # This class, given a Markov chain, will attempt to construct a new text based on a given seed using
5
6
  # the Markov associations.
6
7
  module Markovian
7
8
  class TextBuilder
8
- attr_reader :seed_text, :chain
9
+ attr_reader :chain
9
10
  def initialize(chain)
10
11
  @chain = chain
11
12
  end
12
13
 
13
14
  def construct(seed_text, length: 140, exclude_seed_text: false)
14
- # TODO: if we don't hit a result for the first pair, move backward through the original text
15
- # until we get something
16
- seed_components = split_seed_text(seed_text)
17
- output = result_with_next_word(
18
- previous_pair: identify_starter_text(seed_components),
19
- result: exclude_seed_text ? [] : seed_components,
20
- length: length
21
- )
15
+ sentence_builder = SentenceBuilder.new(chain: chain, max_length: length, seed_text: seed_text)
16
+ output = sentence_builder.construct_sentence(exclude_seed_text)
22
17
  format_output(apply_filters(output))
23
18
  end
24
19
 
@@ -28,47 +23,17 @@ module Markovian
28
23
  EndOfSentenceFilter.new.filtered_sentence(sentence_with_word_data(output))
29
24
  end
30
25
 
31
- def identify_starter_text(seed_components)
32
- if seed_components.length >= 2
33
- seed_components[-2..-1]
34
- else
35
- # if we only have a one-word seed text, the previous word is nil
36
- [nil, seed_components.first]
37
- end
38
- end
39
-
40
- def result_with_next_word(previous_pair:, result:, length:)
41
- previous_word, current_word = previous_pair
42
- if next_word = chain.next_word(current_word, previous_word: previous_word)
43
- # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
44
- # empty strings
45
- interim_result = result + [next_word]
46
- if format_output(interim_result).length > length
47
- result
48
- else
49
- result_with_next_word(
50
- previous_pair: [current_word, next_word],
51
- result: interim_result,
52
- length: length
53
- )
54
- end
55
- else
56
- result
57
- end
58
- end
59
-
60
26
  # Turn an array of Word objects into an ongoing string
61
27
  def format_output(array_of_words)
62
28
  array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
63
29
  end
64
30
 
65
31
  def sentence_with_word_data(sentence)
66
- @sentence_with_word_data ||= sentence.map {|word| chain.word_entry(word)}
32
+ sentence.map {|word| chain.word_entry(word)}
67
33
  end
68
34
 
69
- def split_seed_text(seed_text)
70
- # We get back Tokeneyes::Word objects, but for now only care about the strings within
71
- Utils::TextSplitter.new(seed_text).components
35
+ def sentence_builder
36
+ @sentence_builder ||= SentenceBuilder.new(chain)
72
37
  end
73
38
  end
74
39
  end
@@ -4,8 +4,6 @@ module Markovian
4
4
  # to a certain number of words if those words have a low likelihood of ending the sentence.
5
5
  # Future changes will increase the qualities filtered for.
6
6
  class EndOfSentenceFilter
7
- MAX_WORDS_FILTERED = 3
8
-
9
7
  def filtered_sentence(sentence)
10
8
  filter_unlikely_ending_words(sentence)
11
9
  end
@@ -13,11 +11,12 @@ module Markovian
13
11
  protected
14
12
 
15
13
  def filter_unlikely_ending_words(current_sentence, words_filtered = 0)
16
- return current_sentence if words_filtered >= MAX_WORDS_FILTERED
17
-
18
14
  last_word = current_sentence.last
19
- likelihood = last_word.likelihood_to_end_sentence
20
- if likelihood && rand < likelihood
15
+ if !last_word
16
+ # None of the words merit ending the sentence! The caller will deal with how to handle
17
+ # this situation.
18
+ []
19
+ elsif should_filter_out?(last_word)
21
20
  # if we pop a word, consider removing the next one
22
21
  filter_unlikely_ending_words(current_sentence[0..-2], words_filtered + 1)
23
22
  else
@@ -25,6 +24,19 @@ module Markovian
25
24
  current_sentence
26
25
  end
27
26
  end
27
+
28
+ def should_filter_out?(word)
29
+ likelihood = word.likelihood_to_end_sentence
30
+ # We filter words out that
31
+ # a) have enough data to say whether they end sentences
32
+ # b) do not always end the sentence AND
33
+ # c1) either literally never end a sentence OR
34
+ # c2) randomly fail a check based on how frequently they end stuff
35
+ likelihood &&
36
+ likelihood != 1 &&
37
+ (likelihood == 0 || rand > word.likelihood_to_end_sentence)
38
+
39
+ end
28
40
  end
29
41
  end
30
42
  end
@@ -0,0 +1,63 @@
1
+ module Markovian
2
+ class TextBuilder
3
+ class SentenceBuilder
4
+ attr_reader :seed_text, :chain, :max_length
5
+ def initialize(chain:, seed_text:, max_length:)
6
+ @chain = chain
7
+ @seed_text = seed_text
8
+ @max_length = max_length
9
+ end
10
+
11
+ def construct_sentence(exclude_seed_text = false)
12
+ seed_components = split_seed_text(seed_text)
13
+ result = result_with_next_word(
14
+ previous_pair: identify_starter_text(seed_components),
15
+ result: exclude_seed_text ? [] : seed_components
16
+ )
17
+ # Return a set of strings, not Tokeneyes::Word objects
18
+ result.map(&:to_s)
19
+ end
20
+
21
+ protected
22
+
23
+ def identify_starter_text(seed_components)
24
+ if seed_components.length >= 2
25
+ seed_components[-2..-1]
26
+ else
27
+ # if we only have a one-word seed text, the previous word is nil
28
+ [nil, seed_components.first]
29
+ end
30
+ end
31
+
32
+ def result_with_next_word(previous_pair:, result:)
33
+ previous_word, current_word = previous_pair
34
+ if next_word = chain.next_word(current_word, previous_word: previous_word)
35
+ # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
36
+ # empty strings
37
+ interim_result = result + [next_word]
38
+ if format_output(interim_result).length > max_length
39
+ result
40
+ else
41
+ result_with_next_word(
42
+ previous_pair: [current_word, next_word],
43
+ result: interim_result
44
+ )
45
+ end
46
+ else
47
+ result
48
+ end
49
+ end
50
+
51
+ def split_seed_text(seed_text)
52
+ # We get back Tokeneyes::Word objects, but for now only care about the strings within
53
+ Utils::TextSplitter.new(seed_text).components
54
+ end
55
+
56
+ # Turn an array of Word objects into an ongoing string
57
+ def format_output(array_of_words)
58
+ array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
59
+ end
60
+ end
61
+ end
62
+ end
63
+
@@ -1,3 +1,3 @@
1
1
  module Markovian
2
- VERSION = "0.3.0"
2
+ VERSION = "0.4.0"
3
3
  end
metadata CHANGED
@@ -1,52 +1,52 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: markovian
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Alex Koppel
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2015-10-09 00:00:00.000000000 Z
11
+ date: 2015-10-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: tokeneyes
15
14
  requirement: !ruby/object:Gem::Requirement
16
15
  requirements:
17
16
  - - "~>"
18
17
  - !ruby/object:Gem::Version
19
18
  version: 0.1.0
20
- type: :runtime
19
+ name: tokeneyes
21
20
  prerelease: false
21
+ type: :runtime
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
26
  version: 0.1.0
27
27
  - !ruby/object:Gem::Dependency
28
- name: bundler
29
28
  requirement: !ruby/object:Gem::Requirement
30
29
  requirements:
31
30
  - - "~>"
32
31
  - !ruby/object:Gem::Version
33
32
  version: '1.7'
34
- type: :development
33
+ name: bundler
35
34
  prerelease: false
35
+ type: :development
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: '1.7'
41
41
  - !ruby/object:Gem::Dependency
42
- name: rake
43
42
  requirement: !ruby/object:Gem::Requirement
44
43
  requirements:
45
44
  - - "~>"
46
45
  - !ruby/object:Gem::Version
47
46
  version: '10.0'
48
- type: :development
47
+ name: rake
49
48
  prerelease: false
49
+ type: :development
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
52
  - - "~>"
@@ -79,10 +79,9 @@ files:
79
79
  - lib/markovian/chain/compiler.rb
80
80
  - lib/markovian/chain/dictionary.rb
81
81
  - lib/markovian/chain/dictionary_entry.rb
82
- - lib/markovian/importers/twitter/csv_importer.rb
83
- - lib/markovian/importers/twitter/tweet.rb
84
82
  - lib/markovian/text_builder.rb
85
83
  - lib/markovian/text_builder/end_of_sentence_filter.rb
84
+ - lib/markovian/text_builder/sentence_builder.rb
86
85
  - lib/markovian/utils/text_splitter.rb
87
86
  - lib/markovian/version.rb
88
87
  - markovian.gemspec
@@ -90,7 +89,7 @@ homepage: https://github.com/arsduo/markov-ahkoppel
90
89
  licenses:
91
90
  - MIT
92
91
  metadata: {}
93
- post_install_message:
92
+ post_install_message:
94
93
  rdoc_options: []
95
94
  require_paths:
96
95
  - lib
@@ -105,9 +104,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
105
104
  - !ruby/object:Gem::Version
106
105
  version: '0'
107
106
  requirements: []
108
- rubyforge_project:
109
- rubygems_version: 2.4.5.1
110
- signing_key:
107
+ rubyforge_project:
108
+ rubygems_version: 2.4.8
109
+ signing_key:
111
110
  specification_version: 4
112
111
  summary: A simple, hopefully easy-to-use Markov chain generator.
113
112
  test_files: []
@@ -1,47 +0,0 @@
1
- require 'csv'
2
- require 'markovian/importers/twitter/tweet'
3
-
4
- # This class will import a Twitter archive CSV, returning a set of tweets suitable for importation
5
- # into a Markovian chain.
6
- module Markovian
7
- module Importers
8
- module Twitter
9
- class CsvImporter
10
- attr_reader :path
11
- def initialize(path)
12
- @path = path
13
- end
14
-
15
- def texts_for_markov_analysis
16
- # reject any blank tweets -- in our case, those with only a stripped-out URL
17
- tweet_enumerator.reject {|t| t.empty?}
18
- end
19
-
20
- def chain
21
- Chain::Compiler.new.build_chain(texts_for_markov_analysis)
22
- end
23
-
24
- protected
25
-
26
- def csv_enumerator
27
- # returns an iterator object that we can roll through
28
- # this does not actually start reading the file
29
- @csv_enumerator ||= CSV.open(path, headers: true).each
30
- end
31
-
32
- # an iterator over personal tweets (e.g. not RTs)
33
- # the lazy iterator allows us to add the condition without having to parse the entire file at
34
- # once (which could easily encounter tens of thousands of rows).
35
- def personal_tweet_enumerator
36
- csv_enumerator.select {|row| row["retweeted_status_id"].empty? }
37
- end
38
-
39
- def tweet_enumerator
40
- personal_tweet_enumerator.map do |row|
41
- Tweet.new(row["text"]).interesting_text
42
- end
43
- end
44
- end
45
- end
46
- end
47
- end
@@ -1,37 +0,0 @@
1
- module Markovian
2
- module Importers
3
- module Twitter
4
- # Represents an individual tweet
5
- class Tweet
6
- attr_reader :text
7
- def initialize(text)
8
- @text = text
9
- end
10
-
11
- # Not currently used, but we might want to weight mentions later.
12
- def mentions
13
- text.scan(/(\@[a-z0-9_]+)/).flatten
14
- end
15
-
16
- def interesting_text
17
- without_urls(without_leading_dot(text))
18
- end
19
-
20
- protected
21
-
22
- # We don't want URLs to be considered inside our Markov machine.
23
- # URL matching is nearly impossible, but this regexp should be good enough: http://stackoverflow.com/questions/17733236/optimize-gruber-url-regex-for-javascript
24
- # Nowadays Twitter replaces URLS with their own link shortener, but historically that wasn't
25
- # always true.
26
- def without_urls(string)
27
- string.gsub(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)\S+(?:[^\s`!\[\]{};:'".,?«»“”‘’]))/i, "")
28
- end
29
-
30
- # Avoid dots used to trigger mentions
31
- def without_leading_dot(string)
32
- string.gsub(/^\.\@/, "@")
33
- end
34
- end
35
- end
36
- end
37
- end