markovian 0.3.0 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 290b5c05432cd805aa1aafdae2d93b68cf1e9a8a
4
- data.tar.gz: c51deea8332351976638c6767603ad137c85fb4b
3
+ metadata.gz: a3434070498b33afcd46afc82fd46aa827a46abf
4
+ data.tar.gz: 0b416ef501bfbaec9bdaf7d2e072b2ef347891e3
5
5
  SHA512:
6
- metadata.gz: eca6c116a0e9686b90ebd3e9335cd55f3a48261a3824dd5d2d71c58e6ba97b8749c738b042da0f2b72c02df58936a924ae9953a1970b64d02d70b58f3f953ae9
7
- data.tar.gz: e2279a199969da3cf587952a57a6eb1fb3d7f22e967cba3f4dc700ba5022e5562e4e70eb5a34f3f199a65ee458ed232952a9f7e17b254054e0b3bd7327d89839
6
+ metadata.gz: 06f77a167d5e5f8e9699385dc160e239bd92f5f7ef484bfb9e96d1bdba8ba0fd96f36f9d10786a496f57500e79754c192d6cdac24a3afa2b2e1653b15245fa59
7
+ data.tar.gz: 3773b056fb2356813780e191cd27536a861a3b007a0f4fafe94e6c55c9122512fd5cd5ae11e5448f9b5033d2944cebd31b5d348c49182d797b4fdc6a8f1320c2
data/README.md CHANGED
@@ -15,27 +15,31 @@ Clone from Github, and then execute:
15
15
  Fuller documentation will come shortly. For now, let's see how we can use Markovian to build some tweets from a Twitter archive we've downloaded:
16
16
 
17
17
  ```ruby
18
- > path = #{path_to_twitter_archive}
19
- => path_to_twitter_archive
20
- > importer = Markovian::Importers::Twitter::CsvImporter.new(path)
21
- => #<Markovian::Importers::Twitter::CsvImporter:0x007fd0ca3282a8 @path=path_to_twitter_archive>
22
- # now assemble the chain based on the tweets -- this may take a few seconds to compile
23
- > chain = importer.chain
24
- => #<Markovian::Corpus:0x007fd0ca03df70 ...>
18
+ > chain = Markovian::Chain.new
19
+ > chain.lengthen("there", next_word: "friend")
20
+ > chain.lengthen("there", next_word: "are")
21
+ > chain.lengthen("are", next_word: "four", previous_word: "four")
22
+ > chain.lengthen("four", next_word: "lights", previous_word: "four")
23
+ > chain.lengthen("are", next_word: "we")
24
+ > chain.lengthen("friend", next_word: "cat")
25
+ > chain.lengthen("cat", next_word: "rocks", previous_word: "friend")
25
26
  # Now, we can build some text!
26
- > Markovian::TextBuilder.new(chain).construct("markov")
27
- => "markov chains a lot better than a month, i've been here half an hour of night when you can get behind belgium for the offline train journey"
27
+ > Markovian::TextBuilder.new(chain).construct("there")
28
+ => "there friend cat rocks"
28
29
  ```
29
30
 
30
31
  Exactly!
31
32
 
33
+ Markovian is most easily used with the [markovian-tools
34
+ gem](https://github.com/arsduo/markovian-tools), which provides utilities for importing
35
+ Twitter and Facebook archives and for posting tweets, among other things.
36
+
32
37
  ## Features
33
38
 
34
39
  So far, Markovian gives you the ability to, given a set of inputs, generate random text. In
35
40
  addition, your money gets you:
36
41
 
37
- * A built-in importer to turn Twitter csv archives into Markov chain-derived text
38
- * A built-in filter to remove final words that statistically (in the corpus) rarely end sentences.
42
+ * A built-in filter to remove final words that statistically (in the corpus) rarely end sentences.
39
43
  Avoid unsightly sentences ending in "and so of" and so on!
40
44
 
41
45
  ## Development
data/changelog.md CHANGED
@@ -1,5 +1,14 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## 0.4.0
4
+
5
+ * Extract SentenceBuilder from TextBuilder for future use
6
+ * Chain#lengthen can now take strings as well as Tokeneyes::Words
7
+ * Fix bug preventing reuse of TextBuilder objects
8
+ * Update EndOfSentenceFilter (works when no words match, has no limit, uses proper probabilities)
9
+ * Bumped up the significant occurrence threshold for filtering to 500 occurrences
10
+ * Handle edge cases of words that always end sentences
11
+
3
12
  ## 0.3.0
4
13
 
5
14
  * TextBuilder now filters out final words that statistically rarely end sentences (first filter!)
data/lib/markovian.rb CHANGED
@@ -1,8 +1,6 @@
1
1
  require 'markovian/text_builder'
2
2
  require 'markovian/chain'
3
3
  require 'markovian/chain/compiler'
4
- # importers
5
- require 'markovian/importers/twitter/csv_importer'
6
4
 
7
5
  # The base module.
8
6
  module Markovian
@@ -17,10 +17,7 @@ module Markovian
17
17
  end
18
18
 
19
19
  def lengthen(word, next_word:, previous_word: nil)
20
- # When we encounter a word, we track its metadata and and what words surround it
21
- write_to_dictionary(@one_key_dictionary, word, word, next_word)
22
- write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
23
- word
20
+ push(tokeneyes(word), tokeneyes(next_word), tokeneyes(previous_word))
24
21
  end
25
22
 
26
23
  def next_word(word, previous_word: nil)
@@ -40,6 +37,12 @@ module Markovian
40
37
 
41
38
  protected
42
39
 
40
+ def push(word, next_word, previous_word)
41
+ write_to_dictionary(@one_key_dictionary, word, word, next_word)
42
+ write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
43
+ word
44
+ end
45
+
43
46
  # for equality checking
44
47
  attr_reader :one_key_dictionary, :two_key_dictionary
45
48
 
@@ -77,5 +80,17 @@ module Markovian
77
80
  dictionary[key].record_observance(word_instance)
78
81
  dictionary[key].push(next_word)
79
82
  end
83
+
84
+ # Allow strings to be passed in natively. There won't be metadata, but for small things this
85
+ # makes the gem much easier to use.
86
+ def tokeneyes(word)
87
+ return nil unless word
88
+
89
+ if word.is_a?(Tokeneyes::Word)
90
+ word
91
+ else
92
+ Tokeneyes::Word.new(word)
93
+ end
94
+ end
80
95
  end
81
96
  end
@@ -2,7 +2,9 @@ module Markovian
2
2
  class Chain
3
3
  class DictionaryEntry
4
4
  # Below this, we don't have enough occurrences to draw conclusions about how a word is used.
5
- SIGNIFICANT_OCCURRENCE_THRESHOLD = 50
5
+ # Longer-term, this could possibly be calculated in a more dynamic and effective way by
6
+ # analyzing the corpus itself.
7
+ SIGNIFICANT_OCCURRENCE_THRESHOLD = 500
6
8
 
7
9
  attr_reader :word, :counts
8
10
  def initialize(word)
@@ -38,7 +40,8 @@ module Markovian
38
40
  end
39
41
 
40
42
  def ==(other)
41
- self.word == other.word &&
43
+ other &&
44
+ self.word == other.word &&
42
45
  self.next_words == other.next_words &&
43
46
  self.previous_words == other.previous_words
44
47
  end
@@ -1,24 +1,19 @@
1
1
  require 'markovian/utils/text_splitter'
2
+ require 'markovian/text_builder/sentence_builder'
2
3
  require 'markovian/text_builder/end_of_sentence_filter'
3
4
 
4
5
  # This class, given a Markov chain, will attempt to construct a new text based on a given seed using
5
6
  # the Markov associations.
6
7
  module Markovian
7
8
  class TextBuilder
8
- attr_reader :seed_text, :chain
9
+ attr_reader :chain
9
10
  def initialize(chain)
10
11
  @chain = chain
11
12
  end
12
13
 
13
14
  def construct(seed_text, length: 140, exclude_seed_text: false)
14
- # TODO: if we don't hit a result for the first pair, move backward through the original text
15
- # until we get something
16
- seed_components = split_seed_text(seed_text)
17
- output = result_with_next_word(
18
- previous_pair: identify_starter_text(seed_components),
19
- result: exclude_seed_text ? [] : seed_components,
20
- length: length
21
- )
15
+ sentence_builder = SentenceBuilder.new(chain: chain, max_length: length, seed_text: seed_text)
16
+ output = sentence_builder.construct_sentence(exclude_seed_text)
22
17
  format_output(apply_filters(output))
23
18
  end
24
19
 
@@ -28,47 +23,17 @@ module Markovian
28
23
  EndOfSentenceFilter.new.filtered_sentence(sentence_with_word_data(output))
29
24
  end
30
25
 
31
- def identify_starter_text(seed_components)
32
- if seed_components.length >= 2
33
- seed_components[-2..-1]
34
- else
35
- # if we only have a one-word seed text, the previous word is nil
36
- [nil, seed_components.first]
37
- end
38
- end
39
-
40
- def result_with_next_word(previous_pair:, result:, length:)
41
- previous_word, current_word = previous_pair
42
- if next_word = chain.next_word(current_word, previous_word: previous_word)
43
- # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
44
- # empty strings
45
- interim_result = result + [next_word]
46
- if format_output(interim_result).length > length
47
- result
48
- else
49
- result_with_next_word(
50
- previous_pair: [current_word, next_word],
51
- result: interim_result,
52
- length: length
53
- )
54
- end
55
- else
56
- result
57
- end
58
- end
59
-
60
26
  # Turn an array of Word objects into an ongoing string
61
27
  def format_output(array_of_words)
62
28
  array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
63
29
  end
64
30
 
65
31
  def sentence_with_word_data(sentence)
66
- @sentence_with_word_data ||= sentence.map {|word| chain.word_entry(word)}
32
+ sentence.map {|word| chain.word_entry(word)}
67
33
  end
68
34
 
69
- def split_seed_text(seed_text)
70
- # We get back Tokeneyes::Word objects, but for now only care about the strings within
71
- Utils::TextSplitter.new(seed_text).components
35
+ def sentence_builder
36
+ @sentence_builder ||= SentenceBuilder.new(chain)
72
37
  end
73
38
  end
74
39
  end
@@ -4,8 +4,6 @@ module Markovian
4
4
  # to a certain number of words if those words have a low likelihood of ending the sentence.
5
5
  # Future changes will increase the qualities filtered for.
6
6
  class EndOfSentenceFilter
7
- MAX_WORDS_FILTERED = 3
8
-
9
7
  def filtered_sentence(sentence)
10
8
  filter_unlikely_ending_words(sentence)
11
9
  end
@@ -13,11 +11,12 @@ module Markovian
13
11
  protected
14
12
 
15
13
  def filter_unlikely_ending_words(current_sentence, words_filtered = 0)
16
- return current_sentence if words_filtered >= MAX_WORDS_FILTERED
17
-
18
14
  last_word = current_sentence.last
19
- likelihood = last_word.likelihood_to_end_sentence
20
- if likelihood && rand < likelihood
15
+ if !last_word
16
+ # None of the words merit ending the sentence! The caller will deal with how to handle
17
+ # this situation.
18
+ []
19
+ elsif should_filter_out?(last_word)
21
20
  # if we pop a word, consider removing the next one
22
21
  filter_unlikely_ending_words(current_sentence[0..-2], words_filtered + 1)
23
22
  else
@@ -25,6 +24,19 @@ module Markovian
25
24
  current_sentence
26
25
  end
27
26
  end
27
+
28
+ def should_filter_out?(word)
29
+ likelihood = word.likelihood_to_end_sentence
30
+ # We filter words out that
31
+ # a) have enough data to say whether they end sentences
32
+ # b) do not always end the sentence AND
33
+ # c1) either literally never end a sentence OR
34
+ # c2) randomly fail a check based on how frequently they end stuff
35
+ likelihood &&
36
+ likelihood != 1 &&
37
+ (likelihood == 0 || rand > word.likelihood_to_end_sentence)
38
+
39
+ end
28
40
  end
29
41
  end
30
42
  end
@@ -0,0 +1,63 @@
1
+ module Markovian
2
+ class TextBuilder
3
+ class SentenceBuilder
4
+ attr_reader :seed_text, :chain, :max_length
5
+ def initialize(chain:, seed_text:, max_length:)
6
+ @chain = chain
7
+ @seed_text = seed_text
8
+ @max_length = max_length
9
+ end
10
+
11
+ def construct_sentence(exclude_seed_text = false)
12
+ seed_components = split_seed_text(seed_text)
13
+ result = result_with_next_word(
14
+ previous_pair: identify_starter_text(seed_components),
15
+ result: exclude_seed_text ? [] : seed_components
16
+ )
17
+ # Return a set of strings, not Tokeneyes::Word objects
18
+ result.map(&:to_s)
19
+ end
20
+
21
+ protected
22
+
23
+ def identify_starter_text(seed_components)
24
+ if seed_components.length >= 2
25
+ seed_components[-2..-1]
26
+ else
27
+ # if we only have a one-word seed text, the previous word is nil
28
+ [nil, seed_components.first]
29
+ end
30
+ end
31
+
32
+ def result_with_next_word(previous_pair:, result:)
33
+ previous_word, current_word = previous_pair
34
+ if next_word = chain.next_word(current_word, previous_word: previous_word)
35
+ # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
36
+ # empty strings
37
+ interim_result = result + [next_word]
38
+ if format_output(interim_result).length > max_length
39
+ result
40
+ else
41
+ result_with_next_word(
42
+ previous_pair: [current_word, next_word],
43
+ result: interim_result
44
+ )
45
+ end
46
+ else
47
+ result
48
+ end
49
+ end
50
+
51
+ def split_seed_text(seed_text)
52
+ # We get back Tokeneyes::Word objects, but for now only care about the strings within
53
+ Utils::TextSplitter.new(seed_text).components
54
+ end
55
+
56
+ # Turn an array of Word objects into an ongoing string
57
+ def format_output(array_of_words)
58
+ array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
59
+ end
60
+ end
61
+ end
62
+ end
63
+
@@ -1,3 +1,3 @@
1
1
  module Markovian
2
- VERSION = "0.3.0"
2
+ VERSION = "0.4.0"
3
3
  end
metadata CHANGED
@@ -1,52 +1,52 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: markovian
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Alex Koppel
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2015-10-09 00:00:00.000000000 Z
11
+ date: 2015-10-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: tokeneyes
15
14
  requirement: !ruby/object:Gem::Requirement
16
15
  requirements:
17
16
  - - "~>"
18
17
  - !ruby/object:Gem::Version
19
18
  version: 0.1.0
20
- type: :runtime
19
+ name: tokeneyes
21
20
  prerelease: false
21
+ type: :runtime
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
26
  version: 0.1.0
27
27
  - !ruby/object:Gem::Dependency
28
- name: bundler
29
28
  requirement: !ruby/object:Gem::Requirement
30
29
  requirements:
31
30
  - - "~>"
32
31
  - !ruby/object:Gem::Version
33
32
  version: '1.7'
34
- type: :development
33
+ name: bundler
35
34
  prerelease: false
35
+ type: :development
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
40
  version: '1.7'
41
41
  - !ruby/object:Gem::Dependency
42
- name: rake
43
42
  requirement: !ruby/object:Gem::Requirement
44
43
  requirements:
45
44
  - - "~>"
46
45
  - !ruby/object:Gem::Version
47
46
  version: '10.0'
48
- type: :development
47
+ name: rake
49
48
  prerelease: false
49
+ type: :development
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
52
  - - "~>"
@@ -79,10 +79,9 @@ files:
79
79
  - lib/markovian/chain/compiler.rb
80
80
  - lib/markovian/chain/dictionary.rb
81
81
  - lib/markovian/chain/dictionary_entry.rb
82
- - lib/markovian/importers/twitter/csv_importer.rb
83
- - lib/markovian/importers/twitter/tweet.rb
84
82
  - lib/markovian/text_builder.rb
85
83
  - lib/markovian/text_builder/end_of_sentence_filter.rb
84
+ - lib/markovian/text_builder/sentence_builder.rb
86
85
  - lib/markovian/utils/text_splitter.rb
87
86
  - lib/markovian/version.rb
88
87
  - markovian.gemspec
@@ -90,7 +89,7 @@ homepage: https://github.com/arsduo/markov-ahkoppel
90
89
  licenses:
91
90
  - MIT
92
91
  metadata: {}
93
- post_install_message:
92
+ post_install_message:
94
93
  rdoc_options: []
95
94
  require_paths:
96
95
  - lib
@@ -105,9 +104,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
105
104
  - !ruby/object:Gem::Version
106
105
  version: '0'
107
106
  requirements: []
108
- rubyforge_project:
109
- rubygems_version: 2.4.5.1
110
- signing_key:
107
+ rubyforge_project:
108
+ rubygems_version: 2.4.8
109
+ signing_key:
111
110
  specification_version: 4
112
111
  summary: A simple, hopefully easy-to-use Markov chain generator.
113
112
  test_files: []
@@ -1,47 +0,0 @@
1
- require 'csv'
2
- require 'markovian/importers/twitter/tweet'
3
-
4
- # This class will import a Twitter archive CSV, returning a set of tweets suitable for importation
5
- # into a Markovian chain.
6
- module Markovian
7
- module Importers
8
- module Twitter
9
- class CsvImporter
10
- attr_reader :path
11
- def initialize(path)
12
- @path = path
13
- end
14
-
15
- def texts_for_markov_analysis
16
- # reject any blank tweets -- in our case, those with only a stripped-out URL
17
- tweet_enumerator.reject {|t| t.empty?}
18
- end
19
-
20
- def chain
21
- Chain::Compiler.new.build_chain(texts_for_markov_analysis)
22
- end
23
-
24
- protected
25
-
26
- def csv_enumerator
27
- # returns an iterator object that we can roll through
28
- # this does not actually start reading the file
29
- @csv_enumerator ||= CSV.open(path, headers: true).each
30
- end
31
-
32
- # an iterator over personal tweets (e.g. not RTs)
33
- # the lazy iterator allows us to add the condition without having to parse the entire file at
34
- # once (which could easily encounter tens of thousands of rows).
35
- def personal_tweet_enumerator
36
- csv_enumerator.select {|row| row["retweeted_status_id"].empty? }
37
- end
38
-
39
- def tweet_enumerator
40
- personal_tweet_enumerator.map do |row|
41
- Tweet.new(row["text"]).interesting_text
42
- end
43
- end
44
- end
45
- end
46
- end
47
- end
@@ -1,37 +0,0 @@
1
- module Markovian
2
- module Importers
3
- module Twitter
4
- # Represents an individual tweet
5
- class Tweet
6
- attr_reader :text
7
- def initialize(text)
8
- @text = text
9
- end
10
-
11
- # Not currently used, but we might want to weight mentions later.
12
- def mentions
13
- text.scan(/(\@[a-z0-9_]+)/).flatten
14
- end
15
-
16
- def interesting_text
17
- without_urls(without_leading_dot(text))
18
- end
19
-
20
- protected
21
-
22
- # We don't want URLs to be considered inside our Markov machine.
23
- # URL matching is nearly impossible, but this regexp should be good enough: http://stackoverflow.com/questions/17733236/optimize-gruber-url-regex-for-javascript
24
- # Nowadays Twitter replaces URLS with their own link shortener, but historically that wasn't
25
- # always true.
26
- def without_urls(string)
27
- string.gsub(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)\S+(?:[^\s`!\[\]{};:'".,?«»“”‘’]))/i, "")
28
- end
29
-
30
- # Avoid dots used to trigger mentions
31
- def without_leading_dot(string)
32
- string.gsub(/^\.\@/, "@")
33
- end
34
- end
35
- end
36
- end
37
- end