RubyGems - markovian - Versions diffs - 0.3.0 → 0.4.0 - Mend

markovian 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/README.md +15 -11
data/changelog.md +9 -0
data/lib/markovian.rb +0 -2
data/lib/markovian/chain.rb +19 -4
data/lib/markovian/chain/dictionary_entry.rb +5 -2
data/lib/markovian/text_builder.rb +7 -42
data/lib/markovian/text_builder/end_of_sentence_filter.rb +18 -6
data/lib/markovian/text_builder/sentence_builder.rb +63 -0
data/lib/markovian/version.rb +1 -1
metadata +14 -15
data/lib/markovian/importers/twitter/csv_importer.rb +0 -47
data/lib/markovian/importers/twitter/tweet.rb +0 -37

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 290b5c05432cd805aa1aafdae2d93b68cf1e9a8a
-  data.tar.gz: c51deea8332351976638c6767603ad137c85fb4b
+  metadata.gz: a3434070498b33afcd46afc82fd46aa827a46abf
+  data.tar.gz: 0b416ef501bfbaec9bdaf7d2e072b2ef347891e3
 SHA512:
-  metadata.gz: eca6c116a0e9686b90ebd3e9335cd55f3a48261a3824dd5d2d71c58e6ba97b8749c738b042da0f2b72c02df58936a924ae9953a1970b64d02d70b58f3f953ae9
-  data.tar.gz: e2279a199969da3cf587952a57a6eb1fb3d7f22e967cba3f4dc700ba5022e5562e4e70eb5a34f3f199a65ee458ed232952a9f7e17b254054e0b3bd7327d89839
+  metadata.gz: 06f77a167d5e5f8e9699385dc160e239bd92f5f7ef484bfb9e96d1bdba8ba0fd96f36f9d10786a496f57500e79754c192d6cdac24a3afa2b2e1653b15245fa59
+  data.tar.gz: 3773b056fb2356813780e191cd27536a861a3b007a0f4fafe94e6c55c9122512fd5cd5ae11e5448f9b5033d2944cebd31b5d348c49182d797b4fdc6a8f1320c2

data/README.md CHANGED Viewed

@@ -15,27 +15,31 @@ Clone from Github, and then execute:
 Fuller documentation will come shortly. For now, let's see how we can use Markovian to build some tweets from a Twitter archive we've downloaded:
 ```ruby
-> path = #{path_to_twitter_archive}
- => path_to_twitter_archive
-> importer = Markovian::Importers::Twitter::CsvImporter.new(path)
- => #<Markovian::Importers::Twitter::CsvImporter:0x007fd0ca3282a8 @path=path_to_twitter_archive>
-# now assemble the chain based on the tweets -- this may take a few seconds to compile
-> chain = importer.chain
- => #<Markovian::Corpus:0x007fd0ca03df70 ...>
+> chain = Markovian::Chain.new
+> chain.lengthen("there", next_word: "friend")
+> chain.lengthen("there", next_word: "are")
+> chain.lengthen("are", next_word: "four", previous_word: "four")
+> chain.lengthen("four", next_word: "lights", previous_word: "four")
+> chain.lengthen("are", next_word: "we")
+> chain.lengthen("friend", next_word: "cat")
+> chain.lengthen("cat", next_word: "rocks", previous_word: "friend")
 # Now, we can build some text!
-> Markovian::TextBuilder.new(chain).construct("markov")
-=> "markov chains a lot better than a month, i've been here half an hour of night when you can get behind belgium for the offline train journey"
+> Markovian::TextBuilder.new(chain).construct("there")
+=> "there friend cat rocks"
 ```
 Exactly!
+Markovian is most easily used with the [markovian-tools
+gem](https://github.com/arsduo/markovian-tools), which provides utilities for importing
+Twitter and Facebook archives and for posting tweets, among other things.
 ## Features
 So far, Markovian gives you the ability to, given a set of inputs, generate random text. In
 addition, your money gets you:
-* A built-in importer to turn Twitter csv archives into Markov chain-derived text
-* A built-in filter  to remove final words that statistically (in the corpus) rarely end sentences.
+* A built-in filter to remove final words that statistically (in the corpus) rarely end sentences.
   Avoid unsightly sentences ending in "and so of" and so on!
 ## Development

data/changelog.md CHANGED Viewed

@@ -1,5 +1,14 @@
 # CHANGELOG
+## 0.4.0
+* Extract SentenceBuilder from TextBuilder for future use
+* Chain#lengthen can now take strings as well as Tokeneyes::Words
+* Fix bug preventing reuse of TextBuilder objects
+* Update EndOfSentenceFilter (works when no words match, has no limit, uses proper probabilities)
+* Bumped up the significant occurrence threshold for filtering to 500 occurrences
+* Handle edge cases of words that always end sentences
 ## 0.3.0
 * TextBuilder now filters out final words that statistically rarely end sentences (first filter!)

data/lib/markovian.rb CHANGED Viewed

@@ -1,8 +1,6 @@
 require 'markovian/text_builder'
 require 'markovian/chain'
 require 'markovian/chain/compiler'
-# importers
-require 'markovian/importers/twitter/csv_importer'
 # The base module.
 module Markovian

data/lib/markovian/chain.rb CHANGED Viewed

@@ -17,10 +17,7 @@ module Markovian
     end
     def lengthen(word, next_word:, previous_word: nil)
-      # When we encounter a word, we track its metadata and and what words surround it
-      write_to_dictionary(@one_key_dictionary, word, word, next_word)
-      write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
-      word
+      push(tokeneyes(word), tokeneyes(next_word), tokeneyes(previous_word))
     end
     def next_word(word, previous_word: nil)
@@ -40,6 +37,12 @@ module Markovian
     protected
+    def push(word, next_word, previous_word)
+      write_to_dictionary(@one_key_dictionary, word, word, next_word)
+      write_to_dictionary(@two_key_dictionary, two_word_key(previous_word, word), word, next_word)
+      word
+    end
     # for equality checking
     attr_reader :one_key_dictionary, :two_key_dictionary
@@ -77,5 +80,17 @@ module Markovian
       dictionary[key].record_observance(word_instance)
       dictionary[key].push(next_word)
     end
+    # Allow strings to be passed in natively. There won't be metadata, but for small things this
+    # makes the gem much easier to use.
+    def tokeneyes(word)
+      return nil unless word
+      if word.is_a?(Tokeneyes::Word)
+        word
+      else
+        Tokeneyes::Word.new(word)
+      end
+    end
   end
 end

data/lib/markovian/chain/dictionary_entry.rb CHANGED Viewed

@@ -2,7 +2,9 @@ module Markovian
   class Chain
     class DictionaryEntry
       # Below this, we don't have enough occurrences to draw conclusions about how a word is used.
-      SIGNIFICANT_OCCURRENCE_THRESHOLD = 50
+      # Longer-term, this could possibly be calculated in a more dynamic and effective way by
+      # analyzing the corpus itself.
+      SIGNIFICANT_OCCURRENCE_THRESHOLD = 500
       attr_reader :word, :counts
       def initialize(word)
@@ -38,7 +40,8 @@ module Markovian
       end
       def ==(other)
-        self.word == other.word &&
+        other &&
+          self.word == other.word &&
           self.next_words == other.next_words &&
           self.previous_words == other.previous_words
       end

data/lib/markovian/text_builder.rb CHANGED Viewed

@@ -1,24 +1,19 @@
 require 'markovian/utils/text_splitter'
+require 'markovian/text_builder/sentence_builder'
 require 'markovian/text_builder/end_of_sentence_filter'
 # This class, given a Markov chain, will attempt to construct a new text based on a given seed using
 # the Markov associations.
 module Markovian
   class TextBuilder
-    attr_reader :seed_text, :chain
+    attr_reader :chain
     def initialize(chain)
       @chain = chain
     end
     def construct(seed_text, length: 140, exclude_seed_text: false)
-      # TODO: if we don't hit a result for the first pair, move backward through the original text
-      # until we get something
-      seed_components = split_seed_text(seed_text)
-      output = result_with_next_word(
-        previous_pair: identify_starter_text(seed_components),
-        result: exclude_seed_text ? [] : seed_components,
-        length: length
-      )
+      sentence_builder = SentenceBuilder.new(chain: chain, max_length: length, seed_text: seed_text)
+      output = sentence_builder.construct_sentence(exclude_seed_text)
       format_output(apply_filters(output))
     end
@@ -28,47 +23,17 @@ module Markovian
       EndOfSentenceFilter.new.filtered_sentence(sentence_with_word_data(output))
     end
-    def identify_starter_text(seed_components)
-      if seed_components.length >= 2
-        seed_components[-2..-1]
-      else
-        # if we only have a one-word seed text, the previous word is nil
-        [nil, seed_components.first]
-      end
-    end
-    def result_with_next_word(previous_pair:, result:, length:)
-      previous_word, current_word = previous_pair
-      if next_word = chain.next_word(current_word, previous_word: previous_word)
-        # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
-        # empty strings
-        interim_result = result + [next_word]
-        if format_output(interim_result).length > length
-          result
-        else
-          result_with_next_word(
-            previous_pair: [current_word, next_word],
-            result: interim_result,
-            length: length
-          )
-        end
-      else
-        result
-      end
-    end
     # Turn an array of Word objects into an ongoing string
     def format_output(array_of_words)
       array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
     end
     def sentence_with_word_data(sentence)
-      @sentence_with_word_data ||= sentence.map {|word| chain.word_entry(word)}
+      sentence.map {|word| chain.word_entry(word)}
     end
-    def split_seed_text(seed_text)
-      # We get back Tokeneyes::Word objects, but for now only care about the strings within
-      Utils::TextSplitter.new(seed_text).components
+    def sentence_builder
+      @sentence_builder ||= SentenceBuilder.new(chain)
     end
   end
 end

data/lib/markovian/text_builder/end_of_sentence_filter.rb CHANGED Viewed

@@ -4,8 +4,6 @@ module Markovian
     # to a certain number of words if those words have a low likelihood of ending the sentence.
     # Future changes will increase the qualities filtered for.
     class EndOfSentenceFilter
-      MAX_WORDS_FILTERED = 3
       def filtered_sentence(sentence)
         filter_unlikely_ending_words(sentence)
       end
@@ -13,11 +11,12 @@ module Markovian
       protected
       def filter_unlikely_ending_words(current_sentence, words_filtered = 0)
-        return current_sentence if words_filtered >= MAX_WORDS_FILTERED
         last_word = current_sentence.last
-        likelihood = last_word.likelihood_to_end_sentence
-        if likelihood && rand < likelihood
+        if !last_word
+          # None of the words merit ending the sentence! The caller will deal with how to handle
+          # this situation.
+          []
+        elsif should_filter_out?(last_word)
           # if we pop a word, consider removing the next one
           filter_unlikely_ending_words(current_sentence[0..-2], words_filtered + 1)
         else
@@ -25,6 +24,19 @@ module Markovian
           current_sentence
         end
       end
+      def should_filter_out?(word)
+        likelihood = word.likelihood_to_end_sentence
+        # We filter words out that
+        # a) have enough data to say whether they end sentences
+        # b) do not always end the sentence AND
+        # c1) either literally never end a sentence OR
+        # c2) randomly fail a check based on how frequently they end stuff
+        likelihood &&
+          likelihood != 1 &&
+          (likelihood == 0 || rand > word.likelihood_to_end_sentence)
+      end
     end
   end
 end

data/lib/markovian/text_builder/sentence_builder.rb ADDED Viewed

@@ -0,0 +1,63 @@
+module Markovian
+  class TextBuilder
+    class SentenceBuilder
+      attr_reader :seed_text, :chain, :max_length
+      def initialize(chain:, seed_text:, max_length:)
+        @chain = chain
+        @seed_text = seed_text
+        @max_length = max_length
+      end
+      def construct_sentence(exclude_seed_text = false)
+        seed_components = split_seed_text(seed_text)
+        result = result_with_next_word(
+          previous_pair: identify_starter_text(seed_components),
+          result: exclude_seed_text ? [] : seed_components
+        )
+        # Return a set of strings, not Tokeneyes::Word objects
+        result.map(&:to_s)
+      end
+      protected
+      def identify_starter_text(seed_components)
+        if seed_components.length >= 2
+          seed_components[-2..-1]
+        else
+          # if we only have a one-word seed text, the previous word is nil
+          [nil, seed_components.first]
+        end
+      end
+      def result_with_next_word(previous_pair:, result:)
+        previous_word, current_word = previous_pair
+        if next_word = chain.next_word(current_word, previous_word: previous_word)
+          # we use join rather than + to avoid leading spaces, and strip to ignore leading nils or
+          # empty strings
+          interim_result = result + [next_word]
+          if format_output(interim_result).length > max_length
+            result
+          else
+            result_with_next_word(
+              previous_pair: [current_word, next_word],
+              result: interim_result
+            )
+          end
+        else
+          result
+        end
+      end
+      def split_seed_text(seed_text)
+        # We get back Tokeneyes::Word objects, but for now only care about the strings within
+        Utils::TextSplitter.new(seed_text).components
+      end
+      # Turn an array of Word objects into an ongoing string
+      def format_output(array_of_words)
+        array_of_words.compact.map(&:to_s).map(&:strip).join(" ")
+      end
+    end
+  end
+end

data/lib/markovian/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Markovian
-  VERSION = "0.3.0"
+  VERSION = "0.4.0"
 end

metadata CHANGED Viewed

@@ -1,52 +1,52 @@
 --- !ruby/object:Gem::Specification
 name: markovian
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.4.0
 platform: ruby
 authors:
 - Alex Koppel
-autorequire:
+autorequire:
 bindir: exe
 cert_chain: []
-date: 2015-10-09 00:00:00.000000000 Z
+date: 2015-10-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: tokeneyes
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.1.0
-  type: :runtime
+  name: tokeneyes
   prerelease: false
+  type: :runtime
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
         version: 0.1.0
 - !ruby/object:Gem::Dependency
-  name: bundler
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.7'
-  type: :development
+  name: bundler
   prerelease: false
+  type: :development
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.7'
 - !ruby/object:Gem::Dependency
-  name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '10.0'
-  type: :development
+  name: rake
   prerelease: false
+  type: :development
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
@@ -79,10 +79,9 @@ files:
 - lib/markovian/chain/compiler.rb
 - lib/markovian/chain/dictionary.rb
 - lib/markovian/chain/dictionary_entry.rb
-- lib/markovian/importers/twitter/csv_importer.rb
-- lib/markovian/importers/twitter/tweet.rb
 - lib/markovian/text_builder.rb
 - lib/markovian/text_builder/end_of_sentence_filter.rb
+- lib/markovian/text_builder/sentence_builder.rb
 - lib/markovian/utils/text_splitter.rb
 - lib/markovian/version.rb
 - markovian.gemspec
@@ -90,7 +89,7 @@ homepage: https://github.com/arsduo/markov-ahkoppel
 licenses:
 - MIT
 metadata: {}
-post_install_message:
+post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -105,9 +104,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubyforge_project:
-rubygems_version: 2.4.5.1
-signing_key:
+rubyforge_project:
+rubygems_version: 2.4.8
+signing_key:
 specification_version: 4
 summary: A simple, hopefully easy-to-use Markov chain generator.
 test_files: []

data/lib/markovian/importers/twitter/csv_importer.rb DELETED Viewed

@@ -1,47 +0,0 @@
-require 'csv'
-require 'markovian/importers/twitter/tweet'
-# This class will import a Twitter archive CSV, returning a set of tweets suitable for importation
-# into a Markovian chain.
-module Markovian
-  module Importers
-    module Twitter
-      class CsvImporter
-        attr_reader :path
-        def initialize(path)
-          @path = path
-        end
-        def texts_for_markov_analysis
-          # reject any blank tweets -- in our case, those with only a stripped-out URL
-          tweet_enumerator.reject {|t| t.empty?}
-        end
-        def chain
-          Chain::Compiler.new.build_chain(texts_for_markov_analysis)
-        end
-        protected
-        def csv_enumerator
-          # returns an iterator object that we can roll through
-          # this does not actually start reading the file
-          @csv_enumerator ||= CSV.open(path, headers: true).each
-        end
-        # an iterator over personal tweets (e.g. not RTs)
-        # the lazy iterator allows us to add the condition without having to parse the entire file at
-        # once (which could easily encounter tens of thousands of rows).
-        def personal_tweet_enumerator
-          csv_enumerator.select {|row| row["retweeted_status_id"].empty? }
-        end
-        def tweet_enumerator
-          personal_tweet_enumerator.map do |row|
-            Tweet.new(row["text"]).interesting_text
-          end
-        end
-      end
-    end
-  end
-end

data/lib/markovian/importers/twitter/tweet.rb DELETED Viewed

@@ -1,37 +0,0 @@
-module Markovian
-  module Importers
-    module Twitter
-      # Represents an individual tweet
-      class Tweet
-        attr_reader :text
-        def initialize(text)
-          @text = text
-        end
-        # Not currently used, but we might want to weight mentions later.
-        def mentions
-          text.scan(/(\@[a-z0-9_]+)/).flatten
-        end
-        def interesting_text
-          without_urls(without_leading_dot(text))
-        end
-        protected
-        # We don't want URLs to be considered inside our Markov machine.
-        # URL matching is nearly impossible, but this regexp should be good enough: http://stackoverflow.com/questions/17733236/optimize-gruber-url-regex-for-javascript
-        # Nowadays Twitter replaces URLS with their own link shortener, but historically that wasn't
-        # always true.
-        def without_urls(string)
-          string.gsub(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)\S+(?:[^\s`!\[\]{};:'".,?«»“”‘’]))/i, "")
-        end
-        # Avoid dots used to trigger mentions
-        def without_leading_dot(string)
-          string.gsub(/^\.\@/, "@")
-        end
-      end
-    end
-  end
-end