twitter_ebooks 2.0.7 → 2.0.8

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
File without changes
data/Gemfile CHANGED
File without changes
data/Gemfile.lock CHANGED
@@ -1,8 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- twitter_ebooks (2.0.3)
5
- bloomfilter-rb
4
+ twitter_ebooks (2.0.7)
6
5
  engtagger
7
6
  fast-stemmer
8
7
  gingerice
@@ -19,8 +18,6 @@ GEM
19
18
  addressable (2.3.5)
20
19
  atomic (1.1.14)
21
20
  awesome_print (1.2.0)
22
- bloomfilter-rb (2.1.1)
23
- redis
24
21
  cookiejar (0.3.0)
25
22
  daemons (1.1.9)
26
23
  em-http-request (1.0.3)
@@ -50,7 +47,6 @@ GEM
50
47
  minitest (5.0.8)
51
48
  multi_json (1.8.2)
52
49
  multipart-post (1.2.0)
53
- redis (3.0.5)
54
50
  rufus-scheduler (3.0.2)
55
51
  tzinfo
56
52
  simple_oauth (0.2.0)
data/LICENSE CHANGED
File without changes
data/NOTES.md CHANGED
File without changes
data/README.md CHANGED
@@ -1,20 +1,9 @@
1
- # twitter\_ebooks 2.0.0
1
+ # twitter\_ebooks 2.0.8
2
2
 
3
- Complete rewrite of twitter\_ebooks. Allows context-sensitive responsive bots via the Twitter streaming API, along with higher-quality tokenization and ngram modeling.
3
+ Complete rewrite of twitter\_ebooks. Allows context-sensitive responsive bots via the Twitter streaming API, along with higher-quality ngram modeling. Still needs a bit of cleaning and documenting.
4
4
 
5
5
  ## Installation
6
6
 
7
7
  ```bash
8
8
  gem install twitter_ebooks
9
9
  ```
10
-
11
- ## Making a bot
12
-
13
- twitter\_ebooks uses a Rails-like skeleton app generator. Let's say we want to make a revolutionary Marxist bot based on the writings of Leon Trotsky (who doesn't?):
14
-
15
- ```bash
16
- ebooks new trotsky_ebooks
17
- cd trotsky_ebooks
18
- ```
19
-
20
-
data/Rakefile CHANGED
File without changes
data/bin/ebooks CHANGED
@@ -46,9 +46,9 @@ module Ebooks
46
46
  def self.gen(model_path, input)
47
47
  model = Model.load(model_path)
48
48
  if input && !input.empty?
49
- puts "@cmd " + model.markov_response(input, 135)
49
+ puts "@cmd " + model.make_response(input, 135)
50
50
  else
51
- puts model.markov_statement
51
+ puts model.make_statement
52
52
  end
53
53
  end
54
54
 
@@ -64,7 +64,7 @@ module Ebooks
64
64
  def self.tweet(modelpath, username)
65
65
  load File.join(APP_PATH, 'bots.rb')
66
66
  model = Model.load(modelpath)
67
- statement = model.markov_statement
67
+ statement = model.make_statement
68
68
  log "@#{username}: #{statement}"
69
69
  bot = Bot.get(username)
70
70
  bot.configure
data/data/adjectives.txt CHANGED
File without changes
data/data/nouns.txt CHANGED
File without changes
data/data/stopwords.txt CHANGED
File without changes
File without changes
File without changes
@@ -54,9 +54,10 @@ module Ebooks
54
54
 
55
55
  def chain(tokens)
56
56
  if tokens.length == 1
57
- matches = @unigrams[tokens[0]]
57
+ matches = @unigrams[tokens[-1]]
58
58
  else
59
59
  matches = @bigrams[tokens[-2]][tokens[-1]]
60
+ matches = @unigrams[tokens[-1]] if matches.length < 2
60
61
  end
61
62
 
62
63
  if matches.empty?
@@ -7,7 +7,7 @@ require 'digest/md5'
7
7
 
8
8
  module Ebooks
9
9
  class Model
10
- attr_accessor :hash, :sentences, :markov, :keywords
10
+ attr_accessor :hash, :sentences, :generator, :keywords
11
11
 
12
12
  def self.consume(txtpath)
13
13
  Model.new.consume(txtpath)
@@ -67,16 +67,29 @@ module Ebooks
67
67
  NLP.htmlentities.decode tweet
68
68
  end
69
69
 
70
- def markov_statement(limit=140, markov=nil)
71
- markov ||= MarkovModel.build(@sentences)
70
+ def valid_tweet?(tokens, limit)
71
+ tweet = NLP.reconstruct(tokens)
72
+ tweet.length <= limit && !NLP.unmatched_enclosers?(tweet)
73
+ end
74
+
75
+ def make_statement(limit=140, generator=nil)
76
+ responding = !generator.nil?
77
+ generator = SuffixGenerator.build(@sentences)
72
78
  tweet = ""
73
79
 
74
- while (tweet = markov.generate) do
75
- next if tweet.length > limit
76
- next if NLP.unmatched_enclosers?(tweet)
77
- break if tweet.length > limit*0.4 || rand > 0.8
80
+ while (tokens = generator.generate(3, :bigrams)) do
81
+ next if tokens.length <= 3 && !responding
82
+ break if valid_tweet?(tokens, limit)
83
+ end
84
+
85
+ if @sentences.include?(tokens) && tokens.length > 3 # We made a verbatim tweet by accident
86
+ while (tokens = generator.generate(3, :unigrams)) do
87
+ break if valid_tweet?(tokens, limit) && !@sentences.include?(tokens)
88
+ end
78
89
  end
79
90
 
91
+ tweet = NLP.reconstruct(tokens)
92
+
80
93
  fix tweet
81
94
  end
82
95
 
@@ -101,19 +114,19 @@ module Ebooks
101
114
  end
102
115
 
103
116
  # Generates a response by looking for related sentences
104
- # in the corpus and building a smaller markov model from these
105
- def markov_response(input, limit=140)
117
+ # in the corpus and building a smaller generator from these
118
+ def make_response(input, limit=140)
106
119
  # First try
107
120
  relevant, slightly_relevant = relevant_sentences(input)
108
121
 
109
122
  if relevant.length >= 3
110
- markov = MarkovModel.new.consume(relevant)
111
- markov_statement(limit, markov)
112
- elsif slightly_relevant.length > 5
113
- markov = MarkovModel.new.consume(slightly_relevant)
114
- markov_statement(limit, markov)
123
+ generator = SuffixGenerator.build(relevant)
124
+ make_statement(limit, generator)
125
+ elsif slightly_relevant.length >= 5
126
+ generator = SuffixGenerator.build(slightly_relevant)
127
+ make_statement(limit, generator)
115
128
  else
116
- markov_statement(limit)
129
+ make_statement(limit)
117
130
  end
118
131
  end
119
132
  end
@@ -61,7 +61,7 @@ module Ebooks
61
61
  # As above, this is ad hoc because tokenization libraries
62
62
  # do not behave well wrt. things like emoticons and timestamps
63
63
  def self.tokenize(sentence)
64
- regex = /\s+|(?<=[#{PUNCTUATION}])(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+)/
64
+ regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
65
65
  sentence.split(regex)
66
66
  end
67
67
 
@@ -150,5 +150,12 @@ module Ebooks
150
150
 
151
151
  false
152
152
  end
153
+
154
+ # Determine if a2 is a subsequence of a1
155
+ def self.subseq?(a1, a2)
156
+ a1.each_index.find do |i|
157
+ a1[i...i+a2.length] == a2
158
+ end
159
+ end
153
160
  end
154
161
  end
@@ -0,0 +1,82 @@
1
+ module Ebooks
2
+ class SuffixGenerator
3
+ def self.build(sentences)
4
+ SuffixGenerator.new(sentences)
5
+ end
6
+
7
+ def initialize(sentences)
8
+ @sentences = sentences.reject { |s| s.length < 2 }
9
+ @unigrams = {}
10
+ @bigrams = {}
11
+
12
+ @sentences.each_with_index do |tokens, i|
13
+ last_token = INTERIM
14
+ tokens.each_with_index do |token, j|
15
+ @unigrams[last_token] ||= []
16
+ @unigrams[last_token] << [i, j]
17
+
18
+ @bigrams[last_token] ||= {}
19
+ @bigrams[last_token][token] ||= []
20
+
21
+ if j == tokens.length-1 # Mark sentence endings
22
+ @unigrams[token] ||= []
23
+ @unigrams[token] << [i, INTERIM]
24
+ @bigrams[last_token][token] << [i, INTERIM]
25
+ else
26
+ @bigrams[last_token][token] << [i, j+1]
27
+ end
28
+
29
+ last_token = token
30
+ end
31
+ end
32
+
33
+ self
34
+ end
35
+
36
+ def generate(passes=5, n=:unigrams)
37
+ index = rand(@sentences.length)
38
+ tokens = @sentences[index]
39
+ used = [index] # Sentences we've already used
40
+ verbatim = [tokens] # Verbatim sentences to avoid reproducing
41
+
42
+ 0.upto(passes-1) do
43
+ puts NLP.reconstruct(tokens)
44
+ varsites = {} # Map bigram start site => next token alternatives
45
+
46
+ tokens.each_with_index do |token, i|
47
+ next_token = tokens[i+1]
48
+ break if next_token.nil?
49
+
50
+ alternatives = (n == :unigrams) ? @unigrams[next_token] : @bigrams[token][next_token]
51
+ alternatives.reject! { |a| a[1] == INTERIM || used.include?(a[0]) }
52
+ varsites[i] = alternatives unless alternatives.empty?
53
+ end
54
+
55
+ variant = nil
56
+ varsites.to_a.shuffle.each do |site|
57
+ start = site[0]
58
+
59
+ site[1].shuffle.each do |alt|
60
+ start, alt = site[0], site[1].sample
61
+ verbatim << @sentences[alt[0]]
62
+ suffix = @sentences[alt[0]][alt[1]..-1]
63
+ potential = tokens[0..start+1] + suffix
64
+
65
+ unless verbatim.find { |v| NLP.subseq?(v, potential) || NLP.subseq?(potential, v) }
66
+ used << alt[0]
67
+ variant = potential
68
+ break
69
+ end
70
+ end
71
+
72
+ break if variant
73
+ end
74
+
75
+ tokens = variant if variant
76
+ end
77
+
78
+
79
+ tokens
80
+ end
81
+ end
82
+ end
@@ -1,3 +1,3 @@
1
1
  module Ebooks
2
- VERSION = "2.0.7"
2
+ VERSION = "2.0.8"
3
3
  end
@@ -16,5 +16,6 @@ end
16
16
  require 'twitter_ebooks/nlp'
17
17
  require 'twitter_ebooks/archiver'
18
18
  require 'twitter_ebooks/markov'
19
+ require 'twitter_ebooks/suffix'
19
20
  require 'twitter_ebooks/model'
20
21
  require 'twitter_ebooks/bot'
data/skeleton/.gitignore CHANGED
File without changes
data/skeleton/Procfile CHANGED
File without changes
data/skeleton/bots.rb CHANGED
File without changes
File without changes
File without changes
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: twitter_ebooks
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.7
4
+ version: 2.0.8
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2013-11-06 00:00:00.000000000 Z
12
+ date: 2013-11-14 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: minitest
@@ -180,12 +180,12 @@ files:
180
180
  - lib/twitter_ebooks/markov.rb
181
181
  - lib/twitter_ebooks/model.rb
182
182
  - lib/twitter_ebooks/nlp.rb
183
+ - lib/twitter_ebooks/suffix.rb
183
184
  - lib/twitter_ebooks/version.rb
184
185
  - script/process_anc_data.rb
185
186
  - skeleton/.gitignore
186
187
  - skeleton/Procfile
187
188
  - skeleton/bots.rb
188
- - skeleton/corpus/README.md
189
189
  - skeleton/run.rb
190
190
  - test/corpus/0xabad1dea.tweets
191
191
  - test/keywords.rb
@@ -1 +0,0 @@
1
- Put any raw text files in here to be processed.