bot_twitter_ebooks 3.2.0 → 3.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 9bd28ebd17798bf3a09edbd2ecac05e840d3be98
4
- data.tar.gz: 035fcd79ba4d452030d514b5389792c11631a84d
3
+ metadata.gz: 0a68a4d52f738569df003bfac626ffeff9ef8a27
4
+ data.tar.gz: 0f639a347f2d15988d1559bcef4c09a78b66cbd5
5
5
  SHA512:
6
- metadata.gz: 42440202f4e9fb012ed01261cc575107e82ac2e21450241c3d12d912ff9921c17d47be37abe8cc0aa504cd64d5c858aaac78ace66a8f28602f796a493216b46d
7
- data.tar.gz: c1872982f4d8c16487fbfe70bcecf71b560105c052c96f06237a4bd3acf35028adc4c4ab9a15c27eacb5f19fd9628a1adadb81528949c71da17ce0488b7b051b
6
+ metadata.gz: 4ca1cc6ceed9c9a71313d92b9f844eb923382f9f28d38766d3fe3e8772420e05e41312a8392fd7ee15f5a9dba89288d92da926c0079d0eee7fba78d158b9f92d
7
+ data.tar.gz: 7fe92c90844b8b4e34a4dd75df136006b994925c40f998ac1f2643dd181dfbbc01ba14c64c19e1baaa56c167f4a1f40e8b404a4fd3bba8166a72c9731830d4d2
data/.gitignore CHANGED
@@ -1,5 +1,250 @@
1
1
 
2
- # Created by https://www.gitignore.io/api/git,vim,ruby,linux,macos,emacs,windows
2
+ # Created by https://www.gitignore.io/api/git,vim,linux,macos,emacs,ruby,dropbox,windows,archive,vagrant,archives,compressed,compression,visualstudiocode,archlinuxpackages,compressedarchive
3
+
4
+ ### Archive ###
5
+
6
+ ### Mostly from https://en.wikipedia.org/wiki/List_of_archive_formats
7
+
8
+ ## Archiving only
9
+ # The traditional archive format on Unix-like systems, now used mainly for the creation of static libraries.
10
+ *.a
11
+ *.ar
12
+ # RPM files consist of metadata concatenated with (usually) a cpio archive. Newer RPM systems also support other archives, as cpio is becoming obsolete. cpio is also used with initramfs.
13
+ *.cpio
14
+
15
+ # A self-extracting archive that uses the Bourne shell (sh).
16
+ *.shar
17
+ # A system for storing multiple files. LBR archives typically contained files processed by SQ, or the archive itself was compressed with SQ. LBR archives that were compressed with SQ ended with the extension .LQR
18
+ *.LBR
19
+ # An archive format originally used mainly for archiving and distribution of the exact, nearly-exact, or custom-modified contents of an optical storage medium such as a CD-ROM or DVD-ROM. However, it can be used to archive the contents of other storage media, selected partitions, folders, and/or files. The resulting archive is typically optimized for convenient rendering to (re-)writable CD or DVD media.
20
+ *.iso
21
+ # A library format used primarily on the Commodore 64 and 128 lines of computers. This bears no resemblance to the DOS LBR format. While library files were quick to implement (a number of programs exist to work with them) they are crippled in that they cannot grow with use: once a file has been created it cannot be amended (files added, changed or deleted) without recreating the entire file.
22
+ *.lbr
23
+ # An archive format used by Mozilla for storing binary diffs. Used in conjunction with bzip2.
24
+ *.mar
25
+ # A common archive format used on Unix-like systems. Generally used in conjunction with compressors such as gzip, bzip2, compress or xz to create .tar.gz, .tar.bz2, .tar.Z or tar.xz files.
26
+ *.tar
27
+
28
+ # Package managers
29
+ # Red Hat Package Manager
30
+ *.rpm
31
+ # Debian package
32
+ *.deb
33
+ # MicroSoft Installer
34
+ *.msi
35
+ *.msm
36
+ *.msp
37
+ # Mozilla package installer
38
+ *.xpi
39
+ # Ruby Package
40
+ *.gem
41
+
42
+
43
+ ### Archives ###
44
+ # It's better to unpack these files and commit the raw source because
45
+ # git has its own built in compression methods.
46
+ *.7z
47
+ *.jar
48
+ *.rar
49
+ *.zip
50
+ *.gz
51
+ *.tgz
52
+ *.bzip
53
+ *.bz2
54
+ *.xz
55
+ *.lzma
56
+ *.cab
57
+
58
+ #packing-only formats
59
+
60
+ #package management formats
61
+ *.dmg
62
+ *.egg
63
+
64
+ ### ArchLinuxPackages ###
65
+ *.tar.*
66
+ *.exe
67
+ *.log
68
+ *.log.*
69
+ *.sig
70
+
71
+ pkg/
72
+ src/
73
+
74
+ ### Compressed ###
75
+ *.pkg
76
+ *.sit
77
+ *.sitx
78
+ *.zipx
79
+
80
+ ### CompressedArchive ###
81
+
82
+ ### Mostly from https://en.wikipedia.org/wiki/List_of_archive_formats
83
+
84
+ ## Archiving and compression
85
+ # Open source file format. Used by 7-Zip.
86
+ # Mac OS X, restoration on different platforms is possible although not immediate Yes Based on 7z. Preserves Spotlight metadata, resource forks, owner/group information, dates and other data which would be otherwise lost with compression.
87
+ *.s7z
88
+ # Old archive versions only Proprietary format
89
+ *.ace
90
+ # A format that compresses and doubly encrypt the data (AES256 and CAS256) avoiding brute force attacks, also hide files in an AFA file. It has two ways to safeguard data integrity and subsequent repair of the file if has an error (repair with AstroA2P (online) or Astrotite (offline)).
91
+ *.afa
92
+ # A mainly Korean format designed for very large archives.
93
+ *.alz
94
+ # Android application package (variant of JAR file format).
95
+ *.apk
96
+ # ??
97
+ *.arc
98
+ # Originally DOS, now multiple
99
+ *.arj
100
+ # Open archive format, used by B1 Free Archiver (http://dev.b1.org/standard/archive-format.html)
101
+ *.b1
102
+ # Binary Archive with external header
103
+ *.ba
104
+ # Proprietary format from the ZipTV Compression Components
105
+ *.bh
106
+ # The Microsoft Windows native archive format, which is also used by many commercial installers such as InstallShield and WISE.
107
+ # Originally DOS, now DOS and Windows Created by Yaakov Gringeler; released last in 2003 (Compressia 1.0.0.1 beta), now apparently defunct. Free trial of 30 days lets user create and extract archives; after that it is possible to extract, but not to create.
108
+ *.car
109
+ # Open source file format.
110
+ *.cfs
111
+ # Compact Pro archive, a common archiver used on Mac platforms until about Mac OS 7.5.x. Competed with StuffIt; now obsolete.
112
+ *.cpt
113
+ # Windows, Unix-like, Mac OS X Open source file format. Files are compressed individually with either gzip, bzip2 or lzo.
114
+ *.dar
115
+ # DiskDoubler Mac OS obsolete
116
+ *.dd
117
+ # ??
118
+ *.dgc
119
+ # Apple Disk Image upports "Internet-enabled" disk images, which, once downloaded, are automatically decompressed, mounted, have the contents extracted, and thrown away. Currently, Safari is the only browser that supports this form of extraction; however, the images can be manually extracted as well. This format can also be password-protected or encrypted with 128-bit or 256-bit AES encryption.
120
+ # Enterprise Java Archive archive
121
+ *.ear
122
+ # ETSoft compressed archive
123
+ # The predecessor of DGCA.
124
+ *.gca
125
+ # Originally DOS Yes, but may be covered by patents DOS era format; uses arithmetic/Markov coding
126
+ *.ha
127
+ # MS Windows HKI
128
+ *.hki
129
+ # Produced by ICEOWS program. Excels at text file compression.
130
+ *.ice
131
+ # Java archive, compatible with ZIP files
132
+ *.jar
133
+ # Open sourced archiver with compression using the PAQ family of algorithms and optional encryption.
134
+ *.kgb
135
+ # Originally DOS, now multiple Multiple Yes The standard format on Amiga.
136
+ *.lzh
137
+ *.lha
138
+ # Archiver originally used on The Amiga. Now copied by Microsoft to use in their .cab and .chm files.
139
+ *.lzx
140
+ # file format from NoGate Consultings, a rival from ARC-Compressor.
141
+ *.pak
142
+ # A disk image archive format that supports several compression methods as well as splitting the archive into smaller pieces.
143
+ *.partimg
144
+ # An experimental open source packager (http://mattmahoney.net/dc)
145
+ *.paq*
146
+ # Open source archiver supporting authenticated encryption, volume spanning, customizable object level and volume level integrity checks (form CRCs to SHA-512 and Whirlpool hashes), fast deflate based compression
147
+ *.pea
148
+ # The format from the PIM - a freeware compression tool by Ilia Muraviev. It uses an LZP-based compression algorithm with set of filters for executable, image and audio files.
149
+ *.pim
150
+ # PackIt Mac OS obsolete
151
+ *.pit
152
+ # Used for data in games written using the Quadruple D library for Delphi. Uses byte pair compression.
153
+ *.qda
154
+ # A proprietary archive format, second in popularity to .zip files.
155
+ # The format from a commercial archiving package. Odd among commercial packages in that they focus on incorporating experimental algorithms with the highest possible compression (at the expense of speed and memory), such as PAQ, PPMD and PPMZ (PPMD with unlimited-length strings), as well as a proprietary algorithms.
156
+ *.rk
157
+ # Self Dissolving ARChive Commodore 64, Commodore 128 Commodore 64, Commodore 128 Yes SDAs refer to Self Dissolving ARC files, and are based on the Commodore 64 and Commodore 128 versions of ARC, originally written by Chris Smeets. While the files share the same extension, they are not compatible between platforms. That is, an SDA created on a Commodore 64 but run on a Commodore 128 in Commodore 128 mode will crash the machine, and vice versa. The intended successor to SDA is SFX.
158
+ *.sda
159
+ # A pre-Mac OS X Self-Extracting Archive format. StuffIt, Compact Pro, Disk Doubler and others could create .sea files, though the StuffIt versions were the most common.
160
+ *.sea
161
+ # Scifer Archive with internal header
162
+ *.sen
163
+ # Commodore 64, Commodore 128 SFX is a Self Extracting Archive which uses the LHArc compression algorithm. It was originally developed by Chris Smeets on the Commodore platform, and runs primarily using the CS-DOS extension for the Commodore 128. Unlike its predecessor SDA, SFX files will run on both the Commodore 64 and Commodore 128 regardless of which machine they were created on.
164
+ *.sfx
165
+ # An archive format designed for the Apple II series of computers. The canonical implementation is ShrinkIt, which can operate on disk images as well as files. Preferred compression algorithm is a combination of RLE and 12-bit LZW. Archives can be manipulated with the command-line NuLib tool, or the Windows-based CiderPress.
166
+ *.shk
167
+ # A compression format common on Apple Macintosh computers. The free StuffIt Expander is available for Windows and OS X.
168
+ # The replacement for the .sit format that supports more compression methods, UNIX file permissions, long file names, very large files, more encryption options, data specific compressors (JPEG, Zip, PDF, 24-bit image, MP3). The free StuffIt Expander is available for Windows and OS X.
169
+ # A royalty-free compressing format
170
+ *.sqx
171
+ # The "tarball" format combines tar archives with a file-based compression scheme (usually gzip). Commonly used for source and binary distribution on Unix-like platforms, widely available elsewhere.
172
+ *.tar.gz
173
+ *.tar.Z
174
+ *.tar.bz2
175
+ *.tbz2
176
+ *.tar.lzma
177
+ *.tlz
178
+ # UltraCompressor 2.3 was developed to act as an alternative to the then popular PKZIP application. The main feature of the application is its ability to create large archives. This means that compressed archives with the UC2 file extension can hold almost 1 million files.
179
+ *.uc
180
+ *.uc0
181
+ *.uc2
182
+ *.ucn
183
+ *.ur2
184
+ *.ue2
185
+ # Based on PAQ, RZM, CSC, CCM, and 7zip. The format consists of a PAQ, RZM, CSC, or CCM compressed file and a manifest with compression settings stored in a 7z archive.
186
+ *.uca
187
+ # A high compression rate archive format originally for DOS.
188
+ *.uha
189
+ # Web Application archive (Java-based web app)
190
+ *.war
191
+ # File-based disk image format developed to deploy Microsoft Windows.
192
+ *.wim
193
+ # XAR
194
+ *.xar
195
+ # Native format of the Open Source KiriKiri Visual Novel engine. Uses combination of block splitting and zlib compression. The filenames and pathes are stored in UTF-16 format. For integrity check, the Adler-32 hashsum is used. For many commercial games, the files are encrypted (and decoded on runtime) via so-called "cxdec" module, which implements xor-based encryption.
196
+ *.xp3
197
+ # Yamazaki zipper archive. Compression format used in DeepFreezer archiver utility created by Yamazaki Satoshi. Read and write support exists in TUGZip, IZArc and ZipZag
198
+ *.yz1
199
+ # The most widely used compression format on Microsoft Windows. Commonly used on Macintosh and Unix systems as well.
200
+ # application/x-zoo zoo Multiple Multiple Yes
201
+ *.zoo
202
+ # Journaling (append-only) archive format with rollback capability. Supports deduplication and incremental update based on last-modified dates. Multi-threaded. Compresses in LZ77, BWT, and context mixing formats. Open source.
203
+ *.zpaq
204
+ # Archiver with a compression algorithm based on the Burrows-Wheeler transform method.
205
+ *.zz
206
+
207
+
208
+ ### Compression ###
209
+
210
+ ### From https://en.wikipedia.org/wiki/List_of_archive_formats
211
+
212
+ ## Compression only
213
+ # An open source, patent- and royalty-free compression format. The compression algorithm is a Burrows-Wheeler transform followed by a move-to-front transform and finally Huffman coding
214
+ # Old compressor for QNX4 OS. The compression algorithm is a modified LZSS, with an adaptive Huffman coding.
215
+ *.F
216
+ # GNU Zip, the primary compression format used by Unix-like systems. The compression algorithm is DEFLATE.
217
+ # An alternate LZMA algorithm implementation, with support for checksums and ident bytes.
218
+ *.lz
219
+ # The LZMA compression algorithm as used by 7-Zip
220
+ # An implementation of the LZO data compression algorithm
221
+ *.lzo
222
+ # A compression program designed to do particularly well on very large files containing long distance redundancy.
223
+ *.rz
224
+ # Windows compress/decompress- Linux and Mac OS X decompress only A compression program designed to do high compression on SF2 files (SoundFont)
225
+ *.sfark
226
+ # A compression format invented by Google and open-sourced in 2011. Snappy aims for very high speeds, reasonable compression, and maximum stability rather than maximum compression or compatibility with any other compression library.
227
+ *.sz
228
+ # Squeeze: A program which compressed files. A file which was "squeezed" had the middle initial of the name changed to "Q", so that a squeezed text file would end with .TQT, a squeezed executable would end with .CQM or .EQE. Typically used with .LBR archives, either by storing the squeezed files in the archive, or by storing the files decompressed and then compressing the archive, which would have a name ending in ".LQR".
229
+ *.?Q?
230
+ # A compression program written by Steven Greenberg implementing the LZW algorithm. For several years in the CP/M world when no implementation was available of ARC, CRUNCHed files stored in .LBR archives were very popular. CRUNCH's implementation of LZW had a somewhat unique feature of modifying and occasionally clearing the code table in memory when it became full, resulting in a few percent better compression on many files.
231
+ *.?Z?
232
+ # A compression format using LZMA2 to yield very high compression ratios.
233
+ # The traditional Huffman coding compression format.
234
+ *.z
235
+ # The traditional LZW compression format.
236
+ *.Z
237
+ # Joke compression program, actually increasing file size
238
+ *.infl
239
+ # Compression format(s) used by some DOS and Windows install programs. MS-DOS includes expand.exe to decompress its install files. The compressed files are created with a matching compress.exe command. The compression algorithm is LZSS.
240
+ *.??_
241
+
242
+
243
+ ### Dropbox ###
244
+ # Dropbox settings and caches
245
+ .dropbox
246
+ .dropbox.attr
247
+ .dropbox.cache
3
248
 
4
249
  ### Emacs ###
5
250
  # -*- mode: gitignore; -*-
@@ -112,7 +357,6 @@ Temporary Items
112
357
  .apdisk
113
358
 
114
359
  ### Ruby ###
115
- *.gem
116
360
  *.rbc
117
361
  /.config
118
362
  /coverage/
@@ -163,6 +407,9 @@ Gemfile.lock
163
407
  # unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
164
408
  .rvmrc
165
409
 
410
+ ### Vagrant ###
411
+ .vagrant/
412
+
166
413
  ### Vim ###
167
414
  # swap
168
415
  [._]*.s[a-v][a-z]
@@ -176,6 +423,14 @@ Session.vim
176
423
  # auto-generated tag files
177
424
  tags
178
425
 
426
+ ### VisualStudioCode ###
427
+ .vscode/*
428
+ !.vscode/settings.json
429
+ !.vscode/tasks.json
430
+ !.vscode/launch.json
431
+ !.vscode/extensions.json
432
+ .history
433
+
179
434
  ### Windows ###
180
435
  # Windows thumbnail cache files
181
436
  Thumbs.db
@@ -189,12 +444,8 @@ Desktop.ini
189
444
  $RECYCLE.BIN/
190
445
 
191
446
  # Windows Installer files
192
- *.cab
193
- *.msi
194
- *.msm
195
- *.msp
196
447
 
197
448
  # Windows shortcuts
198
449
  *.lnk
199
450
 
200
- # End of https://www.gitignore.io/api/git,vim,ruby,linux,macos,emacs,windows
451
+ # End of https://www.gitignore.io/api/git,vim,linux,macos,emacs,ruby,dropbox,windows,archive,vagrant,archives,compressed,compression,visualstudiocode,archlinuxpackages,compressedarchive
data/README.md CHANGED
@@ -21,7 +21,7 @@ Note that 3.0 is not backwards compatible with 2.x, so upgrade carefully! In par
21
21
 
22
22
  ## Installation
23
23
 
24
- Requires Ruby 2.1+. Ruby 2.4+ is recommended.
24
+ Requires Ruby 2.4+.
25
25
 
26
26
  ```bash
27
27
  gem install bot_twitter_ebooks
@@ -2,7 +2,7 @@
2
2
  require File.expand_path('../lib/bot_twitter_ebooks/version', __FILE__)
3
3
 
4
4
  Gem::Specification.new do |spec|
5
- spec.required_ruby_version = '>= 2.1'
5
+ spec.required_ruby_version = '>= 2.4'
6
6
 
7
7
  spec.name = "bot_twitter_ebooks"
8
8
  spec.version = Ebooks::VERSION
@@ -127,7 +127,7 @@ module Ebooks
127
127
  sentences.map do |s|
128
128
  tokens = NLP.tokenize(s).reject do |t|
129
129
  # Don't include usernames/urls as tokens
130
- t.include?('@') || t.include?('http')
130
+ t.start_with?('@') || t.downcase.start_with?('http')
131
131
  end
132
132
 
133
133
  tokens.map { |t| tikify(t) }
@@ -169,7 +169,7 @@ module Ebooks
169
169
  mentions = []
170
170
  lines.each do |l|
171
171
  next if l.start_with?('#') # Remove commented lines
172
- next if l.include?('RT') || l.include?('MT') # Remove soft retweets
172
+ next if l.include?(': "RT @') || l.include?(': "MT @') # Remove soft retweets
173
173
 
174
174
  if l.include?('@')
175
175
  mentions << NLP.normalize(l)
@@ -190,7 +190,7 @@ module Ebooks
190
190
 
191
191
  log "Ranking keywords"
192
192
  @keywords = NLP.keywords(text).top(200).map(&:to_s)
193
- log "Top keywords: #{@keywords[0]} #{@keywords[1]} #{@keywords[2]}"
193
+ log "Top keywords: #{@keywords[0]} #{@keywords[1]} #{@keywords[2]} #{@keywords[3]} #{@keywords[4]} #{@keywords[5]} #{@keywords[6]} #{@keywords[7]}"
194
194
 
195
195
  self
196
196
  end
@@ -7,7 +7,7 @@ module Ebooks
7
7
  module NLP
8
8
  # We deliberately limit our punctuation handling to stuff we can do consistently
9
9
  # It'll just be a part of another token if we don't split it out, and that's fine
10
- PUNCTUATION = ".?!,"
10
+ PUNCTUATION = ".¿?¡!,"
11
11
 
12
12
  # Lazy-load NLP libraries and resources
13
13
  # Some of this stuff is pretty heavy and we don't necessarily need
@@ -86,10 +86,12 @@ module Ebooks
86
86
  # @param text [String]
87
87
  # @return [Highscore::Keywords]
88
88
  def self.keywords(text)
89
- # Preprocess to remove stopwords (highscore's blacklist is v. slow)
90
- text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')
89
+ # Preprocess to remove stopwords and urls (highscore's blacklist is v. slow)
90
+ text = NLP.tokenize(text).reject do |t|
91
+ t.downcase.start_with?('http') || stopword?(t)
92
+ end
91
93
 
92
- text = Highscore::Content.new(text)
94
+ text = Highscore::Content.new(text.join(' '))
93
95
 
94
96
  text.configure do
95
97
  #set :multiplier, 2
@@ -98,7 +100,7 @@ module Ebooks
98
100
  #set :long_words_threshold, 15
99
101
  #set :vowels, 1 # => default: 0 = not considered
100
102
  #set :consonants, 5 # => default: 0 = not considered
101
- #set :ignore_case, true # => default: false
103
+ set :ignore_case, true # => default: false
102
104
  set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/ # => default: /\w+/
103
105
  #set :stemming, true # => default: false
104
106
  end
@@ -1,3 +1,3 @@
1
1
  module Ebooks
2
- VERSION = "3.2.0"
2
+ VERSION = "3.3.0"
3
3
  end
data/spec/model_spec.rb CHANGED
@@ -39,8 +39,8 @@ describe Ebooks::Model do
39
39
  expect(report2.total_memsize).to be < 4000000
40
40
 
41
41
  expect(model.tokens[0]).to be_a String
42
- expect(model.sentences[0][0]).to be_a Fixnum
43
- expect(model.mentions[0][0]).to be_a Fixnum
42
+ expect(model.sentences[0][0]).to be_a Integer
43
+ expect(model.mentions[0][0]).to be_a Integer
44
44
  expect(model.keywords[0]).to be_a String
45
45
 
46
46
  puts "elonmusk.model uses #{report2.total_memsize} bytes in memory"
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bot_twitter_ebooks
3
3
  version: !ruby/object:Gem::Version
4
- version: 3.2.0
4
+ version: 3.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - astrolince
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-09-29 00:00:00.000000000 Z
11
+ date: 2017-10-07 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec
@@ -289,7 +289,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
289
289
  requirements:
290
290
  - - ">="
291
291
  - !ruby/object:Gem::Version
292
- version: '2.1'
292
+ version: '2.4'
293
293
  required_rubygems_version: !ruby/object:Gem::Requirement
294
294
  requirements:
295
295
  - - ">="