tokeneyes 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: f648d93394449ac71d1d776d559e4b394edd08af
4
+ data.tar.gz: 8ee6a9db2a1bf74b9bf317e225379a463b436012
5
+ SHA512:
6
+ metadata.gz: a94e9a12e5c9c301b791588593e858fc4ddb389fbe98b4e5dc80489819b0447ba2de289019a426c6075aea9c803f243269bcfcea3a440b18ff209f99c30a6f08
7
+ data.tar.gz: cecaa693773ea68d7211e82ff0c1a59fdd1fc875ab64bc98709d4acc1b6946ee01e5348d7bd7931da9853d0b6f303be6fe00b1bdc424c20174d359e905b93199
data/.gitignore ADDED
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ .ruby-version
11
+ .ruby-gemset
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.2.3
4
+ - jruby-9000
5
+ before_install: gem install bundler -v 1.10.6
@@ -0,0 +1,13 @@
1
+ # Contributor Code of Conduct
2
+
3
+ As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
4
+
5
+ We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
6
+
7
+ Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
8
+
9
+ Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
10
+
11
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
12
+
13
+ This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in tokeneyes.gemspec
4
+ gemspec
5
+
6
+ group :test do
7
+ gem "codeclimate-test-reporter", require: nil
8
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2015 Alex Koppel
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,79 @@
1
+ # Tokeneyes
2
+
3
+ A string tokenizer designed to capture words with associated punctuation and sentence flow
4
+ information (e.g. if they start or end a sentence).
5
+
6
+ ### Why write a tokenizer?
7
+
8
+ As I was writing [markovian](https://github.com/arsduo/markovian), I realized that the Markov text
9
+ generator needed significantly more information about the corpus than was possible by simply
10
+ calling String#split on the input text. To add punctuation or end sentences properly (rather than
11
+ with a series of short, frequent prepositions or pronouns), the gem has to better understand how
12
+ words are used in context.
13
+
14
+ There are a number of excellent tokenizers available, such as the [tokenizer
15
+ gem](https://github.com/arbox/tokenizer), [Apache's OpenNLP](http://opennlp.apache.org/index.html),
16
+ and the [OpeNER Project](http://www.opener-project.eu/) -- if you're looking to do serious language processing, you should click on one one of those links.
17
+
18
+ Tokeneyes is a learning exercise; text parsing is a rich, fun, and deceptive
19
+ problem -- you can quickly get 80% of the way to proper tokenization, but it's the other 20% of
20
+ language use that makes the difference between "amusingly off" and "passes the Turing test". Mine
21
+ doesn't and won't, but I've still enjoyed writing it and look forward to refining it further.
22
+
23
+ ## Installation
24
+
25
+ Add this line to your application's Gemfile:
26
+
27
+ ```ruby
28
+ gem 'tokeneyes'
29
+ ```
30
+
31
+ And then execute:
32
+
33
+ $ bundle
34
+
35
+ Or install it yourself as:
36
+
37
+ $ gem install tokeneyes
38
+
39
+ ## Usage
40
+
41
+ In a console session, you can run
42
+
43
+ ```ruby
44
+ tokenizer = Tokeneyes::Tokenizer.new(text_to_parse)
45
+ tokens = tokenizer.parse_into_words
46
+ ```
47
+
48
+ This will return an array of
49
+ [Tokeneyes::Word](https://github.com/arsduo/tokeneyes/tree/master/lib/tokeneyes/word.rb) objects,
50
+ each of which provides the text of the word, punctuation before and after (if applicable) and
51
+ whether the word ended or began a sentence (as I have somewhat arbitrarily defined the concept 😁).
52
+
53
+ ## Still to do
54
+
55
+ There are several significant areas left to do:
56
+
57
+ * Capture periods at the end of a sentence
58
+ * Capture dividing punctuation that occurs after spaces (e.g. -, —, etc.)
59
+ * Capture ellipses and other multiple-character punctuation (e.g. ?!, --, etc.)
60
+ * Capture URLs as one word
61
+
62
+ Most of these should be doable by rewriting WordBuilder. Currently, a new WordBuilder is
63
+ initialized for each character; if we instead initialize one per word and then pass it each
64
+ new character (it then building up the word and setting/clearing punctuation as the word's format
65
+ changes), that should allow us to properly handle many of these cases.
66
+
67
+ ## Development
68
+
69
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake false` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
70
+
71
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
72
+
73
+ ## Contributing
74
+
75
+ Bug reports and pull requests are welcome on GitHub at https://github.com/arsduo/tokeneyes. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](contributor-covenant.org) code of conduct.
76
+
77
+ ## License
78
+
79
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,8 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ task :default => :spec
4
+
5
+ require 'rspec/core/rake_task'
6
+ RSpec::Core::RakeTask.new do |t|
7
+ t.rspec_opts = ["--color", '--format doc']
8
+ end
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "tokeneyes"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
data/bin/setup ADDED
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
data/changelog.md ADDED
@@ -0,0 +1,4 @@
1
+ v0.1.0
2
+ ======
3
+
4
+ * Basic tokenization of texts into Word objects with punctuation and sentence states
@@ -0,0 +1,56 @@
1
+ require 'tokeneyes/word_reader'
2
+
3
+ module Tokeneyes
4
+ class Tokenizer
5
+ attr_reader :text
6
+ def initialize(text)
7
+ @text = text
8
+ end
9
+
10
+ def parse_into_words
11
+ extract_words_from_stream until text_stream.eof?
12
+ results
13
+ end
14
+
15
+ protected
16
+
17
+ def extract_words_from_stream(previous_word = nil)
18
+ unless text_stream.eof?
19
+ word = read_next_word(previous_word)
20
+ results.push(word) if word
21
+ extract_words_from_stream(word)
22
+ end
23
+ end
24
+
25
+ def read_next_word(previous_word)
26
+ current_word = word_reader.read_word
27
+ # If we finished the text and just have nonsense afterward (or have nothing in the entire
28
+ # text that's a word), ignore that.
29
+ if current_word.length > 0
30
+ populate_previous_punctuation(previous_word, current_word)
31
+ current_word
32
+ end
33
+ end
34
+
35
+ def results
36
+ @results ||= []
37
+ end
38
+
39
+ # Various metadata about what preceded a word can (only) be drawn from the previous word.
40
+ def populate_previous_punctuation(previous_word, current_word)
41
+ if !previous_word || previous_word.ends_sentence?
42
+ current_word.begins_sentence = true
43
+ else
44
+ current_word.punctuation_before = previous_word.punctuation_after
45
+ end
46
+ end
47
+
48
+ def word_reader
49
+ @word_reader ||= WordReader.new(text_stream)
50
+ end
51
+
52
+ def text_stream
53
+ @text_stream ||= StringIO.new(text)
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,3 @@
1
+ module Tokeneyes
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,29 @@
1
+ module Tokeneyes
2
+ # Represents a word as read from the stream, with certain useful related metadata.
3
+ class Word
4
+ attr_reader :text
5
+ attr_accessor :punctuation_before, :punctuation_after
6
+ attr_writer :begins_sentence, :ends_sentence
7
+
8
+ def initialize(text)
9
+ @text = text
10
+ @begins_sentence = @ends_sentence = false
11
+ end
12
+
13
+ def begins_sentence?
14
+ @begins_sentence
15
+ end
16
+
17
+ def ends_sentence?
18
+ @ends_sentence
19
+ end
20
+
21
+ def to_s
22
+ text
23
+ end
24
+
25
+ def length
26
+ text.length
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,94 @@
1
+ module Tokeneyes
2
+ # Given a word fragment and the next character in the stream, continue building the word until we
3
+ # hit a boundary.
4
+ class WordBuilder
5
+ # We track both the word so far and the previous character (which may be punctuation and not
6
+ # part of the word).
7
+ attr_reader :previous_char, :current_char, :word_so_far
8
+ def initialize(previous, current, word)
9
+ @current_char = current.to_s
10
+ @previous_char = previous.to_s
11
+ @word_so_far = word
12
+ end
13
+
14
+ # Definite word elements, those that can repeat as much as they want and always be words:
15
+ # alphanumeric characters (including European symbols, all the Unicode blocks). If anyone has expertise on non-European
16
+ # languages, I would love to add support for other character groups.
17
+ # We include @ and # to support Twitter mentions, hashtags, and email addresses.
18
+ WORD_ELEMENTS = /[\w\d\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\@\#]/
19
+ # Defines a word boundary that also ends a unit of text.
20
+ SENTENCE_BOUNDARY = /[\.;\?\!]/
21
+ # Possible word elements, those that mark a word boundary unless they're followed by a word
22
+ # element:
23
+ POSSIBLE_WORD_ELEMENTS = /[\.'\-]/
24
+ # We don't track all possible punctuation, just some. (In particular, we don't track those that
25
+ # come in pairs, like parentheses and brackets, etc.)
26
+ MEANINGFUL_PUNCTUATION = /[\.,\-;\!\?]/
27
+ # Everything else represents a word boundary.
28
+
29
+ def word_finished?
30
+ # if the word hasn't actually started, we always continue it until we find something
31
+ return false if word_so_far.length == 0
32
+ word_did_indeed_terminate_previously? || !word_continues?
33
+ end
34
+
35
+ def character_to_add_to_word
36
+ if word_continues? && previous_character_was_possible_boundary?
37
+ # If the word does continue after a possible edge, that means that we should add both
38
+ # parts
39
+ "#{previous_char}#{current_char}"
40
+ elsif current_char_is_word_element?
41
+ current_char
42
+ else
43
+ # If the word is over (or hasn't yet begun), we have nothing to add; if it's possibly over, we'll find out next
44
+ # character. (If the string were to end on a possible boundary, that would indicate that
45
+ # it isn't actually part of the word anyway.)
46
+ ""
47
+ end
48
+ end
49
+
50
+ # Which punctuation ended the word?
51
+ def punctuation
52
+ return nil unless word_finished?
53
+ punctuation_candidate if punctuation_candidate.match(MEANINGFUL_PUNCTUATION)
54
+ end
55
+
56
+ def sentence_ended?
57
+ !!(punctuation && punctuation.match(SENTENCE_BOUNDARY))
58
+ end
59
+
60
+ protected
61
+
62
+ # If our last character was a possible word element but this one isn't, we're done.
63
+ def word_did_indeed_terminate_previously?
64
+ previous_character_was_possible_boundary? && !current_char_is_word_element?
65
+ end
66
+
67
+ def word_continues?
68
+ current_char_is_word_element? || current_char_is_possible_boundary?
69
+ end
70
+
71
+ def current_char_is_word_element?
72
+ current_char.match(WORD_ELEMENTS)
73
+ end
74
+
75
+ def previous_character_was_possible_boundary?
76
+ # it's not a possible word boundary if the word hasn't yet started
77
+ previous_char.match(POSSIBLE_WORD_ELEMENTS) && word_so_far.length > 0
78
+ end
79
+
80
+ def current_char_is_possible_boundary?
81
+ # If the previous character was also a boundary, this one can't be as well -- we've ended the
82
+ # word.
83
+ current_char.match(POSSIBLE_WORD_ELEMENTS) && !previous_character_was_possible_boundary?
84
+ end
85
+
86
+ def punctuation_candidate
87
+ if previous_character_was_possible_boundary?
88
+ previous_char
89
+ else
90
+ current_char
91
+ end
92
+ end
93
+ end
94
+ end
@@ -0,0 +1,41 @@
1
+ require "tokeneyes/word_builder"
2
+ require "tokeneyes/word"
3
+
4
+ module Tokeneyes
5
+ # The WordReader class will read a single word from a StringIO, advancing the IO stream until a
6
+ # word and subsequent boundary are reached (or the string runs out). It will return a Word object
7
+ # containing info on the word and its ending (the object receiving this data will be resopnsible
8
+ # for filling in any data about the previous state, if any).
9
+ class WordReader
10
+ attr_reader :text_stream
11
+ def initialize(text_stream)
12
+ @text_stream = text_stream
13
+ end
14
+
15
+ def read_word(previous_char = "", word = "")
16
+ current_char = text_stream.readchar
17
+ word_builder = WordBuilder.new(previous_char, current_char, word)
18
+ word += word_builder.character_to_add_to_word
19
+
20
+ # if we detect a word boundary but don't actually have a word yet, keep going -- that is,
21
+ # discard leading punctuation not attached to a word (e.g. x,,y or ^,y)
22
+ if text_stream.eof? || (word_builder.word_finished? && word.length > 0)
23
+ build_word(word, word_builder)
24
+ else
25
+ read_word(current_char, word)
26
+ end
27
+ end
28
+
29
+ protected
30
+
31
+ def build_word(word, word_builder)
32
+ Word.new(word).tap do |word_object|
33
+ # we don't set punctuation before, even if it's at the beginning (e.g. no previous words)
34
+ # -- that will be set based on the punctuation of the previous word in the class that reads
35
+ # in the whole text.
36
+ word_object.punctuation_after = word_builder.punctuation
37
+ word_object.ends_sentence = word_builder.sentence_ended? || text_stream.eof?
38
+ end
39
+ end
40
+ end
41
+ end
data/lib/tokeneyes.rb ADDED
@@ -0,0 +1,5 @@
1
+ require "tokeneyes/version"
2
+ require 'tokeneyes/tokenizer'
3
+
4
+ module Tokeneyes
5
+ end
data/tokeneyes.gemspec ADDED
@@ -0,0 +1,26 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'tokeneyes/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "tokeneyes"
8
+ spec.version = Tokeneyes::VERSION
9
+ spec.authors = ["Alex Koppel"]
10
+ spec.email = ["alex@alexkoppel.com"]
11
+
12
+ spec.summary = %q{A simple string tokenizer designed to capture punctuation and sentence flow information.}
13
+ spec.description = %q{A simple string tokenizer designed to capture punctuation and sentence flow information.}
14
+ spec.homepage = "https://github.com/arsduo/tokeneyes"
15
+ spec.license = "MIT"
16
+
17
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
18
+ spec.bindir = "exe"
19
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
20
+ spec.require_paths = ["lib"]
21
+
22
+ spec.add_development_dependency "bundler", "~> 1.10"
23
+ spec.add_development_dependency "rake", "~> 10.0"
24
+ spec.add_development_dependency "rspec", "~> 3.3"
25
+ spec.add_development_dependency "faker"
26
+ end
metadata ADDED
@@ -0,0 +1,119 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: tokeneyes
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Alex Koppel
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-09-28 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.10'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.10'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '3.3'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '3.3'
55
+ - !ruby/object:Gem::Dependency
56
+ name: faker
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: '0'
69
+ description: A simple string tokenizer designed to capture punctuation and sentence
70
+ flow information.
71
+ email:
72
+ - alex@alexkoppel.com
73
+ executables: []
74
+ extensions: []
75
+ extra_rdoc_files: []
76
+ files:
77
+ - ".gitignore"
78
+ - ".travis.yml"
79
+ - CODE_OF_CONDUCT.md
80
+ - Gemfile
81
+ - LICENSE.txt
82
+ - README.md
83
+ - Rakefile
84
+ - bin/console
85
+ - bin/setup
86
+ - changelog.md
87
+ - lib/tokeneyes.rb
88
+ - lib/tokeneyes/tokenizer.rb
89
+ - lib/tokeneyes/version.rb
90
+ - lib/tokeneyes/word.rb
91
+ - lib/tokeneyes/word_builder.rb
92
+ - lib/tokeneyes/word_reader.rb
93
+ - tokeneyes.gemspec
94
+ homepage: https://github.com/arsduo/tokeneyes
95
+ licenses:
96
+ - MIT
97
+ metadata: {}
98
+ post_install_message:
99
+ rdoc_options: []
100
+ require_paths:
101
+ - lib
102
+ required_ruby_version: !ruby/object:Gem::Requirement
103
+ requirements:
104
+ - - ">="
105
+ - !ruby/object:Gem::Version
106
+ version: '0'
107
+ required_rubygems_version: !ruby/object:Gem::Requirement
108
+ requirements:
109
+ - - ">="
110
+ - !ruby/object:Gem::Version
111
+ version: '0'
112
+ requirements: []
113
+ rubyforge_project:
114
+ rubygems_version: 2.4.5.1
115
+ signing_key:
116
+ specification_version: 4
117
+ summary: A simple string tokenizer designed to capture punctuation and sentence flow
118
+ information.
119
+ test_files: []