RubyGems - tokeneyes - Versions diffs - 0.1.0 - Mend

tokeneyes 0.1.0

Files changed (19) hide show

checksums.yaml +7 -0
data/.gitignore +11 -0
data/.travis.yml +5 -0
data/CODE_OF_CONDUCT.md +13 -0
data/Gemfile +8 -0
data/LICENSE.txt +21 -0
data/README.md +79 -0
data/Rakefile +8 -0
data/bin/console +14 -0
data/bin/setup +7 -0
data/changelog.md +4 -0
data/lib/tokeneyes/tokenizer.rb +56 -0
data/lib/tokeneyes/version.rb +3 -0
data/lib/tokeneyes/word.rb +29 -0
data/lib/tokeneyes/word_builder.rb +94 -0
data/lib/tokeneyes/word_reader.rb +41 -0
data/lib/tokeneyes.rb +5 -0
data/tokeneyes.gemspec +26 -0
metadata +119 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: f648d93394449ac71d1d776d559e4b394edd08af
+  data.tar.gz: 8ee6a9db2a1bf74b9bf317e225379a463b436012
+SHA512:
+  metadata.gz: a94e9a12e5c9c301b791588593e858fc4ddb389fbe98b4e5dc80489819b0447ba2de289019a426c6075aea9c803f243269bcfcea3a440b18ff209f99c30a6f08
+  data.tar.gz: cecaa693773ea68d7211e82ff0c1a59fdd1fc875ab64bc98709d4acc1b6946ee01e5348d7bd7931da9853d0b6f303be6fe00b1bdc424c20174d359e905b93199

data/.gitignore ADDED Viewed

@@ -0,0 +1,11 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+.ruby-version
+.ruby-gemset

data/.travis.yml ADDED Viewed

@@ -0,0 +1,5 @@
+language: ruby
+rvm:
+  - 2.2.3
+  - jruby-9000
+before_install: gem install bundler -v 1.10.6

data/CODE_OF_CONDUCT.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Contributor Code of Conduct
+As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
+We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
+Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
+This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)

data/Gemfile ADDED Viewed

@@ -0,0 +1,8 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in tokeneyes.gemspec
+gemspec
+group :test do
+  gem "codeclimate-test-reporter", require: nil
+end

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2015 Alex Koppel
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,79 @@
+# Tokeneyes
+A string tokenizer designed to capture words with associated punctuation and sentence flow
+information (e.g. if they start or end a sentence).
+### Why write a tokenizer?
+As I was writing [markovian](https://github.com/arsduo/markovian), I realized that the Markov text
+generator needed significantly more information about the corpus than was possible by simply
+calling String#split on the input text. To add punctuation or end sentences properly (rather than
+with a series of short, frequent prepositions or pronouns), the gem has to better understand how
+words are used in context.
+There are a number of excellent tokenizers available, such as the [tokenizer
+gem](https://github.com/arbox/tokenizer), [Apache's OpenNLP](http://opennlp.apache.org/index.html),
+and the [OpeNER Project](http://www.opener-project.eu/) -- if you're looking to do serious language processing, you should click on one one of those links.
+Tokeneyes is a learning exercise; text parsing is a rich, fun, and deceptive
+problem -- you can quickly get 80% of the way to proper tokenization, but it's the other 20% of
+language use that makes the difference between "amusingly off" and "passes the Turing test". Mine
+doesn't and won't, but I've still enjoyed writing it and look forward to refining it further.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'tokeneyes'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install tokeneyes
+## Usage
+In a console session, you can run
+```ruby
+tokenizer = Tokeneyes::Tokenizer.new(text_to_parse)
+tokens = tokenizer.parse_into_words
+```
+This will return an array of
+[Tokeneyes::Word](https://github.com/arsduo/tokeneyes/tree/master/lib/tokeneyes/word.rb) objects,
+each of which provides the text of the word, punctuation before and after (if applicable) and
+whether the word ended or began a sentence (as I have somewhat arbitrarily defined the concept 😁).
+## Still to do
+There are several significant areas left to do:
+* Capture periods at the end of a sentence
+* Capture dividing punctuation that occurs after spaces (e.g. -, —, etc.)
+* Capture ellipses and other multiple-character punctuation (e.g. ?!, --, etc.)
+* Capture URLs as one word
+Most of these should be doable by rewriting WordBuilder. Currently, a new WordBuilder is
+initialized for each character; if we instead initialize one per word and then pass it each
+new character (it then building up the word and setting/clearing punctuation as the word's format
+changes), that should allow us to properly handle many of these cases.
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake false` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/arsduo/tokeneyes. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](contributor-covenant.org) code of conduct.
+## License
+The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).

data/Rakefile ADDED Viewed

@@ -0,0 +1,8 @@
+require "bundler/gem_tasks"
+task :default => :spec
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new do |t|
+  t.rspec_opts = ["--color", '--format doc']
+end

data/bin/console ADDED Viewed

@@ -0,0 +1,14 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "tokeneyes"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+# require "pry"
+# Pry.start
+require "irb"
+IRB.start

data/bin/setup ADDED Viewed

@@ -0,0 +1,7 @@
+#!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
+bundle install
+# Do any other automated setup that you need to do here

data/changelog.md ADDED Viewed

@@ -0,0 +1,4 @@
+v0.1.0
+======
+* Basic tokenization of texts into Word objects with punctuation and sentence states

data/lib/tokeneyes/tokenizer.rb ADDED Viewed

@@ -0,0 +1,56 @@
+require 'tokeneyes/word_reader'
+module Tokeneyes
+  class Tokenizer
+    attr_reader :text
+    def initialize(text)
+      @text = text
+    end
+    def parse_into_words
+      extract_words_from_stream until text_stream.eof?
+      results
+    end
+    protected
+    def extract_words_from_stream(previous_word = nil)
+      unless text_stream.eof?
+        word = read_next_word(previous_word)
+        results.push(word) if word
+        extract_words_from_stream(word)
+      end
+    end
+    def read_next_word(previous_word)
+      current_word = word_reader.read_word
+      # If we finished the text and just have nonsense afterward (or have nothing in the entire
+      # text that's a word), ignore that.
+      if current_word.length > 0
+        populate_previous_punctuation(previous_word, current_word)
+        current_word
+      end
+    end
+    def results
+      @results ||= []
+    end
+    # Various metadata about what preceded a word can (only) be drawn from the previous word.
+    def populate_previous_punctuation(previous_word, current_word)
+      if !previous_word || previous_word.ends_sentence?
+        current_word.begins_sentence = true
+      else
+        current_word.punctuation_before = previous_word.punctuation_after
+      end
+    end
+    def word_reader
+      @word_reader ||= WordReader.new(text_stream)
+    end
+    def text_stream
+      @text_stream ||= StringIO.new(text)
+    end
+  end
+end

data/lib/tokeneyes/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Tokeneyes
+  VERSION = "0.1.0"
+end

data/lib/tokeneyes/word.rb ADDED Viewed

@@ -0,0 +1,29 @@
+module Tokeneyes
+  # Represents a word as read from the stream, with certain useful related metadata.
+  class Word
+    attr_reader :text
+    attr_accessor :punctuation_before, :punctuation_after
+    attr_writer :begins_sentence, :ends_sentence
+    def initialize(text)
+      @text = text
+      @begins_sentence = @ends_sentence = false
+    end
+    def begins_sentence?
+      @begins_sentence
+    end
+    def ends_sentence?
+      @ends_sentence
+    end
+    def to_s
+      text
+    end
+    def length
+      text.length
+    end
+  end
+end

data/lib/tokeneyes/word_builder.rb ADDED Viewed

@@ -0,0 +1,94 @@
+module Tokeneyes
+  # Given a word fragment and the next character in the stream, continue building the word until we
+  # hit a boundary.
+  class WordBuilder
+    # We track both the word so far and the previous character (which may be punctuation and not
+    # part of the word).
+    attr_reader :previous_char, :current_char, :word_so_far
+    def initialize(previous, current, word)
+      @current_char = current.to_s
+      @previous_char = previous.to_s
+      @word_so_far = word
+    end
+    # Definite word elements, those that can repeat as much as they want and always be words:
+    # alphanumeric characters (including European symbols, all the Unicode blocks). If anyone has expertise on non-European
+    # languages, I would love to add support for other character groups.
+    # We include @ and # to support Twitter mentions, hashtags, and email addresses.
+    WORD_ELEMENTS = /[\w\d\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\@\#]/
+    # Defines a word boundary that also ends a unit of text.
+    SENTENCE_BOUNDARY = /[\.;\?\!]/
+    # Possible word elements, those that mark a word boundary unless they're followed by a word
+    # element:
+    POSSIBLE_WORD_ELEMENTS = /[\.'\-]/
+    # We don't track all possible punctuation, just some. (In particular, we don't track those that
+    # come in pairs, like parentheses and brackets, etc.)
+    MEANINGFUL_PUNCTUATION = /[\.,\-;\!\?]/
+    # Everything else represents a word boundary.
+    def word_finished?
+      # if the word hasn't actually started, we always continue it until we find something
+      return false if word_so_far.length == 0
+      word_did_indeed_terminate_previously? || !word_continues?
+    end
+    def character_to_add_to_word
+      if word_continues? && previous_character_was_possible_boundary?
+        # If the word does continue after a possible edge, that means that we should add both
+        # parts
+        "#{previous_char}#{current_char}"
+      elsif current_char_is_word_element?
+        current_char
+      else
+        # If the word is over (or hasn't yet begun), we have nothing to add; if it's possibly over, we'll find out next
+        # character. (If the string were to end on a possible boundary, that would indicate that
+        # it isn't actually part of the word anyway.)
+        ""
+      end
+    end
+    # Which punctuation ended the word?
+    def punctuation
+      return nil unless word_finished?
+      punctuation_candidate if punctuation_candidate.match(MEANINGFUL_PUNCTUATION)
+    end
+    def sentence_ended?
+      !!(punctuation && punctuation.match(SENTENCE_BOUNDARY))
+    end
+    protected
+    # If our last character was a possible word element but this one isn't, we're done.
+    def word_did_indeed_terminate_previously?
+      previous_character_was_possible_boundary? && !current_char_is_word_element?
+    end
+    def word_continues?
+      current_char_is_word_element? || current_char_is_possible_boundary?
+    end
+    def current_char_is_word_element?
+      current_char.match(WORD_ELEMENTS)
+    end
+    def previous_character_was_possible_boundary?
+      # it's not a possible word boundary if the word hasn't yet started
+      previous_char.match(POSSIBLE_WORD_ELEMENTS) && word_so_far.length > 0
+    end
+    def current_char_is_possible_boundary?
+      # If the previous character was also a boundary, this one can't be as well -- we've ended the
+      # word.
+      current_char.match(POSSIBLE_WORD_ELEMENTS) && !previous_character_was_possible_boundary?
+    end
+    def punctuation_candidate
+      if previous_character_was_possible_boundary?
+        previous_char
+      else
+        current_char
+      end
+    end
+  end
+end

data/lib/tokeneyes/word_reader.rb ADDED Viewed

@@ -0,0 +1,41 @@
+require "tokeneyes/word_builder"
+require "tokeneyes/word"
+module Tokeneyes
+  # The WordReader class will read a single word from a StringIO, advancing the IO stream until a
+  # word and subsequent boundary are reached (or the string runs out). It will return a Word object
+  # containing info on the word and its ending (the object receiving this data will be resopnsible
+  # for filling in any data about the previous state, if any).
+  class WordReader
+    attr_reader :text_stream
+    def initialize(text_stream)
+      @text_stream = text_stream
+    end
+    def read_word(previous_char = "", word = "")
+      current_char = text_stream.readchar
+      word_builder = WordBuilder.new(previous_char, current_char, word)
+      word += word_builder.character_to_add_to_word
+      # if we detect a word boundary but don't actually have a word yet, keep going -- that is,
+      # discard leading punctuation not attached to a word (e.g. x,,y or ^,y)
+      if text_stream.eof? || (word_builder.word_finished? && word.length > 0)
+        build_word(word, word_builder)
+      else
+        read_word(current_char, word)
+      end
+    end
+    protected
+    def build_word(word, word_builder)
+      Word.new(word).tap do |word_object|
+        # we don't set punctuation before, even if it's at the beginning (e.g. no previous words)
+        # -- that will be set based on the punctuation of the previous word in the class that reads
+        # in the whole text.
+        word_object.punctuation_after = word_builder.punctuation
+        word_object.ends_sentence = word_builder.sentence_ended? || text_stream.eof?
+      end
+    end
+  end
+end

data/lib/tokeneyes.rb ADDED Viewed

@@ -0,0 +1,5 @@
+require "tokeneyes/version"
+require 'tokeneyes/tokenizer'
+module Tokeneyes
+end

data/tokeneyes.gemspec ADDED Viewed

@@ -0,0 +1,26 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'tokeneyes/version'
+Gem::Specification.new do |spec|
+  spec.name          = "tokeneyes"
+  spec.version       = Tokeneyes::VERSION
+  spec.authors       = ["Alex Koppel"]
+  spec.email         = ["alex@alexkoppel.com"]
+  spec.summary       = %q{A simple string tokenizer designed to capture punctuation and sentence flow information.}
+  spec.description   = %q{A simple string tokenizer designed to capture punctuation and sentence flow information.}
+  spec.homepage      = "https://github.com/arsduo/tokeneyes"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.10"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec", "~> 3.3"
+  spec.add_development_dependency "faker"
+end

metadata ADDED Viewed

@@ -0,0 +1,119 @@
+--- !ruby/object:Gem::Specification
+name: tokeneyes
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Alex Koppel
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2015-09-28 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.10'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.10'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.3'
+- !ruby/object:Gem::Dependency
+  name: faker
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: A simple string tokenizer designed to capture punctuation and sentence
+  flow information.
+email:
+- alex@alexkoppel.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- CODE_OF_CONDUCT.md
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- bin/console
+- bin/setup
+- changelog.md
+- lib/tokeneyes.rb
+- lib/tokeneyes/tokenizer.rb
+- lib/tokeneyes/version.rb
+- lib/tokeneyes/word.rb
+- lib/tokeneyes/word_builder.rb
+- lib/tokeneyes/word_reader.rb
+- tokeneyes.gemspec
+homepage: https://github.com/arsduo/tokeneyes
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.5.1
+signing_key:
+specification_version: 4
+summary: A simple string tokenizer designed to capture punctuation and sentence flow
+  information.
+test_files: []