RubyGems - tactful_tokenizer - Versions diffs - 0.0.1 - Mend

tactful_tokenizer 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

data.tar.gz.sig ADDED Viewed

@@ -0,0 +1,4 @@
+1���8����&�֭�D����0�SLI�x�r1e��(|?;j��_y���r���y���@[��z�-��<)���zt�����K�<,�����u��Xxᯱ(�V�����!��J�,�
+b����N�!y�B�*{z��=��Jh���D�Z��O������dI�m�F/
+�p�YD~�4h�i�vV*y�!��QR���8ӕ�g���Ό��ʧ�1Hߚ�F
+q��|����	v7G"xE=��)�#��.��

data/Manifest ADDED Viewed

@@ -0,0 +1,8 @@
+Manifest
+README.rdoc
+Rakefile
+lib/models/features.mar
+lib/models/lower_words.mar
+lib/models/non_abbrs.mar
+lib/tactful_tokenizer.rb
+lib/word_tokenizer.rb

data/README.rdoc ADDED Viewed

@@ -0,0 +1,26 @@
+= TactfulTokenizer
+TactfulTokenizer is a Ruby library for high quality sentence
+tokenization. It uses a Naive Bayesian statistical model, and
+is based on Splitta[http://code.google.com/p/splitta/], but
+has support for '?' and '!' as well as primitive handling of
+XHTML markup. Better support for XHTML parsing is coming shortly.
+== Usage
+ require "tactful_tokenizer"
+ m = TactfulTokenizer::Model.new
+ m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
+ #=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]
+The input text is expected to consist of paragraphs delimited
+by line breaks.
+== Installation
+ git clone http://github.com/SlyShy/Tactful_Tokenizer.git
+ gem install andand
+== Author
+Copyright (c) 2010 Matthew Bunday. All rights reserved.
+Released under the {GNU GPL v3}[http://www.gnu.org/licenses/gpl.html].

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+require 'rubygems'
+require 'rake'
+require 'echoe'
+Echoe.new('tactful_tokenizer', '0.0.1') do |p|
+    p.description    = "A high accuracy naive bayesian sentence tokenizer based on Splitta."
+    p.url            = "http://github.com/SlyShy/Tactful_Tokenizer"
+    p.author         = "Matthew Bunday"
+    p.email          = "mkbunday @nospam@ gmail.com"
+    p.ignore_pattern = ["tmp/*", "script/*"]
+    p.development_dependencies = []
+end

data/lib/models/features.mar ADDED Viewed

Binary file

data/lib/models/lower_words.mar ADDED Viewed

Binary file

data/lib/models/non_abbrs.mar ADDED Viewed

Binary file

data/lib/tactful_tokenizer.rb ADDED Viewed

@@ -0,0 +1,210 @@
+# TactfulTokenizer is a Ruby library for high quality sentence
+# tokenization. It uses a Naive Bayesian statistical model, and
+# is based on Splitta[http://code.google.com/p/splitta/]. But
+# has support for '?' and '!' as well as primitive handling of
+# XHTML markup. Better support for XHTML parsing is coming shortly.
+#
+# Example usage:
+#
+#  require "tactful_tokenizer"
+#  m = TactfulTokenizer::Model.new
+#  m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way, really? Yes.")
+#  #=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way, really?", "Yes."]
+#
+# The input text is expected to consist of paragraphs delimited
+# by line breaks.
+#
+# Author:: Matthew Bunday (mailto:mkbunday@gmail.com)
+# License:: GNU General Public License v3
+require "word_tokenizer.rb"
+include WordTokenizer
+#--
+####### Performance TODOs.
+# TODO: Use inline C where necessary?
+# TODO: Use RE2 regexp extension.
+#++
+module TactfulTokenizer
+    # Basic String extensions.
+    String.class_eval do
+        # Simple regex to check if a string is alphabetic.
+        def is_alphabetic?
+            return !/[^[:alpha:]]/.match(self)
+        end
+        # Check for upper case.
+        # Surprisingly, this is faster than a regex in benchmarks.
+        # Using the trinary operator is faster than to_s
+        def is_upper_case?
+            self == self.upcase ? 'true' : 'false'
+        end
+    end
+    # A model stores normalized probabilities of different features occuring.
+    class Model
+        # Initialize the model. feats, lower_words, and non_abbrs
+        # indicate the locations of the respective Marshal dumps.
+        def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
+            @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
+                File.open(file) do |f|
+                    Marshal.load(f.read)
+                end
+            end
+            @p0 = @feats["<prior>"] ** 4
+        end
+        # feats = {feature => normalized probability of feature}
+        # lower_words = {token => log count of occurences in lower case}
+        # non_abbrs = {token => log count of occurences when not an abbrv.}
+        attr_accessor :feats, :lower_words, :non_abbrs
+        # This function is the only one that'll end up being used.
+        # m = TactfulTokenizer::Model.new
+        # m.tokenize_text("Hey, are these two sentences? I bet they should be.")
+        # => ["Hey, are these two sentences?", "I bet they should be."]
+        def tokenize_text(text)
+            data = Doc.new(text)
+            featurize(data)
+            classify(data)
+            return data.segment
+        end
+        # Assign a prediction (probability, to be precise) to each sentence fragment.
+        # For each feature in each fragment we hunt up the normalized probability and
+        # multiply. This is a fairly straightforward Bayesian probabilistic algorithm.
+        def classify(doc)
+            frag = nil
+            probs = 1
+            feat = ''
+            doc.frags.each do |frag|
+                probs = @p0
+                frag.features.each do |feat|
+                    probs *= @feats[feat]
+                end
+                frag.pred = probs / (probs + 1)
+            end
+        end
+        # Get the features of every fragment.
+        def featurize(doc)
+            frag = nil
+            doc.frags.each do |frag|
+                get_features(frag, self)
+            end
+        end
+        # Finds the features in a text fragment of the form:
+        # ... w1. (sb?) w2 ...
+        # Features listed in rough order of importance:
+        # * w1: a word that includes a period.
+        # * w2: the next word, if it exists.
+        # * w1length: the number of alphabetic characters in w1.
+        # * both: w1 and w2 taken together.
+        # * w1abbr: logarithmic count of w1 occuring without a period.
+        # * w2lower: logarithmiccount of w2 occuring lowercased.
+        def get_features(frag, model)
+            w1 = (frag.cleaned.last or '')
+            w2 = (frag.next or '')
+            frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]
+            if not w2.empty?
+                if w1.chop.is_alphabetic?
+                    frag.features.push "w1length_#{[10, w1.length].min}"
+                    frag.features.push "w1abbr_#{model.non_abbrs[w1.chop]}"
+                end
+                if w2.chop.is_alphabetic?
+                    frag.features.push "w2cap_#{w2[0].is_upper_case?}"
+                    frag.features.push "w2lower_#{model.lower_words[w2.downcase]}"
+                end
+            end
+        end
+    end
+    # A document represents the input text. It holds a list of fragments generated
+    # from the text.
+    class Doc
+        # List of fragments.
+        attr_accessor :frags
+        # Receives a text, which is then broken into fragments.
+        # A fragment ends with a period, quesetion mark, or exclamation mark followed
+        # possibly by right handed punctuation like quotation marks or closing braces
+        # and trailing whitespace. Failing that, it'll accept something like "I hate cheese\n"
+        # No, it doesn't have a period, but that's the end of paragraph.
+        #
+        # Input assumption: Paragraphs delimited by line breaks.
+        def initialize(text)
+            @frags = []
+            res = nil
+            puts "Hey!"
+            puts text.inspect
+            text.each_line do |line|
+                unless line.strip.empty?
+                    line.split(/(.*?[.!?](?:["')\]}]|(?:<.*>))*[\s])/).each do |res|
+                        unless res.strip.empty?
+                            frag = Frag.new(res)
+                            @frags.last.next = frag.cleaned.first unless @frags.empty?
+                            @frags.push frag
+                        end
+                    end
+                end
+            end
+        end
+        # Segments the text. More precisely, it reassembles the fragments into sentences.
+        # We call something a sentence whenever it is more likely to be a sentence than not.
+        def segment
+            sents, sent = [], []
+            thresh = 0.5
+            frag = nil
+            @frags.each do |frag|
+                sent.push(frag.orig)
+                if frag.pred > thresh
+                    break if frag.orig.nil?
+                    sents.push(sent.join('').strip)
+                    sent = []
+                end
+            end
+            sents
+        end
+    end
+    # A fragment is a potential sentence, but is based only on the existence of a period.
+    # The text "Here in the U.S. Senate we prefer to devour our friends." will be split
+    # into "Here in the U.S." and "Senate we prefer to devour our friends."
+    class Frag
+        # orig = The original text of the fragment.
+        # next = The next word following the fragment.
+        # cleaned = Array of the fragment's words after cleaning.
+        # pred = Probability that the fragment is a sentence.
+        # features = Array of the fragment's features.
+        attr_accessor :orig, :next, :cleaned, :pred, :features
+        # Create a new fragment.
+        def initialize(orig='')
+            @orig = orig
+            clean(orig)
+            @next, @pred, @features = nil, nil, nil
+        end
+        # Normalizes numbers and discards ambiguous punctuation. And then splits into an
+        # array, because realistically only the last and first words are ever accessed.
+        def clean(s)
+            @cleaned = String.new(s)
+            tokenize(@cleaned)
+            @cleaned.gsub!(/[.,\d]*\d/, '<NUM>')
+            @cleaned.gsub!(/[^a-zA-Z0-9,.;:<>\-'\/$% ]/, '')
+            @cleaned.gsub!('--', ' ')
+            @cleaned = @cleaned.split
+        end
+    end
+end

data/lib/word_tokenizer.rb ADDED Viewed

@@ -0,0 +1,51 @@
+module WordTokenizer
+    @@tokenize_regexps = [
+        # Uniform Quotes
+        [/''|``/, '"'],
+        # Separate punctuation (except for periods) from words.
+        [/(^|\s)(')/, '\1\2'],
+        [/(?=[\("`{\[:;&#*@])(.)/, '\1 '],
+        [/(.)(?=[?!\)";}\]*:@'])|(?=[\)}\]])(.)|(.)(?=[({\[])|((^|\s)-)(?=[^-])/, '\1 '],
+        # Treat double-hyphen as a single token.
+        [/([^-])(--+)([^-])/, '\1 \2 \3'],
+        [/(\s|^)(,)(?=(\S))/, '\1\2 '],
+        # Only separate a comma if a space follows.
+        [/(.)(,)(\s|$)/, '\1 \2\3'],
+        # Combine dots separated by whitespace to be a single token.
+        [/\.\s\.\s\./, '...'],
+        # Separate "No.6"
+        [/([a-zA-Z]\.)(\d+)/, '\1 \2'],
+        # Separate words from ellipses
+        [/([^\.]|^)(\.{2,})(.?)/, '\1 \2 \3'],
+        [/(^|\s)(\.{2,})([^\.\s])/, '\1\2 \3'],
+        [/(^|\s)(\.{2,})([^\.\s])/, '\1 \2\3'],
+        ##### Some additional fixes.
+        # Fix %, $, &
+        [/(\d)%/, '\1 %'],
+        [/\$(\.?\d)/, '$ \1'],
+        [/(\w)& (\w)/, '\1&\2'],
+        [/(\w\w+)&(\w\w+)/, '\1 & \2'],
+        # Fix (n 't) -> ( n't)
+        [/n 't( |$)/, " n't\\1"],
+        [/N 'T( |$)/, " N'T\\1"],
+        # Treebank tokenizer special words
+        [/([Cc])annot/, '\1an not']
+    ];
+    def tokenize(s)
+        rules = []
+        @@tokenize_regexps.each {|rules| s.gsub!(rules[0], rules[1])}
+    end
+end

data/tactful_tokenizer.gemspec ADDED Viewed

@@ -0,0 +1,32 @@
+# -*- encoding: utf-8 -*-
+Gem::Specification.new do |s|
+  s.name = %q{tactful_tokenizer}
+  s.version = "0.0.1"
+  s.required_rubygems_version = Gem::Requirement.new(">= 1.2") if s.respond_to? :required_rubygems_version=
+  s.authors = ["Matthew Bunday"]
+  s.cert_chain = ["/home/slyshy/.ssh/gem-public_cert.pem"]
+  s.date = %q{2010-03-23}
+  s.description = %q{A high accuracy naive bayesian sentence tokenizer based on Splitta.}
+  s.email = %q{mkbunday @nospam@ gmail.com}
+  s.extra_rdoc_files = ["README.rdoc", "lib/models/features.mar", "lib/models/lower_words.mar", "lib/models/non_abbrs.mar", "lib/tactful_tokenizer.rb", "lib/word_tokenizer.rb"]
+  s.files = ["Manifest", "README.rdoc", "Rakefile", "lib/models/features.mar", "lib/models/lower_words.mar", "lib/models/non_abbrs.mar", "lib/tactful_tokenizer.rb", "lib/word_tokenizer.rb", "tactful_tokenizer.gemspec"]
+  s.homepage = %q{http://github.com/SlyShy/Tactful_Tokenizer}
+  s.rdoc_options = ["--line-numbers", "--inline-source", "--title", "Tactful_tokenizer", "--main", "README.rdoc"]
+  s.require_paths = ["lib"]
+  s.rubyforge_project = %q{tactful_tokenizer}
+  s.rubygems_version = %q{1.3.6}
+  s.signing_key = %q{/home/slyshy/.ssh/gem-private_key.pem}
+  s.summary = %q{A high accuracy naive bayesian sentence tokenizer based on Splitta.}
+  if s.respond_to? :specification_version then
+    current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
+    s.specification_version = 3
+    if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
+    else
+    end
+  else
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,102 @@
+--- !ruby/object:Gem::Specification
+name: tactful_tokenizer
+version: !ruby/object:Gem::Version
+  prerelease: false
+  segments:
+  - 0
+  - 0
+  - 1
+  version: 0.0.1
+platform: ruby
+authors:
+- Matthew Bunday
+autorequire:
+bindir: bin
+cert_chain:
+- |
+  -----BEGIN CERTIFICATE-----
+  MIIDMjCCAhqgAwIBAgIBADANBgkqhkiG9w0BAQUFADA/MREwDwYDVQQDDAhta2J1
+  bmRheTEVMBMGCgmSJomT8ixkARkWBWdtYWlsMRMwEQYKCZImiZPyLGQBGRYDY29t
+  MB4XDTEwMDMyMzE2MDkzOVoXDTExMDMyMzE2MDkzOVowPzERMA8GA1UEAwwIbWti
+  dW5kYXkxFTATBgoJkiaJk/IsZAEZFgVnbWFpbDETMBEGCgmSJomT8ixkARkWA2Nv
+  bTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMk5+Wsur5ptIGUthPBG
+  VHECPqlV7TRgxiEMbH8vxkMVNnqFGDTezd9zsmqfX9kKR4/Jmu1fXKyBswGRxYxD
+  qx8nR+DCnWk0gfx2jjpnknPPWTQ6lHiZaPrGb+QuANhebPTwI6cDIz4A3dg2QIRo
+  ETdiAdOspNudUHu2Jf/QeNQPr5SURy9vGnSXkDhMcrnR3EjkRAP4suNIlHBNj3Hz
+  7hYjZV5QzeFwVENR5K3zFSkbC3ZK6uZTUwPVngmCqWz3MLsNJiQhAhvn/XQ8OCJ3
+  Q8O/nPuIIqFNeT3TMvnfrbx+wyxX6FIBZ12M4lNmU6yoXxzmi/n/cBNLAkQ/hc2g
+  n68CAwEAAaM5MDcwCQYDVR0TBAIwADAdBgNVHQ4EFgQUZfQL/a3SzQ017Zj9MUwh
+  Y6BtLUgwCwYDVR0PBAQDAgSwMA0GCSqGSIb3DQEBBQUAA4IBAQAjdEGkZbV7tkOq
+  N0y3yL5n1JOMsVHsQF7/w2zeET3PyUgKmmobdq3V0rztqVcJ1oP/+fYUO1KYxC90
+  b8FOCGGvcKjMn1QJufFp1DTfiGFcz6nHRWmiAMRXbempzA5NDzocQP9jaRkoYEzK
+  pwsJwe0dlpJXs8/fqqljNdBe4AToDGLcbzdMmpGxZN63P70yAFL5G7sJy1Izp5ei
+  CvIRDtL1PdU1ESVLFJuoCAiCtpBfwwepv4kuuoca9Ykd5ldPCGzMq0n8+KIubb+2
+  xz7fp33atnZoMajdCOYKqwo2xVhUuFPZzBFZ3L6T6YLuEVGKHNyUAfcfr+8VSuB5
+  3+l7cSZt
+  -----END CERTIFICATE-----
+date: 2010-03-23 00:00:00 -05:00
+default_executable:
+dependencies: []
+description: A high accuracy naive bayesian sentence tokenizer based on Splitta.
+email: mkbunday @nospam@ gmail.com
+executables: []
+extensions: []
+extra_rdoc_files:
+- README.rdoc
+- lib/models/features.mar
+- lib/models/lower_words.mar
+- lib/models/non_abbrs.mar
+- lib/tactful_tokenizer.rb
+- lib/word_tokenizer.rb
+files:
+- Manifest
+- README.rdoc
+- Rakefile
+- lib/models/features.mar
+- lib/models/lower_words.mar
+- lib/models/non_abbrs.mar
+- lib/tactful_tokenizer.rb
+- lib/word_tokenizer.rb
+- tactful_tokenizer.gemspec
+has_rdoc: true
+homepage: http://github.com/SlyShy/Tactful_Tokenizer
+licenses: []
+post_install_message:
+rdoc_options:
+- --line-numbers
+- --inline-source
+- --title
+- Tactful_tokenizer
+- --main
+- README.rdoc
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 0
+      version: "0"
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      segments:
+      - 1
+      - 2
+      version: "1.2"
+requirements: []
+rubyforge_project: tactful_tokenizer
+rubygems_version: 1.3.6
+signing_key:
+specification_version: 3
+summary: A high accuracy naive bayesian sentence tokenizer based on Splitta.
+test_files: []

metadata.gz.sig ADDED Viewed

Binary file