RubyGems - punkt-segmenter - Versions diffs - 0.9.0 - Mend

punkt-segmenter 0.9.0

Files changed (21) hide show

data/LICENSE.txt +13 -0
data/README.md +79 -0
data/Rakefile +16 -0
data/lib/punkt-segmenter.rb +13 -0
data/lib/punkt-segmenter/frequency_distribution.rb +121 -0
data/lib/punkt-segmenter/punkt.rb +51 -0
data/lib/punkt-segmenter/punkt/base.rb +65 -0
data/lib/punkt-segmenter/punkt/language_vars.rb +34 -0
data/lib/punkt-segmenter/punkt/parameters.rb +37 -0
data/lib/punkt-segmenter/punkt/sentence_tokenizer.rb +180 -0
data/lib/punkt-segmenter/punkt/token.rb +81 -0
data/lib/punkt-segmenter/punkt/trainer.rb +304 -0
data/punkt-segmenter.gemspec +17 -0
data/script/console +7 -0
data/test/punkt-segmenter/frequency_distribution_test.rb +118 -0
data/test/punkt-segmenter/punkt/language_vars_test.rb +21 -0
data/test/punkt-segmenter/punkt/token_test.rb +121 -0
data/test/punkt-segmenter/punkt/trainer_test.rb +32 -0
data/test/punkt-segmenter/punkt_test.rb +67 -0
data/test/test_helper.rb +16 -0
metadata +129 -0

data/LICENSE.txt ADDED

@@ -0,0 +1,13 @@
+Copyright [2010] [Luis Cipriani]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

data/README.md ADDED

@@ -0,0 +1,79 @@
+# Punkt sentence tokenizer
+This code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identiﬁed.
+The description of the algorithm is presented in the following academic paper:
+> Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+> Computational Linguistics 32: 485-525.
+> [Download paper]
+Here are the credits for the original implementation:
+- Willy (willy@csse.unimelb.edu.au) (original Python port)
+- Steven Bird (sb@csse.unimelb.edu.au) (additions)
+- Edward Loper (edloper@gradient.cis.upenn.edu) (rewrite)
+- Joel Nothman (jnothman@student.usyd.edu.au) (almost rewrite)
+I simply did the ruby port and some API changes.
+## Install
+    gem install punkt-segmenter
+Currently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)
+## How to use
+Let's suppose we have the following text:
+*"A minute is a unit of measurement of time or of angle. The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1. In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second. The minute is not an SI unit; however, it is accepted for use with SI units. The symbol for minute or minutes is min. The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system. Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."* (source: http://en.wikipedia.org/wiki/Minute)
+You can separate in sentences using the Punkt::SentenceTokenizer object:
+    tokenizer = Punkt::SentenceTokenizer.new(text)
+    result    = tokenizer.sentences_from_text(text, :output => :sentences_text)
+The result will be:
+    result    = [
+        [0] "A minute is a unit of measurement of time or of angle.",
+        [1] "The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1.",
+        [2] "In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second.",
+        [3] "The minute is not an SI unit; however, it is accepted for use with SI units.",
+        [4] "The symbol for minute or minutes is min.",
+        [5] "The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.",
+        [6] "Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."
+    ]
+The algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:
+    trainer = Punkt::Trainer.new()
+    trainer.train(trainning_text)
+    tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)
+    result    = tokenizer.sentences_from_text(text, :output => :sentences_text)
+In this case, instead of passing the text to SentenceTokenizer, you pass the trainer parameters.
+A recommended use case for the trainning object is to train a big corpus in a specific language and then marshal the object to a file. Then you can load the already trained tokenizer from a file. You can even add more texts to the trainning set whenever you want.
+The available options for *sentences_from_text* method are:
+- array of sentences indexes (default)
+- array of sentences string  (**:output => :sentences_text**)
+- array of sentences tokens  (**:output => :tokenized_sentences**)
+- realigned boundaries (**:realign_boundaries => true**): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc
+If you have a list of tokens, you can use the *sentences_from_tokens* method, which takes only the list of tokens as parameter.
+Check the unit tests for more detailed examples in English and Portuguese.
+----
+*This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)*
+*Copyright (C) Luis Cipriani*
+  [http://www.nltk.org/]: http://www.nltk.org/
+  [Download paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf

data/Rakefile ADDED

@@ -0,0 +1,16 @@
+require 'rake'
+require 'rake/testtask'
+Rake::TestTask.new do |t|
+  t.libs << "test"
+  t.test_files = FileList['test/**/*.rb']
+  t.verbose = true
+end
+desc "Run test coverage (need cover_me gem)"
+task :coverage do
+  ENV["coverage"] = "true"
+  Rake::Task["test"].invoke
+end
+task :default => :test

data/lib/punkt-segmenter.rb ADDED

@@ -0,0 +1,13 @@
+if RUBY_VERSION >= "1.9"
+  $:.unshift(File.dirname(__FILE__)) unless $:.include?(File.dirname(__FILE__))
+  # Dependencies
+  require "unicode_utils"
+  require "set"
+  # Lib requires
+  require "punkt-segmenter/frequency_distribution"
+  require "punkt-segmenter/punkt"
+else
+  raise "This gem requires Ruby 1.9 or superior."
+end

data/lib/punkt-segmenter/frequency_distribution.rb ADDED

@@ -0,0 +1,121 @@
+module Probability
+  class FrequencyDistribution < Hash
+    attr_reader :N
+    alias_method :B      , :size
+    alias_method :samples, :keys
+    def initialize
+      super
+      clear
+    end
+    def clear
+      super
+      @N = 0
+      @cache = {}
+    end
+    def [](sample)
+      super || 0
+    end
+    def []=(sample, value)
+      @N += (value - self[sample])
+      super
+      @cache = {}
+    end
+    def keys
+      result = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
+      result.map { |item| item[0] }
+    end
+    def values
+      result = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
+      result.map { |item| item[1] }
+    end
+    def items
+      @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
+    end
+    def each(&block)
+      items = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
+      items.each { |item| yield(item[0], item[1]) }
+    end
+    def each_key(&block)
+      keys.each { |item| yield(item) }
+    end
+    def each_value(&block)
+      values.each { |value| yield(value) }
+    end
+    def <<(sample)
+      self.inc(sample)
+    end
+    def inc(sample, count = 1)
+      return if count == 0
+      self[sample] = self[sample] + count
+    end
+    def delete(sample, &block)
+      result = super
+      if result
+        @cache = {}
+        @N -= result
+      end
+      result
+    end
+    def delete_if(&block)
+      raise "Not implemented for Frequency Distributions"
+    end
+    def frequency_of(sample)
+      return 0 if @N == 0
+      return self[sample].to_f / @N
+    end
+    def max
+      unless @cache[:max]
+        max_sample = nil
+        max_count  = -1
+        self.keys.each do |sample|
+          if self[sample] > max_count
+            max_sample = sample
+            max_count  = self[sample]
+          end
+        end
+        @cache[:max] = max_sample
+      end
+      return @cache[:max]
+    end
+    def merge(other_frequency_distribution)
+      temp = self.dup
+      other_frequency_distribution.each do |sample, value|
+        temp.inc(sample, value)
+      end
+      return temp
+    end
+    def merge!(other_frequency_distribution)
+      other_frequency_distribution.each do |sample, value|
+        self.inc(sample, value)
+      end
+      self
+    end
+  private
+    def order_by_frequency_desc
+      @cache[:ordered_by_frequency_desc] = self.to_a.sort {|x,y| y[1] <=> x[1] }
+    end
+  end
+end

data/lib/punkt-segmenter/punkt.rb ADDED

@@ -0,0 +1,51 @@
+# Ruby implementation of Punkt sentence tokenizer
+#
+# This code is a ruby port of the algorithm implemented by
+# the NLTK Project. This code follows the terms and conditions
+# of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)
+#
+# Copyright (C) 2001-2010 NLTK Project
+# Algorithm: Kiss & Strunk (2006)
+# Author: Willy <willy@csse.unimelb.edu.au> (original Python port)
+#         Steven Bird <sb@csse.unimelb.edu.au> (additions)
+#         Edward Loper <edloper@gradient.cis.upenn.edu> (rewrite)
+#         Joel Nothman <jnothman@student.usyd.edu.au> (almost rewrite)
+#
+#         Luis Cipriani (ruby port)
+# URL: <http://www.nltk.org/>
+#
+#
+# The Punkt sentence tokenizer.  The algorithm for this tokenizer is
+# described in Kiss & Strunk (2006)::
+#
+#      Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
+#      Boundary Detection.  Computational Linguistics 32: 485-525.
+#
+module Punkt
+  # Orthographoc Context Constants
+  ORTHO_BEG_UC = 1 << 1 # Orthographoc context: beginning of sentence with upper case
+  ORTHO_MID_UC = 1 << 2 # Orthographoc context: middle of sentence with upper case
+  ORTHO_UNK_UC = 1 << 3 # Orthographoc context: unknown position in a sentence with upper case
+  ORTHO_BEG_LC = 1 << 4 # Orthographoc context: beginning of sentence with lower case
+  ORTHO_MID_LC = 1 << 5 # Orthographoc context: middle of sentence with lower case
+  ORTHO_UNK_LC = 1 << 6 # Orthographoc context: unknown position in a sentence with lower case
+  ORTHO_UC     = ORTHO_BEG_UC + ORTHO_MID_UC + ORTHO_UNK_UC
+  ORTHO_LC     = ORTHO_BEG_LC + ORTHO_MID_LC + ORTHO_UNK_LC
+  ORTHO_MAP    = {
+    [:initial,  :upper] => ORTHO_BEG_UC,
+    [:internal, :upper] => ORTHO_MID_UC,
+    [:unknown,  :upper] => ORTHO_UNK_UC,
+    [:initial,  :lower] => ORTHO_BEG_LC,
+    [:internal, :lower] => ORTHO_MID_LC,
+    [:unknown,  :lower] => ORTHO_UNK_LC,
+  }
+end
+require "punkt-segmenter/punkt/language_vars"
+require "punkt-segmenter/punkt/parameters"
+require "punkt-segmenter/punkt/token"
+require "punkt-segmenter/punkt/base"
+require "punkt-segmenter/punkt/trainer"
+require "punkt-segmenter/punkt/sentence_tokenizer"

data/lib/punkt-segmenter/punkt/base.rb ADDED

@@ -0,0 +1,65 @@
+module Punkt
+  class Base
+    def initialize(language_vars = Punkt::LanguageVars.new,
+                   token_class   = Punkt::Token,
+                   parameters    = Punkt::Parameters.new)
+      @parameters    = parameters
+      @language_vars = language_vars
+      @token_class   = token_class
+    end
+    def tokenize_words(plain_text, options = {})
+      return @language_vars.word_tokenize(plain_text) if options[:output] == :string
+      result = []
+      paragraph_start = false
+      plain_text.split("\n").each do |line|
+        unless line.strip.empty?
+          line_tokens = @language_vars.word_tokenize(line)
+          first_token = @token_class.new(line_tokens.shift,
+                           :paragraph_start => paragraph_start,
+                           :line_start      => true)
+          paragraph_start = false
+          line_tokens.map! { |token| @token_class.new(token) }.unshift(first_token)
+          result += line_tokens
+        else
+          paragraph_start = true
+        end
+      end
+      return result
+    end
+  private
+    def annotate_first_pass(tokens)
+      tokens.each do |aug_token|
+        tok = aug_token.token
+        if @language_vars.sent_end_chars.include?(tok)
+          aug_token.sentence_break = true
+        elsif aug_token.is_ellipsis?
+          aug_token.is_ellipsis = true
+        elsif aug_token.ends_with_period? && !tok.end_with?("..")
+          tok_low = UnicodeUtils.downcase(tok.chop)
+          if @parameters.abbreviation_types.include?(tok_low) || @parameters.abbreviation_types.include?(tok_low.split("-")[-1])
+            aug_token.abbr = true
+          else
+            aug_token.sentence_break = true
+          end
+        end
+      end
+    end
+    def pair_each(list, &block)
+      previous = list[0]
+      list[1..list.size-1].each do |item|
+        yield(previous, item)
+        previous = item
+      end
+      yield(previous, nil)
+    end
+  end
+end

data/lib/punkt-segmenter/punkt/language_vars.rb ADDED

@@ -0,0 +1,34 @@
+module Punkt
+  class LanguageVars
+    attr_reader :re_period_context
+    attr_reader :sent_end_chars
+    attr_reader :internal_punctuation
+    attr_reader :re_boundary_realignment
+    def initialize
+      @sent_end_chars = ['.', '?', '!']
+      @re_sent_end_chars = /[.?!]/
+      @internal_punctuation = [',', ':', ';']
+      @re_boundary_realignment = /^["\')\]}]+?(?:\s+|(?=--)|$)/m
+      @re_word_start = /[^\(\"\`{\[:;&\#\*@\)}\]\-,]/
+      @re_non_word_chars = /(?:[?!)\";}\]\*:@\'\({\[])/
+      @re_multi_char_punct = /(?:\-{2,}|\.{2,}|(?:\.\s){2,}\.)/
+      @re_word_tokenizer = /#{@re_multi_char_punct}|(?=#{@re_word_start})\S+?(?=\s|$|#{@re_non_word_chars}|#{@re_multi_char_punct}|,(?=$|\s|#{@re_non_word_chars}|#{@re_multi_char_punct}))|\S/
+      @re_period_context = /\S*#{@re_sent_end_chars}(?=(?<after_tok>#{@re_non_word_chars}|\s+(?<next_tok>\S+)))/
+    end
+    def word_tokenize(text)
+      text.scan(@re_word_tokenizer)
+    end
+  end
+end

data/lib/punkt-segmenter/punkt/parameters.rb ADDED

@@ -0,0 +1,37 @@
+module Punkt
+  class Parameters
+    attr_accessor :abbreviation_types
+    attr_accessor :collocations
+    attr_accessor :sentence_starters
+    attr_accessor :orthographic_context
+    def initialize
+      clear_abbreviations
+      clear_collocations
+      clear_sentence_starters
+      clear_orthographic_context
+    end
+    def clear_abbreviations
+      @abbreviation_types   = Set.new
+    end
+    def clear_collocations
+      @collocations         = Set.new
+    end
+    def clear_sentence_starters
+      @sentence_starters    = Set.new
+    end
+    def clear_orthographic_context
+      @orthographic_context = Hash.new(0)
+    end
+    def add_orthographic_context(type, flag)
+      @orthographic_context[type] |= flag
+    end
+  end
+end

data/lib/punkt-segmenter/punkt/sentence_tokenizer.rb ADDED

@@ -0,0 +1,180 @@
+module Punkt
+  class SentenceTokenizer < Base
+    def initialize(train_text_or_parameters,
+                   language_vars = Punkt::LanguageVars.new,
+                   token_class   = Punkt::Token)
+      super(language_vars, token_class)
+      @trainer = nil
+      if train_text_or_parameters.kind_of?(String)
+        @parameters = train(train_text_or_parameters)
+      elsif train_text_or_parameters.kind_of?(Punkt::Parameters)
+        @parameters = train_text_or_parameters
+      else
+        raise "You need to pass trainer parameters or a text to train."
+      end
+    end
+    def sentences_from_text(text, options = {})
+      sentences = split_in_sentences(text)
+      sentences = realign_boundaries(text, sentences) if options[:realign_boundaries]
+      sentences = self.class.send(options[:output], text, sentences) if options[:output]
+      return sentences
+    end
+    alias_method :tokenize, :sentences_from_text
+    def sentences_from_tokens(tokens)
+      tokens = annotate_tokens(tokens.map { |t| @token_class.new(t) })
+      sentences = []
+      sentence = []
+      tokens.each do |t|
+        sentence << t.token
+        if t.sentence_break
+          sentences << sentence
+          sentence = []
+        end
+      end
+      sentences << sentence unless sentence.empty?
+      return sentences
+    end
+    class << self
+      def sentences_text(text, sentences_indexes)
+        sentences_indexes.map { |index| text[index[0]..index[1]] }
+      end
+      def tokenized_sentences(text, sentences_indexes)
+        tokenizer = Punkt::Base.new()
+        self.sentences_text(text, sentences_indexes).map { |text| tokenizer.tokenize_words(text, :output => :string) }
+      end
+    end
+  private
+    def train(train_text)
+      @trainer = Punkt::Trainer.new(@language_vars, @token_class) unless @trainer
+      @trainer.train(train_text)
+      @parameters = @trainer.parameters
+    end
+    def split_in_sentences(text)
+      result = []
+      last_break = 0
+      current_sentence_start = 0
+      while match = @language_vars.re_period_context.match(text, last_break)
+        context = match[0] + match[:after_tok]
+        if text_contains_sentence_break?(context)
+          result << [current_sentence_start, (match.end(0)-1)]
+          match[:next_tok] ? current_sentence_start = match.begin(:next_tok) : current_sentence_start = match.end(0)
+        end
+        if match[:next_tok]
+          last_break = match.begin(:next_tok)
+        else
+          last_break = match.end(0)
+        end
+      end
+      result << [current_sentence_start, (text.size-1)]
+    end
+    def text_contains_sentence_break?(text)
+      found = false
+      annotate_tokens(tokenize_words(text)).each do |token|
+        return true if found
+        found = true if token.sentence_break
+      end
+      return false
+    end
+    def annotate_tokens(tokens)
+      tokens = annotate_first_pass(tokens)
+      tokens = annotate_second_pass(tokens)
+      return tokens
+    end
+    def annotate_second_pass(tokens)
+      pair_each(tokens) do |tok1, tok2|
+        next unless tok2
+        next unless tok1.ends_with_period?
+        token            = tok1.token
+        type             = tok1.type_without_period
+        next_token       = tok2.token
+        next_type        = tok2.type_without_sentence_period
+        token_is_initial = tok1.is_initial?
+        if @parameters.collocations.include?([type, next_type])
+          tok1.sentence_break = false
+          tok1.abbr           = true
+          next
+        end
+        if (tok1.abbr || tok1.ellipsis) && !token_is_initial
+          is_sentence_starter = orthographic_heuristic(tok2)
+          if is_sentence_starter == true
+            tok1.sentence_break = true
+            next
+          end
+          if tok2.first_upper? && @parameters.sentence_starters.include?(next_type)
+            tok1.sentence_break = true
+            next
+          end
+        end
+        if token_is_initial || type == "##number##"
+          is_sentence_starter = orthographic_heuristic(tok2)
+          if is_sentence_starter == false
+            tok1.sentence_break = false
+            tok1.abbr           = true
+            next
+          end
+          if is_sentence_starter == :unknown && token_is_initial &&
+             tok2.first_upper? && !(@parameters.orthographic_context[next_type] & Punkt::ORTHO_LC != 0)
+             tok1.sentence_break = false
+             tok1.abbr           = true
+          end
+        end
+      end
+      return tokens
+    end
+    def orthographic_heuristic(aug_token)
+      return false if [';', ',', ':', '.', '!', '?'].include?(aug_token.token)
+      orthographic_context = @parameters.orthographic_context[aug_token.type_without_sentence_period]
+      return true if aug_token.first_upper? && (orthographic_context & Punkt::ORTHO_LC != 0) && !(orthographic_context & Punkt::ORTHO_MID_UC != 0)
+      return false if aug_token.first_lower? && ((orthographic_context & Punkt::ORTHO_UC != 0) || !(orthographic_context & Punkt::ORTHO_BEG_LC != 0))
+      return :unknown
+    end
+    def realign_boundaries(text, sentences)
+      result = []
+      realign = 0
+      pair_each(sentences) do |i1, i2|
+        s1 = text[i1[0]..i1[1]]
+        s2 = i2 ? text[i2[0]..i2[1]] : nil
+        #s1 = s1[realign..(s1.size-1)]
+        unless s2
+          result << [i1[0]+realign, i1[1]] if s1
+          next
+        end
+        if match = @language_vars.re_boundary_realignment.match(s2)
+          result << [i1[0]+realign, i1[1]+match[0].strip.size] #s1 + match[0].strip()
+          realign = match.end(0)
+        else
+          result << [i1[0]+realign, i1[1]] if s1
+          realign = 0
+        end
+      end
+      return result
+    end
+  end
+end