RubyGems - DRMacIver-term-extractor - Versions diffs - 0.0.0 - Mend

DRMacIver-term-extractor 0.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

data/LICENSE +25 -0
data/README.markdown +40 -0
data/Rakefile +56 -0
data/VERSION +1 -0
data/bin/terms.rb +8 -0
data/lib/term-extractor.rb +195 -0
data/lib/term-extractor/maxent-2.5.2.jar +0 -0
data/lib/term-extractor/nlp.rb +262 -0
data/lib/term-extractor/opennlp-tools.jar +0 -0
data/lib/term-extractor/snowball.jar +0 -0
data/lib/term-extractor/trove.jar +0 -0
data/licenses/Maxent +421 -0
data/licenses/OpenNLP +421 -0
data/licenses/Trove +504 -0
data/licenses/snowball.php +33 -0
data/models/chunk.bin.gz +0 -0
data/models/sd.bin.gz +0 -0
data/models/stopwords +567 -0
data/models/tag.bin.gz +0 -0
data/models/tagdict +16204 -0
data/models/tok.bin.gz +0 -0
data/term-extractor.gemspec +66 -0
data/test/examples_spec.rb +131 -0
data/test/files/1.email +37 -0
data/test/files/juries_seg_8_v1 +20 -0
data/test/nlp_spec.rb +231 -0
data/test/term_extractor_spec.rb +141 -0
metadata +83 -0

data/LICENSE ADDED Viewed

@@ -0,0 +1,25 @@
+Copyright (c) 2009, Trampoline Systems
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+    * Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in the
+      documentation and/or other materials provided with the distribution.
+    * Neither the name of Trampoline Systems nor the
+      names of its contributors may be used to endorse or promote products
+      derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY Trampoline Systems ''AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL <copyright holder> BE LIABLE FOR ANY
+DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/README.markdown ADDED Viewed

@@ -0,0 +1,40 @@
+# The Trampoline Systems term extractor
+The term extractor is a library for taking natural text and extracting a
+set of terms from it which make sense without additional context. For example, feeding it the following text from my home page:
+    Hi. I’m David.
+    I’m also various other things. By training I’m a mathematician,
+    but I seem to have drifted away from that and become a programmer,
+    currently working on natural language processing and social analytic
+    software at Trampoline Systems.
+    This site is my public face on the internet. It contains my blog,
+    my OpenID and anything else I want to share with the world.
+We get the following terms:
+    David
+    training
+    mathematician
+    programmer
+    natural language processing
+    social analytic software
+    Trampoline Systems
+    site
+    public face
+    public face on the internet
+    internet
+    blog
+    world
+No attempt is made to assign meaning to the terms: They're not guaranteed to represent the content of the document. They're just intended to be coherent snippets of text which you can reuse in a broader context.
+One limitation of this is that it doesn't necessarily extract all reasonable terms. For example "natural language" is a reasonable term for this text which is not included in this. The way we use the term extractor at trampoline is to build a vocabulary of terms we consider interesting and then performing literal string searching for this term - this allows us to be selective in what terms we generate and permissive in looking for matches for them.
+Currently only english is supported. There are plans to support other languages, but nothing is implemented in that regard: It requires someone who is native to that language, a competent programmer and at least passingly familiar with NLP, so understandably we're a bit resource constrained on getting wide spread non-english support.
+## Copyright
+Copyright (c) 2009 Trampoline Systems. See LICENSE for details.

data/Rakefile ADDED Viewed

@@ -0,0 +1,56 @@
+require 'rubygems'
+require 'rake'
+require 'spec/rake/spectask'
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = "term-extractor"
+    gem.summary = %Q{A library for extracting useful terms from text}
+    gem.email = "david.maciver@gmail.com"
+    gem.homepage = "http://github.com/david.maciver@gmail.com/term-extractor"
+    gem.authors = ["David R. MacIver"]
+  end
+rescue LoadError
+  puts "Jeweler (or a dependency) not available. Install it with: sudo gem install jeweler"
+end
+begin
+  require 'rcov/rcovtask'
+  Rcov::RcovTask.new do |test|
+    test.libs << 'test'
+    test.pattern = 'test/**/*_test.rb'
+    test.verbose = true
+  end
+rescue LoadError
+  task :rcov do
+    abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
+  end
+end
+task :default => :test
+require 'rake/rdoctask'
+Rake::RDocTask.new do |rdoc|
+  if File.exist?('VERSION.yml')
+    config = YAML.load(File.read('VERSION.yml'))
+    version = "#{config[:major]}.#{config[:minor]}.#{config[:patch]}"
+  else
+    version = ""
+  end
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title = "term-extractor #{version}"
+  rdoc.rdoc_files.include('README*')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end
+Spec::Rake::SpecTask.new do |t|
+  t.rcov = false
+  t.spec_files = FileList["test/**/*_spec.rb"]
+  t.libs << "./lib"
+end

data/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 0.0.0

data/bin/terms.rb ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env jruby
+require "term-extractor"
+PE = TermExtractor.new
+PE.nlp.each_sentence(ARGF) do |sentence|
+  puts PE.extract_terms_from_sentence(sentence)
+end

data/lib/term-extractor.rb ADDED Viewed

@@ -0,0 +1,195 @@
+require "term-extractor/nlp"
+class Term
+  attr_accessor :to_s, :pos, :sentence
+  def initialize(ts, pos, sentence = nil)
+    @to_s, @pos, @sentence = ts, pos, sentence
+  end
+end
+class TermExtractor
+  attr_accessor :nlp, :max_term_length, :proscribed_start, :required_ending, :remove_urls, :remove_paths
+  def initialize(models = File.dirname(__FILE__) + "/../models")
+    @nlp = NLP.new(models)
+    # Empirically, terms longer than about 5 words seem to be either
+    # too specific to be useful or very noisy.
+    @max_term_length = 5
+    # Common sources of crap starting words
+    @proscribed_start = /CC|PRP|IN|DT|PRP\$|WP|WP\$|TO|EX/
+    # We have to end in a noun, foreign word or number.
+    @required_ending = /NN|NNS|NNP|NNPS|FW|CD/
+    self.remove_urls = true
+    self.remove_paths = true
+    yield self if block_given?
+  end
+  class TermContext
+    attr_accessor :parent, :tokens, :postags, :chunks
+    def nlp
+      parent.nlp
+    end
+    def initialize(parent, sentence)
+      @parent = parent
+      sentence = NLP.clean_sentence(sentence)
+      # User defineable cleaning.
+      sentence = NLP.remove_urls(sentence) if parent.remove_urls
+      sentence = NLP.remove_paths(sentence) if parent.remove_paths
+      @tokens = NLP.tokenize_sentence(sentence)
+      @postags = nlp.postagger.tag(tokens)
+      @chunks = nlp.chunker.chunk(tokens, postags)
+      @sentence = sentence
+    end
+    def boundaries
+      return @boundaries if @boundaries
+      # To each token we assign three attributes which determine how it may occur within a term.
+      # can_cross determines if this token can appear internally in a term
+      # can_start determines if a term is allowed to start with this token
+      # can_end determines if a term is allowed to end with this token
+      @boundaries = tokens.map{|t| {}}
+      @boundaries.each_with_index do |b, i|
+        tok = tokens[i]
+        pos = postags[i]
+        chunk = chunks[i]
+        # Cannot cross commas or coordinating conjections (and, or, etc)
+        b[:can_cross] = !(pos =~ /,|CC/)
+        # Cannot cross the beginning of verb terms
+        # i.e. we may start with verb terms but not include them
+        b[:can_cross] = (chunk != "B-VP") if b[:can_cross]
+        # We generate tags like <PATH>, <URL> and <QUOTE>
+        # to encapsulate various sorts of noise strings.
+        b[:can_cross] &&= !(tok =~ /<\w+>/)
+        # We are only allowed to start terms on the beginning of a term chunk
+        b[:can_start] = (chunks[i] == "B-NP")
+        if i > 0
+          if postags[i-1] =~ /DT|WDT|PRP|JJR|JJS/
+              # In some cases we want to move the start of a term to the right. These cases are:
+              # - a determiner (the, a, etc)
+              # - a posessive pronoun (my, your, etc)
+              # - comparative and superlative adjectives (best, better, etc.)
+              # In all cases we only do this for noun terms, and will only move them to internal points.
+              b[:can_start] ||= (chunks[i] == "I-NP")
+              @boundaries[i - 1][:can_start] = false
+          end
+        end
+        # We must include any tokens internal to the current chunk
+        b[:can_end] = !(chunks[i + 1] =~ /I-/)
+        # It is permitted to cross stopwords, but they cannot lie at the term boundary
+        if (nlp.stopword? tok) || (nlp.stopword? tokens[i..i+1].join) # Need to take into account contractions, which span multiple tokens
+          b[:can_end] = false
+          b[:can_start] = false
+        end
+        # The presence of a ' at the start of a token is most likely an indicator that we've
+        # split across a contraction. e.g. would've -> would 've. We are not allowed to
+        # cross this transition point.
+        if tok =~ /^'/
+          b[:can_start] = false
+          @boundaries[i - 1][:can_end] = false
+        end
+        # Must match the requirements for POSes at the beginning and end.
+        b[:can_start] &&= !(pos =~ parent.proscribed_start)
+        b[:can_end] &&= (pos =~ parent.required_ending)
+      end
+      @boundaries
+    end
+    def terms
+      return @terms if @terms
+      @terms = []
+      i = 0
+      j = 0
+      while(i < tokens.length)
+        if !boundaries[i][:can_start] || !boundaries[i][:can_cross]
+          i += 1
+          next
+        end
+        j = i if j < i
+        if (j == tokens.length) || !boundaries[j][:can_cross] || (j >= i + parent.max_term_length)
+          i += 1
+          j = i
+          next
+        end
+        if !boundaries[j][:can_end]
+          j += 1
+          next
+        end
+        term = tokens[i..j]
+        poses = postags.to_a[i..j]
+        term = Term.new(TermExtractor.recombobulate_term(term), poses.join("-"))
+        terms << term if TermExtractor.allowed_term?(term)
+        j += 1
+      end
+      @terms
+    end
+  end
+  # Extract all terms in a given sentence.
+  def extract_terms_from_sentence(sentence)
+    TermContext.new(self, sentence).terms
+  end
+  def extract_terms_from_text(text)
+    if block_given?
+      nlp.sentences(text).each_with_index do |s, i|
+        terms = extract_terms_from_sentence(s);
+        terms.each{|p| p.sentence = i; yield(p) }
+      end
+    else
+      results = []
+      extract_terms_from_text(text){ |p| results << p }
+      results
+    end
+  end
+  # Final post filter on terms to determine if they're allowed.
+  def self.allowed_term?(p)
+    return false if p.pos =~ /^CD(-CD)*$/ # We don't allow things which are just sequences of numbers
+    return false if p.to_s.length > 255
+    true
+  end
+  # Take a sequence of tokens and turn them back into a term.
+  def self.recombobulate_term(term)
+    term = term.join(" ")
+    term.gsub!(/ '/, "'")
+    term.gsub!(/ \./, ".")
+    term
+  end
+end

data/lib/term-extractor/maxent-2.5.2.jar ADDED Viewed

Binary file

data/lib/term-extractor/nlp.rb ADDED Viewed

@@ -0,0 +1,262 @@
+require "fileutils"
+require "java"
+require "term-extractor/opennlp-tools"
+require "term-extractor/maxent-2.5.2"
+require "term-extractor/trove"
+require "term-extractor/snowball"
+require "set"
+class TermExtractor
+  # NLP contains a lot of general NLP related utilities.
+  # In particular it contains:
+  # - a selection of OpenNLP classes
+  # - a snowball stemmer
+  # - a stopword list
+  #
+  # And various utilities built on top of these.
+  class NLP
+    JV = Java::OpennlpToolsLangEnglish
+    include_class("org.tartarus.snowball.ext.englishStemmer") { |x, y| "EnglishStemmer" }
+    def stem(word)
+      stemmer.setCurrent(word)
+      stemmer.stem
+      stemmer.getCurrent
+    end
+    def sentdetect
+      @sentdetect ||= JV::SentenceDetector.new(loc("sd.bin.gz"))
+    end
+    def tagdict
+      @tagdict ||= Java::OpennlpToolsPostag::POSDictionary.new(loc("tagdict"), true)
+    end
+    def postagger
+      @postagger ||= JV::PosTagger.new(loc("tag.bin.gz"), tagdict)
+    end
+    def chunker
+      @chunker ||= JV::TreebankChunker.new(loc("chunk.bin.gz"))
+    end
+    def stopwords
+      @stopwords
+    end
+    def stemmer
+      @stemmer ||= EnglishStemmer.new
+    end
+    def initialize(models)
+      @models = models
+      @stopwords = Set.new
+      File.open(loc("stopwords")).each_line do |l|
+        l.gsub!(/#.+$/, "")
+        @stopwords.add clean_for_stopword(l)
+      end
+    end
+    # Canonicalisation gives a string that in some sense captures the "essential character"
+    # of a piece of text. It normalizes it by removing unneccessary words, rearranging, and
+    # stripping suffixes.
+    # It is not itself intended to be a useful representation of the string, but instead for
+    # determining if two strings are equal.
+    def canonicalize(str)
+      str.
+        to_s.
+        downcase.
+        gsub(/[^\w\s]/, " ").
+        split.
+        select{|p| !stopword?(p)}.
+        map{|p| stem(p) }.
+        sort.
+        join(" ")
+    end
+    def stopword?(word)
+      stopwords.include?(clean_for_stopword(word))
+    end
+    # Once we have split sentences, we clean them up prior to tokenization. We remove or normalize
+    # a bunch of noise sources and get it to a form where distinct tokens are separated by whitespace.
+    def NLP.clean_sentence(text)
+      text = text.dup
+      text.gsub!(/--+/, " -- ") # TODO: What's this for?
+      # Normalize bracket types.
+      # TODO: Shouldn't do this inside of tokens.
+      text.gsub!(/{\[/, "(")
+      text.gsub!(/\}\]/, ")")
+      # We turn most forms of punctuation which are not internal to tokens into commas
+      punct = /(\"|\(|\)|;|-|\:|-|\*|,)/
+      # Convert cunning "smart" apostrophes into plain old boring
+      # dumb ones.
+      text.gsub!(/’/, "'")
+      text.gsub!(/([\w])\.\.+([\w])/){ "#{$1} , #{$2}"}
+      text.gsub!(/(^| )#{punct}+/, " , ")
+      text.gsub!(/#{punct}( |$)/, " , ")
+      text.gsub!(/(\.+ |')/){" #{$1}"}
+      separators = /\//
+      text.gsub!(/ #{separators} /, " , ")
+      # We can be a bit overeager in turning things into commas, so we clear them up here
+      # In particular we remove any we've accidentally added to the end of lines and we collapse
+      # consecutive ones into a single one.
+      text.gsub!(/(,|\.) *,/){ " #{$1} " }
+      text.gsub!(/(,| )+$/, "")
+      text.gsub!(/^(,| )+/, "")
+      text.gsub!(/((?:\.|\!|\?)+)$/){" #{$1}" }
+      # Clean up superfluous whitespace
+      text.gsub!(/\s+/, " ")
+      text
+    end
+    def NLP.tokenize_sentence(string)
+      clean_sentence(string).split
+    end
+    Ending = /(!|\?|\.)+/
+    def self.clean_text(text)
+      text = text.gsub(/\r(\n?)/, "\n") # Evil microsoft line endings, die die die!
+      text.gsub!(/^\s+$/, "") # For convenience, remove all spaces from blank lines
+      text.gsub!(/\n\n+/m, ".\n.\n") # Collapse multiple line endings into periods
+      text.gsub!(/\n/, " ") # Squash the text onto a single line.
+      text.gsub!(/(\d+)\. /){ "#{$1} . " } # We separate out things of the form 1. as these are commonly lists and OpenNLP sentence detection handles them badly
+      text.strip!
+      text
+    end
+    def self.remove_urls(text)
+      text.gsub(/\w+:\/\/[^\s]+?(?=\.?(?= |$))/, "<URL>")
+    end
+    def self.remove_paths(text)
+      text = text.clone
+      # Fragments of windows paths
+      text.gsub!(/[\w:\\]*\\[\w:\\]*/, "<PATH>")
+      # fragments of unix paths
+      text.gsub!(/\/[\w\/]+/, "<PATH>")
+      text.gsub!(/[\w\/]+\//, "<PATH>")
+      while text.gsub!(/<PATH>\s+\w+\s+<PATH>/, "<PATH>")
+        # concatenate fragments where we have e.g. <PATH> and <PATH>
+        # into single paths. This is to take into account paths containing spaces.
+      end
+      text.gsub!(/<PATH>(\s*<PATH)*/, "<PATH>")
+      text
+    end
+    EmbedBoundaries = [
+      ["\"", "\""],
+      ["(", ")"],
+      ["[", "]"],
+      ["{", "}"]
+    ].map{|s| s.map{|x| Regexp.quote(x) }}
+    # Normalise a sentence by removing all parenthetical comments and replacing all embedded quotes contained therein
+    # Return an array of the sentence and all contained subterms
+    def self.extract_embedded_sentences(text)
+      text = text.clone
+      fragments = [text]
+      l = nil
+      begin
+        l = fragments.length
+        EmbedBoundaries.each do |s, e|
+          replace = if s == e then "<QUOTE>" else "" end
+          matcher = /#{s}[^#{s}#{e}\n]*#{e}/
+          text.gsub!(matcher) { |frag| fragments << frag[1..-2]; replace }
+        end
+      end while fragments.length > l
+      if fragments.length > 1
+        fragments = fragments.map{|f| extract_embedded_sentences(f) }.flatten
+      end
+      fragments
+    end
+    def sentences(string)
+      sentdetect.sentDetect(NLP.clean_text(string)).to_a.map{|s| s.strip }.select{|s| (s.length > 0) && !(s =~ /^(\.|!|\?)+$/) }
+    end
+    def each_sentence(source)
+      lines = []
+      process_lines = lambda{
+        text = lines.join("\n").strip
+        if text != ""
+          sentences(text).each{|s| yield(s.gsub("\n", " ")) }
+        end
+        lines = []
+      }
+      source.each_line do |line|
+        line = line.strip
+        if line == ""
+          process_lines.call
+        end
+        lines << line
+      end
+      process_lines.call
+    end
+    def postag(tokens)
+      if tokens.is_a? String
+        tokens = NLP.tokenize_sentence(tokens)
+      else
+        tokens = tokens.to_a
+      end
+      tokens.zip(postagger.tag(tokens).to_a)
+    end
+    def chunk_text(text)
+      result = []
+      sentences(text).each{|x| result += chunk_sentence(x)}
+      result
+    end
+    def chunk_sentence(sentence)
+      tokens = NLP.tokenize_sentence(sentence)
+      postags = postagger.tag(tokens)
+      tokens.zip(chunker.chunk(tokens, postags).to_a)
+    end
+    private
+    def loc(file)
+      File.join(@models, file)
+    end
+    def clean_for_stopword(word)
+      word.downcase.gsub(/[^\w]/, "")
+    end
+    def chunk_type(tag)
+      case tag
+        when "O"
+          "O"
+        when /B-(.+)$/
+          $1
+      end
+    end
+  end
+end