punkt-segmenter 0.9.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,13 @@
1
+ Copyright [2010] [Luis Cipriani]
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
@@ -0,0 +1,79 @@
1
+ # Punkt sentence tokenizer
2
+
3
+ This code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified.
4
+
5
+ The description of the algorithm is presented in the following academic paper:
6
+
7
+ > Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
8
+ > Computational Linguistics 32: 485-525.
9
+ > [Download paper]
10
+
11
+ Here are the credits for the original implementation:
12
+
13
+ - Willy (willy@csse.unimelb.edu.au) (original Python port)
14
+ - Steven Bird (sb@csse.unimelb.edu.au) (additions)
15
+ - Edward Loper (edloper@gradient.cis.upenn.edu) (rewrite)
16
+ - Joel Nothman (jnothman@student.usyd.edu.au) (almost rewrite)
17
+
18
+ I simply did the ruby port and some API changes.
19
+
20
+ ## Install
21
+
22
+ gem install punkt-segmenter
23
+
24
+ Currently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)
25
+
26
+ ## How to use
27
+
28
+ Let's suppose we have the following text:
29
+
30
+ *"A minute is a unit of measurement of time or of angle. The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1. In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second. The minute is not an SI unit; however, it is accepted for use with SI units. The symbol for minute or minutes is min. The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system. Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."* (source: http://en.wikipedia.org/wiki/Minute)
31
+
32
+ You can separate in sentences using the Punkt::SentenceTokenizer object:
33
+
34
+ tokenizer = Punkt::SentenceTokenizer.new(text)
35
+ result = tokenizer.sentences_from_text(text, :output => :sentences_text)
36
+
37
+ The result will be:
38
+
39
+ result = [
40
+ [0] "A minute is a unit of measurement of time or of angle.",
41
+ [1] "The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1.",
42
+ [2] "In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second.",
43
+ [3] "The minute is not an SI unit; however, it is accepted for use with SI units.",
44
+ [4] "The symbol for minute or minutes is min.",
45
+ [5] "The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.",
46
+ [6] "Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."
47
+ ]
48
+
49
+ The algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:
50
+
51
+ trainer = Punkt::Trainer.new()
52
+ trainer.train(trainning_text)
53
+
54
+ tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)
55
+ result = tokenizer.sentences_from_text(text, :output => :sentences_text)
56
+
57
+ In this case, instead of passing the text to SentenceTokenizer, you pass the trainer parameters.
58
+
59
+ A recommended use case for the trainning object is to train a big corpus in a specific language and then marshal the object to a file. Then you can load the already trained tokenizer from a file. You can even add more texts to the trainning set whenever you want.
60
+
61
+ The available options for *sentences_from_text* method are:
62
+
63
+ - array of sentences indexes (default)
64
+ - array of sentences string (**:output => :sentences_text**)
65
+ - array of sentences tokens (**:output => :tokenized_sentences**)
66
+ - realigned boundaries (**:realign_boundaries => true**): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc
67
+
68
+ If you have a list of tokens, you can use the *sentences_from_tokens* method, which takes only the list of tokens as parameter.
69
+
70
+ Check the unit tests for more detailed examples in English and Portuguese.
71
+
72
+ ----
73
+ *This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)*
74
+
75
+ *Copyright (C) Luis Cipriani*
76
+
77
+ [http://www.nltk.org/]: http://www.nltk.org/
78
+ [Download paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf
79
+
@@ -0,0 +1,16 @@
1
+ require 'rake'
2
+ require 'rake/testtask'
3
+
4
+ Rake::TestTask.new do |t|
5
+ t.libs << "test"
6
+ t.test_files = FileList['test/**/*.rb']
7
+ t.verbose = true
8
+ end
9
+
10
+ desc "Run test coverage (need cover_me gem)"
11
+ task :coverage do
12
+ ENV["coverage"] = "true"
13
+ Rake::Task["test"].invoke
14
+ end
15
+
16
+ task :default => :test
@@ -0,0 +1,13 @@
1
+ if RUBY_VERSION >= "1.9"
2
+ $:.unshift(File.dirname(__FILE__)) unless $:.include?(File.dirname(__FILE__))
3
+
4
+ # Dependencies
5
+ require "unicode_utils"
6
+ require "set"
7
+
8
+ # Lib requires
9
+ require "punkt-segmenter/frequency_distribution"
10
+ require "punkt-segmenter/punkt"
11
+ else
12
+ raise "This gem requires Ruby 1.9 or superior."
13
+ end
@@ -0,0 +1,121 @@
1
+ module Probability
2
+ class FrequencyDistribution < Hash
3
+
4
+ attr_reader :N
5
+
6
+ alias_method :B , :size
7
+ alias_method :samples, :keys
8
+
9
+ def initialize
10
+ super
11
+ clear
12
+ end
13
+
14
+ def clear
15
+ super
16
+ @N = 0
17
+ @cache = {}
18
+ end
19
+
20
+ def [](sample)
21
+ super || 0
22
+ end
23
+
24
+ def []=(sample, value)
25
+ @N += (value - self[sample])
26
+ super
27
+ @cache = {}
28
+ end
29
+
30
+ def keys
31
+ result = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
32
+ result.map { |item| item[0] }
33
+ end
34
+
35
+ def values
36
+ result = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
37
+ result.map { |item| item[1] }
38
+ end
39
+
40
+ def items
41
+ @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
42
+ end
43
+
44
+ def each(&block)
45
+ items = @cache[:ordered_by_frequency_desc] || order_by_frequency_desc
46
+ items.each { |item| yield(item[0], item[1]) }
47
+ end
48
+
49
+ def each_key(&block)
50
+ keys.each { |item| yield(item) }
51
+ end
52
+
53
+ def each_value(&block)
54
+ values.each { |value| yield(value) }
55
+ end
56
+
57
+ def <<(sample)
58
+ self.inc(sample)
59
+ end
60
+
61
+ def inc(sample, count = 1)
62
+ return if count == 0
63
+ self[sample] = self[sample] + count
64
+ end
65
+
66
+ def delete(sample, &block)
67
+ result = super
68
+ if result
69
+ @cache = {}
70
+ @N -= result
71
+ end
72
+ result
73
+ end
74
+
75
+ def delete_if(&block)
76
+ raise "Not implemented for Frequency Distributions"
77
+ end
78
+
79
+ def frequency_of(sample)
80
+ return 0 if @N == 0
81
+ return self[sample].to_f / @N
82
+ end
83
+
84
+ def max
85
+ unless @cache[:max]
86
+ max_sample = nil
87
+ max_count = -1
88
+ self.keys.each do |sample|
89
+ if self[sample] > max_count
90
+ max_sample = sample
91
+ max_count = self[sample]
92
+ end
93
+ end
94
+ @cache[:max] = max_sample
95
+ end
96
+ return @cache[:max]
97
+ end
98
+
99
+ def merge(other_frequency_distribution)
100
+ temp = self.dup
101
+ other_frequency_distribution.each do |sample, value|
102
+ temp.inc(sample, value)
103
+ end
104
+ return temp
105
+ end
106
+
107
+ def merge!(other_frequency_distribution)
108
+ other_frequency_distribution.each do |sample, value|
109
+ self.inc(sample, value)
110
+ end
111
+ self
112
+ end
113
+
114
+ private
115
+
116
+ def order_by_frequency_desc
117
+ @cache[:ordered_by_frequency_desc] = self.to_a.sort {|x,y| y[1] <=> x[1] }
118
+ end
119
+
120
+ end
121
+ end
@@ -0,0 +1,51 @@
1
+ # Ruby implementation of Punkt sentence tokenizer
2
+ #
3
+ # This code is a ruby port of the algorithm implemented by
4
+ # the NLTK Project. This code follows the terms and conditions
5
+ # of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)
6
+ #
7
+ # Copyright (C) 2001-2010 NLTK Project
8
+ # Algorithm: Kiss & Strunk (2006)
9
+ # Author: Willy <willy@csse.unimelb.edu.au> (original Python port)
10
+ # Steven Bird <sb@csse.unimelb.edu.au> (additions)
11
+ # Edward Loper <edloper@gradient.cis.upenn.edu> (rewrite)
12
+ # Joel Nothman <jnothman@student.usyd.edu.au> (almost rewrite)
13
+ #
14
+ # Luis Cipriani (ruby port)
15
+ # URL: <http://www.nltk.org/>
16
+ #
17
+ #
18
+ # The Punkt sentence tokenizer. The algorithm for this tokenizer is
19
+ # described in Kiss & Strunk (2006)::
20
+ #
21
+ # Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
22
+ # Boundary Detection. Computational Linguistics 32: 485-525.
23
+ #
24
+ module Punkt
25
+
26
+ # Orthographoc Context Constants
27
+ ORTHO_BEG_UC = 1 << 1 # Orthographoc context: beginning of sentence with upper case
28
+ ORTHO_MID_UC = 1 << 2 # Orthographoc context: middle of sentence with upper case
29
+ ORTHO_UNK_UC = 1 << 3 # Orthographoc context: unknown position in a sentence with upper case
30
+ ORTHO_BEG_LC = 1 << 4 # Orthographoc context: beginning of sentence with lower case
31
+ ORTHO_MID_LC = 1 << 5 # Orthographoc context: middle of sentence with lower case
32
+ ORTHO_UNK_LC = 1 << 6 # Orthographoc context: unknown position in a sentence with lower case
33
+ ORTHO_UC = ORTHO_BEG_UC + ORTHO_MID_UC + ORTHO_UNK_UC
34
+ ORTHO_LC = ORTHO_BEG_LC + ORTHO_MID_LC + ORTHO_UNK_LC
35
+ ORTHO_MAP = {
36
+ [:initial, :upper] => ORTHO_BEG_UC,
37
+ [:internal, :upper] => ORTHO_MID_UC,
38
+ [:unknown, :upper] => ORTHO_UNK_UC,
39
+ [:initial, :lower] => ORTHO_BEG_LC,
40
+ [:internal, :lower] => ORTHO_MID_LC,
41
+ [:unknown, :lower] => ORTHO_UNK_LC,
42
+ }
43
+
44
+ end
45
+
46
+ require "punkt-segmenter/punkt/language_vars"
47
+ require "punkt-segmenter/punkt/parameters"
48
+ require "punkt-segmenter/punkt/token"
49
+ require "punkt-segmenter/punkt/base"
50
+ require "punkt-segmenter/punkt/trainer"
51
+ require "punkt-segmenter/punkt/sentence_tokenizer"
@@ -0,0 +1,65 @@
1
+ module Punkt
2
+ class Base
3
+ def initialize(language_vars = Punkt::LanguageVars.new,
4
+ token_class = Punkt::Token,
5
+ parameters = Punkt::Parameters.new)
6
+
7
+ @parameters = parameters
8
+ @language_vars = language_vars
9
+ @token_class = token_class
10
+ end
11
+
12
+ def tokenize_words(plain_text, options = {})
13
+ return @language_vars.word_tokenize(plain_text) if options[:output] == :string
14
+ result = []
15
+ paragraph_start = false
16
+ plain_text.split("\n").each do |line|
17
+ unless line.strip.empty?
18
+ line_tokens = @language_vars.word_tokenize(line)
19
+ first_token = @token_class.new(line_tokens.shift,
20
+ :paragraph_start => paragraph_start,
21
+ :line_start => true)
22
+ paragraph_start = false
23
+ line_tokens.map! { |token| @token_class.new(token) }.unshift(first_token)
24
+
25
+ result += line_tokens
26
+ else
27
+ paragraph_start = true
28
+ end
29
+ end
30
+ return result
31
+ end
32
+
33
+ private
34
+
35
+ def annotate_first_pass(tokens)
36
+ tokens.each do |aug_token|
37
+ tok = aug_token.token
38
+
39
+ if @language_vars.sent_end_chars.include?(tok)
40
+ aug_token.sentence_break = true
41
+ elsif aug_token.is_ellipsis?
42
+ aug_token.is_ellipsis = true
43
+ elsif aug_token.ends_with_period? && !tok.end_with?("..")
44
+ tok_low = UnicodeUtils.downcase(tok.chop)
45
+ if @parameters.abbreviation_types.include?(tok_low) || @parameters.abbreviation_types.include?(tok_low.split("-")[-1])
46
+ aug_token.abbr = true
47
+ else
48
+ aug_token.sentence_break = true
49
+ end
50
+ end
51
+
52
+ end
53
+ end
54
+
55
+ def pair_each(list, &block)
56
+ previous = list[0]
57
+ list[1..list.size-1].each do |item|
58
+ yield(previous, item)
59
+ previous = item
60
+ end
61
+ yield(previous, nil)
62
+ end
63
+
64
+ end
65
+ end
@@ -0,0 +1,34 @@
1
+ module Punkt
2
+ class LanguageVars
3
+
4
+ attr_reader :re_period_context
5
+ attr_reader :sent_end_chars
6
+ attr_reader :internal_punctuation
7
+ attr_reader :re_boundary_realignment
8
+
9
+ def initialize
10
+ @sent_end_chars = ['.', '?', '!']
11
+
12
+ @re_sent_end_chars = /[.?!]/
13
+
14
+ @internal_punctuation = [',', ':', ';']
15
+
16
+ @re_boundary_realignment = /^["\')\]}]+?(?:\s+|(?=--)|$)/m
17
+
18
+ @re_word_start = /[^\(\"\`{\[:;&\#\*@\)}\]\-,]/
19
+
20
+ @re_non_word_chars = /(?:[?!)\";}\]\*:@\'\({\[])/
21
+
22
+ @re_multi_char_punct = /(?:\-{2,}|\.{2,}|(?:\.\s){2,}\.)/
23
+
24
+ @re_word_tokenizer = /#{@re_multi_char_punct}|(?=#{@re_word_start})\S+?(?=\s|$|#{@re_non_word_chars}|#{@re_multi_char_punct}|,(?=$|\s|#{@re_non_word_chars}|#{@re_multi_char_punct}))|\S/
25
+
26
+ @re_period_context = /\S*#{@re_sent_end_chars}(?=(?<after_tok>#{@re_non_word_chars}|\s+(?<next_tok>\S+)))/
27
+ end
28
+
29
+ def word_tokenize(text)
30
+ text.scan(@re_word_tokenizer)
31
+ end
32
+
33
+ end
34
+ end
@@ -0,0 +1,37 @@
1
+ module Punkt
2
+ class Parameters
3
+
4
+ attr_accessor :abbreviation_types
5
+ attr_accessor :collocations
6
+ attr_accessor :sentence_starters
7
+ attr_accessor :orthographic_context
8
+
9
+ def initialize
10
+ clear_abbreviations
11
+ clear_collocations
12
+ clear_sentence_starters
13
+ clear_orthographic_context
14
+ end
15
+
16
+ def clear_abbreviations
17
+ @abbreviation_types = Set.new
18
+ end
19
+
20
+ def clear_collocations
21
+ @collocations = Set.new
22
+ end
23
+
24
+ def clear_sentence_starters
25
+ @sentence_starters = Set.new
26
+ end
27
+
28
+ def clear_orthographic_context
29
+ @orthographic_context = Hash.new(0)
30
+ end
31
+
32
+ def add_orthographic_context(type, flag)
33
+ @orthographic_context[type] |= flag
34
+ end
35
+
36
+ end
37
+ end
@@ -0,0 +1,180 @@
1
+ module Punkt
2
+ class SentenceTokenizer < Base
3
+ def initialize(train_text_or_parameters,
4
+ language_vars = Punkt::LanguageVars.new,
5
+ token_class = Punkt::Token)
6
+
7
+ super(language_vars, token_class)
8
+
9
+ @trainer = nil
10
+
11
+ if train_text_or_parameters.kind_of?(String)
12
+ @parameters = train(train_text_or_parameters)
13
+ elsif train_text_or_parameters.kind_of?(Punkt::Parameters)
14
+ @parameters = train_text_or_parameters
15
+ else
16
+ raise "You need to pass trainer parameters or a text to train."
17
+ end
18
+ end
19
+
20
+ def sentences_from_text(text, options = {})
21
+ sentences = split_in_sentences(text)
22
+ sentences = realign_boundaries(text, sentences) if options[:realign_boundaries]
23
+ sentences = self.class.send(options[:output], text, sentences) if options[:output]
24
+
25
+ return sentences
26
+ end
27
+ alias_method :tokenize, :sentences_from_text
28
+
29
+ def sentences_from_tokens(tokens)
30
+ tokens = annotate_tokens(tokens.map { |t| @token_class.new(t) })
31
+
32
+ sentences = []
33
+ sentence = []
34
+ tokens.each do |t|
35
+ sentence << t.token
36
+ if t.sentence_break
37
+ sentences << sentence
38
+ sentence = []
39
+ end
40
+ end
41
+ sentences << sentence unless sentence.empty?
42
+
43
+ return sentences
44
+ end
45
+
46
+ class << self
47
+ def sentences_text(text, sentences_indexes)
48
+ sentences_indexes.map { |index| text[index[0]..index[1]] }
49
+ end
50
+
51
+ def tokenized_sentences(text, sentences_indexes)
52
+ tokenizer = Punkt::Base.new()
53
+ self.sentences_text(text, sentences_indexes).map { |text| tokenizer.tokenize_words(text, :output => :string) }
54
+ end
55
+ end
56
+
57
+ private
58
+
59
+ def train(train_text)
60
+ @trainer = Punkt::Trainer.new(@language_vars, @token_class) unless @trainer
61
+ @trainer.train(train_text)
62
+ @parameters = @trainer.parameters
63
+ end
64
+
65
+ def split_in_sentences(text)
66
+ result = []
67
+ last_break = 0
68
+ current_sentence_start = 0
69
+ while match = @language_vars.re_period_context.match(text, last_break)
70
+ context = match[0] + match[:after_tok]
71
+ if text_contains_sentence_break?(context)
72
+ result << [current_sentence_start, (match.end(0)-1)]
73
+ match[:next_tok] ? current_sentence_start = match.begin(:next_tok) : current_sentence_start = match.end(0)
74
+ end
75
+ if match[:next_tok]
76
+ last_break = match.begin(:next_tok)
77
+ else
78
+ last_break = match.end(0)
79
+ end
80
+ end
81
+ result << [current_sentence_start, (text.size-1)]
82
+ end
83
+
84
+ def text_contains_sentence_break?(text)
85
+ found = false
86
+ annotate_tokens(tokenize_words(text)).each do |token|
87
+ return true if found
88
+ found = true if token.sentence_break
89
+ end
90
+ return false
91
+ end
92
+
93
+ def annotate_tokens(tokens)
94
+ tokens = annotate_first_pass(tokens)
95
+ tokens = annotate_second_pass(tokens)
96
+ return tokens
97
+ end
98
+
99
+ def annotate_second_pass(tokens)
100
+ pair_each(tokens) do |tok1, tok2|
101
+ next unless tok2
102
+ next unless tok1.ends_with_period?
103
+
104
+ token = tok1.token
105
+ type = tok1.type_without_period
106
+ next_token = tok2.token
107
+ next_type = tok2.type_without_sentence_period
108
+ token_is_initial = tok1.is_initial?
109
+
110
+ if @parameters.collocations.include?([type, next_type])
111
+ tok1.sentence_break = false
112
+ tok1.abbr = true
113
+ next
114
+ end
115
+
116
+ if (tok1.abbr || tok1.ellipsis) && !token_is_initial
117
+ is_sentence_starter = orthographic_heuristic(tok2)
118
+ if is_sentence_starter == true
119
+ tok1.sentence_break = true
120
+ next
121
+ end
122
+
123
+ if tok2.first_upper? && @parameters.sentence_starters.include?(next_type)
124
+ tok1.sentence_break = true
125
+ next
126
+ end
127
+ end
128
+
129
+ if token_is_initial || type == "##number##"
130
+ is_sentence_starter = orthographic_heuristic(tok2)
131
+ if is_sentence_starter == false
132
+ tok1.sentence_break = false
133
+ tok1.abbr = true
134
+ next
135
+ end
136
+
137
+ if is_sentence_starter == :unknown && token_is_initial &&
138
+ tok2.first_upper? && !(@parameters.orthographic_context[next_type] & Punkt::ORTHO_LC != 0)
139
+ tok1.sentence_break = false
140
+ tok1.abbr = true
141
+ end
142
+ end
143
+
144
+ end
145
+ return tokens
146
+ end
147
+
148
+ def orthographic_heuristic(aug_token)
149
+ return false if [';', ',', ':', '.', '!', '?'].include?(aug_token.token)
150
+
151
+ orthographic_context = @parameters.orthographic_context[aug_token.type_without_sentence_period]
152
+ return true if aug_token.first_upper? && (orthographic_context & Punkt::ORTHO_LC != 0) && !(orthographic_context & Punkt::ORTHO_MID_UC != 0)
153
+ return false if aug_token.first_lower? && ((orthographic_context & Punkt::ORTHO_UC != 0) || !(orthographic_context & Punkt::ORTHO_BEG_LC != 0))
154
+ return :unknown
155
+ end
156
+
157
+ def realign_boundaries(text, sentences)
158
+ result = []
159
+ realign = 0
160
+ pair_each(sentences) do |i1, i2|
161
+ s1 = text[i1[0]..i1[1]]
162
+ s2 = i2 ? text[i2[0]..i2[1]] : nil
163
+ #s1 = s1[realign..(s1.size-1)]
164
+ unless s2
165
+ result << [i1[0]+realign, i1[1]] if s1
166
+ next
167
+ end
168
+ if match = @language_vars.re_boundary_realignment.match(s2)
169
+ result << [i1[0]+realign, i1[1]+match[0].strip.size] #s1 + match[0].strip()
170
+ realign = match.end(0)
171
+ else
172
+ result << [i1[0]+realign, i1[1]] if s1
173
+ realign = 0
174
+ end
175
+ end
176
+ return result
177
+ end
178
+
179
+ end
180
+ end