scalpel 0.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. data/LICENSE +18 -0
  2. data/README.md +40 -0
  3. data/lib/scalpel.rb +74 -0
  4. metadata +51 -0
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Scalpel - A fast and accurate sentence segmentation tool for Ruby.
2
+
3
+ This program is free software: you can redistribute it and/or modify
4
+ it under the terms of the GNU General Public License as published by
5
+ the Free Software Foundation, either version 3 of the License, or
6
+ (at your option) any later version.
7
+
8
+ This program is distributed in the hope that it will be useful,
9
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
10
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11
+ GNU General Public License for more details.
12
+
13
+ This license also applies to the included Stanford CoreNLP files.
14
+
15
+ You should have received a copy of the GNU General Public License
16
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
17
+
18
+ Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2012.
data/README.md ADDED
@@ -0,0 +1,40 @@
1
+ [![Build Status](https://secure.travis-ci.org/louismullie/scalpel.png)](http://travis-ci.org/#!/louismullie/scalpel)
2
+
3
+ **What?**
4
+
5
+ Scalpel is the result of my unability to find a simple and elegant solution to sentence segmentation in Ruby. Machine learning approaches - both unsupervised ([punkt-segmenter](https://github.com/lfcipriani/punkt-segmenter)) and supervised ( [tactful_tokenizer](https://github.com/SlyShy/Tactful_Tokenizer)) - depend on proper domain-specific training to work well. Stanford's tokenize-first group-later method ([stanford-core-nlp](https://github.com/louismullie/stanford-core-nlp)) do not work so well in the face of ill-formatted content. Finally, extensive rule-based methods ([srx-english](https://github.com/apohllo/srx-english)) are very accurate but suffer from poor performance.
6
+
7
+ Scalpel is based on a very simple principle that reduces the complexity of performing sentence segmentation. The idea is that it is simpler and more efficient to find occurrences of periods that do __not__ indicate the end of a sentence, rather than those who do. These occurrences are temporarily replaced by "placeholder" characters, and sentence splitting is subsequently performed. The placeholder characters are then replaced by the original characters.
8
+
9
+ **How?**
10
+
11
+ gem install scalpel
12
+
13
+ ```ruby
14
+ require 'scalpel'
15
+ Scalpel.cut("some text")
16
+ ```
17
+
18
+ **Why?**
19
+
20
+ ![Benchmark](http://www.louismullie.com/ruby/scalpel/comparison.png)
21
+
22
+ * Scalpel is 80x faster than the only other method with 100% accuracy.
23
+ * Scalpel is 5x faster than the next fastest-running segmentation tool.
24
+ * Scalpel loads twice as fast as the next fastest-loading segmentation tool.
25
+
26
+ A few notes on the benchmark:
27
+
28
+ * As with any benchmark, your mileage may vary, take it with a grain of salt, etc.
29
+ * Loading time for the Punkt segmenter is dependent on the size of the model.
30
+ * Stanford segmenter that was used is the one in DocumentPreprocessor, via Rjb.
31
+
32
+ The text that was used is the following:
33
+
34
+ ```ruby
35
+ "For years, people in the U.A.E.R. have accepted murky air, tainted waters and scarred landscapes as the unavoidable price of the country’s meteoric economic growth. But public dissent over environmental issues has been growing steadily in the communist nation, and now seems to be building the foundations of a fledgling green movement! In July alone, two separate demonstrations made international news when they turned violent after about 1.5 minutes... These recent successes come after a slew of ever-larger and more violent green protests over the past few years, as the environmentalist Dr. Jeung of China’s growth becomes harder to ignore.Some ask: “Are demonstrations are evidence of the public anger and frustration at opaque environmental management and decision-making?” Others yet say: \"Should we be scared about these 'protests'?\""
36
+ ```
37
+
38
+ **Contributing**
39
+
40
+ Feel free to fork the project and send me a pull request!
data/lib/scalpel.rb ADDED
@@ -0,0 +1,74 @@
1
+ # encoding: utf-8
2
+ #
3
+ # Sentence segmentation based on a set of predefined
4
+ # rules that handle a large number of usage cases of
5
+ # sentence enders. The idea is to remove all cases of
6
+ # .!? being used for other purposes than marking a
7
+ # full stop before naively segmenting the text.
8
+ class Scalpel
9
+
10
+ # Current version.
11
+ VERSION = '0.2'
12
+
13
+ # Segment a text using the Scalpel algorithm.
14
+ # This will eventually be ported to a gem.
15
+ def self.cut(text)
16
+
17
+ # Get a copy of the string.
18
+ text = text.to_s
19
+ # Remove composite abbreviations.
20
+ text.gsub!('et al.', '&&&')
21
+ # Remove suspension points.
22
+ text.gsub!('...', '&;&.')
23
+ # Remove floating point numbers.
24
+ text.gsub!(/([0-9]+)\.([0-9]+)/) { $1 + '&@&' + $2 }
25
+ # Remove abbreviations.
26
+ text.gsub!(/(?:[A-Za-z]\.){2,}/) { |abbr| abbr.gsub('.', '&-&') }
27
+ # Remove titles.
28
+ text.gsub!(/[A-Z][a-z]{1,2}\./) { |title| title.gsub('.', '&*&') }
29
+ # Unstick sentences from each other.
30
+ text.gsub!(/([^.?!]\.|\!|\?)([^\s"'])/) { $1 + ' ' + $2 }
31
+ # Remove sentence enders next to quotes.
32
+ text.gsub!(/'([.?!])\s?"/) { '&^&' + $1 }
33
+ text.gsub!(/'([.?!])\s?”/) { '&*&' + $1 }
34
+ text.gsub!(/([.?!])\s?”/) { '&=&' + $1 }
35
+ text.gsub!(/([.?!])\s?'"/) { '&,&' + $1 }
36
+ text.gsub!(/([.?!])\s?'/) { '&%&' + $1 }
37
+ text.gsub!(/([.?!])\s?"/) { '&$&' + $1 }
38
+ # Split on any sentence ender.
39
+ sentences = text.split(/([.!?])/)
40
+ new_sents = []
41
+ # Join the obtaine slices.
42
+ sentences.each_slice(2) do |slice|
43
+ new_sents << slice.join('')
44
+ end
45
+ # Repair the damage we've done.
46
+ results = []
47
+ new_sents.each do |sentence|
48
+ # Skip whitespace zones.
49
+ next if sentence.strip == ''
50
+ # Repair composite abbreviations.
51
+ sentence.gsub!('&&&', 'et al.')
52
+ # Repair abbreviations.
53
+ sentence.gsub!('&-&', '.')
54
+ # Repair titles.
55
+ sentence.gsub!('&*&', '.')
56
+ # Repair suspension points.
57
+ sentence.gsub!('&;&.', '...')
58
+ # Repair floats.
59
+ sentence.gsub!(/([0-9]+)&@&([0-9]+)/) { $1 + '.' + $2 }
60
+ # Repair quotes with sentence enders
61
+ sentence.gsub!(/&=&([.!?])/) { $1 + '”' }
62
+ sentence.gsub!(/&,&([.!?])/) { $1 + "'\"" }
63
+ sentence.gsub!(/&%&([.!?])/) { $1 + "'" }
64
+ sentence.gsub!(/&\^&([.?!])/) { "'" + $1 + '"' }
65
+ sentence.gsub!(/&\*&([.?!])/) { "'" + $1 + '”' }
66
+ sentence.gsub!(/&$&([.!?])/) { $1 + '"' }
67
+ results << sentence.strip
68
+ end
69
+
70
+ results
71
+
72
+ end
73
+
74
+ end
metadata ADDED
@@ -0,0 +1,51 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: scalpel
3
+ version: !ruby/object:Gem::Version
4
+ version: '0.2'
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Louis Mullie
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-08-15 00:00:00.000000000 Z
13
+ dependencies: []
14
+ description: ! ' Scalpel is a sentence segmentation tool for Ruby. It allows you to
15
+ split a text into an array of sentences. It is simple, lightweight, blazing fast
16
+ and does not require any domain-specific training. It works well even in the face
17
+ of ill-formatted texts. '
18
+ email:
19
+ - louis.mullie@gmail.com
20
+ executables: []
21
+ extensions: []
22
+ extra_rdoc_files: []
23
+ files:
24
+ - lib/scalpel.rb
25
+ - README.md
26
+ - LICENSE
27
+ homepage: https://github.com/louismullie/scalpel
28
+ licenses: []
29
+ post_install_message:
30
+ rdoc_options: []
31
+ require_paths:
32
+ - lib
33
+ required_ruby_version: !ruby/object:Gem::Requirement
34
+ none: false
35
+ requirements:
36
+ - - ! '>='
37
+ - !ruby/object:Gem::Version
38
+ version: '0'
39
+ required_rubygems_version: !ruby/object:Gem::Requirement
40
+ none: false
41
+ requirements:
42
+ - - ! '>='
43
+ - !ruby/object:Gem::Version
44
+ version: '0'
45
+ requirements: []
46
+ rubyforge_project:
47
+ rubygems_version: 1.8.24
48
+ signing_key:
49
+ specification_version: 3
50
+ summary: A fast and accurate rule-based sentence segmentation tool for Ruby.
51
+ test_files: []