RubyGems - pragmatic_segmenter - Versions diffs - 0.0.6 → 0.0.7 - Mend

pragmatic_segmenter 0.0.6 → 0.0.7

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +39 -8
data/lib/pragmatic_segmenter/abbreviation_replacer.rb +3 -1
data/lib/pragmatic_segmenter/list.rb +34 -14
data/lib/pragmatic_segmenter/rules.rb +17 -17
data/lib/pragmatic_segmenter/version.rb +1 -1
data/spec/pragmatic_segmenter_spec.rb +30 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: ae3798fa47a86a8928835153af20c91d181ab2d5
-  data.tar.gz: 265670562de5e8b25aa90454919f044c910353f8
+  metadata.gz: 0f7d9ee20797db1385cc6b19dd6a9029f51355bc
+  data.tar.gz: 44a2066e7e3cc3a08e7a53a0a31b35d49687471a
 SHA512:
-  metadata.gz: 33eea4d021662c497763950fb5815e29b59214d2c4d7056f77f081ea3edc77a0fc9c54a9467d89dc5d278174e316ced772a6f826bb477c711239eb2b0d0b1722
-  data.tar.gz: 73fb101b5a2c6a3d2f57bdede1d37a70286519dd54ca93aaa2cfc467dd5202961bd3f27b3f25c906ea67e3038542823303bbb411bb4ffd6fafd9f7145f3aa116
+  metadata.gz: 6bc171f4cda7cddce161dc2ce1f7acddf0c90a9602316530a7378d48ec26fc41e335c34bd560ecc726c1f6cd16b363d37e153feac0ce11c04b01b67f89983522
+  data.tar.gz: 498ec2b1e6b8ef8b7f6f07f482e6805ab98cfc5c06d5a53cf074929b6928461876f46457f093e572342f2e716dcdc2914ce040562a20a945b46c48ca0c4af3ef

data/README.md CHANGED Viewed

@@ -641,14 +641,14 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 Name                                                                 | Programming Language | License                                             | GRS (English) | GRS (Other Languages)† | Speed‡
 ---------------------------------------------------------------------| -------------------- | --------------------------------------------------- | ------------- | ---------------------- | -------
-Pragmatic Segmenter                                                  | Ruby                 | [MIT](http://opensource.org/licenses/MIT)           | 98.04%        | 100.00%                | 3.84 s
-[TactfulTokenizer](https://github.com/zencephalon/Tactful_Tokenizer) | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 66.67%        | 45.45%                 | 46.32 s
-[OpenNLP](https://opennlp.apache.org/)                               | Java                 | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 60.78%        | 42.42%                 | 1.27 s
-[Standford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml)  | Java                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 58.82%        | 27.27%                 | 0.92 s
-[Splitta](http://www.nltk.org/_modules/nltk/tokenize/punkt.html)     | Python               | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 56.86%        | 33.33%                 | N/A
-[Punkt](http://www.nltk.org/_modules/nltk/tokenize/punkt.html)       | Python               | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 47.06%        | 45.45%                 | 1.79 s
-[SRX English](https://github.com/apohllo/srx-english)                | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 29.41%        | 24.24%                 | 6.19 s
-[Scapel](https://github.com/louismullie/scalpel)                     | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 27.45%        | 15.15%                 | 0.13 s
+Pragmatic Segmenter                                                  | Ruby                 | [MIT](http://opensource.org/licenses/MIT)           | 98.08%        | 100.00%                | 3.84 s
+[TactfulTokenizer](https://github.com/zencephalon/Tactful_Tokenizer) | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 65.38%        | 45.45%                 | 46.32 s
+[OpenNLP](https://opennlp.apache.org/)                               | Java                 | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 59.62%        | 42.42%                 | 1.27 s
+[Standford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml)  | Java                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 59.62%        | 27.27%                 | 0.92 s
+[Splitta](http://www.nltk.org/_modules/nltk/tokenize/punkt.html)     | Python               | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 55.77%        | 33.33%                 | N/A
+[Punkt](http://www.nltk.org/_modules/nltk/tokenize/punkt.html)       | Python               | [APLv2](http://www.apache.org/licenses/LICENSE-2.0) | 46.15%        | 45.45%                 | 1.79 s
+[SRX English](https://github.com/apohllo/srx-english)                | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 30.77%        | 24.24%                 | 6.19 s
+[Scapel](https://github.com/louismullie/scalpel)                     | Ruby                 | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html)   | 28.85%        | 15.15%                 | 0.13 s
 †GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.
 ‡ Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.
@@ -707,6 +707,37 @@ To test the relative performance of different segmentation tools and libraries I
 * Add abbreviation lists for any languages that do not currently have one (only relevant for languages that have the concept of abbreviations with periods)
 * Get Golden Rule #18 passing - Handling of a.m. or p.m. followed by a capitalized non sentence starter (ex. "At 5 p.m. Mr. Smith went to the bank. He left the bank at 6 p.m. Next he went to the store." --> ["At 5 p.m. Mr. Smith went to the bank.", "He left the bank at 6 p.m.", "Next he went to the store."])
+## Change Log
+**Version 0.0.1**
+* Initial Release
+**Version 0.0.2**
+* Major design refactor
+**Version 0.0.3**
+* Add travis.yml
+* Add Code Climate
+* Update README
+**Version 0.0.4**
+* Add `ConsecutiveForwardSlashRule` to cleaner
+* Refactor `segmenter.rb` and `process.rb`
+**Version 0.0.5**
+* Make symbol substitution safer
+* Refactor `process.rb`
+* Update cleaner with escaped newline rules
+**Version 0.0.6**
+* Add rule for escaped newlines that include a space between the slash and character
+* Add Golden Rule #52 and code to make it pass
+**Version 0.0.7**
+* Add change log to README
+* Add passing spec for new end of sentence abbreviation (EN)
+* Add roman numeral list support
 ## Contributing
 If you find a text that is incorrectly segmented using this gem, please submit an issue.

data/lib/pragmatic_segmenter/abbreviation_replacer.rb CHANGED Viewed

@@ -28,7 +28,7 @@ module PragmaticSegmenter
       All = [UpperCasePmRule, UpperCaseAmRule, LowerCasePmRule, LowerCaseAmRule]
     end
-    SENTENCE_STARTERS = %w(A Being Did For He How However I In Millions More She That The There They We What When Where Who Why)
+    SENTENCE_STARTERS = %w(A Being Did For He How However I In It Millions More She That The There They We What When Where Who Why)
     attr_reader :text
     def initialize(text:)
@@ -109,6 +109,8 @@ module PragmaticSegmenter
               .gsub(/U∯S∯A∯\s#{Regexp.escape(word)}\s/, "U∯S∯A\.\s#{Regexp.escape(word)}\s")
               .gsub(/U\.S\.A∯\s#{Regexp.escape(word)}\s/, "U\.S\.A\.\s#{Regexp.escape(word)}\s")
               .gsub(/I∯\s#{Regexp.escape(word)}\s/, "I\.\s#{Regexp.escape(word)}\s")
+              .gsub(/i.v∯\s#{Regexp.escape(word)}\s/, "i\.v\.\s#{Regexp.escape(word)}\s")
+              .gsub(/I.V∯\s#{Regexp.escape(word)}\s/, "I\.V\.\s#{Regexp.escape(word)}\s")
       end
       txt
     end

data/lib/pragmatic_segmenter/list.rb CHANGED Viewed

@@ -6,11 +6,11 @@ module PragmaticSegmenter
   class List
     # Rubular: http://rubular.com/r/XcpaJKH0sz
     ALPHABETICAL_LIST_WITH_PERIODS =
-      /(?<=^)[a-z](?=\.)|(?<=\A)[a-z](?=\.)|(?<=\s)[a-z](?=\.)/i
+      /(?<=^)[a-z](?=\.)|(?<=\A)[a-z](?=\.)|(?<=\s)[a-z](?=\.)/
-    # Rubular: http://rubular.com/r/0MIlImeBsC
+    # Rubular: http://rubular.com/r/Gu5rQapywf
     ALPHABETICAL_LIST_WITH_PARENS =
-      /(?<=^)[a-z](?=\))|(?<=\A)[a-z](?=\))|(?<=\s)[a-z](?=\))/i
+      /(?<=\()[a-z]+(?=\))|(?<=^)[a-z]+(?=\))|(?<=\A)[a-z]+(?=\))|(?<=\s)[a-z]+(?=\))/i
     SubstituteListPeriodRule = Rule.new(/♨/, '∯')
     ListMarkerRule = Rule.new(/☝/, '')
@@ -30,9 +30,9 @@ module PragmaticSegmenter
       /(?<=\s)\d+\.(?=\s)|^\d+\.(?=\s)|(?<=\s)\d+\.(?=\))|^\d+\.(?=\))|(?<=\s\-)\d+\.(?=\s)|(?<=^\-)\d+\.(?=\s)|(?<=\s\⁃)\d+\.(?=\s)|(?<=^\⁃)\d+\.(?=\s)|(?<=\s\-)\d+\.(?=\))|(?<=^\-)\d+\.(?=\))|(?<=\s\⁃)\d+\.(?=\))|(?<=^\⁃)\d+\.(?=\))/
     NUMBERED_LIST_PARENS_REGEX = /\d+(?=\)\s)/
-    # Rubular: http://rubular.com/r/0MIlImeBsC
+    # Rubular: http://rubular.com/r/NsNFSqrNvJ
     EXTRACT_ALPHABETICAL_LIST_LETTERS_REGEX =
-      /(?<=^)[a-z](?=\))|(?<=\A)[a-z](?=\))|(?<=\s)[a-z](?=\))/i
+      /(?<=\()[a-z]+(?=\))|(?<=^)[a-z]+(?=\))|(?<=\A)[a-z]+(?=\))|(?<=\s)[a-z]+(?=\))/i
     # Rubular: http://rubular.com/r/wMpnVedEIb
     ALPHABETICAL_LIST_LETTERS_AND_PERIODS_REGEX =
@@ -45,6 +45,7 @@ module PragmaticSegmenter
     def add_line_break
       formatted_text = format_alphabetical_lists(text)
+      formatted_text = format_roman_numeral_lists(formatted_text)
       formatted_text = format_numbered_list_with_periods(formatted_text)
       format_numbered_list_with_parens(formatted_text)
     end
@@ -64,8 +65,13 @@ module PragmaticSegmenter
     end
     def format_alphabetical_lists(txt)
-      new_txt = add_line_breaks_for_alphabetical_list_with_periods(txt)
-      add_line_breaks_for_alphabetical_list_with_parens(new_txt)
+      new_txt = add_line_breaks_for_alphabetical_list_with_periods(txt, false)
+      add_line_breaks_for_alphabetical_list_with_parens(new_txt, false)
+    end
+    def format_roman_numeral_lists(txt)
+      new_txt = add_line_breaks_for_alphabetical_list_with_periods(txt, true)
+      add_line_breaks_for_alphabetical_list_with_parens(new_txt, true)
     end
     def replace_periods_in_numbered_list(txt)
@@ -112,12 +118,12 @@ module PragmaticSegmenter
       end
     end
-    def add_line_breaks_for_alphabetical_list_with_periods(txt)
-      iterate_alphabet_array(ALPHABETICAL_LIST_WITH_PERIODS, false, txt)
+    def add_line_breaks_for_alphabetical_list_with_periods(txt, roman_numeral)
+      iterate_alphabet_array(ALPHABETICAL_LIST_WITH_PERIODS, false, txt, roman_numeral)
     end
-    def add_line_breaks_for_alphabetical_list_with_parens(txt)
-      iterate_alphabet_array(ALPHABETICAL_LIST_WITH_PARENS, true, txt)
+    def add_line_breaks_for_alphabetical_list_with_parens(txt, roman_numeral)
+      iterate_alphabet_array(ALPHABETICAL_LIST_WITH_PARENS, true, txt, roman_numeral)
     end
     def replace_alphabet_list(a, txt)
@@ -128,7 +134,11 @@ module PragmaticSegmenter
     def replace_alphabet_list_parens(a, txt)
       txt.gsub!(EXTRACT_ALPHABETICAL_LIST_LETTERS_REGEX).with_index do |m|
-        a.eql?(m) ? "\r#{Regexp.escape(a.to_s)}" : "#{m}"
+        if txt =~ /\(#{Regexp.escape(m.to_s)}\)/i
+          a.eql?(m.dup.downcase) ? "\rȸ(#{Regexp.escape(m.to_s)}" : "#{m}"
+        else
+          a.eql?(m.dup.downcase) ? "\r#{Regexp.escape(m.to_s)}" : "#{m}"
+        end
       end
     end
@@ -141,19 +151,29 @@ module PragmaticSegmenter
     end
     def last_array_item_replacement(a, i, alphabet, list_array, txt, parens)
+      return if alphabet & list_array == [] ||
+        !alphabet.include?(list_array[i - 1]) ||
+        !alphabet.include?(a)
       return if (alphabet.index(list_array[i - 1]) - alphabet.index(a)).abs != 1
       replace_correct_alphabet_list(a, txt, parens)
     end
     def other_items_replacement(a, i, alphabet, list_array, txt, parens)
+      return if alphabet & list_array == [] ||
+        !alphabet.include?(list_array[i - 1]) ||
+        !alphabet.include?(a)
       return if alphabet.index(list_array[i + 1]) - alphabet.index(a) != 1 &&
                 (alphabet.index(list_array[i - 1]) - alphabet.index(a)).abs != 1
       replace_correct_alphabet_list(a, txt, parens)
     end
-    def iterate_alphabet_array(regex, parens, txt)
+    def iterate_alphabet_array(regex, parens, txt, roman_numeral)
       list_array = txt.scan(regex).map(&:downcase)
-      alphabet = ('a'..'z').to_a
+      if roman_numeral
+        alphabet = %w(i ii iii iv v vi vii viii ix x xi xii xiii xiv x xi xii xiii xv xvi xvii xviii xix xx)
+      else
+        alphabet = ('a'..'z').to_a
+      end
       list_array.each_with_index do |a, i|
         if i.eql?(list_array.length - 1)
           last_array_item_replacement(a, i, alphabet, list_array, txt, parens)

data/lib/pragmatic_segmenter/rules.rb CHANGED Viewed

@@ -36,15 +36,15 @@ module PragmaticSegmenter
     end
     module ReinsertEllipsisRules
-      ThreeConsecutivePeriod = Rule.new(/ƪ/, '...')
-      ThreeSpacePeriod = Rule.new(/♟/, ' . . . ')
-      FourSpacePeriod = Rule.new(/♝/, '. . . .')
-      TwoConsecutivePeriod = Rule.new(/☏/, '..')
-      OnePeriod = Rule.new(/∮/, '.')
-      All = [ ThreeConsecutivePeriod, ThreeSpacePeriod,
-              FourSpacePeriod, TwoConsecutivePeriod,
-              OnePeriod ]
+      SubThreeConsecutivePeriod = Rule.new(/ƪ/, '...')
+      SubThreeSpacePeriod = Rule.new(/♟/, ' . . . ')
+      SubFourSpacePeriod = Rule.new(/♝/, '. . . .')
+      SubTwoConsecutivePeriod = Rule.new(/☏/, '..')
+      SubOnePeriod = Rule.new(/∮/, '.')
+      All = [ SubThreeConsecutivePeriod, SubThreeSpacePeriod,
+              SubFourSpacePeriod, SubTwoConsecutivePeriod,
+              SubOnePeriod ]
     end
     module SubSymbolsRules
@@ -86,14 +86,14 @@ module PragmaticSegmenter
     end
     module SubEscapedRegexReservedCharacters
-      LeftParen = Rule.new('\\(', '(')
-      RightParen = Rule.new('\\)', ')')
-      LeftBracket = Rule.new('\\[', '[')
-      RightBracket = Rule.new('\\]', ']')
-      Dash = Rule.new('\\-', '-')
-      All = [ LeftParen, RightParen,
-              LeftBracket, RightBracket, Dash ]
+      SubLeftParen = Rule.new('\\(', '(')
+      SubRightParen = Rule.new('\\)', ')')
+      SubLeftBracket = Rule.new('\\[', '[')
+      SubRightBracket = Rule.new('\\]', ']')
+      SubDash = Rule.new('\\-', '-')
+      All = [ SubLeftParen, SubRightParen,
+              SubLeftBracket, SubRightBracket, SubDash ]
     end
   end
 end

data/lib/pragmatic_segmenter/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module PragmaticSegmenter
-  VERSION = "0.0.6"
+  VERSION = "0.0.7"
 end

data/spec/pragmatic_segmenter_spec.rb CHANGED Viewed

@@ -883,6 +883,36 @@ RSpec.describe PragmaticSegmenter::Segmenter do
         ps = PragmaticSegmenter::Segmenter.new(text: 'Hello World. \ r \ nHello.', language: 'en')
         expect(ps.segment).to eq(["Hello World.", "Hello."])
       end
+      it "correctly segments text #083" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "The nurse gave him the i.v. in his vein. She gave him the i.v. It was a great I.V. that she gave him. She gave him the I.V. It was night.", language: "en")
+        expect(ps.segment).to eq(["The nurse gave him the i.v. in his vein.", "She gave him the i.v.", "It was a great I.V. that she gave him.", "She gave him the I.V.", "It was night."])
+      end
+      it "correctly segments text #084" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "(i) Hello world. \n(ii) Hello world.\n(iii) Hello world.\n(iv) Hello world.\n(v) Hello world.\n(vi) Hello world.", language: "en")
+        expect(ps.segment).to eq(["(i) Hello world.", "(ii) Hello world.", "(iii) Hello world.", "(iv) Hello world.", "(v) Hello world.", "(vi) Hello world."])
+      end
+      it "correctly segments text #085" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "i) Hello world. \nii) Hello world.\niii) Hello world.\niv) Hello world.\nv) Hello world.\nvi) Hello world.", language: "en")
+        expect(ps.segment).to eq(["i) Hello world.", "ii) Hello world.", "iii) Hello world.", "iv) Hello world.", "v) Hello world.", "vi) Hello world."])
+      end
+      it "correctly segments text #086" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "(a) Hello world. \n(b) Hello world.\n(c) Hello world.\n(d) Hello world.\n(e) Hello world.\n(f) Hello world.", language: "en")
+        expect(ps.segment).to eq(["(a) Hello world.", "(b) Hello world.", "(c) Hello world.", "(d) Hello world.", "(e) Hello world.", "(f) Hello world."])
+      end
+      it "correctly segments text #087" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "(A) Hello world. \n(B) Hello world.\n(C) Hello world.\n(D) Hello world.\n(E) Hello world.\n(F) Hello world.", language: "en")
+        expect(ps.segment).to eq(["(A) Hello world.", "(B) Hello world.", "(C) Hello world.", "(D) Hello world.", "(E) Hello world.", "(F) Hello world."])
+      end
+      it "correctly segments text #088" do
+        ps = PragmaticSegmenter::Segmenter.new(text: "A) Hello world. \nB) Hello world.\nC) Hello world.\nD) Hello world.\nE) Hello world.\nF) Hello world.", language: "en")
+        expect(ps.segment).to eq(["A) Hello world.", "B) Hello world.", "C) Hello world.", "D) Hello world.", "E) Hello world.", "F) Hello world."])
+      end
     end
   end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pragmatic_segmenter
 version: !ruby/object:Gem::Version
-  version: 0.0.6
+  version: 0.0.7
 platform: ruby
 authors:
 - Kevin S. Dias
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-01-11 00:00:00.000000000 Z
+date: 2015-01-12 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler