RubyGems - pragmatic_segmenter - Versions diffs - 0.3.13 → 0.3.14 - Mend

pragmatic_segmenter 0.3.13 → 0.3.14

Files changed (8) hide show

checksums.yaml +4 -4
data/CODE_OF_CONDUCT.md +1 -1
data/NEWS +4 -0
data/README.md +24 -21
data/lib/pragmatic_segmenter/languages/common.rb +1 -1
data/lib/pragmatic_segmenter/version.rb +1 -1
data/spec/pragmatic_segmenter/languages/english_spec.rb +15 -0
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c313d610281828819a76463bd3b42590927307e2
-  data.tar.gz: 3c9f340a197450a6dffac38f4a3c0b378f3b8edf
+  metadata.gz: 1f4cb40793083044dcf653f7c5c2c64d6e1f9476
+  data.tar.gz: df80bb40afbb05d6f9d1109f0fbb412b34fd7171
 SHA512:
-  metadata.gz: d8756402fcb03f456f27d8359b4e8adfd970fed42a89bdfebab9b7a75b25d83dd9b4db19b1f7982b71214c573500b519cbc1920b6a54fa2632a3521dabee9a68
-  data.tar.gz: 911a665d2609086e20aff601161e63ce22410e18c0d169fc9d8032cef029062aa7a3cdafa082c55db1ab78ae6f2e5f7a1f302246c716f5393a815ed9aec890dc
+  metadata.gz: 3296133dffb5dc2a1c3084f31ecd6115c9e5ffbcb53590ee751fa1e69159dbffadc0bbb6885f7fb09f38ec53aceaf07e2da8942ba6135ee344ae34fd1fac4d83
+  data.tar.gz: 6c3230195453e40c8a4bbfcb9b864a369db7a3beebcf4f61f37b11f2f8dff454a173456a59fa298d5962560441828fbec6b3d926e4dd033785cd4b8a7a716a05

data/CODE_OF_CONDUCT.md CHANGED Viewed

@@ -1,4 +1,4 @@
-#Contributor Code of Conduct
+# Contributor Code of Conduct
 As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

data/NEWS CHANGED Viewed

@@ -1,3 +1,7 @@
+0.3.13 (2017-06-28):
+* Improvement: Add English abbreviation Rs. to denote the Indian currency
 0.3.13 (2017-01-17):
 * Bug Fix: Unexpected sentence break between abbreviation and hyphen

data/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
-#Pragmatic Segmenter
+# Pragmatic Segmenter
 [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
 Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
-##Install
+## Install
 **Ruby**
 *Supports Ruby 2.1.5 and above*
@@ -18,7 +18,7 @@ Add this line to your application’s Gemfile:
 gem 'pragmatic_segmenter'
 ```
-##Usage
+## Usage
 * If no language is specified, the library will default to English.
 * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
@@ -54,11 +54,11 @@ ps.clean
 # => "This is a sentence cut off in the middle because pdf."
 ```
-##Live Demo
+## Live Demo
 Try out a [live demo](https://www.tm-town.com/natural-language-processing) of Pragmatic Segmenter in the browser.
-##Background
+## Background
 According to Wikipedia, [sentence boundary disambiguation](http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation) (aka sentence boundary detection, sentence segmentation) is defined as:
@@ -99,7 +99,7 @@ The Holy Grail of sentence segmentation appears to be **Golden Rule #18** as no
 Download the Golden Rules: [[txt](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) | [Ruby RSpec](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules_rspec.rb)]
-####Golden Rules (English)
+#### Golden Rules (English)
 1.) **Simple period to end sentence**
 ```
@@ -413,7 +413,7 @@ Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That i
 => ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]
 ```
-####Golden Rules (German)
+#### Golden Rules (German)
 1.) **Quotation at end of sentence**
 ```
@@ -433,7 +433,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
 => ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]
 ```
-####Golden Rules (Japanese)
+#### Golden Rules (Japanese)
 1.) **Simple period to end sentence**
 ```
@@ -465,7 +465,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
 => ["これは父の家です。"]
 ```
-####Golden Rules (Arabic)
+#### Golden Rules (Arabic)
 1.) **Regular punctuation**
 ```
@@ -497,7 +497,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
 => ["عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب،", "زرعها عملاء الموساد كما تقول مصادر إسرائيلية،", "وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية،", "وبدأت مراسم الحداد عليه"]
 ```
-####Golden Rules (Italian)
+#### Golden Rules (Italian)
 1.) **Abbreviations**
 ```
@@ -517,7 +517,7 @@ La casa costa 170.500.000,00€!
 => ["La casa costa 170.500.000,00€!"]
 ```
-####Golden Rules (Russian)
+#### Golden Rules (Russian)
 1.) **Abbreviations**
 ```
@@ -537,7 +537,7 @@ La casa costa 170.500.000,00€!
 => ["Сегодня 27.10.14"]
 ```
-####Golden Rules (Spanish)
+#### Golden Rules (Spanish)
 1.) **Question mark to end sentence**
 ```
@@ -569,7 +569,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles."]
 ```
-####Golden Rules (Greek)
+#### Golden Rules (Greek)
 1.) **Question mark to end sentence**
 ```
@@ -577,7 +577,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["Με συγχωρείτε· πού είναι οι τουαλέτες;", "Τις Κυριακές δε δούλευε κανένας.", "το κόστος του σπιτιού ήταν £260.950,00."]
 ```
-####Golden Rules (Hindi)
+#### Golden Rules (Hindi)
 1.) **Full stop**
 ```
@@ -585,7 +585,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["सच्चाई यह है कि इसे कोई नहीं जानता।", "हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।"]
 ```
-####Golden Rules (Armenian)
+#### Golden Rules (Armenian)
 1.) **Sentence ending punctuation**
 ```
@@ -605,7 +605,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["Այսպիսով` մոտենում ենք ավարտին:", "Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:"]
 ```
-####Golden Rules (Burmese)
+#### Golden Rules (Burmese)
 1.) **Sentence ending punctuation**
 ```
@@ -613,7 +613,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။", "၇ွင္ေနေကာင္းလား။"]
 ```
-####Golden Rules (Amharic)
+#### Golden Rules (Amharic)
 1.) **Sentence ending punctuation**
 ```
@@ -621,7 +621,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["እንደምን አለህ፧", "መልካም ቀን ይሁንልህ።", "እባክሽ ያልሽዉን ድገሚልኝ።"]
 ```
-####Golden Rules (Persian)
+#### Golden Rules (Persian)
 1.) **Sentence ending punctuation**
 ```
@@ -629,7 +629,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["خوشبختم، آقای رضا.", "شما کجایی هستید؟", "من از تهران هستم."]
 ```
-####Golden Rules (Urdu)
+#### Golden Rules (Urdu)
 1.) **Sentence ending punctuation**
 ```
@@ -637,7 +637,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
 => ["کیا حال ہے؟", "ميرا نام ___ ەے۔", "میں حالا تاوان دےدوں؟"]
 ```
-####Golden Rules (Dutch)
+#### Golden Rules (Dutch)
 1.) **Sentence starting with a number**
 ```
@@ -701,7 +701,7 @@ To test the relative performance of different segmentation tools and libraries I
 * Persian
 * Urdu
-##Segmentation Papers and Books
+## Segmentation Papers and Books
 * *Elephant: Sequence Labeling for Word and Sentence Segmentation* - Kilian Evang, Valerio Basile, Grzegorz Chrupała and Johan Bos (2013) [[pdf](http://www.aclweb.org/anthology/D13-1146) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/Elephant-+Sequence+Labeling+for+Word+and+Sentence+Segmentation.pdf)]
 * *Sentence Boundary Detection: A Long Solved Problem?* (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [[pdf](http://www.aclweb.org/anthology/C12-2096) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/C12-2096.pdf)]
@@ -850,6 +850,9 @@ To test the relative performance of different segmentation tools and libraries I
 **Version 0.3.13**
 * Fix issue involving unexpected sentence break between abbreviation and hyphen
+**Version 0.3.14**
+* Add English abbreviation Rs. to denote the Indian currency
 ## Contributing
 If you find a text that is incorrectly segmented using this gem, please submit an issue.

data/lib/pragmatic_segmenter/languages/common.rb CHANGED Viewed

@@ -9,7 +9,7 @@ module PragmaticSegmenter
       # Defines the abbreviations for each language (if available)
       module Abbreviation
-        ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
+        ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
         PREPOSITIVE_ABBREVIATIONS = Set.new(['adm', 'attys', 'brig', 'capt', 'cmdr', 'col', 'cpl', 'det', 'dr', 'gen', 'gov', 'ing', 'lt', 'maj', 'mr', 'mrs', 'ms', 'mt', 'messrs', 'mssrs', 'prof', 'ph', 'rep', 'reps', 'rev', 'sen', 'sens', 'sgt', 'st', 'supt', 'v', 'vs']).freeze
         NUMBER_ABBREVIATIONS = Set.new(['art', 'ext', 'no', 'nos', 'p', 'pp']).freeze
       end

data/lib/pragmatic_segmenter/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module PragmaticSegmenter
-  VERSION = "0.3.13"
+  VERSION = "0.3.14"
 end

data/spec/pragmatic_segmenter/languages/english_spec.rb CHANGED Viewed

@@ -1369,5 +1369,20 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
       ps = PragmaticSegmenter::Segmenter.new(text: "He has Ph.D. level training", clean: false)
       expect(ps.segment).to eq(["He has Ph.D. level training"])
     end
+    it "correctly segments text #111" do
+      ps = PragmaticSegmenter::Segmenter.new(text: "I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study.", clean: false)
+      expect(ps.segment).to eq(["I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study."])
+    end
+    it "correctly segments text #112" do
+      ps = PragmaticSegmenter::Segmenter.new(text: "If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC. If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study.", clean: false)
+      expect(ps.segment).to eq(["If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC.", "If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study."])
+    end
+    it "correctly segments text #113" do
+      ps = PragmaticSegmenter::Segmenter.new(text: "After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation.", clean: false)
+      expect(ps.segment).to eq(["After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation."])
+    end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pragmatic_segmenter
 version: !ruby/object:Gem::Version
-  version: 0.3.13
+  version: 0.3.14
 platform: ruby
 authors:
 - Kevin S. Dias
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-01-17 00:00:00.000000000 Z
+date: 2017-06-28 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode
@@ -178,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.6.8
+rubygems_version: 2.6.12
 signing_key:
 specification_version: 4
 summary: A rule-based sentence boundary detection gem that works out-of-the-box across