pragmatic_segmenter 0.3.13 → 0.3.14

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c313d610281828819a76463bd3b42590927307e2
4
- data.tar.gz: 3c9f340a197450a6dffac38f4a3c0b378f3b8edf
3
+ metadata.gz: 1f4cb40793083044dcf653f7c5c2c64d6e1f9476
4
+ data.tar.gz: df80bb40afbb05d6f9d1109f0fbb412b34fd7171
5
5
  SHA512:
6
- metadata.gz: d8756402fcb03f456f27d8359b4e8adfd970fed42a89bdfebab9b7a75b25d83dd9b4db19b1f7982b71214c573500b519cbc1920b6a54fa2632a3521dabee9a68
7
- data.tar.gz: 911a665d2609086e20aff601161e63ce22410e18c0d169fc9d8032cef029062aa7a3cdafa082c55db1ab78ae6f2e5f7a1f302246c716f5393a815ed9aec890dc
6
+ metadata.gz: 3296133dffb5dc2a1c3084f31ecd6115c9e5ffbcb53590ee751fa1e69159dbffadc0bbb6885f7fb09f38ec53aceaf07e2da8942ba6135ee344ae34fd1fac4d83
7
+ data.tar.gz: 6c3230195453e40c8a4bbfcb9b864a369db7a3beebcf4f61f37b11f2f8dff454a173456a59fa298d5962560441828fbec6b3d926e4dd033785cd4b8a7a716a05
data/CODE_OF_CONDUCT.md CHANGED
@@ -1,4 +1,4 @@
1
- #Contributor Code of Conduct
1
+ # Contributor Code of Conduct
2
2
 
3
3
  As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
4
4
 
data/NEWS CHANGED
@@ -1,3 +1,7 @@
1
+ 0.3.13 (2017-06-28):
2
+
3
+ * Improvement: Add English abbreviation Rs. to denote the Indian currency
4
+
1
5
  0.3.13 (2017-01-17):
2
6
 
3
7
  * Bug Fix: Unexpected sentence break between abbreviation and hyphen
data/README.md CHANGED
@@ -1,10 +1,10 @@
1
- #Pragmatic Segmenter
1
+ # Pragmatic Segmenter
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
4
4
 
5
5
  Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
6
6
 
7
- ##Install
7
+ ## Install
8
8
 
9
9
  **Ruby**
10
10
  *Supports Ruby 2.1.5 and above*
@@ -18,7 +18,7 @@ Add this line to your application’s Gemfile:
18
18
  gem 'pragmatic_segmenter'
19
19
  ```
20
20
 
21
- ##Usage
21
+ ## Usage
22
22
 
23
23
  * If no language is specified, the library will default to English.
24
24
  * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
@@ -54,11 +54,11 @@ ps.clean
54
54
  # => "This is a sentence cut off in the middle because pdf."
55
55
  ```
56
56
 
57
- ##Live Demo
57
+ ## Live Demo
58
58
 
59
59
  Try out a [live demo](https://www.tm-town.com/natural-language-processing) of Pragmatic Segmenter in the browser.
60
60
 
61
- ##Background
61
+ ## Background
62
62
 
63
63
  According to Wikipedia, [sentence boundary disambiguation](http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation) (aka sentence boundary detection, sentence segmentation) is defined as:
64
64
 
@@ -99,7 +99,7 @@ The Holy Grail of sentence segmentation appears to be **Golden Rule #18** as no
99
99
 
100
100
  Download the Golden Rules: [[txt](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) | [Ruby RSpec](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules_rspec.rb)]
101
101
 
102
- ####Golden Rules (English)
102
+ #### Golden Rules (English)
103
103
 
104
104
  1.) **Simple period to end sentence**
105
105
  ```
@@ -413,7 +413,7 @@ Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That i
413
413
  => ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]
414
414
  ```
415
415
 
416
- ####Golden Rules (German)
416
+ #### Golden Rules (German)
417
417
 
418
418
  1.) **Quotation at end of sentence**
419
419
  ```
@@ -433,7 +433,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
433
433
  => ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]
434
434
  ```
435
435
 
436
- ####Golden Rules (Japanese)
436
+ #### Golden Rules (Japanese)
437
437
 
438
438
  1.) **Simple period to end sentence**
439
439
  ```
@@ -465,7 +465,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
465
465
  => ["これは父の家です。"]
466
466
  ```
467
467
 
468
- ####Golden Rules (Arabic)
468
+ #### Golden Rules (Arabic)
469
469
 
470
470
  1.) **Regular punctuation**
471
471
  ```
@@ -497,7 +497,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
497
497
  => ["عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب،", "زرعها عملاء الموساد كما تقول مصادر إسرائيلية،", "وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية،", "وبدأت مراسم الحداد عليه"]
498
498
  ```
499
499
 
500
- ####Golden Rules (Italian)
500
+ #### Golden Rules (Italian)
501
501
 
502
502
  1.) **Abbreviations**
503
503
  ```
@@ -517,7 +517,7 @@ La casa costa 170.500.000,00€!
517
517
  => ["La casa costa 170.500.000,00€!"]
518
518
  ```
519
519
 
520
- ####Golden Rules (Russian)
520
+ #### Golden Rules (Russian)
521
521
 
522
522
  1.) **Abbreviations**
523
523
  ```
@@ -537,7 +537,7 @@ La casa costa 170.500.000,00€!
537
537
  => ["Сегодня 27.10.14"]
538
538
  ```
539
539
 
540
- ####Golden Rules (Spanish)
540
+ #### Golden Rules (Spanish)
541
541
 
542
542
  1.) **Question mark to end sentence**
543
543
  ```
@@ -569,7 +569,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
569
569
  => ["«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles."]
570
570
  ```
571
571
 
572
- ####Golden Rules (Greek)
572
+ #### Golden Rules (Greek)
573
573
 
574
574
  1.) **Question mark to end sentence**
575
575
  ```
@@ -577,7 +577,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
577
577
  => ["Με συγχωρείτε· πού είναι οι τουαλέτες;", "Τις Κυριακές δε δούλευε κανένας.", "το κόστος του σπιτιού ήταν £260.950,00."]
578
578
  ```
579
579
 
580
- ####Golden Rules (Hindi)
580
+ #### Golden Rules (Hindi)
581
581
 
582
582
  1.) **Full stop**
583
583
  ```
@@ -585,7 +585,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
585
585
  => ["सच्चाई यह है कि इसे कोई नहीं जानता।", "हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।"]
586
586
  ```
587
587
 
588
- ####Golden Rules (Armenian)
588
+ #### Golden Rules (Armenian)
589
589
 
590
590
  1.) **Sentence ending punctuation**
591
591
  ```
@@ -605,7 +605,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
605
605
  => ["Այսպիսով` մոտենում ենք ավարտին:", "Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:"]
606
606
  ```
607
607
 
608
- ####Golden Rules (Burmese)
608
+ #### Golden Rules (Burmese)
609
609
 
610
610
  1.) **Sentence ending punctuation**
611
611
  ```
@@ -613,7 +613,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
613
613
  => ["ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။", "၇ွင္ေနေကာင္းလား။"]
614
614
  ```
615
615
 
616
- ####Golden Rules (Amharic)
616
+ #### Golden Rules (Amharic)
617
617
 
618
618
  1.) **Sentence ending punctuation**
619
619
  ```
@@ -621,7 +621,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
621
621
  => ["እንደምን አለህ፧", "መልካም ቀን ይሁንልህ።", "እባክሽ ያልሽዉን ድገሚልኝ።"]
622
622
  ```
623
623
 
624
- ####Golden Rules (Persian)
624
+ #### Golden Rules (Persian)
625
625
 
626
626
  1.) **Sentence ending punctuation**
627
627
  ```
@@ -629,7 +629,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
629
629
  => ["خوشبختم، آقای رضا.", "شما کجایی هستید؟", "من از تهران هستم."]
630
630
  ```
631
631
 
632
- ####Golden Rules (Urdu)
632
+ #### Golden Rules (Urdu)
633
633
 
634
634
  1.) **Sentence ending punctuation**
635
635
  ```
@@ -637,7 +637,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
637
637
  => ["کیا حال ہے؟", "ميرا نام ___ ەے۔", "میں حالا تاوان دےدوں؟"]
638
638
  ```
639
639
 
640
- ####Golden Rules (Dutch)
640
+ #### Golden Rules (Dutch)
641
641
 
642
642
  1.) **Sentence starting with a number**
643
643
  ```
@@ -701,7 +701,7 @@ To test the relative performance of different segmentation tools and libraries I
701
701
  * Persian
702
702
  * Urdu
703
703
 
704
- ##Segmentation Papers and Books
704
+ ## Segmentation Papers and Books
705
705
 
706
706
  * *Elephant: Sequence Labeling for Word and Sentence Segmentation* - Kilian Evang, Valerio Basile, Grzegorz Chrupała and Johan Bos (2013) [[pdf](http://www.aclweb.org/anthology/D13-1146) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/Elephant-+Sequence+Labeling+for+Word+and+Sentence+Segmentation.pdf)]
707
707
  * *Sentence Boundary Detection: A Long Solved Problem?* (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [[pdf](http://www.aclweb.org/anthology/C12-2096) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/C12-2096.pdf)]
@@ -850,6 +850,9 @@ To test the relative performance of different segmentation tools and libraries I
850
850
  **Version 0.3.13**
851
851
  * Fix issue involving unexpected sentence break between abbreviation and hyphen
852
852
 
853
+ **Version 0.3.14**
854
+ * Add English abbreviation Rs. to denote the Indian currency
855
+
853
856
  ## Contributing
854
857
 
855
858
  If you find a text that is incorrectly segmented using this gem, please submit an issue.
@@ -9,7 +9,7 @@ module PragmaticSegmenter
9
9
 
10
10
  # Defines the abbreviations for each language (if available)
11
11
  module Abbreviation
12
- ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
12
+ ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
13
13
  PREPOSITIVE_ABBREVIATIONS = Set.new(['adm', 'attys', 'brig', 'capt', 'cmdr', 'col', 'cpl', 'det', 'dr', 'gen', 'gov', 'ing', 'lt', 'maj', 'mr', 'mrs', 'ms', 'mt', 'messrs', 'mssrs', 'prof', 'ph', 'rep', 'reps', 'rev', 'sen', 'sens', 'sgt', 'st', 'supt', 'v', 'vs']).freeze
14
14
  NUMBER_ABBREVIATIONS = Set.new(['art', 'ext', 'no', 'nos', 'p', 'pp']).freeze
15
15
  end
@@ -1,3 +1,3 @@
1
1
  module PragmaticSegmenter
2
- VERSION = "0.3.13"
2
+ VERSION = "0.3.14"
3
3
  end
@@ -1369,5 +1369,20 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
1369
1369
  ps = PragmaticSegmenter::Segmenter.new(text: "He has Ph.D. level training", clean: false)
1370
1370
  expect(ps.segment).to eq(["He has Ph.D. level training"])
1371
1371
  end
1372
+
1373
+ it "correctly segments text #111" do
1374
+ ps = PragmaticSegmenter::Segmenter.new(text: "I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study.", clean: false)
1375
+ expect(ps.segment).to eq(["I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study."])
1376
+ end
1377
+
1378
+ it "correctly segments text #112" do
1379
+ ps = PragmaticSegmenter::Segmenter.new(text: "If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC. If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study.", clean: false)
1380
+ expect(ps.segment).to eq(["If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC.", "If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study."])
1381
+ end
1382
+
1383
+ it "correctly segments text #113" do
1384
+ ps = PragmaticSegmenter::Segmenter.new(text: "After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation.", clean: false)
1385
+ expect(ps.segment).to eq(["After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation."])
1386
+ end
1372
1387
  end
1373
1388
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pragmatic_segmenter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.13
4
+ version: 0.3.14
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin S. Dias
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-01-17 00:00:00.000000000 Z
11
+ date: 2017-06-28 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode
@@ -178,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
178
178
  version: '0'
179
179
  requirements: []
180
180
  rubyforge_project:
181
- rubygems_version: 2.6.8
181
+ rubygems_version: 2.6.12
182
182
  signing_key:
183
183
  specification_version: 4
184
184
  summary: A rule-based sentence boundary detection gem that works out-of-the-box across