pragmatic_segmenter 0.3.13 → 0.3.14
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CODE_OF_CONDUCT.md +1 -1
- data/NEWS +4 -0
- data/README.md +24 -21
- data/lib/pragmatic_segmenter/languages/common.rb +1 -1
- data/lib/pragmatic_segmenter/version.rb +1 -1
- data/spec/pragmatic_segmenter/languages/english_spec.rb +15 -0
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1f4cb40793083044dcf653f7c5c2c64d6e1f9476
|
4
|
+
data.tar.gz: df80bb40afbb05d6f9d1109f0fbb412b34fd7171
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 3296133dffb5dc2a1c3084f31ecd6115c9e5ffbcb53590ee751fa1e69159dbffadc0bbb6885f7fb09f38ec53aceaf07e2da8942ba6135ee344ae34fd1fac4d83
|
7
|
+
data.tar.gz: 6c3230195453e40c8a4bbfcb9b864a369db7a3beebcf4f61f37b11f2f8dff454a173456a59fa298d5962560441828fbec6b3d926e4dd033785cd4b8a7a716a05
|
data/CODE_OF_CONDUCT.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
#Contributor Code of Conduct
|
1
|
+
# Contributor Code of Conduct
|
2
2
|
|
3
3
|
As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
|
4
4
|
|
data/NEWS
CHANGED
data/README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1
|
-
#Pragmatic Segmenter
|
1
|
+
# Pragmatic Segmenter
|
2
2
|
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
|
4
4
|
|
5
5
|
Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
|
6
6
|
|
7
|
-
##Install
|
7
|
+
## Install
|
8
8
|
|
9
9
|
**Ruby**
|
10
10
|
*Supports Ruby 2.1.5 and above*
|
@@ -18,7 +18,7 @@ Add this line to your application’s Gemfile:
|
|
18
18
|
gem 'pragmatic_segmenter'
|
19
19
|
```
|
20
20
|
|
21
|
-
##Usage
|
21
|
+
## Usage
|
22
22
|
|
23
23
|
* If no language is specified, the library will default to English.
|
24
24
|
* To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
|
@@ -54,11 +54,11 @@ ps.clean
|
|
54
54
|
# => "This is a sentence cut off in the middle because pdf."
|
55
55
|
```
|
56
56
|
|
57
|
-
##Live Demo
|
57
|
+
## Live Demo
|
58
58
|
|
59
59
|
Try out a [live demo](https://www.tm-town.com/natural-language-processing) of Pragmatic Segmenter in the browser.
|
60
60
|
|
61
|
-
##Background
|
61
|
+
## Background
|
62
62
|
|
63
63
|
According to Wikipedia, [sentence boundary disambiguation](http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation) (aka sentence boundary detection, sentence segmentation) is defined as:
|
64
64
|
|
@@ -99,7 +99,7 @@ The Holy Grail of sentence segmentation appears to be **Golden Rule #18** as no
|
|
99
99
|
|
100
100
|
Download the Golden Rules: [[txt](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) | [Ruby RSpec](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules_rspec.rb)]
|
101
101
|
|
102
|
-
####Golden Rules (English)
|
102
|
+
#### Golden Rules (English)
|
103
103
|
|
104
104
|
1.) **Simple period to end sentence**
|
105
105
|
```
|
@@ -413,7 +413,7 @@ Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That i
|
|
413
413
|
=> ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]
|
414
414
|
```
|
415
415
|
|
416
|
-
####Golden Rules (German)
|
416
|
+
#### Golden Rules (German)
|
417
417
|
|
418
418
|
1.) **Quotation at end of sentence**
|
419
419
|
```
|
@@ -433,7 +433,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
|
|
433
433
|
=> ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]
|
434
434
|
```
|
435
435
|
|
436
|
-
####Golden Rules (Japanese)
|
436
|
+
#### Golden Rules (Japanese)
|
437
437
|
|
438
438
|
1.) **Simple period to end sentence**
|
439
439
|
```
|
@@ -465,7 +465,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
|
|
465
465
|
=> ["これは父の家です。"]
|
466
466
|
```
|
467
467
|
|
468
|
-
####Golden Rules (Arabic)
|
468
|
+
#### Golden Rules (Arabic)
|
469
469
|
|
470
470
|
1.) **Regular punctuation**
|
471
471
|
```
|
@@ -497,7 +497,7 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
|
|
497
497
|
=> ["عثر في الغرفة على بعض أدوية علاج ارتفاع ضغط الدم، والقلب،", "زرعها عملاء الموساد كما تقول مصادر إسرائيلية،", "وقرر الطبيب أن الفلسطيني قد توفي وفاة طبيعية ربما إثر نوبة قلبية،", "وبدأت مراسم الحداد عليه"]
|
498
498
|
```
|
499
499
|
|
500
|
-
####Golden Rules (Italian)
|
500
|
+
#### Golden Rules (Italian)
|
501
501
|
|
502
502
|
1.) **Abbreviations**
|
503
503
|
```
|
@@ -517,7 +517,7 @@ La casa costa 170.500.000,00€!
|
|
517
517
|
=> ["La casa costa 170.500.000,00€!"]
|
518
518
|
```
|
519
519
|
|
520
|
-
####Golden Rules (Russian)
|
520
|
+
#### Golden Rules (Russian)
|
521
521
|
|
522
522
|
1.) **Abbreviations**
|
523
523
|
```
|
@@ -537,7 +537,7 @@ La casa costa 170.500.000,00€!
|
|
537
537
|
=> ["Сегодня 27.10.14"]
|
538
538
|
```
|
539
539
|
|
540
|
-
####Golden Rules (Spanish)
|
540
|
+
#### Golden Rules (Spanish)
|
541
541
|
|
542
542
|
1.) **Question mark to end sentence**
|
543
543
|
```
|
@@ -569,7 +569,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
569
569
|
=> ["«Ninguna mente extraordinaria está exenta de un toque de demencia.», dijo Aristóteles."]
|
570
570
|
```
|
571
571
|
|
572
|
-
####Golden Rules (Greek)
|
572
|
+
#### Golden Rules (Greek)
|
573
573
|
|
574
574
|
1.) **Question mark to end sentence**
|
575
575
|
```
|
@@ -577,7 +577,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
577
577
|
=> ["Με συγχωρείτε· πού είναι οι τουαλέτες;", "Τις Κυριακές δε δούλευε κανένας.", "το κόστος του σπιτιού ήταν £260.950,00."]
|
578
578
|
```
|
579
579
|
|
580
|
-
####Golden Rules (Hindi)
|
580
|
+
#### Golden Rules (Hindi)
|
581
581
|
|
582
582
|
1.) **Full stop**
|
583
583
|
```
|
@@ -585,7 +585,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
585
585
|
=> ["सच्चाई यह है कि इसे कोई नहीं जानता।", "हो सकता है यह फ़्रेन्को के खिलाफ़ कोई विद्रोह रहा हो, या फिर बेकाबू हो गया कोई आनंदोत्सव।"]
|
586
586
|
```
|
587
587
|
|
588
|
-
####Golden Rules (Armenian)
|
588
|
+
#### Golden Rules (Armenian)
|
589
589
|
|
590
590
|
1.) **Sentence ending punctuation**
|
591
591
|
```
|
@@ -605,7 +605,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
605
605
|
=> ["Այսպիսով` մոտենում ենք ավարտին:", "Տրամաբանությյունը հետևյալն է. պարզություն և աշխատանք:"]
|
606
606
|
```
|
607
607
|
|
608
|
-
####Golden Rules (Burmese)
|
608
|
+
#### Golden Rules (Burmese)
|
609
609
|
|
610
610
|
1.) **Sentence ending punctuation**
|
611
611
|
```
|
@@ -613,7 +613,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
613
613
|
=> ["ခင္ဗ်ားနာမည္ဘယ္လိုေခၚလဲ။", "၇ွင္ေနေကာင္းလား။"]
|
614
614
|
```
|
615
615
|
|
616
|
-
####Golden Rules (Amharic)
|
616
|
+
#### Golden Rules (Amharic)
|
617
617
|
|
618
618
|
1.) **Sentence ending punctuation**
|
619
619
|
```
|
@@ -621,7 +621,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
621
621
|
=> ["እንደምን አለህ፧", "መልካም ቀን ይሁንልህ።", "እባክሽ ያልሽዉን ድገሚልኝ።"]
|
622
622
|
```
|
623
623
|
|
624
|
-
####Golden Rules (Persian)
|
624
|
+
#### Golden Rules (Persian)
|
625
625
|
|
626
626
|
1.) **Sentence ending punctuation**
|
627
627
|
```
|
@@ -629,7 +629,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
629
629
|
=> ["خوشبختم، آقای رضا.", "شما کجایی هستید؟", "من از تهران هستم."]
|
630
630
|
```
|
631
631
|
|
632
|
-
####Golden Rules (Urdu)
|
632
|
+
#### Golden Rules (Urdu)
|
633
633
|
|
634
634
|
1.) **Sentence ending punctuation**
|
635
635
|
```
|
@@ -637,7 +637,7 @@ Hola Srta. Ledesma. Buenos días, soy el Lic. Naser Pastoriza, y él es mi padre
|
|
637
637
|
=> ["کیا حال ہے؟", "ميرا نام ___ ەے۔", "میں حالا تاوان دےدوں؟"]
|
638
638
|
```
|
639
639
|
|
640
|
-
####Golden Rules (Dutch)
|
640
|
+
#### Golden Rules (Dutch)
|
641
641
|
|
642
642
|
1.) **Sentence starting with a number**
|
643
643
|
```
|
@@ -701,7 +701,7 @@ To test the relative performance of different segmentation tools and libraries I
|
|
701
701
|
* Persian
|
702
702
|
* Urdu
|
703
703
|
|
704
|
-
##Segmentation Papers and Books
|
704
|
+
## Segmentation Papers and Books
|
705
705
|
|
706
706
|
* *Elephant: Sequence Labeling for Word and Sentence Segmentation* - Kilian Evang, Valerio Basile, Grzegorz Chrupała and Johan Bos (2013) [[pdf](http://www.aclweb.org/anthology/D13-1146) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/Elephant-+Sequence+Labeling+for+Word+and+Sentence+Segmentation.pdf)]
|
707
707
|
* *Sentence Boundary Detection: A Long Solved Problem?* (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [[pdf](http://www.aclweb.org/anthology/C12-2096) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/C12-2096.pdf)]
|
@@ -850,6 +850,9 @@ To test the relative performance of different segmentation tools and libraries I
|
|
850
850
|
**Version 0.3.13**
|
851
851
|
* Fix issue involving unexpected sentence break between abbreviation and hyphen
|
852
852
|
|
853
|
+
**Version 0.3.14**
|
854
|
+
* Add English abbreviation Rs. to denote the Indian currency
|
855
|
+
|
853
856
|
## Contributing
|
854
857
|
|
855
858
|
If you find a text that is incorrectly segmented using this gem, please submit an issue.
|
@@ -9,7 +9,7 @@ module PragmaticSegmenter
|
|
9
9
|
|
10
10
|
# Defines the abbreviations for each language (if available)
|
11
11
|
module Abbreviation
|
12
|
-
ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
|
12
|
+
ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
|
13
13
|
PREPOSITIVE_ABBREVIATIONS = Set.new(['adm', 'attys', 'brig', 'capt', 'cmdr', 'col', 'cpl', 'det', 'dr', 'gen', 'gov', 'ing', 'lt', 'maj', 'mr', 'mrs', 'ms', 'mt', 'messrs', 'mssrs', 'prof', 'ph', 'rep', 'reps', 'rev', 'sen', 'sens', 'sgt', 'st', 'supt', 'v', 'vs']).freeze
|
14
14
|
NUMBER_ABBREVIATIONS = Set.new(['art', 'ext', 'no', 'nos', 'p', 'pp']).freeze
|
15
15
|
end
|
@@ -1369,5 +1369,20 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
|
|
1369
1369
|
ps = PragmaticSegmenter::Segmenter.new(text: "He has Ph.D. level training", clean: false)
|
1370
1370
|
expect(ps.segment).to eq(["He has Ph.D. level training"])
|
1371
1371
|
end
|
1372
|
+
|
1373
|
+
it "correctly segments text #111" do
|
1374
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study.", clean: false)
|
1375
|
+
expect(ps.segment).to eq(["I will be paid Rs. 16720/- in total for the time spent and the inconvenience caused to me, only after completion of all aspects of the study."])
|
1376
|
+
end
|
1377
|
+
|
1378
|
+
it "correctly segments text #112" do
|
1379
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC. If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study.", clean: false)
|
1380
|
+
expect(ps.segment).to eq(["If I decide to withdraw from the study for other reasons, I will be paid only up to the extent of my participation amount according to the approved procedure of Apotex BEC.", "If I complete all aspects in Period 1, I will be paid Rs. 3520 and if I complete all aspects in Period 1 and Period 2, I will be paid Rs. 7790 and if I complete all aspects in Period 1, Period 2 and Period 3, I will be paid Rs. 12060 at the end of the study."])
|
1381
|
+
end
|
1382
|
+
|
1383
|
+
it "correctly segments text #113" do
|
1384
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation.", clean: false)
|
1385
|
+
expect(ps.segment).to eq(["After completion of each Period, I will be paid an advance amount of rs. 1000 and this amount will be deducted from my final study compensation."])
|
1386
|
+
end
|
1372
1387
|
end
|
1373
1388
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pragmatic_segmenter
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.14
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Kevin S. Dias
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-
|
11
|
+
date: 2017-06-28 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: unicode
|
@@ -178,7 +178,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
178
178
|
version: '0'
|
179
179
|
requirements: []
|
180
180
|
rubyforge_project:
|
181
|
-
rubygems_version: 2.6.
|
181
|
+
rubygems_version: 2.6.12
|
182
182
|
signing_key:
|
183
183
|
specification_version: 4
|
184
184
|
summary: A rule-based sentence boundary detection gem that works out-of-the-box across
|