pragmatic_segmenter 0.0.3 → 0.0.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4a3193df3b8ca130acd3ce35c73cc1e6f33ec4c1
4
- data.tar.gz: d309a5c69fe5abe4cc06a4d97e58191b391a8e88
3
+ metadata.gz: 9d58b0da895c9249efd632eca9c21ba386ab0ba3
4
+ data.tar.gz: 3e9b46ad1c9bbc1704fe2a6b2c4ce899c73267fa
5
5
  SHA512:
6
- metadata.gz: c830d570cbc5d4cb36f93385d8a1e29da2101cf8831c4c3ce217bf079ea6c1990d059c246e3a13862109f71a2b4ddbbc5bf078308a43f91cc610ada7aaf10264
7
- data.tar.gz: 94ae13f93099f6fd5923e60a315313bac7122f5ca3b6536992f3fcd20352cdfd19fdf227e74a37a2483146861cfd6ed5c92396c008fd4413e142633f4cfad715
6
+ metadata.gz: 14b408a8e527d35d6da36082fa17240c1c9c6b50f66747fda05e52ed859ca0d1c3b6f5a8f57ef8fd0827b5134d960beed9536509738c94db31dbcd12b3a03fcd
7
+ data.tar.gz: 6c67d78cf0777504c4c21e6cf9a168897b4c8c43bf3e3e84a6161f81bcbb8355a5794d35a9924196fafed4a0d682ff74063548a676d454e1a198ad3ea9959a83
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  #Pragmatic Segmenter
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter)
3
+ [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
4
4
 
5
5
  Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
6
6
 
@@ -97,6 +97,8 @@ Therefore, I created a set of distinct edge cases to compare segmentation tools
97
97
 
98
98
  The Holy Grail of sentence segmentation appears to be **Golden Rule #18** as no segmenter I tested was able to correctly segment that text. The difficulty being that an abbreviation (in this case a.m./A.M./p.m./P.M.) followed by a capitalized abbreviation (such as Mr., Mrs., etc.) or followed by a proper noun such as a name can be both a sentence boundary and a non sentence boundary.
99
99
 
100
+ Download the Golden Rules: [[txt](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) | [Ruby RSpec](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules_rspec.rb)]
101
+
100
102
  ####Golden Rules (English)
101
103
 
102
104
  1.) **Simple period to end sentence**
@@ -654,7 +656,7 @@ Other tools not yet tested:
654
656
 
655
657
  ## Speed Performance Benchmarks
656
658
 
657
- To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt the tests were run using the [Ruby port](https://github.com/lfcipriani/punkt-segmenter). For Standford CoreNLP the tests were run using the [Ruby port](https://github.com/louismullie/stanford-core-nlp). For OpenNLP the tests were run using the [Ruby port](https://github.com/louismullie/open-nlp).
659
+ To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt the tests were run using this [Ruby port](https://github.com/lfcipriani/punkt-segmenter), for Standford CoreNLP the tests were run using this [Ruby port](https://github.com/louismullie/stanford-core-nlp), and for OpenNLP the tests were run using this [Ruby port](https://github.com/louismullie/open-nlp).
658
660
 
659
661
  ## Languages with sentence boundary punctuation that is different than English
660
662
 
@@ -674,6 +676,7 @@ To test the relative performance of different segmentation tools and libraries I
674
676
 
675
677
  ##Segmentation Papers and Books
676
678
 
679
+ * *Sentence Boundary Detection: A Long Solved Problem?* (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [[pdf](http://www.aclweb.org/anthology/C12-2096) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/C12-2096.pdf)]
677
680
  * *Handbook of Natural Language Processing* (Second Edition) - Nitin Indurkhya and Fred J. Damerau (2010) [[amazon](http://www.amazon.com/Handbook-Language-Processing-Learning-Recognition/dp/1420085921)]
678
681
  * *Sentence Boundary Detection and the Problem with the U.S.* - Dan Gillick (2009) [[pdf](http://dgillick.com/resource/sbd_naacl_2009.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/sbd_naacl_2009.pdf)]
679
682
  * *Thoughts on Word and Sentence Segmentation in Thai* - Wirote Aroonmanakun (2007) [[pdf](http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/snlp2007-wirote.pdf)]
@@ -50,6 +50,9 @@ module PragmaticSegmenter
50
50
  # Rubular: http://rubular.com/r/DwNSuZrNtk
51
51
  ConsecutivePeriodsRule = Rule.new(/\.{5,}/, ' ')
52
52
 
53
+ # http://rubular.com/r/IQ4TPfsbd8
54
+ ConsecutiveForwardSlashRule = Rule.new(/\/{3}/, '')
55
+
53
56
  ReplaceNewlineWithCarriageReturnRule = Rule.new(/\n/, "\r")
54
57
 
55
58
  QuotationsFirstRule = Rule.new(/''/, '"')
@@ -135,7 +138,9 @@ module PragmaticSegmenter
135
138
  end
136
139
 
137
140
  def clean_table_of_contents(txt)
138
- txt.apply(TableOfContentsRule).apply(ConsecutivePeriodsRule)
141
+ txt.apply(TableOfContentsRule).
142
+ apply(ConsecutivePeriodsRule).
143
+ apply(ConsecutiveForwardSlashRule)
139
144
  end
140
145
  end
141
146
  end
@@ -0,0 +1,30 @@
1
+ module PragmaticSegmenter
2
+ module LanguageSupport
3
+ LANGUAGE_CODES = {
4
+ 'en' => 'English',
5
+ 'de' => 'Deutsch',
6
+ 'es' => 'Spanish',
7
+ 'fr' => 'French',
8
+ 'it' => 'Italian',
9
+ 'ja' => 'Japanese',
10
+ 'el' => 'Greek',
11
+ 'ru' => 'Russian',
12
+ 'ar' => 'Arabic',
13
+ 'am' => 'Amharic',
14
+ 'hi' => 'Hindi',
15
+ 'hy' => 'Armenian',
16
+ 'fa' => 'Persian',
17
+ 'my' => 'Burmese',
18
+ 'ur' => 'Urdu',
19
+ }
20
+
21
+ def process_class
22
+ Object.const_get("PragmaticSegmenter::Languages::#{LANGUAGE_CODES[language] || 'Common'}::Process")
23
+ end
24
+
25
+ def cleaner_class
26
+ Object.const_get("PragmaticSegmenter::Languages::#{LANGUAGE_CODES[language] || 'Common'}::Cleaner")
27
+ end
28
+
29
+ end
30
+ end
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[፧።!\?]|.*?$/
18
21
 
@@ -17,6 +17,9 @@ module PragmaticSegmenter
17
17
  end
18
18
  end
19
19
 
20
+ class Cleaner < PragmaticSegmenter::Cleaner
21
+ end
22
+
20
23
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
21
24
  SENTENCE_BOUNDARY = /.*?[:\.!\?؟،]|.*?\z|.*?$/
22
25
 
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[։՜:]|.*?$/
18
21
 
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[။၏!\?]|.*?$/
18
21
 
@@ -0,0 +1,10 @@
1
+ module PragmaticSegmenter
2
+ module Languages
3
+ class Common
4
+ class Process < PragmaticSegmenter::Process
5
+ end
6
+ class Cleaner < PragmaticSegmenter::Cleaner
7
+ end
8
+ end
9
+ end
10
+ end
@@ -17,6 +17,9 @@ module PragmaticSegmenter
17
17
  end
18
18
  end
19
19
 
20
+ class Cleaner < PragmaticSegmenter::Cleaner
21
+ end
22
+
20
23
  class Number < PragmaticSegmenter::Number
21
24
  # Rubular: http://rubular.com/r/hZxoyQwKT1
22
25
  NumberPeriodSpaceRule = Rule.new(/(?<=\s[0-9]|\s([1-9][0-9]))\.(?=\s)/, '∯')
@@ -1,6 +1,9 @@
1
1
  module PragmaticSegmenter
2
2
  module Languages
3
3
  class English
4
+ class Process < PragmaticSegmenter::Process
5
+ end
6
+
4
7
  class Cleaner < PragmaticSegmenter::Cleaner
5
8
  def clean
6
9
  super
@@ -1,6 +1,12 @@
1
1
  module PragmaticSegmenter
2
2
  module Languages
3
3
  class French
4
+ class Process < PragmaticSegmenter::Process
5
+ end
6
+
7
+ class Cleaner < PragmaticSegmenter::Cleaner
8
+ end
9
+
4
10
  class Abbreviation < PragmaticSegmenter::Abbreviation
5
11
  ABBREVIATIONS = ['a.c.n', 'a.m', 'al', 'ann', 'apr', 'art', 'auj', 'av', 'b.p', 'boul', 'c.-à-d', 'c.n', 'c.n.s', 'c.p.i', 'c.q.f.d', 'c.s', 'ca', 'cf', 'ch.-l', 'chap', 'co', 'co', 'contr', 'dir', 'e.g', 'e.v', 'env', 'etc', 'ex', 'fasc', 'fig', 'fr', 'fém', 'hab', 'i.e', 'ibid', 'id', 'inf', 'l.d', 'lib', 'll.aa', 'll.aa.ii', 'll.aa.rr', 'll.aa.ss', 'll.ee', 'll.mm', 'll.mm.ii.rr', 'loc.cit', 'ltd', 'ltd', 'masc', 'mm', 'ms', 'n.b', 'n.d', 'n.d.a', 'n.d.l.r', 'n.d.t', 'n.p.a.i', 'n.s', 'n/réf', 'nn.ss', 'p.c.c', 'p.ex', 'p.j', 'p.s', 'pl', 'pp', 'r.-v', 'r.a.s', 'r.i.p', 'r.p', 's.a', 's.a.i', 's.a.r', 's.a.s', 's.e', 's.m', 's.m.i.r', 's.s', 'sec', 'sect', 'sing', 'sq', 'sqq', 'ss', 'suiv', 'sup', 'suppl', 't.s.v.p', 'tél', 'vb', 'vol', 'vs', 'x.o', 'z.i', 'éd']
6
12
 
@@ -9,6 +9,9 @@ module PragmaticSegmenter
9
9
  end
10
10
  end
11
11
 
12
+ class Cleaner < PragmaticSegmenter::Cleaner
13
+ end
14
+
12
15
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
13
16
  SENTENCE_BOUNDARY = /.*?[\.;!\?]|.*?$/
14
17
 
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[।\|!\?]|.*?$/
18
21
 
@@ -9,6 +9,9 @@ module PragmaticSegmenter
9
9
  end
10
10
  end
11
11
 
12
+ class Cleaner < PragmaticSegmenter::Cleaner
13
+ end
14
+
12
15
  class Abbreviation < PragmaticSegmenter::Abbreviation
13
16
  ABBREVIATIONS = ['1°', 'a.c', 'a.c/a', 'a.cam', 'a.civ', 'a.cor', 'a.d.r', 'a.gov', 'a.mil', 'a.mon', 'a.smv', 'a.v', 'a/a', 'a/c', 'a/i', 'aa', 'aaaa', 'aaal', 'aacst', 'aamct', 'aams', 'aar', 'aato', 'ab', 'abbigl', 'abbrev', 'abc', 'abi', 'abl', 'abm', 'abr', 'abs', 'absp', 'ac', 'acam', 'acb', 'acbi', 'acc', 'accorc', 'accr', 'acd', 'ace', 'acec', 'acep', 'aci', 'acli', 'acp', 'acro', 'acsit', 'actl', 'ad', 'ad.mil', 'ada', 'adap', 'adatt', 'adc', 'add', 'adei', 'adeion', 'adhd', 'adi', 'adisco', 'adj', 'adm', 'adp', 'adr', 'ads', 'adsi', 'adsl', 'adv', 'ae.b', 'aefi', 'aer', 'aerodin', 'aeron', 'afa', 'afc', 'afci', 'affl', 'afi', 'afic', 'afm', 'afp', 'ag', 'agcm', 'agcom', 'age', 'agecs', 'agesci', 'agg', 'agip', 'agis', 'agm', 'ago', 'agr', 'agric', 'agt', 'ai', 'aia', 'aiab', 'aiac', 'aiace', 'aiap', 'aias', 'aiat', 'aib', 'aic', 'aica', 'aicel', 'aici', 'aics', 'aid', 'aida', 'aidaa', 'aidac', 'aidama', 'aidda', 'aidim', 'aido', 'aids', 'aies', 'aif', 'aih', 'aiip', 'aimi', 'aip', 'aipsc', 'airi', 'ais', 'aisa', 'aism', 'aiss', 'aissca', 'aitc', 'aiti', 'aitr', 'aits', 'aka', 'al', 'alai', 'alch', 'alg', 'ali', 'alim', 'all', 'allev', 'allus', 'alp', 'alq', 'alt', 'am', 'ama', 'amaci', 'amag', 'amami', 'amc', 'ammec', 'amn', 'ampas', 'amps', 'an', 'ana', 'anaai', 'anac', 'anaci', 'anad', 'anai', 'anaoo', 'anart', 'anat', 'anat. comp', 'ancci', 'anci', 'ancip', 'ancsa', 'andit', 'anec', 'anee', 'anem', 'anes', 'anffas', 'ani', 'ania', 'anica', 'anie', 'animi', 'anis', 'anisc', 'anm', 'anmfit', 'anmig', 'anmil', 'anmli', 'anms', 'anpa', 'anpas', 'anpci', 'anpe', 'anpi', 'ansi', 'ansv', 'ant', 'anta', 'antifr', 'antlo', 'anton', 'antrop', 'anusca', 'anvi', 'anx', 'ao', 'ap', 'apa', 'apd', 'apea', 'apec', 'apet', 'api', 'apos', 'app', 'app.sc', 'apr', 'aps', 'apt', 'aq', 'ar', 'ar.ind', 'ar.rep', 'arald', 'arame', 'arc', 'arch', 'archeol', 'arci', 'ardsu', 'are', 'arg', 'aritm', 'arpa', 'arpat', 'arred', 'arrt', 'arsia', 'art', 'arti min', 'artig', 'artigl', 'artt', 'as', 'asa', 'asae', 'asc', 'asci', 'ascii', 'ascom', 'ascop', 'asd', 'ase', 'asf', 'asfer', 'asg', 'asic', 'asifa', 'asl', 'asmdc', 'asmi', 'asp', 'aspic', 'aspp', 'assi', 'assic', 'assol', 'asst', 'aster', 'astr', 'astrol', 'astron', 'at', 'ata', 'atb', 'atic', 'atm', 'ats', 'att', 'attrav', 'atv', 'au', 'auc', 'aus', 'auser', 'aut', 'autom', 'av', 'avi', 'avis', 'avo', 'avv', 'avvers', 'awb', 'awdp', 'az', 'azh', 'b.a', 'b2b', 'b2c', 'ba', 'bafta', 'bal', 'ball', 'ban', 'banc', 'bar', 'bart', 'bas', 'bat', 'batt', 'bban', 'bbc', 'bbl', 'bbs', 'bbtc', 'bcc', 'bce', 'bcf', 'bdf', 'bei', 'bep', 'bers', 'bg', 'bi', 'bibl', 'bic', 'bioch', 'biol', 'bl', 'bld', 'bldg', 'blpc', 'bm', 'bmps', 'bmw', 'bn', 'bna', 'bncf', 'bncrm', 'bni', 'bnl', 'bo', 'bot', 'bpl', 'bpm', 'bpn', 'bpr', 'br', 'brd', 'bre', 'bric', 'brig', 'brig.ca', 'brig.gen', 'bros', 'bs', 'bsc', 'bsp', 'bsu', 'bt', 'btc', 'btg', 'btg.l', 'btr', 'bts', 'bu', 'bur', 'bz', 'c.a', 'c.a.p', 'c.c.p', 'c.cost', 'c.d a', 'c.d', 'c.le', 'c.m', 'c.opv', 'c.p', 'c.s', 'c.v', 'c.v.d', 'c/a', 'c/c', 'c/pag', 'ca', 'ca.rep', 'ca.sm', 'ca.sz', 'ca.uf', 'caaf', 'cab', 'cad', 'cae', 'cai', 'cal', 'cam', 'cap', 'capol', 'capt', 'car', 'car.sc', 'carat', 'card', 'cas', 'casaca', 'casd', 'cass.civ', 'cat', 'caus', 'cav', 'cavg', 'cb', 'cbd', 'cbr', 'cbs', 'cc', 'cca', 'ccap', 'ccda', 'ccdp', 'ccee', 'cciaa', 'ccie', 'ccip', 'cciss', 'ccna', 'ccnl', 'ccnp', 'ccpb', 'ccs', 'ccsp', 'cctld', 'cctv', 'ccv', 'cd', 'cda', 'cdma', 'cdo', 'cdpd', 'cdr', 'cds', 'cdw', 'ce', 'ced', 'cee', 'cei', 'cemat', 'cenelec', 'centr', 'cepis', 'ceps', 'cept', 'cerit', 'cese', 'cesis', 'cesvot', 'cet', 'cf', 'cfa', 'cfr', 'cg', 'cgi', 'cgil', 'cgs', 'ch', 'chf', 'chim', 'chim. ind', 'chir', 'ci', 'ci-europa', 'ciber', 'cicae', 'cid', 'cie', 'cif', 'cifej', 'cig', 'cigs', 'cii', 'cilea', 'cilo', 'cim', 'cime', 'cin', 'cinit', 'cio', 'cipe', 'cirm', 'cisal', 'ciscs', 'cisd', 'cisl', 'cism', 'citol', 'cl', 'class', 'cli', 'cm', 'cmdr', 'cme', 'cmo', 'cmr', 'cms', 'cmyk', 'cm²', 'cm³', 'cn', 'cna', 'cnb', 'cnc', 'cnel', 'cngei', 'cni', 'cnipa', 'cnit', 'cnn', 'cnr', 'cns', 'cnt', 'cnvvf', 'co', 'co.ing', 'co.sa', 'cobas', 'coc', 'cod', 'cod. civ', 'cod. deont. not', 'cod. pen', 'cod. proc. civ', 'cod. proc. pen', 'codec', 'coi', 'col', 'colf', 'coll', 'com', 'comdr', 'comm', 'comp', 'compar', 'compl', 'con', 'conai', 'conc', 'concl', 'condiz', 'confetra', 'confitarma', 'confr', 'cong', 'congeav', 'congiunt', 'coni', 'coniug', 'consec', 'consob', 'contab', 'contr', 'coreco', 'corp', 'corr', 'correl', 'corrisp', 'cosap', 'cospe', 'cost', 'costr', 'cpc', 'cpdel', 'cpe', 'cpi', 'cpl', 'cpt', 'cpu', 'cr', 'cral', 'credem', 'crf', 'cri', 'cric', 'cristall', 'crm', 'cro', 'cron', 'crsm', 'crt', 'cs', 'csa', 'csai', 'csc', 'csm', 'csn', 'css', 'ct', 'ctc', 'cti', 'ctr', 'ctsis', 'cuc', 'cud', 'cun', 'cup', 'cusi', 'cvb', 'cvbs', 'cwt', 'cz', 'd', 'd.c', 'd.i.a', 'dab', 'dac', 'dam', 'dams', 'dat', 'dau', 'db', 'dbms', 'dc', 'dca', 'dccc', 'dda', 'ddp', 'ddr', 'ddt', 'dea', 'decoraz', 'dect', 'dek', 'denom', 'deriv', 'derm', 'determ', 'df', 'dfp', 'dg', 'dga', 'dhcp', 'di', 'dia', 'dial', 'dic', 'dicomac', 'dif', 'difett', 'dig. iv', 'digos', 'dimin', 'dimostr', 'din', 'dipart', 'diplom', 'dir', 'dir. amm', 'dir. can', 'dir. civ', 'dir. d. lav', 'dir. giur', 'dir. internaz', 'dir. it', 'dir. pen', 'dir. priv', 'dir. proces', 'dir. pub', 'dir. rom', 'disus', 'diy', 'dl', 'dlf', 'dm', 'dme', 'dmf', 'dmo', 'dmoz', 'dm²', 'dm³', 'dnr', 'dns', 'doa', 'doc', 'docg', 'dom', 'dop', 'dos', 'dott', 'dpa', 'dpi', 'dpl', 'dpof', 'dps', 'dpt', 'dr', 'dra', 'drm', 'drs', 'dry pt', 'ds', 'dslam', 'dspn', 'dss', 'dtc', 'dtmf', 'dtp', 'dts', 'dv', 'dvb', 'dvb-t', 'dvd', 'dvi', 'dwdm', 'e.g', 'e.p.c', 'ead', 'eafrd', 'ean', 'eap', 'easw', 'eb', 'eban', 'ebr', 'ebri', 'ebtn', 'ecc', 'eccl', 'ecdl', 'ecfa', 'ecff', 'ecg', 'ecm', 'econ', 'econ. az', 'econ. dom', 'econ. pol', 'ecpnm', 'ed', 'ed agg', 'edge', 'edi', 'edil', 'edit', 'ef', 'efa', 'efcb', 'efp', 'efsa', 'efta', 'eg', 'egiz', 'egl', 'egr', 'ei', 'eisa', 'elab', 'elettr', 'elettron', 'ellitt', 'emap', 'emas', 'embr', 'emdr', 'emi', 'emr', 'en', 'enaip', 'enal', 'enaoli', 'enapi', 'encat', 'enclic', 'enea', 'enel', 'eni', 'enigm', 'enit', 'enol', 'enpa', 'enpaf', 'enpals', 'enpi', 'enpmf', 'ens', 'entom', 'epd', 'epigr', 'epirbs', 'epl', 'epo', 'ept', 'erc', 'ercom', 'ermes', 'erp', 'es', 'esa', 'escl', 'esist', 'eso', 'esp', 'estens', 'estr. min', 'etacs', 'etf', 'eti', 'etim', 'etn', 'etol', 'eu', 'eufem', 'eufic', 'eula', 'eva®', 'f.a', 'f.b', 'f.m', 'f.p', 'fa', 'fabi', 'fac', 'facl', 'facs', 'fad', 'fai', 'faile', 'failp', 'failpa', 'faisa', 'falcri', 'fam', 'famar', 'fans', 'fao', 'fapav', 'faq', 'farm', 'fasi', 'fasib', 'fatt', 'fbe', 'fbi', 'fc', 'fco', 'fcp', 'fcr', 'fcu', 'fdi', 'fe', 'feaog', 'feaosc', 'feb', 'fedic', 'fema', 'feoga', 'ferr', 'fesco', 'fesr', 'fess', 'fg', 'fi', 'fiaf', 'fiaip', 'fiais', 'fialtel', 'fiap', 'fiapf', 'fiat', 'fiavet', 'fic', 'ficc', 'fice', 'fidal', 'fidam', 'fidapa', 'fieg', 'fifa', 'fifo', 'fig', 'figc', 'figs', 'filat', 'filcams', 'file', 'filol', 'filos', 'fim', 'fima', 'fimmg', 'fin', 'finco', 'fio', 'fioto', 'fipe', 'fipresci', 'fis', 'fisar', 'fisc', 'fisg', 'fisiol', 'fisiopatol', 'fistel', 'fit', 'fita', 'fitav', 'fits', 'fiv', 'fivet', 'fivl', 'flo', 'flpd', 'fluid pt', 'fm', 'fmcg', 'fmi', 'fmth', 'fnas', 'fnomceo', 'fnsi', 'fob', 'fod', 'folcl', 'fon', 'fop', 'fotogr', 'fp', 'fpc', 'fpld', 'fr', 'fra', 'fs', 'fsc', 'fse', 'fsf', 'fsfi', 'fsh', 'ft', 'ftase', 'ftbcc', 'fte', 'ftp', 'fts', 'ft²', 'ft³', 'fuaav', 'fut', 'fv', 'fvg', 'g.fv', 'g.u', 'g.u.el', 'gal', 'gats', 'gatt', 'gb', 'gc', 'gccc', 'gco', 'gcost', 'gd', 'gdd', 'gdf', 'gdi', 'gdo', 'gdp', 'ge', 'gea', 'gel', 'gen', 'geneal', 'geod', 'geofis', 'geogr', 'geogr. antr', 'geogr. fis', 'geol', 'geom', 'gep', 'germ', 'gescal', 'gg', 'ggv', 'gi', 'gia', 'gides', 'gift', 'gio', 'giorn', 'gis', 'gisma', 'gismo', 'giu', 'gm', 'gmdss', 'gme', 'gmo', 'go', 'gov', 'gp', 'gpl', 'gprs', 'gps', 'gr', 'gr.sel.spec', 'gr.sel.tr', 'gr.sqd', 'gra', 'gram', 'grano', 'grd', 'grtn', 'grv', 'gsa', 'gsm', 'gsm-r', 'gsr', 'gtld', 'gu', 'guce', 'gui', 'gus', 'ha', 'haart', 'haccp', 'hba', 'hcg', 'hcrp', 'hd-dvd', 'hdcp', 'hdi', 'hdml', 'hdtv', 'hepa', 'hfpa', 'hg', 'hifi', 'hiperlan', 'hiv', 'hm', 'hmld', 'hon', 'hosp', 'hpv', 'hr', 'hrh', 'hrm', 'hrt', 'html', 'http', 'hvac', 'hz', 'i.e', 'i.g.m', 'iana', 'iasb', 'iasc', 'iass', 'iat', 'iata', 'iatse', 'iau', 'iban', 'ibid', 'ibm', 'icann', 'icao', 'icbi', 'iccu', 'ice', 'icf', 'ici', 'icm', 'icom', 'icon', 'ics', 'icsi', 'icstis', 'ict', 'icta', 'id', 'iden', 'idl', 'idraul', 'iec', 'iedm', 'ieee', 'ietf', 'ifat', 'ifel', 'ifla', 'ifrs', 'ifto', 'ifts', 'ig', 'igm', 'igmp', 'igp', 'iims', 'iipp', 'ilm', 'ilo', 'ilor', 'ils', 'im', 'imaie', 'imap', 'imc', 'imdb', 'imei', 'imi', 'imms', 'imo', 'imp', 'imper', 'imperf', 'impers', 'imq', 'ims', 'imsi', 'in', 'inail', 'inca', 'incb', 'inci', 'ind', 'ind. agr', 'ind. alim', 'ind. cart', 'ind. chim', 'ind. cuoio', 'ind. estratt', 'ind. graf', 'ind. mecc', 'ind. tess', 'indecl', 'indef', 'indeterm', 'indire', 'inea', 'inf', 'infea', 'infm', 'inform', 'ing', 'ingl', 'inmarsat', 'inpdai', 'inpdap', 'inpgi', 'inps', 'inr', 'inran', 'ins', 'insp', 'int', 'inter', 'intr', 'invar', 'invim', 'in²', 'in³', 'ioma', 'iosco', 'ip', 'ipab', 'ipasvi', 'ipi', 'ippc', 'ips', 'iptv', 'iq', 'ira', 'irap', 'ircc', 'ircs', 'irda', 'iref', 'ires', 'iron', 'irpef', 'irpeg', 'irpet', 'irreg', 'is', 'isae', 'isbd', 'isbn', 'isc', 'isdn', 'isee', 'isef', 'isfol', 'isg', 'isi', 'isia', 'ism', 'ismea', 'isnart', 'iso', 'isp', 'ispearmi', 'ispel', 'ispescuole', 'ispesl', 'ispo', 'ispro', 'iss', 'issn', 'istat', 'istol', 'isvap', 'it', 'iti', 'itt', 'ittiol', 'itu', 'iud', 'iugr', 'iulm', 'iva', 'iveco', 'ivg', 'ivr', 'ivs', 'iyhp', 'j', 'jal', 'jit', 'jr', 'jv', 'k', 'kb', 'kee', 'kg', 'kkk', 'klm', 'km', 'km/h', 'kmph', 'kmq', 'km²', 'kr', 'kw', 'kwh', 'l', 'l\'ing', 'l.n', 'l\'avv', 'la', 'lag', 'lan', 'lanc', 'larn', 'laser', 'lat', 'lav', 'lav. femm', 'lav. pubbl', 'laz', 'lb', 'lc', 'lcca', 'lcd', 'le', 'led', 'lett', 'lh', 'li', 'liaf', 'lib', 'lic', 'lic.ord', 'lic.strd', 'licd', 'lice', 'lida', 'lidci', 'liff', 'lifo', 'lig', 'liit', 'lila', 'lilt', 'linfa', 'ling', 'lipu', 'lis', 'lisaac', 'lism', 'lit', 'litab', 'lnp', 'lo', 'loc', 'loc. div', 'lolo', 'lom', 'long', 'lp', 'lrm', 'lrms', 'lsi', 'lsu', 'lt', 'ltd', 'lu', 'lug', 'luiss', 'lun', 'lwt', 'lww', 'm.a', 'm.b', 'm.o', 'm/s', 'ma', 'mac', 'macch', 'mag', 'magg.(maj)', 'magg.gen.(maj.gen.)', 'mai', 'maj', 'mar', 'mar.a', 'mar.ca', 'mar.ord', 'marc', 'mat', 'mater', 'max', 'mb', 'mbac', 'mc', 'mcl', 'mcpc', 'mcs', 'md', 'mdf', 'mdp', 'me', 'mec', 'mecc', 'med', 'mediev', 'mef', 'mer', 'merc', 'merid', 'mesa', 'messrs', 'metall', 'meteor', 'metr', 'metrol', 'mg', 'mgc', 'mgm', 'mi', 'mibac', 'mica', 'microb', 'mifed', 'miglio nautico', 'miglio nautico per ora', 'miglio nautico²', 'miglio²', 'mil', 'mile', 'miles/h', 'milesph', 'min', 'miner', 'mips', 'miptv', 'mit', 'mitol', 'miur', 'ml', 'mlle', 'mls', 'mm', 'mme', 'mms', 'mm²', 'mn', 'mnp', 'mo', 'mod', 'mol', 'mons', 'morf', 'mos', 'mpaa', 'mpd', 'mpeg', 'mpi', 'mps', 'mq', 'mr', 'mrs', 'ms', 'msgr', 'mss', 'mt', 'mto', 'murst', 'mus', 'mvds', 'mws', 'm²', 'm³', 'n.a', 'n.b', 'na', 'naa', 'nafta', 'napt', 'nars', 'nasa', 'nat', 'natas', 'nato', 'nb', 'nba', 'nbc', 'ncts', 'nd', 'nda', 'nde', 'ndr', 'ndt', 'ne', 'ned', 'neg', 'neol', 'netpac', 'neur', 'news!', 'ngcc', 'nhmf', 'nlcc', 'nmr', 'no', 'nodo', 'nom', 'nos', 'nov', 'novissdi', 'npi', 'nr', 'nt', 'nta', 'nts', 'ntsc', 'nu', 'nuct', 'numism', 'nwt', 'nyc', 'nz', 'o.m.i', 'oai-pmh', 'oav', 'oc', 'occ', 'occult', 'oci', 'ocr', 'ocse', 'oculist', 'od', 'odg', 'odp', 'oecd', 'oem', 'ofdm', 'oft', 'og', 'ogg', 'ogi', 'ogm', 'ohim', 'oic', 'oics', 'olaf', 'oland', 'ole', 'oled', 'omi', 'oms', 'on', 'ong', 'onig', 'onlus', 'onomat', 'onpi', 'onu', 'op', 'opac', 'opec', 'opord', 'opsosa', 'or', 'ord', 'ord. scol', 'ore', 'oref', 'orient', 'ornit', 'orogr', 'orp', 'ort', 'os', 'osa', 'osas', 'osd', 'ot', 'ote', 'ott', 'oz', 'p', 'p.a', 'p.c', 'p.c.c', 'p.es', 'p.f', 'p.m', 'p.r', 'p.s', 'p.t', 'p.v', 'pa', 'pac', 'pag./p', 'pagg./pp', 'pai', 'pal', 'paleobot', 'paleogr', 'paleont', 'paleozool', 'paletn', 'pamr', 'pan', 'papir', 'par', 'parapsicol', 'part', 'partic', 'pass', 'pat', 'patol', 'pb', 'pc', 'pci', 'pcm', 'pcmcia', 'pcs', 'pcss', 'pct', 'pd', 'pda', 'pdf', 'pdl', 'pds', 'pe', 'pec', 'ped', 'pedag', 'peg', 'pegg', 'per.ind', 'pers', 'pert', 'pesq', 'pet', 'petr', 'petrogr', 'pfc', 'pg', 'pga', 'pgp', 'pgut', 'ph', 'php', 'pi', 'pics', 'pie', 'pif', 'pii', 'pil', 'pime', 'pin', 'pine', 'pip', 'pir', 'pit', 'pitt', 'piuss', 'pkcs', 'pki', 'pko', 'pl', 'pli', 'plr', 'pm', 'pma', 'pmi', 'pmr', 'pn', 'pnf', 'pnl', 'po', 'poet', 'pof', 'pol', 'pop', 'popitt', 'popol', 'port', 'pos', 'poss', 'post', 'pots', 'pp', 'ppa', 'ppc', 'ppga', 'ppp', 'pps', 'pptt', 'ppv', 'pr', 'pra', 'praa', 'pref', 'preist', 'prep', 'pres', 'pret', 'prg', 'pri', 'priv', 'pro.civ', 'prof', 'pron', 'pronom', 'propr', 'prov', 'prs', 'prtl', 'prusst', 'ps', 'pse', 'psi', 'psicoan', 'psicol', 'pso', 'psp', 'pstn', 'pt', 'ptc', 'pti', 'ptsd', 'ptt', 'pu', 'pug', 'puk', 'put', 'pv', 'pvb', 'pvc', 'pvt', 'pz', 'qb', 'qcs', 'qfd', 'qg', 'qi', 'qlco', 'qlcu', 'qos', 'qualif', 'r-lan', 'r.s', 'ra', 'racc', 'radar', 'radc', 'radiotecn', 'raee', 'raf', 'rag', 'raid', 'ram', 'rar', 'ras', 'rass. avv. stato', 'rc', 'rca', 'rcdp', 'rcs', 'rdc', 'rdco', 'rdf', 'rdi', 'rdp', 'rds', 'rdt', 're', 'rea', 'recipr', 'recl', 'reg', 'region', 'rel', 'rem', 'rep', 'reps', 'res', 'retor', 'rev', 'rfi', 'rfid', 'rg', 'rgb', 'rgc', 'rge', 'rgi', 'rgi bdp', 'rgpt', 'rgt', 'ri', 'riaa', 'riaj', 'riba', 'ric', 'rid', 'rif', 'rifl', 'rina', 'rip', 'ris', 'rit', 'ritts', 'rm', 'rmn', 'rn', 'ro', 'roa', 'roc', 'roi', 'rom', 'roro', 'rov', 'rp', 'rpm', 'rr', 'rrf', 'rs', 'rsc', 'rspp', 'rss', 'rsu', 'rsvp', 'rt', 'rtdpc', 'rtg', 'rtn', 'rtp', 'rttt', 'rvm', 's-dab', 's.a', 's.b.f', 's.n.c', 's.p.a', 's.p.m', 's.r.l', 's.ten', 's.v', 's/m', 'sa', 'sab', 'saca', 'sace', 'sact', 'sad', 'sag', 'sahm', 'sai', 'saisa', 'sam', 'san', 'sanas', 'sape', 'sar', 'sars', 'sart', 'sas', 'sbaf', 'sbas', 'sbn', 'sc', 'sca.sm', 'scherz', 'scien', 'scn', 'scsi', 'scuba', 'scult', 'scut', 'sdds', 'sdiaf', 'sds', 'sdsl', 'se', 'seat', 'sebc', 'sec', 'seca', 'secam', 'secc', 'see', 'seg', 'segg', 'segredifesa', 'sem', 'sempo', 'sen', 'sens', 'seo', 'serg', 'serg.magg.(sgm)', 'serg.magg.ca', 'set', 'sfc', 'sfis', 'sfx', 'sg', 'sga', 'sgc', 'sgg', 'sgml', 'sgt', 'si', 'si@lt', 'sia', 'siae', 'siaic', 'siap', 'sias', 'sic', 'sicav', 'sid', 'sido', 'sie', 'sif', 'sig', 'sig.na', 'sig.ra', 'sige', 'sigg', 'sigill', 'sigo', 'siia', 'simb', 'simbdea', 'simg', 'simo', 'sin', 'sinalv', 'sing', 'sins', 'sinu', 'siocmf', 'siog', 'sioi', 'siommms', 'siot', 'sip', 'sipem', 'sips', 'sirf', 'sirm', 'sis', 'sisde', 'sismi', 'sissa', 'sit', 'siulp', 'siusa', 'sla', 'sldn', 'slm', 'slr', 'sm', 'sma', 'smau', 'smd', 'sme', 'smes', 'smm', 'smpt', 'sms', 'sn', 'snad', 'snai', 'snc', 'sncci', 'sncf', 'sngci', 'snit', 'so', 'soc', 'sociol', 'sogg', 'soho', 'soi', 'sol', 'somipar', 'somm', 'sonar', 'sp', 'spa', 'spe', 'spett', 'spi', 'spm', 'spot', 'spp', 'spreg', 'sq', 'sqd', 'sr', 'srd', 'srl', 'srr', 'ss', 'ssi', 'ssn', 'ssr', 'sss', 'st', 'st. d. arte', 'st. d. dir', 'st. d. filos', 'st. d. rel', 'stat', 'stg', 'stp', 'stw', 'su', 'suap', 'suem', 'suff', 'sup', 'superl', 'supt', 'surg', 'surl', 'susm', 'sut', 'suv', 'sv', 'svga', 'swics', 'swift', 'swot', 'sxga', 'sz', 't-dab', 't.sg', 'ta', 'taa', 'tac', 'tacan', 'tacs', 'taeg', 'tai', 'tan', 'tar', 'targa', 'tav', 'tb', 'tbt', 'tci', 'tcp', 'tcp/ip', 'tcsm', 'tdm', 'tdma', 'te', 'tecn', 'tecnol', 'ted', 'tel', 'telecom', 'temp', 'ten.(lt)', 'ten.col.(ltc)', 'ten.gen', 'teol', 'term', 'tesa', 'tese', 'tesol', 'tess', 'tet', 'tetra', 'tfr', 'tft', 'tfts', 'tgv', 'thx', 'tim', 'tipogr', 'tir', 'tit', 'tld', 'tm', 'tmc', 'tn', 'to', 'toefl', 'ton', 'top', 'topog', 'tos', 'tosap', 'tosc', 'tp', 'tpl', 'tr', 'trad', 'tramat', 'trasp', 'ts', 'tso', 'tuir', 'tuld', 'tv', 'twa', 'twain', 'u.ad', 'u.s', 'ucai', 'ucca', 'ucei', 'ucina', 'uclaf', 'ucoi', 'ucoii', 'ucsi', 'ud', 'udc', 'udi', 'udp', 'ue', 'uefa', 'uemri', 'ufo', 'ugc', 'uhci', 'uhf', 'uht', 'uibm', 'uic', 'uicc', 'uiga', 'uil', 'uilps', 'uisp', 'uits', 'uk', 'ul', 'ull', 'uma', 'umb', 'ummc', 'umss', 'umts', 'unac', 'unar', 'unasp', 'uncem', 'unctad', 'undp', 'unefa', 'unep', 'unesco', 'ungh', 'unhcr', 'uni', 'unicef', 'unitec', 'unpredep', 'unsa', 'upa', 'upc', 'urar', 'urban', 'url', 'urp', 'urss', 'usa', 'usb', 'usfi', 'usga', 'usl', 'usp', 'uspi', 'ussr', 'utap', 'v', 'v.brig', 'v.cte', 'v.m', 'v.p', 'v.r', 'v.s', 'va', 'vab', 'vaio', 'val', 'vas', 'vb', 'vbr', 'vc', 'vcc', 'vcr', 'vda', 've', 'ven', 'ves', 'vesa', 'veter', 'vezz', 'vfb', 'vfp', 'vfx', 'vga', 'vhf', 'vhs', 'vi', 'via', 'vip', 'vis', 'vn', 'vo', 'voc', 'voip', 'vol', 'volg', 'voll', 'vor', 'vpdn', 'vpn', 'vr', 'vs', 'vsp', 'vt', 'vtc', 'vts', 'vtt', 'vv', 'vvf', 'wai', 'wais', 'wan', 'wap', 'wasp', 'wc', 'wcdma', 'wcm', 'wga', 'wi-fi', 'wipo', 'wisp', 'wll', 'wml', 'wms', 'worm', 'wp', 'wpan', 'wssn', 'wto', 'wwan', 'wwf', 'www', 'wygiwys', 'xl', 'xml', 'xs', 'xxl', 'xxs', 'yaf', 'yb', 'yci', 'yd', 'yd²', 'yd³', 'ymca', 'zat', 'zb', 'zcs', 'zdf', 'zdg', 'zift', 'zool', 'zoot', 'ztc', 'ztl', '°c', '°f', '°n', '°ra', '°ré', 'µg']
14
17
  PREPOSITIVE_ABBREVIATIONS = ['a.c', 'acc', 'adj', 'adm', 'adv', 'all', 'amn', 'arch', 'asst', 'avv', 'banc', 'bart', 'bcc', 'bldg', 'brig', 'bros', 'c.a', 'c.a.p', 'c.c.p', 'c.m', 'c.p', 'c.p', 'c.s', 'c.v', 'capt', 'cc', 'cmdr', 'co', 'col', 'comdr', 'con', 'corp', 'corr', 'cpl', 'dir', 'dott', 'dott', 'dr', 'dr', 'drs', 'e.p.c', 'ecc', 'egr', 'ens', 'es', 'fatt', 'gen', 'geom', 'gg', 'gov', 'hon', 'hosp', 'hr', 'id', 'ing', 'insp', 'int', "l'avv", "l'ing", 'lett', 'lt', 'maj', 'messrs', 'mlle', 'mm', 'mme', 'mo', 'mons', 'mr', 'mr', 'mrs', 'mrs', 'ms', 'ms', 'msgr', 'n.b', 'ogg', 'on', 'op', 'ord', 'p.c', 'p.c.c', 'p.es', 'p.f', 'p.r', 'p.s', 'p.t', 'p.v', 'pfc', 'ph', 'post', 'pp', 'prof', 'psicol', 'pvt', 'racc', 'rag', 'rep', 'reps', 'res', 'rev', 'ric', 'rif', 'rp', 'rsvp', 'rt', 's.a', 's.b.f', 's.n.c', 's.p.a', 's.p.m', 's.r.l', 'seg', 'sen', 'sens', 'sfc', 'sgg', 'sgt', 'sig', 'sigg', 'soc', 'spett', 'sr', 'ss', 'st', 'supt', 'surg', 'tel', 'u.s', 'v.p', 'v.r', 'v.s']
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[:\.!\?؟]|.*?\z|.*?$/
18
21
 
@@ -9,6 +9,9 @@ module PragmaticSegmenter
9
9
  end
10
10
  end
11
11
 
12
+ class Cleaner < PragmaticSegmenter::Cleaner
13
+ end
14
+
12
15
  class Abbreviation < PragmaticSegmenter::Abbreviation
13
16
  ABBREVIATIONS = ['а', 'авт', 'адм.-терр', 'акад', 'в', 'вв', 'вкз', 'вост.-европ', 'г', 'гг', 'гос', 'гр', 'д', 'деп', 'дисс', 'дол', 'долл', 'ежедн', 'ж', 'жен', 'з', 'зап', 'зап.-европ', 'заруб', 'и', 'И', 'и', 'ин', 'иностр', 'инст', 'к', 'кв', 'К', 'Кв', 'куб', 'канд', 'кг', 'л', 'м', 'мин', 'моск', 'муж', 'нед', 'о', 'о', 'О', 'о', 'п', 'пер', 'пп', 'пр', 'просп', 'р', 'руб', 'с', 'сек', 'см', 'СПб', 'стр', 'т', 'т', 'тел', 'тов', 'тт', 'тыс', 'ул', 'у.е', 'y.e', 'у', 'y', 'Ф', 'ф', 'ч', 'пгт', 'проф', 'л.h', 'Л.Н', 'Н']
14
17
 
@@ -9,6 +9,9 @@ module PragmaticSegmenter
9
9
  end
10
10
  end
11
11
 
12
+ class Cleaner < PragmaticSegmenter::Cleaner
13
+ end
14
+
12
15
  class Abbreviation < PragmaticSegmenter::Abbreviation
13
16
  ABBREVIATIONS = ['a.c', 'a/c', 'abr', 'adj', 'admón', 'afmo', 'ago', 'almte', 'ap', 'apdo', 'arq', 'art', 'atte', 'av', 'avda', 'bco', 'bibl', 'bs. as', 'c', 'c.f', 'c.g', 'c/c', 'c/u', 'cap', 'cc.aa', 'cdad', 'cm', 'co', 'cra', 'cta', 'cv', 'd.e.p', 'da', 'dcha', 'dcho', 'dep', 'dic', 'dicc', 'dir', 'dn', 'doc', 'dom', 'dpto', 'dr', 'dra', 'dto', 'ee', 'ej', 'en', 'entlo', 'esq', 'etc', 'excmo', 'ext', 'f.c', 'fca', 'fdo', 'febr', 'ff. aa', 'ff.cc', 'fig', 'fil', 'fra', 'g.p', 'g/p', 'gob', 'gr', 'gral', 'grs', 'hnos', 'hs', 'igl', 'iltre', 'imp', 'impr', 'impto', 'incl', 'ing', 'inst', 'izdo', 'izq', 'izqdo', 'j.c', 'jue', 'jul', 'jun', 'kg', 'km', 'lcdo', 'ldo', 'let', 'lic', 'ltd', 'lun', 'mar', 'may', 'mg', 'min', 'mié', 'mm', 'máx', 'mín', 'mt', 'n. del t', 'n.b', 'no', 'nov', 'ntra. sra', 'núm', 'oct', 'p', 'p.a', 'p.d', 'p.ej', 'p.v.p', 'párrf', 'ppal', 'prev', 'prof', 'prov', 'ptas', 'pts', 'pza', 'pág', 'págs', 'párr', 'q.e.g.e', 'q.e.p.d', 'q.e.s.m', 'reg', 'rep', 'rr. hh', 'rte', 's', 's. a', 's.a.r', 's.e', 's.l', 's.r.c', 's.r.l', 's.s.s', 's/n', 'sdad', 'seg', 'sept', 'sig', 'sr', 'sra', 'sres', 'srta', 'sta', 'sto', 'sáb', 't.v.e', 'tamb', 'tel', 'tfno', 'ud', 'uu', 'uds', 'univ', 'v.b', 'v.e', 'vd', 'vds', 'vid', 'vie', 'vol', 'vs', 'vto', 'a', 'aero', 'ambi', 'an', 'anfi', 'ante', 'anti', 'archi', 'arci', 'auto', 'bi', 'bien', 'bis', 'co', 'com', 'con', 'contra', 'crio', 'cuadri', 'cuasi', 'cuatri', 'de', 'deci', 'des', 'di', 'dis', 'dr', 'ecto', 'en', 'endo', 'entre', 'epi', 'equi', 'ex', 'extra', 'geo', 'hemi', 'hetero', 'hiper', 'hipo', 'homo', 'i', 'im', 'in', 'infra', 'inter', 'intra', 'iso', 'lic', 'macro', 'mega', 'micro', 'mini', 'mono', 'multi', 'neo', 'omni', 'para', 'pen', 'pluri', 'poli', 'pos', 'post', 'pre', 'pro', 'pseudo', 're', 'retro', 'semi', 'seudo', 'sobre', 'sub', 'super', 'supra', 'trans', 'tras', 'tri', 'ulter', 'ultra', 'un', 'uni', 'vice', 'yuxta']
14
17
  PREPOSITIVE_ABBREVIATIONS = ['a', 'aero', 'ambi', 'an', 'anfi', 'ante', 'anti', 'archi', 'arci', 'auto', 'bi', 'bien', 'bis', 'co', 'com', 'con', 'contra', 'crio', 'cuadri', 'cuasi', 'cuatri', 'de', 'deci', 'des', 'di', 'dis', 'dr', 'ecto', 'ee', 'en', 'endo', 'entre', 'epi', 'equi', 'ex', 'extra', 'geo', 'hemi', 'hetero', 'hiper', 'hipo', 'homo', 'i', 'im', 'in', 'infra', 'inter', 'intra', 'iso', 'lic', 'macro', 'mega', 'micro', 'mini', 'mono', 'mt', 'multi', 'neo', 'omni', 'para', 'pen', 'pluri', 'poli', 'pos', 'post', 'pre', 'pro', 'pseudo', 're', 'retro', 'semi', 'seudo', 'sobre', 'sub', 'super', 'supra', 'srta', 'trans', 'tras', 'tri', 'ulter', 'ultra', 'un', 'uni', 'vice', 'yuxta']
@@ -13,6 +13,9 @@ module PragmaticSegmenter
13
13
  end
14
14
  end
15
15
 
16
+ class Cleaner < PragmaticSegmenter::Cleaner
17
+ end
18
+
16
19
  class SentenceBoundaryPunctuation < PragmaticSegmenter::SentenceBoundaryPunctuation
17
20
  SENTENCE_BOUNDARY = /.*?[۔؟!\?]|.*?$/
18
21
 
@@ -36,12 +36,16 @@ module PragmaticSegmenter
36
36
  private
37
37
 
38
38
  def split_lines(txt)
39
- segments = []
40
- lines = txt.split("\r")
41
- lines.each do |l|
42
- next if l.eql?('')
43
- analyze_lines(l, segments, punctuation_array)
39
+ segments = txt.split("\r")
40
+
41
+ segments.map! do |line|
42
+ line.apply(SingleNewLineRule, EllipsisRules::All, EmailRule)
44
43
  end
44
+
45
+ segments = segments.map { |line| analyze_lines(line) }.flatten
46
+
47
+ segments.map! {|segment| sub_symbols(segment) }
48
+
45
49
  sentence_array = []
46
50
  segments.each_with_index do |line|
47
51
  next if line.gsub(/_{3,}/, '').length.eql?(0) || line.length < 2
@@ -59,25 +63,16 @@ module PragmaticSegmenter
59
63
  sentence_array.reject(&:empty?)
60
64
  end
61
65
 
62
- def analyze_lines(line, segments, punctuation)
63
- line = line.apply(SingleNewLineRule, EllipsisRules::All, EmailRule)
64
- clause_1 = false
65
- end_punc_check = false
66
- punctuation.each do |p|
67
- end_punc_check = true if line[-1].include?(p)
68
- clause_1 = true if line.include?(p)
69
- end
70
- if clause_1
71
- segments = process_text(line, end_punc_check, segments)
66
+ def analyze_lines(line)
67
+ if punctuation_array.any? { |p| line.include?(p) }
68
+ process_text(line)
72
69
  else
73
- line.gsub!(/ȹ/, "\n")
74
- line.gsub!(/∯/, '.')
75
- segments << line
70
+ line
76
71
  end
77
72
  end
78
73
 
79
- def process_text(line, end_punc_check, segments)
80
- line << 'ȸ' if !end_punc_check
74
+ def process_text(line)
75
+ line << 'ȸ' unless punctuation_array.any? { |p| line[-1].include?(p) }
81
76
  PragmaticSegmenter::ExclamationWords.apply_rules(line)
82
77
  between_punctutation(line)
83
78
  line = line.apply(
@@ -85,10 +80,7 @@ module PragmaticSegmenter
85
80
  QuestionMarkInQuotationRule,
86
81
  ExclamationPointRules::All
87
82
  )
88
- subline = sentence_boundary_punctuation(line)
89
- subline.each_with_index do |s_l|
90
- segments << sub_symbols(s_l)
91
- end
83
+ sentence_boundary_punctuation(line)
92
84
  end
93
85
 
94
86
  def replace_numbers(txt)
@@ -100,7 +92,7 @@ module PragmaticSegmenter
100
92
  end
101
93
 
102
94
  def punctuation_array
103
- PragmaticSegmenter::Punctuation.new.punct
95
+ @punct_arr ||= PragmaticSegmenter::Punctuation.new.punct
104
96
  end
105
97
 
106
98
  def between_punctutation(txt)
@@ -17,65 +17,31 @@ require 'pragmatic_segmenter/languages/italian'
17
17
  require 'pragmatic_segmenter/languages/spanish'
18
18
  require 'pragmatic_segmenter/languages/russian'
19
19
  require 'pragmatic_segmenter/languages/japanese'
20
+ require 'pragmatic_segmenter/languages/common'
21
+ require 'pragmatic_segmenter/language_support'
20
22
  require 'pragmatic_segmenter/rules'
21
23
 
22
24
  module PragmaticSegmenter
23
25
  # This class segments a text into an array of sentences.
24
26
  class Segmenter
25
- include Rules
27
+ include LanguageSupport
26
28
  attr_reader :text, :language, :doc_type
29
+
27
30
  def initialize(text:, **args)
28
- return [] unless text
31
+ return unless text
29
32
  @language = args[:language] || 'en'
30
33
  @doc_type = args[:doc_type]
31
- if args[:clean].eql?(false)
32
- @text = text.dup
33
- else
34
- case @language
35
- when 'en'
36
- @text = PragmaticSegmenter::Languages::English::Cleaner.new(text: text.dup, doc_type: args[:doc_type]).clean
37
- when 'ja'
38
- @text = PragmaticSegmenter::Languages::Japanese::Cleaner.new(text: text.dup, doc_type: args[:doc_type]).clean
39
- else
40
- @text = PragmaticSegmenter::Cleaner.new(text: text.dup, doc_type: args[:doc_type]).clean
41
- end
34
+ @text = text.dup
35
+
36
+ unless args[:clean].eql?(false)
37
+ @text = cleaner_class.new(text: @text, doc_type: args[:doc_type]).clean
42
38
  end
43
39
  end
44
40
 
45
41
  def segment
46
42
  return [] unless text
47
- case language
48
- when 'en'
49
- PragmaticSegmenter::Process.new(text: text, doc_type: doc_type).process
50
- when 'de'
51
- PragmaticSegmenter::Languages::Deutsch::Process.new(text: text, doc_type: doc_type).process
52
- when 'es'
53
- PragmaticSegmenter::Languages::Spanish::Process.new(text: text, doc_type: doc_type).process
54
- when 'it'
55
- PragmaticSegmenter::Languages::Italian::Process.new(text: text, doc_type: doc_type).process
56
- when 'ja'
57
- PragmaticSegmenter::Languages::Japanese::Process.new(text: text, doc_type: doc_type).process
58
- when 'el'
59
- PragmaticSegmenter::Languages::Greek::Process.new(text: text, doc_type: doc_type).process
60
- when 'ru'
61
- PragmaticSegmenter::Languages::Russian::Process.new(text: text, doc_type: doc_type).process
62
- when 'ar'
63
- PragmaticSegmenter::Languages::Arabic::Process.new(text: text, doc_type: doc_type).process
64
- when 'am'
65
- PragmaticSegmenter::Languages::Amharic::Process.new(text: text, doc_type: doc_type).process
66
- when 'hi'
67
- PragmaticSegmenter::Languages::Hindi::Process.new(text: text, doc_type: doc_type).process
68
- when 'hy'
69
- PragmaticSegmenter::Languages::Armenian::Process.new(text: text, doc_type: doc_type).process
70
- when 'fa'
71
- PragmaticSegmenter::Languages::Persian::Process.new(text: text, doc_type: doc_type).process
72
- when 'my'
73
- PragmaticSegmenter::Languages::Burmese::Process.new(text: text, doc_type: doc_type).process
74
- when 'ur'
75
- PragmaticSegmenter::Languages::Urdu::Process.new(text: text, doc_type: doc_type).process
76
- else
77
- PragmaticSegmenter::Process.new(text: text, doc_type: doc_type).process
78
- end
43
+
44
+ process_class.new(text: text, doc_type: doc_type).process
79
45
  end
80
46
  end
81
47
  end
@@ -1,3 +1,3 @@
1
1
  module PragmaticSegmenter
2
- VERSION = "0.0.3"
2
+ VERSION = "0.0.4"
3
3
  end
@@ -863,6 +863,11 @@ RSpec.describe PragmaticSegmenter::Segmenter do
863
863
  ps = PragmaticSegmenter::Segmenter.new(text: "She turned to him, “This is great.” She held the book out to show him.", language: 'en')
864
864
  expect(ps.segment).to eq(["She turned to him, “This is great.”", "She held the book out to show him."])
865
865
  end
866
+
867
+ it 'correctly segments text #081' do
868
+ ps = PragmaticSegmenter::Segmenter.new(text: "////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////Header starts here\r////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////", language: 'en')
869
+ expect(ps.segment).to eq(["Header starts here"])
870
+ end
866
871
  end
867
872
  end
868
873
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pragmatic_segmenter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.3
4
+ version: 0.0.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin S. Dias
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2015-01-07 00:00:00.000000000 Z
11
+ date: 2015-01-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -76,10 +76,12 @@ files:
76
76
  - lib/pragmatic_segmenter/cleaner.rb
77
77
  - lib/pragmatic_segmenter/ellipsis.rb
78
78
  - lib/pragmatic_segmenter/exclamation_words.rb
79
+ - lib/pragmatic_segmenter/language_support.rb
79
80
  - lib/pragmatic_segmenter/languages/amharic.rb
80
81
  - lib/pragmatic_segmenter/languages/arabic.rb
81
82
  - lib/pragmatic_segmenter/languages/armenian.rb
82
83
  - lib/pragmatic_segmenter/languages/burmese.rb
84
+ - lib/pragmatic_segmenter/languages/common.rb
83
85
  - lib/pragmatic_segmenter/languages/deutsch.rb
84
86
  - lib/pragmatic_segmenter/languages/english.rb
85
87
  - lib/pragmatic_segmenter/languages/french.rb