pragmatic_segmenter 0.3.16 → 0.3.17

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 394f3564183ae4fc92fd975bf324bea9e6100daa
4
- data.tar.gz: 3f1b25b6a29186fe9bbf7dca3a042c70787d516e
3
+ metadata.gz: baf0f9cb38a40398530e15df0d28ab8a654c49c6
4
+ data.tar.gz: 4a184ffa70092a3f99fe7d194a71e7a935b1d3b2
5
5
  SHA512:
6
- metadata.gz: 6e43431bc0371d590e23cb40068547f6d3131e8f54053cbde84d9a5058fc6e7f57f52e4a2b9ff8ecac69fae66b8649822d5b05c2326adc99647546b9eb1339aa
7
- data.tar.gz: 08ea92f8601d915c467df2112e5c49948bfcdf7c4fb80f7fd5d2a34cba2232f95906681cbd1a1310cf64679273c14e3afca5117f22ae272b21826474a66c17e5
6
+ metadata.gz: 089a40744464256d33ce30b4c6d40a6f00bcd5dd0178757a3d65b3f5ff18242751e5074902bd0aab0460e97e5db0040a30f206648e98c14ec861abf0ba92680c
7
+ data.tar.gz: e49bdbe345e2d2e394d2f4ef03eac2ca0488f3dac158dfd5c193a328368326963785bf643a565d599d0e4d49196e3480d7f8054ec1df4927c192958bb1e1de47
data/NEWS CHANGED
@@ -1,3 +1,7 @@
1
+ 0.3.17 (2017-12-07):
2
+
3
+ * Bug Fix: Regex for parsing HTML
4
+
1
5
  0.3.16 (2017-11-13):
2
6
 
3
7
  * Improvement: Support for Danish
data/README.md CHANGED
@@ -1,27 +1,27 @@
1
- # Pragmatic Segmenter
1
+ # Pragmatic Segmenter
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
4
4
 
5
- Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
5
+ Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
6
6
 
7
- ## Install
7
+ ## Install
8
8
 
9
- **Ruby**
10
- *Supports Ruby 2.1.5 and above*
9
+ **Ruby**
10
+ *Supports Ruby 2.1.5 and above*
11
11
  ```
12
12
  gem install pragmatic_segmenter
13
13
  ```
14
14
 
15
- **Ruby on Rails**
16
- Add this line to your application’s Gemfile:
17
- ```ruby
15
+ **Ruby on Rails**
16
+ Add this line to your application’s Gemfile:
17
+ ```ruby
18
18
  gem 'pragmatic_segmenter'
19
19
  ```
20
20
 
21
- ## Usage
21
+ ## Usage
22
22
 
23
- * If no language is specified, the library will default to English.
24
- * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
23
+ * If no language is specified, the library will default to English.
24
+ * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
25
25
 
26
26
  ```ruby
27
27
  text = "Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S. I live in New York."
@@ -64,11 +64,11 @@ According to Wikipedia, [sentence boundary disambiguation](http://en.wikipedia.o
64
64
 
65
65
  > Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.
66
66
 
67
- The goal of **Pragmatic Segmenter** is to provide a "real-world" segmenter that works out of the box across many languages and does a reasonable job when the format and domain of the input text are unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.
67
+ The goal of **Pragmatic Segmenter** is to provide a "real-world" segmenter that works out of the box across many languages and does a reasonable job when the format and domain of the input text are unknown. Pragmatic Segmenter does not use any machine-learning techniques and thus does not require training data.
68
68
 
69
- Pragmatic Segmenter aims to improve on other segmentation engines in 2 main areas:
70
- 1) Language support (most segmentation tools only focus on English)
71
- 2) Text cleaning and preprocessing
69
+ Pragmatic Segmenter aims to improve on other segmentation engines in 2 main areas:
70
+ 1) Language support (most segmentation tools only focus on English)
71
+ 2) Text cleaning and preprocessing
72
72
 
73
73
  Pragmatic Segmenter is opinionated and made for the explicit purpose of segmenting texts to create translation memories. Therefore, things such as parenthesis within a sentence are kept as one segment, even if technically there are two or more sentences within the segment in order to maintain coherence. The algorithm is also conservative in that if it comes across an ambiguous sentence boundary it will ignore it rather than splitting.
74
74
 
@@ -85,11 +85,11 @@ Pragmatic Segmenter is specifically used for the purpose of segmenting texts for
85
85
 
86
86
  ## The Golden Rules
87
87
 
88
- *The Golden Rules* are a set of tests I developed that can be run through a segmenter to check its accuracy in regards to edge case scenarios. Most of the papers cited below in *Segmentation Papers and Books* either use the WSJ corpus or Brown corpus from the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) to test their segmentation algorithm. In my opinion there are 2 limits to using these corpora:
89
- 1) The corpora may be too expensive for some people ($1,700).
90
- 2) The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again.
88
+ *The Golden Rules* are a set of tests I developed that can be run through a segmenter to check its accuracy in regards to edge case scenarios. Most of the papers cited below in *Segmentation Papers and Books* either use the WSJ corpus or Brown corpus from the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) to test their segmentation algorithm. In my opinion there are 2 limits to using these corpora:
89
+ 1) The corpora may be too expensive for some people ($1,700).
90
+ 2) The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again.
91
91
 
92
- > In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% [53% according to Gale and Church, 1991] of sentences end with a regular word followed by a period.
92
+ > In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% [53% according to Gale and Church, 1991] of sentences end with a regular word followed by a period.
93
93
 
94
94
  Andrei Mikheev - *Periods, Capitalized Words, etc.*
95
95
 
@@ -664,7 +664,7 @@ Pragmatic Segmenter | Ruby
664
664
  [SRX English](https://github.com/apohllo/srx-english) | Ruby | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html) | 30.77% | 28.57% | 6.19 s
665
665
  [Scapel](https://github.com/louismullie/scalpel) | Ruby | [GNU GPLv3](http://www.gnu.org/copyleft/gpl.html) | 28.85% | 20.00% | 0.13 s
666
666
 
667
- †GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.
667
+ †GRS (Other Languages) is the total of the Golden Rules listed above for all languages other than English. This metric by no means includes all languages, only the ones that have Golden Rules listed above.
668
668
  ‡ Speed is based on the performance benchmark results detailed in the section "Speed Performance Benchmarks" below. The number is an average of 10 runs.
669
669
 
670
670
  Other tools not yet tested:
@@ -729,140 +729,143 @@ To test the relative performance of different segmentation tools and libraries I
729
729
  * Add additional language support
730
730
  * Add abbreviation lists for any languages that do not currently have one (only relevant for languages that have the concept of abbreviations with periods)
731
731
  * Get Golden Rule #18 passing - Handling of a.m. or p.m. followed by a capitalized non sentence starter (ex. "At 5 p.m. Mr. Smith went to the bank. He left the bank at 6 p.m. Next he went to the store." --> ["At 5 p.m. Mr. Smith went to the bank.", "He left the bank at 6 p.m.", "Next he went to the store."])
732
- * Support for Thai. This is a very challenging problem due to the absence of explicit sentence markers (i.e. like a period in English) and the ambiguity in Thai regarding what constitutes a sentence even among native speakers. For more information see the following research papers ([#1](http://www.cs.cmu.edu/~paisarn/papers/iccpol2001.pdf) | [#2](http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf)).
732
+ * Support for Thai. This is a very challenging problem due to the absence of explicit sentence markers (i.e. like a period in English) and the ambiguity in Thai regarding what constitutes a sentence even among native speakers. For more information see the following research papers ([#1](http://www.cs.cmu.edu/~paisarn/papers/iccpol2001.pdf) | [#2](http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf)).
733
733
 
734
734
  ## Change Log
735
735
 
736
- **Version 0.0.1**
737
- * Initial Release
736
+ **Version 0.0.1**
737
+ * Initial Release
738
738
 
739
- **Version 0.0.2**
740
- * Major design refactor
739
+ **Version 0.0.2**
740
+ * Major design refactor
741
741
 
742
742
  **Version 0.0.3**
743
- * Add travis.yml
744
- * Add Code Climate
745
- * Update README
743
+ * Add travis.yml
744
+ * Add Code Climate
745
+ * Update README
746
746
 
747
- **Version 0.0.4**
748
- * Add `ConsecutiveForwardSlashRule` to cleaner
749
- * Refactor `segmenter.rb` and `process.rb`
747
+ **Version 0.0.4**
748
+ * Add `ConsecutiveForwardSlashRule` to cleaner
749
+ * Refactor `segmenter.rb` and `process.rb`
750
750
 
751
- **Version 0.0.5**
752
- * Make symbol substitution safer
753
- * Refactor `process.rb`
754
- * Update cleaner with escaped newline rules
751
+ **Version 0.0.5**
752
+ * Make symbol substitution safer
753
+ * Refactor `process.rb`
754
+ * Update cleaner with escaped newline rules
755
755
 
756
- **Version 0.0.6**
757
- * Add rule for escaped newlines that include a space between the slash and character
758
- * Add Golden Rule #52 and code to make it pass
756
+ **Version 0.0.6**
757
+ * Add rule for escaped newlines that include a space between the slash and character
758
+ * Add Golden Rule #52 and code to make it pass
759
759
 
760
- **Version 0.0.7**
761
- * Add change log to README
762
- * Add passing spec for new end of sentence abbreviation (EN)
763
- * Add roman numeral list support
760
+ **Version 0.0.7**
761
+ * Add change log to README
762
+ * Add passing spec for new end of sentence abbreviation (EN)
763
+ * Add roman numeral list support
764
764
 
765
- **Version 0.0.8**
766
- * Fix error in `list.rb`
765
+ **Version 0.0.8**
766
+ * Fix error in `list.rb`
767
767
 
768
- **Version 0.0.9**
769
- * Improve handling of alphabetical and roman numeral lists
768
+ **Version 0.0.9**
769
+ * Improve handling of alphabetical and roman numeral lists
770
770
 
771
- **Version 0.1.0**
772
- * Add Kommanditgesellschaft Rule
771
+ **Version 0.1.0**
772
+ * Add Kommanditgesellschaft Rule
773
773
 
774
- **Version 0.1.1**
775
- * Fix handling of German dates
774
+ **Version 0.1.1**
775
+ * Fix handling of German dates
776
776
 
777
- **Version 0.1.2**
778
- * Fix missing abbreviations
779
- * Add footnote rule to `cleaner.rb`
777
+ **Version 0.1.2**
778
+ * Fix missing abbreviations
779
+ * Add footnote rule to `cleaner.rb`
780
780
 
781
- **Version 0.1.3**
782
- * Improve punctuation in bracket replacement
781
+ **Version 0.1.3**
782
+ * Improve punctuation in bracket replacement
783
783
 
784
- **Version 0.1.4**
785
- * Fix missing abbreviations
784
+ **Version 0.1.4**
785
+ * Fix missing abbreviations
786
786
 
787
- **Version 0.1.5**
788
- * Fix comma at end of quotation bug
787
+ **Version 0.1.5**
788
+ * Fix comma at end of quotation bug
789
789
 
790
- **Version 0.1.6**
791
- * Fix bug in numbered list finder (ignore longer digits)
790
+ **Version 0.1.6**
791
+ * Fix bug in numbered list finder (ignore longer digits)
792
792
 
793
- **Version 0.1.7**
794
- * Add Alice in Wonderland specs
795
- * Fix parenthesis between double quotations bug
796
- * Fix split after quotation ending in dash bug
793
+ **Version 0.1.7**
794
+ * Add Alice in Wonderland specs
795
+ * Fix parenthesis between double quotations bug
796
+ * Fix split after quotation ending in dash bug
797
797
 
798
- **Version 0.1.8**
799
- * Fix bug in splitting new sentence after single quotes
798
+ **Version 0.1.8**
799
+ * Fix bug in splitting new sentence after single quotes
800
800
 
801
- **Version 0.2.0**
802
- * Add Dutch Golden Rules and abbreviations
803
- * Update README with additional tools
804
- * Update segmentation test scores in README with results of new Golden Rule tests
805
- * Add Polish abbreviations
801
+ **Version 0.2.0**
802
+ * Add Dutch Golden Rules and abbreviations
803
+ * Update README with additional tools
804
+ * Update segmentation test scores in README with results of new Golden Rule tests
805
+ * Add Polish abbreviations
806
806
 
807
- **Version 0.3.0**
808
- * Add support for square brackets
809
- * Add support for continuous exclamation points or questions marks or combinations of both
810
- * Fix Roman numeral support
811
- * Add English abbreviations
807
+ **Version 0.3.0**
808
+ * Add support for square brackets
809
+ * Add support for continuous exclamation points or questions marks or combinations of both
810
+ * Fix Roman numeral support
811
+ * Add English abbreviations
812
812
 
813
- **Version 0.3.1**
814
- * Fix undefined method 'gsub!' for nil:NilClass issue
813
+ **Version 0.3.1**
814
+ * Fix undefined method 'gsub!' for nil:NilClass issue
815
815
 
816
- **Version 0.3.2**
817
- * Add English abbreviations
816
+ **Version 0.3.2**
817
+ * Add English abbreviations
818
818
 
819
- **Version 0.3.3**
820
- * Fix cleaner bug
819
+ **Version 0.3.3**
820
+ * Fix cleaner bug
821
821
 
822
- **Version 0.3.4**
823
- * Large refactor
822
+ **Version 0.3.4**
823
+ * Large refactor
824
824
 
825
- **Version 0.3.5**
826
- * Reduce GC by replacing `#gsub` with `#gsub!` where possible
825
+ **Version 0.3.5**
826
+ * Reduce GC by replacing `#gsub` with `#gsub!` where possible
827
827
 
828
- **Version 0.3.6**
829
- * Refactor SENTENCE_STARTERS to each individual language and add SENTENCE_STARTERS for German
828
+ **Version 0.3.6**
829
+ * Refactor SENTENCE_STARTERS to each individual language and add SENTENCE_STARTERS for German
830
830
 
831
- **Version 0.3.7**
832
- * Add `unicode` gem and use it for downcasing to better handle cyrillic languages
831
+ **Version 0.3.7**
832
+ * Add `unicode` gem and use it for downcasing to better handle cyrillic languages
833
833
 
834
- **Version 0.3.8**
835
- * Fix bug that cleaned away single letter segments
834
+ **Version 0.3.8**
835
+ * Fix bug that cleaned away single letter segments
836
836
 
837
- **Version 0.3.9**
838
- * Remove `guard-rspec` development dependency
837
+ **Version 0.3.9**
838
+ * Remove `guard-rspec` development dependency
839
839
 
840
- **Version 0.3.10**
841
- * Change load order of dependencies to fix bug
840
+ **Version 0.3.10**
841
+ * Change load order of dependencies to fix bug
842
842
 
843
- **Version 0.3.11**
843
+ **Version 0.3.11**
844
844
  * Update German abbreviation list
845
- * Refactor 'remove_newline_in_middle_of_sentence' method
845
+ * Refactor 'remove_newline_in_middle_of_sentence' method
846
846
 
847
- **Version 0.3.12**
847
+ **Version 0.3.12**
848
848
  * Fix issue involving words with leading apostrophes
849
849
 
850
- **Version 0.3.13**
850
+ **Version 0.3.13**
851
851
  * Fix issue involving unexpected sentence break between abbreviation and hyphen
852
852
 
853
- **Version 0.3.14**
853
+ **Version 0.3.14**
854
854
  * Add English abbreviation Rs. to denote the Indian currency
855
855
 
856
- **Version 0.3.15**
856
+ **Version 0.3.15**
857
857
  * Handle em dashes that appear in the middle of a sentence and include a sentence ending punctuation mark
858
858
 
859
- **Version 0.3.16**
859
+ **Version 0.3.16**
860
860
  * Add support and tests for Danish
861
861
 
862
+ **Version 0.3.17**
863
+ * Fix issue involving the HTML regex in the cleaner
864
+
862
865
  ## Contributing
863
866
 
864
867
  If you find a text that is incorrectly segmented using this gem, please submit an issue.
865
-
868
+
866
869
  1. Fork it ( https://github.com/diasks2/pragmatic_segmenter/fork )
867
870
  2. Create your feature branch (`git checkout -b my-new-feature`)
868
871
  3. Commit your changes (`git commit -am 'Add some feature'`)
@@ -64,8 +64,8 @@ module PragmaticSegmenter
64
64
 
65
65
 
66
66
  module HTML
67
- # Rubular: http://rubular.com/r/ENrVFMdJ8v
68
- HTMLTagRule = Rule.new(/<\/?[^>]*>/, '')
67
+ # Rubular: http://rubular.com/r/9d0OVOEJWj
68
+ HTMLTagRule = Rule.new(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[\^'">\s]+))?)+\s*|\s*)\/?>/, '')
69
69
 
70
70
  # Rubular: http://rubular.com/r/XZVqMPJhea
71
71
  EscapedHTMLTagRule = Rule.new(/&lt;\/?[^gt;]*gt;/, '')
@@ -1,3 +1,3 @@
1
1
  module PragmaticSegmenter
2
- VERSION = "0.3.16"
2
+ VERSION = "0.3.17"
3
3
  end
@@ -1399,5 +1399,12 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
1399
1399
  ps = PragmaticSegmenter::Segmenter.new(text: "What do you see? - Posted like silent sentinels all around the town, stand thousands upon thousands of mortal men fixed in ocean reveries.", clean: false)
1400
1400
  expect(ps.segment).to eq(["What do you see?", "- Posted like silent sentinels all around the town, stand thousands upon thousands of mortal men fixed in ocean reveries."])
1401
1401
  end
1402
+
1403
+ it 'correctly segments text #117' do
1404
+ text = "In placebo-controlled studies of all uses of Tracleer, marked decreases in hemoglobin (>15% decrease from baseline resulting in values <11 g/ dL) were observed in 6% of Tracleer-treated patients and 3% of placebo-treated patients. Bosentan is highly bound (>98%) to plasma proteins, mainly albumin."
1405
+ ps = PragmaticSegmenter::Segmenter.new(text: text)
1406
+ expect(ps.segment).to eq(["In placebo-controlled studies of all uses of Tracleer, marked decreases in hemoglobin (>15% decrease from baseline resulting in values <11 g/ dL) were observed in 6% of Tracleer-treated patients and 3% of placebo-treated patients.", "Bosentan is highly bound (>98%) to plasma proteins, mainly albumin."])
1407
+ end
1408
+
1402
1409
  end
1403
1410
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pragmatic_segmenter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.16
4
+ version: 0.3.17
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin S. Dias
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2017-11-12 00:00:00.000000000 Z
11
+ date: 2017-12-07 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode
@@ -180,7 +180,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
180
180
  version: '0'
181
181
  requirements: []
182
182
  rubyforge_project:
183
- rubygems_version: 2.6.12
183
+ rubygems_version: 2.6.14
184
184
  signing_key:
185
185
  specification_version: 4
186
186
  summary: A rule-based sentence boundary detection gem that works out-of-the-box across