pragmatic_segmenter 0.3.18 → 0.3.23

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 3bb581e56e988521adc41dbb94bc7281ee7dfa95
4
- data.tar.gz: 0c3b6fe877a5d39d36053b7ed68a860b5d17779b
2
+ SHA256:
3
+ metadata.gz: 2c66c757c1b4bd8d090e88d7db6c627720f58f6f26e6fab9916a20e8bc15471c
4
+ data.tar.gz: da3a9088f72c90ddde6f0deda67d3e3b4ea3bed317970416deef794e0f594d89
5
5
  SHA512:
6
- metadata.gz: 1b1dd64b5a382e8bb7ed5d79fbb9264565d71088f234ba4a9bd7cae47184e1c64f78b32a72f39326aa31936c6c6742aa2ff75cd75cd1b328987a6061a4d2534b
7
- data.tar.gz: c150b178c93b7183300559e89c117cf8f9f93adf6ef33790a3ce0b292c66588fe6461c724210f7f93e078d2d08a10e995d7234bad963f8d5aa1c52c378effe5e
6
+ metadata.gz: 503c52965b2f98eebbc24e1215204c45307958a0279d56834e0c929d18625e81ac8c5c78779efb1a5946b5fdda5d8496b54a72b009ad6b2a597a70c4ba0fff66
7
+ data.tar.gz: f23773139a3a6d9f45cecaacabb363a7fb825a21eb76b40514abf4d0407191ed3b1afa887a5bc5328626abe2dbac5864895add62a1da036234036984d19a3454
data/.gitignore CHANGED
@@ -12,4 +12,5 @@
12
12
  *.o
13
13
  *.a
14
14
  mkmf.log
15
- .DS_Store
15
+ .DS_Store
16
+ .vscode/launch.json
data/NEWS CHANGED
@@ -1,3 +1,27 @@
1
+ 0.3.22 (2021-05-03):
2
+
3
+ * Improvement: Refactor for Ruby 3.0 compatibility
4
+
5
+ 0.3.22 (2018-09-23):
6
+
7
+ * Improvement: Initial support for Kazakh
8
+
9
+ 0.3.21 (2018-08-30):
10
+
11
+ * Improvement: Add support for file formats
12
+ * Improvement: Add support for numeric references at the end of a sentence (i.e. Wikipedia references)
13
+
14
+ 0.3.20 (2018-08-28):
15
+
16
+ * Improvement: Handle slanted single quotation as a single quote
17
+ * Bug Fix: The text contains a single character abbreviation as part of a list
18
+ * Bug Fix: Chinese book quotes
19
+ * Improvement: Add viz as abbreviation
20
+
21
+ 0.3.19 (2018-07-19):
22
+
23
+ * Bug Fix: A parenthetical following an abbreviation is now included as part of the same segment. Example: "The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)." is now treated as one segment.
24
+
1
25
  0.3.18 (2018-03-27):
2
26
 
3
27
  * Improvement: Performance optimizations
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Pragmatic Segmenter
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
3
+ [![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
4
4
 
5
5
  Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
6
6
 
@@ -433,6 +433,12 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
433
433
  => ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]
434
434
  ```
435
435
 
436
+ 4.) **Cardinal numbers at end of sentence** *Credit: Dr. Michael Ustaszewski*
437
+ ```
438
+ Die Information steht auf Seite 12. Dort kannst du nachlesen.
439
+ => ["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."]
440
+ ```
441
+
436
442
  #### Golden Rules (Japanese)
437
443
 
438
444
  1.) **Simple period to end sentence**
@@ -865,6 +871,25 @@ To test the relative performance of different segmentation tools and libraries I
865
871
  **Version 0.3.18**
866
872
  * Performance optimizations
867
873
 
874
+ **Version 0.3.19**
875
+ * Treat a parenthetical following an abbreviation as part of the same segment
876
+
877
+ **Version 0.3.20**
878
+ * Handle slanted single quotation as a single quote
879
+ * Handle a single character abbreviation as part of a list
880
+ * Add support for Chinese caret brackets
881
+ * Add viz as abbreviation
882
+
883
+ **Version 0.3.21**
884
+ * Add support for file formats
885
+ * Add support for numeric references at the end of a sentence (i.e. Wikipedia references)
886
+
887
+ **Version 0.3.22**
888
+ * Add initial support and tests for Kazakh
889
+
890
+ **Version 0.3.23**
891
+ * Refactor for Ruby 3.0 compatibility
892
+
868
893
  ## Contributing
869
894
 
870
895
  If you find a text that is incorrectly segmented using this gem, please submit an issue.
@@ -875,6 +900,11 @@ If you find a text that is incorrectly segmented using this gem, please submit a
875
900
  4. Push to the branch (`git push origin my-new-feature`)
876
901
  5. Create a new Pull Request
877
902
 
903
+ ## Ports
904
+
905
+ * [C# - PragmaticSegmenterNet](https://github.com/UglyToad/PragmaticSegmenterNet)
906
+ * [Python - pySBD](https://github.com/nipunsadvilkar/pySBD)
907
+
878
908
  ## License
879
909
 
880
910
  The MIT License (MIT)
@@ -10,18 +10,19 @@ module PragmaticSegmenter
10
10
 
11
11
  attr_reader :text
12
12
  def initialize(text:, language: )
13
- @text = Text.new(text)
13
+ @text = text.dup
14
14
  @language = language
15
15
  end
16
16
 
17
17
  def replace
18
- @text.apply(@language::PossessiveAbbreviationRule,
18
+ Rule.apply(@text,
19
+ @language::PossessiveAbbreviationRule,
19
20
  @language::KommanditgesellschaftRule,
20
21
  @language::SingleLetterAbbreviationRules::All)
21
22
 
22
23
  @text = search_for_abbreviations_in_string(@text)
23
24
  @text = replace_multi_period_abbreviations(@text)
24
- @text.apply(@language::AmPmRules::All)
25
+ Rule.apply(@text, @language::AmPmRules::All)
25
26
  replace_abbreviation_as_sentence_boundary(@text)
26
27
  end
27
28
 
@@ -108,7 +109,7 @@ module PragmaticSegmenter
108
109
  end
109
110
 
110
111
  def replace_period_of_abbr(txt, abbr)
111
- txt.gsub!(/(?<=\s#{abbr.strip})\.(?=((\.|\:|-|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))|(?<=^#{abbr.strip})\.(?=((\.|\:|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))/, '∯')
112
+ txt.gsub!(/(?<=\s#{abbr.strip})\.(?=((\.|\:|-|\?)|(\s([a-z]|I\s|I'm|I'll|\d|\())))|(?<=^#{abbr.strip})\.(?=((\.|\:|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))/, '∯')
112
113
  txt.gsub!(/(?<=\s#{abbr.strip})\.(?=,)|(?<=^#{abbr.strip})\.(?=,)/, '∯')
113
114
  txt
114
115
  end
@@ -8,6 +8,8 @@ module PragmaticSegmenter
8
8
  # Rubular: http://rubular.com/r/2YFrKWQUYi
9
9
  BETWEEN_SINGLE_QUOTES_REGEX = /(?<=\s)'(?:[^']|'[a-zA-Z])*'/
10
10
 
11
+ BETWEEN_SINGLE_QUOTE_SLANTED_REGEX = /(?<=\s)‘(?:[^’]|’[a-zA-Z])*’/
12
+
11
13
  # Rubular: http://rubular.com/r/3Pw1QlXOjd
12
14
  BETWEEN_DOUBLE_QUOTES_REGEX = /"(?>[^"\\]+|\\{2}|\\.)*"/
13
15
 
@@ -42,6 +44,7 @@ module PragmaticSegmenter
42
44
 
43
45
  def sub_punctuation_between_quotes_and_parens(txt)
44
46
  sub_punctuation_between_single_quotes(txt)
47
+ sub_punctuation_between_single_quote_slanted(txt)
45
48
  sub_punctuation_between_double_quotes(txt)
46
49
  sub_punctuation_between_square_brackets(txt)
47
50
  sub_punctuation_between_parens(txt)
@@ -74,6 +77,13 @@ module PragmaticSegmenter
74
77
  end
75
78
  end
76
79
 
80
+ def sub_punctuation_between_single_quote_slanted(txt)
81
+ PragmaticSegmenter::PunctuationReplacer.new(
82
+ matches_array: txt.scan(BETWEEN_SINGLE_QUOTE_SLANTED_REGEX),
83
+ text: txt
84
+ ).replace
85
+ end
86
+
77
87
  def sub_punctuation_between_double_quotes(txt)
78
88
  PragmaticSegmenter::PunctuationReplacer.new(
79
89
  matches_array: btwn_dbl_quote(txt),
@@ -11,7 +11,7 @@ module PragmaticSegmenter
11
11
 
12
12
  attr_reader :text, :doc_type
13
13
  def initialize(text:, doc_type: nil, language: Languages::Common)
14
- @text = Text.new(text)
14
+ @text = text.dup
15
15
  @doc_type = doc_type
16
16
  @language = language
17
17
  end
@@ -37,10 +37,10 @@ module PragmaticSegmenter
37
37
  replace_newlines
38
38
  replace_escaped_newlines
39
39
 
40
- @text.apply(HTML::All)
40
+ Rule.apply(@text, HTML::All)
41
41
 
42
42
  replace_punctuation_in_brackets
43
- @text.apply(InlineFormattingRule)
43
+ Rule.apply(@text, InlineFormattingRule)
44
44
  clean_quotations
45
45
  clean_table_of_contents
46
46
  check_for_no_space_in_between_sentences
@@ -72,7 +72,7 @@ module PragmaticSegmenter
72
72
  if word =~ regex
73
73
  unless URL_EMAIL_KEYWORDS.any? { |web| word =~ /#{web}/ }
74
74
  unless abbreviations.any? { |abbr| word =~ /#{abbr}/i }
75
- new_word = word.dup.apply(rule)
75
+ new_word = Rule.apply(word.dup, rule)
76
76
  txt.gsub!(/#{Regexp.escape(word)}/, new_word)
77
77
  end
78
78
  end
@@ -92,45 +92,45 @@ module PragmaticSegmenter
92
92
  end
93
93
 
94
94
  def remove_newline_in_middle_of_word
95
- @text.apply NewLineInMiddleOfWordRule
95
+ Rule.apply @text, NewLineInMiddleOfWordRule
96
96
  end
97
97
 
98
98
  def replace_escaped_newlines
99
- @text.apply EscapedNewLineRule, EscapedCarriageReturnRule,
99
+ Rule.apply @text, EscapedNewLineRule, EscapedCarriageReturnRule,
100
100
  TypoEscapedNewLineRule, TypoEscapedCarriageReturnRule
101
101
  end
102
102
 
103
103
  def replace_double_newlines
104
- @text.apply DoubleNewLineWithSpaceRule, DoubleNewLineRule
104
+ Rule.apply @text, DoubleNewLineWithSpaceRule, DoubleNewLineRule
105
105
  end
106
106
 
107
107
  def replace_newlines
108
108
  if doc_type.eql?('pdf')
109
109
  remove_pdf_line_breaks
110
110
  else
111
- @text.apply NewLineFollowedByPeriodRule,
111
+ Rule.apply @text, NewLineFollowedByPeriodRule,
112
112
  ReplaceNewlineWithCarriageReturnRule
113
113
  end
114
114
  end
115
115
 
116
116
  def remove_pdf_line_breaks
117
- @text.apply NewLineFollowedByBulletRule,
117
+ Rule.apply @text, NewLineFollowedByBulletRule,
118
118
 
119
119
  PDF::NewLineInMiddleOfSentenceRule,
120
120
  PDF::NewLineInMiddleOfSentenceNoSpacesRule
121
121
  end
122
122
 
123
123
  def clean_quotations
124
- @text.apply QuotationsFirstRule, QuotationsSecondRule
124
+ Rule.apply @text, QuotationsFirstRule, QuotationsSecondRule
125
125
  end
126
126
 
127
127
  def clean_table_of_contents
128
- @text.apply TableOfContentsRule, ConsecutivePeriodsRule,
128
+ Rule.apply @text, TableOfContentsRule, ConsecutivePeriodsRule,
129
129
  ConsecutiveForwardSlashRule
130
130
  end
131
131
 
132
132
  def clean_consecutive_characters
133
- @text.apply ConsecutivePeriodsRule, ConsecutiveForwardSlashRule
133
+ Rule.apply @text, ConsecutivePeriodsRule, ConsecutiveForwardSlashRule
134
134
  end
135
135
  end
136
136
  end
@@ -26,6 +26,7 @@ require 'pragmatic_segmenter/languages/polish'
26
26
  require 'pragmatic_segmenter/languages/chinese'
27
27
  require 'pragmatic_segmenter/languages/bulgarian'
28
28
  require 'pragmatic_segmenter/languages/danish'
29
+ require 'pragmatic_segmenter/languages/kazakh'
29
30
 
30
31
  module PragmaticSegmenter
31
32
  module Languages
@@ -49,7 +50,8 @@ module PragmaticSegmenter
49
50
  'nl' => Dutch,
50
51
  'pl' => Polish,
51
52
  'zh' => Chinese,
52
- 'da' => Danish
53
+ 'da' => Danish,
54
+ 'kk' => Kazakh
53
55
  }
54
56
 
55
57
  def self.get_language_by_code(code)
@@ -8,6 +8,32 @@ module PragmaticSegmenter
8
8
  class AbbreviationReplacer < AbbreviationReplacer
9
9
  SENTENCE_STARTERS = [].freeze
10
10
  end
11
+
12
+ class BetweenPunctuation < PragmaticSegmenter::BetweenPunctuation
13
+ BETWEEN_DOUBLE_ANGLE_QUOTATION_MARK_REGEX = /《(?>[^》\\]+|\\{2}|\\.)*》/
14
+ BETWEEN_L_BRACKET_REGEX = /「(?>[^」\\]+|\\{2}|\\.)*」/
15
+ private
16
+
17
+ def sub_punctuation_between_quotes_and_parens(txt)
18
+ super
19
+ sub_punctuation_between_double_angled_quotation_marks(txt)
20
+ sub_punctuation_between_l_bracket(txt)
21
+ end
22
+
23
+ def sub_punctuation_between_double_angled_quotation_marks(txt)
24
+ PunctuationReplacer.new(
25
+ matches_array: txt.scan(BETWEEN_DOUBLE_ANGLE_QUOTATION_MARK_REGEX),
26
+ text: txt
27
+ ).replace
28
+ end
29
+
30
+ def sub_punctuation_between_l_bracket(txt)
31
+ PunctuationReplacer.new(
32
+ matches_array: txt.scan(BETWEEN_L_BRACKET_REGEX),
33
+ text: txt
34
+ ).replace
35
+ end
36
+ end
11
37
  end
12
38
  end
13
39
  end
@@ -11,7 +11,7 @@ module PragmaticSegmenter
11
11
 
12
12
  # Defines the abbreviations for each language (if available)
13
13
  module Abbreviation
14
- ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
14
+ ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'viz', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
15
15
  PREPOSITIVE_ABBREVIATIONS = Set.new(['adm', 'attys', 'brig', 'capt', 'cmdr', 'col', 'cpl', 'det', 'dr', 'gen', 'gov', 'ing', 'lt', 'maj', 'mr', 'mrs', 'ms', 'mt', 'messrs', 'mssrs', 'prof', 'ph', 'rep', 'reps', 'rev', 'sen', 'sens', 'sgt', 'st', 'supt', 'v', 'vs']).freeze
16
16
  NUMBER_ABBREVIATIONS = Set.new(['art', 'ext', 'no', 'nos', 'p', 'pp']).freeze
17
17
  end
@@ -24,6 +24,8 @@ module PragmaticSegmenter
24
24
  # Rubular: http://rubular.com/r/G2opjedIm9
25
25
  GeoLocationRule = Rule.new(/(?<=[a-zA-z]°)\.(?=\s*\d+)/, '∯')
26
26
 
27
+ FileFormatRule = Rule.new(/(?<=\s)\.(?=(jpe?g|png|gif|tiff?|pdf|ps|docx?|xlsx?|svg|bmp|tga|exif|odt|html?|txt|rtf|bat|sxw|xml|zip|exe|msi|blend|wmv|mp[34]|pptx?|flac|rb|cpp|cs|js)\s)/, '∯')
28
+
27
29
  SingleNewLineRule = Rule.new(/\n/, 'ȹ')
28
30
 
29
31
  module DoublePunctuationRules
@@ -47,6 +47,8 @@ module PragmaticSegmenter
47
47
  # Rubular: http://rubular.com/r/mQ8Es9bxtk
48
48
  CONTINUOUS_PUNCTUATION_REGEX = /(?<=\S)(!|\?){3,}(?=(\s|\z|$))/
49
49
 
50
+ NUMBERED_REFERENCE_REGEX = /(?<=[^\d\s])(\.|∯)((\[(\d{1,3},?\s?-?\s?)*\b\d{1,3}\])+|((\d{1,3}\s?)*\d{1,3}))(\s)(?=[A-Z])/
51
+
50
52
  # Rubular: http://rubular.com/r/yqa4Rit8EY
51
53
  PossessiveAbbreviationRule = Rule.new(/\.(?='s\s)|\.(?='s$)|\.(?='s\z)/, '∯')
52
54
 
@@ -76,10 +78,10 @@ module PragmaticSegmenter
76
78
  # replaces the periods.
77
79
  module SingleLetterAbbreviationRules
78
80
  # Rubular: http://rubular.com/r/e3H6kwnr6H
79
- SingleUpperCaseLetterAtStartOfLineRule = Rule.new(/(?<=^[A-Z])\.(?=\s)/, '∯')
81
+ SingleUpperCaseLetterAtStartOfLineRule = Rule.new(/(?<=^[A-Z])\.(?=,?\s)/, '∯')
80
82
 
81
83
  # Rubular: http://rubular.com/r/gitvf0YWH4
82
- SingleUpperCaseLetterRule = Rule.new(/(?<=\s[A-Z])\.(?=\s)/, '∯')
84
+ SingleUpperCaseLetterRule = Rule.new(/(?<=\s[A-Z])\.(?=,?\s)/, '∯')
83
85
 
84
86
  All = [
85
87
  SingleUpperCaseLetterAtStartOfLineRule,
@@ -47,7 +47,7 @@ module PragmaticSegmenter
47
47
  private
48
48
 
49
49
  def replace_numbers
50
- @text.apply Numbers::All
50
+ Rule.apply @text, Numbers::All
51
51
 
52
52
  replace_period_in_deutsch_dates
53
53
  end
@@ -68,7 +68,8 @@ module PragmaticSegmenter
68
68
  ).freeze
69
69
 
70
70
  def replace
71
- @text = text.apply(
71
+ @text = Rule.apply(
72
+ text,
72
73
  @language::PossessiveAbbreviationRule,
73
74
  @language::SingleLetterAbbreviationRules::All,
74
75
  SingleLowerCaseLetterRule,
@@ -76,7 +77,7 @@ module PragmaticSegmenter
76
77
 
77
78
  @text = search_for_abbreviations_in_string(@text)
78
79
  @text = replace_multi_period_abbreviations(@text)
79
- @text.apply(Languages::Common::AmPmRules::All)
80
+ Rule.apply(@text, Languages::Common::AmPmRules::All)
80
81
  replace_abbreviation_as_sentence_boundary(@text)
81
82
  end
82
83
 
@@ -17,7 +17,7 @@ module PragmaticSegmenter
17
17
  private
18
18
 
19
19
  def remove_newline_in_middle_of_word
20
- @text.apply NewLineInMiddleOfWordRule
20
+ Rule.apply @text, NewLineInMiddleOfWordRule
21
21
  end
22
22
  end
23
23
 
@@ -0,0 +1,44 @@
1
+ # frozen_string_literal: true
2
+
3
+ module PragmaticSegmenter
4
+ module Languages
5
+ module Kazakh
6
+ include Languages::Common
7
+
8
+ MULTI_PERIOD_ABBREVIATION_REGEX = /\b\p{Cyrillic}(?:\.\s?\p{Cyrillic})+[.]|b[a-z](?:\.[a-z])+[.]/i
9
+
10
+ module Abbreviation
11
+ ABBREVIATIONS = Set.new(['afp', 'anp', 'atp', 'bae', 'bg', 'bp', 'cam', 'cctv', 'cd', 'cez', 'cgi', 'cnpc', 'farc', 'fbi', 'eiti', 'epo', 'er', 'gp', 'gps', 'has', 'hiv', 'hrh', 'http', 'icu', 'idf', 'imd', 'ime', 'icu', 'idf', 'ip', 'iso', 'kaz', 'kpo', 'kpa', 'kz', 'kz', 'mri', 'nasa', 'nba', 'nbc', 'nds', 'ohl', 'omlt', 'ppm', 'pda', 'pkk', 'psm', 'psp', 'raf', 'rss', 'rtl', 'sas', 'sme', 'sms', 'tnt', 'udf', 'uefa', 'usb', 'utc', 'x', 'zdf', 'әқбк', 'әқбк', 'аақ', 'авг.', 'aбб', 'аек', 'ак', 'ақ', 'акцион.', 'акср', 'ақш', 'англ', 'аөсшк', 'апр', 'м.', 'а.', 'р.', 'ғ.', 'апр.', 'аум.', 'ацат', 'әч', 'т. б.', 'б. з. б.', 'б. з. б.', 'б. з. д.', 'б. з. д.', 'биікт.', 'б. т.', 'биол.', 'биохим', 'бө', 'б. э. д.', 'бта', 'бұұ', 'вич', 'всоонл', 'геогр.', 'геол.', 'гленкор', 'гэс', 'қк', 'км', 'г', 'млн', 'млрд', 'т', 'ғ. с.', 'ғ.', 'қ.', 'ғ.', 'дек.', 'днқ', 'дсұ', 'еақк', 'еқыұ', 'ембімұнайгаз', 'ео', 'еуразэқ', 'еуроодақ', 'еұу', 'ж.', 'ж.', 'жж.', 'жоо', 'жіө', 'жсдп', 'жшс', 'іім', 'инта', 'исаф', 'камаз', 'кгб', 'кеу', 'кг', 'км²', 'км²', 'км³', 'км³', 'кимеп', 'кср', 'ксро', 'кокп', 'кхдр', 'қазатомпром', 'қазкср', 'қазұу', 'қазмұнайгаз', 'қазпошта', 'қазтаг', 'қазұу', 'қкп', 'қмдб', 'қр', 'қхр', 'лат.', 'м²', 'м²', 'м³', 'м³', 'магатэ', 'май.', 'максам', 'мб', 'мвт', 'мемл', 'м', 'мсоп', 'мтк', 'мыс.', 'наса', 'нато', 'нквд', 'нояб.', 'обл.', 'огпу', 'окт.', 'оңт.', 'опек', 'оеб', 'өзенмұнайгаз', 'өф', 'пәк', 'пед.', 'ркфср', 'рнқ', 'рсфср', 'рф', 'свс', 'сву', 'сду', 'сес', 'сент.', 'см', 'снпс', 'солт.', 'солт.', 'сооно', 'ссро', 'сср', 'ссср', 'ссс', 'сэс', 'дк', 'т. б.', 'т', 'тв', 'тереңд.', 'тех.', 'тжқ', 'тмд', 'төм.', 'трлн', 'тр', 'т.', 'и.', 'м.', 'с.', 'ш.', 'т.', 'т. с. с.', 'тэц', 'уаз', 'уефа', 'еқыұ', 'ұқк', 'ұқшұ', 'февр.', 'фққ', 'фсб', 'хим.', 'хқко', 'шұар', 'шыұ', 'экон.', 'экспо', 'цтп', 'цас', 'янв.', 'dvd', 'жкт', 'ққс', 'км', 'ацат', 'юнеско', 'ббс', 'mgm', 'жск', 'зоо', 'бсн', 'өұқ', 'оар', 'боак', 'эөкк', 'хтқо', 'әөк', 'жэк', 'хдо', 'спбму', 'аф', 'сбд', 'амт', 'гсдп', 'гсбп', 'эыдұ', 'нұсжп', 'шыұ', 'жтсх', 'хдп', 'эқк', 'фкққ', 'пиқ', 'өгк', 'мбф', 'маж', 'кота', 'тж', 'ук', 'обб', 'сбл', 'жхл', 'кмс', 'бмтрк', 'жққ', 'бхооо', 'мқо', 'ржмб', 'гулаг', 'жко', 'еэы', 'еаэы', 'кхдр', 'рфкп', 'рлдп', 'хвқ', 'мр', 'мт', 'кту', 'ртж', 'тим', 'мемдум', 'ксро', 'т.с.с', 'с.ш.', 'ш.б.', 'б.б.', 'руб', 'мин', 'акад.', 'ғ.', 'мм', 'мм.']).freeze
12
+ PREPOSITIVE_ABBREVIATIONS = [].freeze
13
+ NUMBER_ABBREVIATIONS = [].freeze
14
+ end
15
+
16
+ class Processor < PragmaticSegmenter::Processor
17
+ private
18
+
19
+ # Rubular: http://rubular.com/r/WRWy56Z5zp
20
+ QuestionMarkFollowedByDashLowercaseRule = Rule.new(/(?<=\p{Ll})\?(?=\s*[-—]\s*\p{Ll})/, '&ᓷ&')
21
+ # Rubular: http://rubular.com/r/lixxP7puSa
22
+ ExclamationMarkFollowedByDashLowercaseRule = Rule.new(/(?<=\p{Ll})!(?=\s*[-—]\s*\p{Ll})/, '&ᓴ&')
23
+
24
+ def between_punctuation(txt)
25
+ super(txt)
26
+ Rule.apply(txt, QuestionMarkFollowedByDashLowercaseRule, ExclamationMarkFollowedByDashLowercaseRule)
27
+ end
28
+ end
29
+
30
+ class AbbreviationReplacer < AbbreviationReplacer
31
+ SENTENCE_STARTERS = [].freeze
32
+
33
+ SingleUpperCaseCyrillicLetterAtStartOfLineRule = Rule.new(/(?<=^[А-ЯЁ])\.(?=\s)/, '∯')
34
+ SingleUpperCaseCyrillicLetterRule = Rule.new(/(?<=\s[А-ЯЁ])\.(?=\s)/, '∯')
35
+
36
+ def replace
37
+ super
38
+ Rule.apply(@text, SingleUpperCaseCyrillicLetterAtStartOfLineRule, SingleUpperCaseCyrillicLetterRule)
39
+ end
40
+ end
41
+ end
42
+ end
43
+ end
44
+
@@ -48,7 +48,7 @@ module PragmaticSegmenter
48
48
 
49
49
  attr_reader :text
50
50
  def initialize(text:)
51
- @text = Text.new(text)
51
+ @text = text.dup
52
52
  end
53
53
 
54
54
  def add_line_break
@@ -68,13 +68,13 @@ module PragmaticSegmenter
68
68
  def format_numbered_list_with_parens
69
69
  replace_parens_in_numbered_list
70
70
  add_line_breaks_for_numbered_list_with_parens
71
- @text.apply(ListMarkerRule)
71
+ Rule.apply(@text, ListMarkerRule)
72
72
  end
73
73
 
74
74
  def format_numbered_list_with_periods
75
75
  replace_periods_in_numbered_list
76
76
  add_line_breaks_for_numbered_list_with_periods
77
- @text.apply(SubstituteListPeriodRule)
77
+ Rule.apply(@text, SubstituteListPeriodRule)
78
78
  end
79
79
 
80
80
  def format_alphabetical_lists
@@ -93,7 +93,7 @@ module PragmaticSegmenter
93
93
 
94
94
  def add_line_breaks_for_numbered_list_with_periods
95
95
  if @text.include?('♨') && @text !~ /♨.+\n.+♨|♨.+\r.+♨/ && @text !~ /for\s\d{1,2}♨\s[a-z]/
96
- @text.apply(SpaceBetweenListItemsFirstRule, SpaceBetweenListItemsSecondRule)
96
+ Rule.apply(@text, SpaceBetweenListItemsFirstRule, SpaceBetweenListItemsSecondRule)
97
97
  end
98
98
  end
99
99
 
@@ -105,7 +105,7 @@ module PragmaticSegmenter
105
105
 
106
106
  def add_line_breaks_for_numbered_list_with_parens
107
107
  if @text.include?('☝') && @text !~ /☝.+\n.+☝|☝.+\r.+☝/
108
- @text.apply(SpaceBetweenListItemsThirdRule)
108
+ Rule.apply(@text, SpaceBetweenListItemsThirdRule)
109
109
  end
110
110
  end
111
111
 
@@ -23,8 +23,10 @@ module PragmaticSegmenter
23
23
  replace_abbreviations
24
24
  replace_numbers
25
25
  replace_continuous_punctuation
26
- @text.apply(@language::Abbreviations::WithMultiplePeriodsAndEmailRule)
27
- @text.apply(@language::GeoLocationRule)
26
+ replace_periods_before_numeric_references
27
+ Rule.apply(@text, @language::Abbreviations::WithMultiplePeriodsAndEmailRule)
28
+ Rule.apply(@text, @language::GeoLocationRule)
29
+ Rule.apply(@text, @language::FileFormatRule)
28
30
  split_into_segments
29
31
  end
30
32
 
@@ -32,18 +34,19 @@ module PragmaticSegmenter
32
34
 
33
35
  def split_into_segments
34
36
  check_for_parens_between_quotes(@text).split("\r")
35
- .map! { |segment| segment.apply(@language::SingleNewLineRule, @language::EllipsisRules::All) }
37
+ .map! { |segment| Rule.apply(segment, @language::SingleNewLineRule, @language::EllipsisRules::All) }
36
38
  .map { |segment| check_for_punctuation(segment) }.flatten
37
- .map! { |segment| segment.apply(@language::SubSymbolsRules::All) }
39
+ .map! { |segment| Rule.apply(segment, @language::SubSymbolsRules::All) }
38
40
  .map { |segment| post_process_segments(segment) }
39
41
  .flatten.compact.delete_if(&:empty?)
40
- .map! { |segment| segment.apply(@language::SubSingleQuoteRule) }
42
+ .map! { |segment| Rule.apply(segment, @language::SubSingleQuoteRule) }
41
43
  end
42
44
 
43
45
  def post_process_segments(txt)
44
46
  return txt if txt.length < 2 && txt =~ /\A[a-zA-Z]*\Z/
45
47
  return if consecutive_underscore?(txt) || txt.length < 2
46
- txt.apply(
48
+ Rule.apply(
49
+ txt,
47
50
  @language::ReinsertEllipsisRules::All,
48
51
  @language::ExtraWhiteSpaceRule
49
52
  )
@@ -68,6 +71,10 @@ module PragmaticSegmenter
68
71
  end
69
72
  end
70
73
 
74
+ def replace_periods_before_numeric_references
75
+ @text.gsub!(@language::NUMBERED_REFERENCE_REGEX, "∯\\2\r\\7")
76
+ end
77
+
71
78
  def consecutive_underscore?(txt)
72
79
  # Rubular: http://rubular.com/r/fTF2Ff3WBL
73
80
  txt.gsub(/_{3,}/, '').length.eql?(0)
@@ -85,7 +92,8 @@ module PragmaticSegmenter
85
92
  txt << 'ȸ' unless @language::Punctuations.any? { |p| txt[-1].include?(p) }
86
93
  ExclamationWords.apply_rules(txt)
87
94
  between_punctuation(txt)
88
- txt = txt.apply(
95
+ txt = Rule.apply(
96
+ txt,
89
97
  @language::DoublePunctuationRules::All,
90
98
  @language::QuestionMarkInQuotationRule,
91
99
  @language::ExclamationPointRules::All
@@ -95,7 +103,7 @@ module PragmaticSegmenter
95
103
  end
96
104
 
97
105
  def replace_numbers
98
- @text.apply @language::Numbers::All
106
+ Rule.apply @text, @language::Numbers::All
99
107
  end
100
108
 
101
109
  def abbreviations_replacer
@@ -123,8 +131,8 @@ module PragmaticSegmenter
123
131
  end
124
132
 
125
133
  def sentence_boundary_punctuation(txt)
126
- txt = txt.apply @language::ReplaceColonBetweenNumbersRule if defined? @language::ReplaceColonBetweenNumbersRule
127
- txt = txt.apply @language::ReplaceNonSentenceBoundaryCommaRule if defined? @language::ReplaceNonSentenceBoundaryCommaRule
134
+ txt = Rule.apply txt, @language::ReplaceColonBetweenNumbersRule if defined? @language::ReplaceColonBetweenNumbersRule
135
+ txt = Rule.apply txt, @language::ReplaceNonSentenceBoundaryCommaRule if defined? @language::ReplaceNonSentenceBoundaryCommaRule
128
136
 
129
137
  txt.scan(@language::SENTENCE_BOUNDARY_REGEX)
130
138
  end
@@ -45,9 +45,9 @@ module PragmaticSegmenter
45
45
 
46
46
  def replace_punctuation(array)
47
47
  return if !array || array.empty?
48
- @text.apply(Rules::EscapeRegexReservedCharacters::All)
48
+ Rule.apply(@text, Rules::EscapeRegexReservedCharacters::All)
49
49
  array.each do |a|
50
- a.apply(Rules::EscapeRegexReservedCharacters::All)
50
+ Rule.apply(a, Rules::EscapeRegexReservedCharacters::All)
51
51
  sub = sub_characters(a, '.', '∯')
52
52
  sub_1 = sub_characters(sub, '。', '&ᓰ&')
53
53
  sub_2 = sub_characters(sub_1, '.', '&ᓱ&')
@@ -59,7 +59,7 @@ module PragmaticSegmenter
59
59
  sub_7 = sub_characters(sub_6, "'", '&⎋&')
60
60
  end
61
61
  end
62
- @text.apply(Rules::SubEscapedRegexReservedCharacters::All)
62
+ Rule.apply(@text, Rules::SubEscapedRegexReservedCharacters::All)
63
63
  end
64
64
 
65
65
  def sub_characters(string, char_a, char_b)
@@ -1,14 +1,14 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module PragmaticSegmenter
4
- Rule = Struct.new(:pattern, :replacement)
5
-
6
- class Text < String
7
- def apply(*rules)
8
- rules.flatten.each do |rule|
9
- self.gsub!(rule.pattern, rule.replacement)
4
+ class Rule < Struct.new(:pattern, :replacement)
5
+ class << self
6
+ def apply(str, *rules)
7
+ rules.flatten.each do |rule|
8
+ str.gsub!(rule.pattern, rule.replacement)
9
+ end
10
+ str
10
11
  end
11
- self
12
12
  end
13
13
  end
14
14
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module PragmaticSegmenter
4
- VERSION = "0.3.18"
4
+ VERSION = "0.3.23"
5
5
  end
@@ -19,8 +19,8 @@ Gem::Specification.new do |spec|
19
19
  spec.require_paths = ["lib"]
20
20
 
21
21
  spec.add_runtime_dependency "unicode"
22
- spec.add_development_dependency "bundler", "~> 1.7"
23
- spec.add_development_dependency "rake", "~> 10.0"
22
+ spec.add_development_dependency "bundler", ">= 1.7"
23
+ spec.add_development_dependency "rake", ">= 12.3.3"
24
24
  spec.add_development_dependency "rspec"
25
25
  spec.add_development_dependency "stackprof"
26
26
  end
@@ -7,5 +7,10 @@ RSpec.describe PragmaticSegmenter::Languages::Chinese, '(zh)' do
7
7
  ps = PragmaticSegmenter::Segmenter.new(text: "安永已聯繫周怡安親屬,協助辦理簽證相關事宜,周怡安家屬1月1日晚間搭乘東方航空班機抵達上海,他們步入入境大廳時神情落寞、不發一語。周怡安來自台中,去年剛從元智大學畢業,同年9月加入安永。", language: 'zh')
8
8
  expect(ps.segment).to eq(["安永已聯繫周怡安親屬,協助辦理簽證相關事宜,周怡安家屬1月1日晚間搭乘東方航空班機抵達上海,他們步入入境大廳時神情落寞、不發一語。", "周怡安來自台中,去年剛從元智大學畢業,同年9月加入安永。"])
9
9
  end
10
+
11
+ it 'correctly segments text #002' do
12
+ ps = PragmaticSegmenter::Segmenter.new(text: "我们明天一起去看《摔跤吧!爸爸》好吗?好!", language: 'zh')
13
+ expect(ps.segment).to eq(["我们明天一起去看《摔跤吧!爸爸》好吗?", "好!"])
14
+ end
10
15
  end
11
16
  end
@@ -507,7 +507,7 @@ RSpec.describe PragmaticSegmenter::Languages::Danish, "(da)" do
507
507
 
508
508
  it 'correctly segments text #048' do
509
509
  ps = PragmaticSegmenter::Segmenter.new(text: "CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)", language: 'en')
510
- expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co.", "(cited from WSJ 05/29/1987)"])
510
+ expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)"])
511
511
  end
512
512
 
513
513
  it 'correctly segments text #049' do
@@ -17,6 +17,24 @@ RSpec.describe PragmaticSegmenter::Languages::Deutsch, "(de)" do
17
17
  ps = PragmaticSegmenter::Segmenter.new(text: "Was sind die Konsequenzen der Abstimmung vom 12. Juni?", language: "de")
18
18
  expect(ps.segment).to eq(["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"])
19
19
  end
20
+
21
+ it "Numbers #004" do
22
+ # Credit: Dr. Michael Ustaszewski
23
+ # A numeral followed by a dot within the sentence should not be treated as a sentence,
24
+ # because the meaning of numeral + dot is that of an ordinal number.
25
+ # However, if the numeral + dot is in the last position of the sentence, then it is not an ordinal,
26
+ # but a cardinal number and thus a sentence break should be made. See the following example:
27
+
28
+ # <INPUT>Die Information steht auf Seite 12. Dort kannst du nachlesen.</INPUT>
29
+ # <SHOULDBE>["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."]</SHOULDBE>
30
+ # The sentence translates as "The information can be found on page 12. You can read it there".
31
+
32
+ # That's a tricky one I guess, because the capitalisation of the word following the dot is not necessarily a clue,
33
+ # since German nouns are usually always capitalised.
34
+ skip "NOT IMPLEMENTED"
35
+ ps = PragmaticSegmenter::Segmenter.new(text: "Die Information steht auf Seite 12. Dort kannst du nachlesen.", language: "de")
36
+ expect(ps.segment).to eq(["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."])
37
+ end
20
38
  end
21
39
 
22
40
  # Thanks to Silvia Busse for the German test examples.
@@ -512,7 +512,7 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
512
512
 
513
513
  it 'correctly segments text #048' do
514
514
  ps = PragmaticSegmenter::Segmenter.new(text: "CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)", language: 'en')
515
- expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co.", "(cited from WSJ 05/29/1987)"])
515
+ expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)"])
516
516
  end
517
517
 
518
518
  it 'correctly segments text #049' do
@@ -1406,5 +1406,61 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
1406
1406
  expect(ps.segment).to eq(["In placebo-controlled studies of all uses of Tracleer, marked decreases in hemoglobin (>15% decrease from baseline resulting in values <11 g/ dL) were observed in 6% of Tracleer-treated patients and 3% of placebo-treated patients.", "Bosentan is highly bound (>98%) to plasma proteins, mainly albumin."])
1407
1407
  end
1408
1408
 
1409
+ it 'correctly segments text #118' do
1410
+ text = "The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)."
1411
+ ps = PragmaticSegmenter::Segmenter.new(text: text, clean: false)
1412
+ expect(ps.segment).to eq(["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)."])
1413
+ end
1414
+
1415
+ it 'correctly segments text #119' do
1416
+ ps = PragmaticSegmenter::Segmenter.new(text: "Unlike the abbreviations i.e. and e.g., viz. is used to indicate a detailed description of something stated before.")
1417
+ expect(ps.segment).to eq(["Unlike the abbreviations i.e. and e.g., viz. is used to indicate a detailed description of something stated before."])
1418
+ end
1419
+
1420
+ it 'correctly segments text #120' do
1421
+ ps = PragmaticSegmenter::Segmenter.new(text: "For example, ‘dragonswort… is said that it should be grown in dragon’s blood. It grows at the tops of mountains where there are groves of trees, chiefly in holy places and in the country that is called Apulia’ (translated by Anne Van Arsdall, in Medieval Herbal Remedies: The Old English Herbarium and Anglo-Saxon Medicine p. 154). The Herbal also includes lore about other plants, such as the mandrake.")
1422
+ expect(ps.segment).to eq(["For example, ‘dragonswort… is said that it should be grown in dragon’s blood. It grows at the tops of mountains where there are groves of trees, chiefly in holy places and in the country that is called Apulia’ (translated by Anne Van Arsdall, in Medieval Herbal Remedies: The Old English Herbarium and Anglo-Saxon Medicine p. 154).", "The Herbal also includes lore about other plants, such as the mandrake."])
1423
+ end
1424
+
1425
+ it 'correctly segments text #121' do
1426
+ ps = PragmaticSegmenter::Segmenter.new(text: "Here’s the - ahem - official citation: Baker, C., Anderson, Kenneth, Martin, James, & Palen, Leysia. Modeling Open Source Software Communities, ProQuest Dissertations and Theses.")
1427
+ expect(ps.segment).to eq(["Here’s the - ahem - official citation: Baker, C., Anderson, Kenneth, Martin, James, & Palen, Leysia.", "Modeling Open Source Software Communities, ProQuest Dissertations and Theses."])
1428
+ end
1429
+
1430
+ it 'correctly segments text #122' do
1431
+ ps = PragmaticSegmenter::Segmenter.new(text: "These include images of various modes of transport and members of the team, all available in .jpeg format. Images can be downloaded from our website. We also offer archives as .zip files.")
1432
+ expect(ps.segment).to eq(["These include images of various modes of transport and members of the team, all available in .jpeg format.", "Images can be downloaded from our website.", "We also offer archives as .zip files."])
1433
+ end
1434
+
1435
+ it 'correctly segments text #123' do
1436
+ ps = PragmaticSegmenter::Segmenter.new(text: "Saint Maximus (died 250) is a Christian saint and martyr.[1] The emperor Decius published a decree ordering the veneration of busts of the deified emperors.")
1437
+ expect(ps.segment).to eq(["Saint Maximus (died 250) is a Christian saint and martyr.[1]", "The emperor Decius published a decree ordering the veneration of busts of the deified emperors."])
1438
+ end
1439
+
1440
+ it 'correctly segments text #124' do
1441
+ ps = PragmaticSegmenter::Segmenter.new(text: "Differing agendas can potentially create an understanding gap in a consultation.11 12 Take the example of one of the most common presentations in ill health: the common cold.")
1442
+ expect(ps.segment).to eq(["Differing agendas can potentially create an understanding gap in a consultation.11 12", "Take the example of one of the most common presentations in ill health: the common cold."])
1443
+ end
1444
+
1445
+ it 'correctly segments text #125' do
1446
+ ps = PragmaticSegmenter::Segmenter.new(text: "Daniel Kahneman popularised the concept of fast and slow thinking: the distinction between instinctive (type 1 thinking) and reflective, analytical cognition (type 2).10 This model relates to doctors achieving a balance between efficiency and effectiveness.")
1447
+ expect(ps.segment).to eq(["Daniel Kahneman popularised the concept of fast and slow thinking: the distinction between instinctive (type 1 thinking) and reflective, analytical cognition (type 2).10", "This model relates to doctors achieving a balance between efficiency and effectiveness."])
1448
+ end
1449
+
1450
+ it 'correctly segments text #126' do
1451
+ ps = PragmaticSegmenter::Segmenter.new(text: "Its traditional use[1] is well documented in the ethnobotanical literature [2–11]. Leaves, buds, tar and essential oils are used to treat a wide spectrum of diseases.")
1452
+ expect(ps.segment).to eq(["Its traditional use[1] is well documented in the ethnobotanical literature [2–11].", "Leaves, buds, tar and essential oils are used to treat a wide spectrum of diseases."])
1453
+ end
1454
+
1455
+ it 'correctly segments text #127' do
1456
+ ps = PragmaticSegmenter::Segmenter.new(text: "Thus increasing the desire for political reform both in Lancashire and in the country at large.[7][8] This was a serious misdemeanour,[16] encouraging them to declare the assembly illegal as soon as it was announced on 31 July.[17][18] The radicals sought a second opinion on the meeting's legality.")
1457
+ expect(ps.segment).to eq(["Thus increasing the desire for political reform both in Lancashire and in the country at large.[7][8]", "This was a serious misdemeanour,[16] encouraging them to declare the assembly illegal as soon as it was announced on 31 July.[17][18]", "The radicals sought a second opinion on the meeting's legality."])
1458
+ end
1459
+
1460
+ it 'correctly segments text #128' do
1461
+ ps = PragmaticSegmenter::Segmenter.new(text: "The table in (4) is a sample from the Wall Street Journal (1987).1 According to the distribution all the pairs given in (4) count as candidates for abbreviations.")
1462
+ expect(ps.segment).to eq([ "The table in (4) is a sample from the Wall Street Journal (1987).1", "According to the distribution all the pairs given in (4) count as candidates for abbreviations."])
1463
+
1464
+ end
1409
1465
  end
1410
1466
  end
@@ -0,0 +1,73 @@
1
+ require 'spec_helper'
2
+
3
+ RSpec.describe PragmaticSegmenter::Languages::Kazakh, "(kk)" do
4
+
5
+ context "Golden Rules" do
6
+ it "Simple period to end sentence #001" do
7
+ ps = PragmaticSegmenter::Segmenter.new(text: "Мұхитқа тікелей шыға алмайтын мемлекеттердің ішінде Қазақстан - ең үлкені.", language: "kk")
8
+ expect(ps.segment).to eq(["Мұхитқа тікелей шыға алмайтын мемлекеттердің ішінде Қазақстан - ең үлкені."])
9
+ end
10
+
11
+ it "Question mark to end sentence #002" do
12
+ ps = PragmaticSegmenter::Segmenter.new(text: "Оқушылар үйі, Достық даңғылы, Абай даналығы, ауыл шаруашылығы – кім? не?", language: "kk")
13
+ expect(ps.segment).to eq(["Оқушылар үйі, Достық даңғылы, Абай даналығы, ауыл шаруашылығы – кім?", "не?"])
14
+ end
15
+
16
+ it "Parenthetical inside sentence #003" do
17
+ ps = PragmaticSegmenter::Segmenter.new(text: "Әр түрлі өлшемнің атауы болып табылатын м (метр), см (сантиметр), кг (киллограмм), т (тонна), га (гектар), ц (центнер), т. б. (тағы басқа), тәрізді белгілер де қысқарған сөздер болып табылады.", language: "kk")
18
+ expect(ps.segment).to eq(["Әр түрлі өлшемнің атауы болып табылатын м (метр), см (сантиметр), кг (киллограмм), т (тонна), га (гектар), ц (центнер), т. б. (тағы басқа), тәрізді белгілер де қысқарған сөздер болып табылады."])
19
+ end
20
+
21
+ it "Two letter abbreviation to end sentence #004" do
22
+ ps = PragmaticSegmenter::Segmenter.new(text: "Мысалы: обкомға (облыстық комитетке) барды, ауаткомда (аудандық атқару комитетінде) болды, педучилищеге (педагогтік училищеге) түсті, медпункттің (медициналық пункттің) алдында т. б.", language: "kk")
23
+ expect(ps.segment).to eq(["Мысалы: обкомға (облыстық комитетке) барды, ауаткомда (аудандық атқару комитетінде) болды, педучилищеге (педагогтік училищеге) түсті, медпункттің (медициналық пункттің) алдында т. б."])
24
+ end
25
+
26
+ it "Number as non sentence boundary #005" do
27
+ ps = PragmaticSegmenter::Segmenter.new(text: "Елдің жалпы ішкі өнімі ЖІӨ (номинал) = $225.619 млрд (2014)", language: "kk")
28
+ expect(ps.segment).to eq(["Елдің жалпы ішкі өнімі ЖІӨ (номинал) = $225.619 млрд (2014)"])
29
+ end
30
+
31
+ it "No whitespace between sentence boundary #006" do
32
+ ps = PragmaticSegmenter::Segmenter.new(text: "Ресейдiң әлеуметтiк-экономикалық жағдайы.XVIII ғасырдың бiрiншi ширегiнде Ресейге тән нәрсе.", language: "kk")
33
+ expect(ps.segment).to eq(["Ресейдiң әлеуметтiк-экономикалық жағдайы.", "XVIII ғасырдың бiрiншi ширегiнде Ресейге тән нәрсе."])
34
+ end
35
+
36
+ it "Dates within sentence #007" do
37
+ ps = PragmaticSegmenter::Segmenter.new(text: "(«Егемен Қазақстан», 7 қыркүйек 2012 жыл. №590-591); Бұл туралы кеше санпедқадағалау комитетінің облыыстық департаменті хабарлады. («Айқын», 23 сəуір 2010 жыл. № 70).", language: "kk")
38
+ expect(ps.segment).to eq(["(«Егемен Қазақстан», 7 қыркүйек 2012 жыл. №590-591); Бұл туралы кеше санпедқадағалау комитетінің облыыстық департаменті хабарлады.", "(«Айқын», 23 сəуір 2010 жыл. № 70)."])
39
+ end
40
+
41
+ it "Multi period abbreviation within sentence #008" do
42
+ ps = PragmaticSegmenter::Segmenter.new(text: "Иран революциясы (1905 — 11) және азаматтық қозғалыс (1918 — 21) кезінде А. Фарахани, М. Кермани, М. Т. Бехар, т.б. ақындар демократиялық идеяның жыршысы болды.", language: "kk")
43
+ expect(ps.segment).to eq(["Иран революциясы (1905 — 11) және азаматтық қозғалыс (1918 — 21) кезінде А. Фарахани, М. Кермани, М. Т. Бехар, т.б. ақындар демократиялық идеяның жыршысы болды."])
44
+ end
45
+
46
+ it "Web addresses #009" do
47
+ ps = PragmaticSegmenter::Segmenter.new(text: "Владимир Федосеев: Аттар магиясы енді жоқ http://www.vremya.ru/2003/179/10/80980.html", language: "kk")
48
+ expect(ps.segment).to eq(["Владимир Федосеев: Аттар магиясы енді жоқ http://www.vremya.ru/2003/179/10/80980.html"])
49
+ end
50
+
51
+ it "Question mark not at end of sentence #010" do
52
+ ps = PragmaticSegmenter::Segmenter.new(text: "Бірақ оның енді не керегі бар? — деді.", language: "kk")
53
+ expect(ps.segment).to eq(["Бірақ оның енді не керегі бар? — деді."])
54
+ end
55
+
56
+ it "Exclamation mark not at end of sentence #011" do
57
+ ps = PragmaticSegmenter::Segmenter.new(text: "Сондықтан шапаныма жегізіп отырғаным! - деп, жауап береді.", language: "kk")
58
+ expect(ps.segment).to eq(["Сондықтан шапаныма жегізіп отырғаным! - деп, жауап береді."])
59
+ end
60
+ end
61
+
62
+ describe '#segment' do
63
+ it 'correctly segments text #001' do
64
+ ps = PragmaticSegmenter::Segmenter.new(text: "Б.з.б. 6 – 3 ғасырларда конфуцийшілдік, моизм, легизм мектептерінің қалыптасуы нәтижесінде Қытай философиясы пайда болды.", language: 'kk')
65
+ expect(ps.segment).to eq(["Б.з.б. 6 – 3 ғасырларда конфуцийшілдік, моизм, легизм мектептерінің қалыптасуы нәтижесінде Қытай философиясы пайда болды."])
66
+ end
67
+
68
+ it 'correctly segments text #002' do
69
+ ps = PragmaticSegmenter::Segmenter.new(text: "'Та марбута' тек сөз соңында екі түрде жазылады:", language: "kk")
70
+ expect(ps.segment).to eq(["'Та марбута' тек сөз соңында екі түрде жазылады:"])
71
+ end
72
+ end
73
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pragmatic_segmenter
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.18
4
+ version: 0.3.23
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin S. Dias
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-03-27 00:00:00.000000000 Z
11
+ date: 2021-05-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode
@@ -28,30 +28,30 @@ dependencies:
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - "~>"
31
+ - - ">="
32
32
  - !ruby/object:Gem::Version
33
33
  version: '1.7'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - "~>"
38
+ - - ">="
39
39
  - !ruby/object:Gem::Version
40
40
  version: '1.7'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rake
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - "~>"
45
+ - - ">="
46
46
  - !ruby/object:Gem::Version
47
- version: '10.0'
47
+ version: 12.3.3
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - "~>"
52
+ - - ">="
53
53
  - !ruby/object:Gem::Version
54
- version: '10.0'
54
+ version: 12.3.3
55
55
  - !ruby/object:Gem::Dependency
56
56
  name: rspec
57
57
  requirement: !ruby/object:Gem::Requirement
@@ -124,6 +124,7 @@ files:
124
124
  - lib/pragmatic_segmenter/languages/hindi.rb
125
125
  - lib/pragmatic_segmenter/languages/italian.rb
126
126
  - lib/pragmatic_segmenter/languages/japanese.rb
127
+ - lib/pragmatic_segmenter/languages/kazakh.rb
127
128
  - lib/pragmatic_segmenter/languages/persian.rb
128
129
  - lib/pragmatic_segmenter/languages/polish.rb
129
130
  - lib/pragmatic_segmenter/languages/russian.rb
@@ -152,6 +153,7 @@ files:
152
153
  - spec/pragmatic_segmenter/languages/hindi_spec.rb
153
154
  - spec/pragmatic_segmenter/languages/italian_spec.rb
154
155
  - spec/pragmatic_segmenter/languages/japanese_spec.rb
156
+ - spec/pragmatic_segmenter/languages/kazakh_spec.rb
155
157
  - spec/pragmatic_segmenter/languages/persian_spec.rb
156
158
  - spec/pragmatic_segmenter/languages/polish_spec.rb
157
159
  - spec/pragmatic_segmenter/languages/russian_spec.rb
@@ -164,7 +166,7 @@ homepage: https://github.com/diasks2/pragmatic_segmenter
164
166
  licenses:
165
167
  - MIT
166
168
  metadata: {}
167
- post_install_message:
169
+ post_install_message:
168
170
  rdoc_options: []
169
171
  require_paths:
170
172
  - lib
@@ -179,9 +181,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
179
181
  - !ruby/object:Gem::Version
180
182
  version: '0'
181
183
  requirements: []
182
- rubyforge_project:
183
- rubygems_version: 2.4.1
184
- signing_key:
184
+ rubyforge_project:
185
+ rubygems_version: 2.7.6
186
+ signing_key:
185
187
  specification_version: 4
186
188
  summary: A rule-based sentence boundary detection gem that works out-of-the-box across
187
189
  many languages
@@ -202,6 +204,7 @@ test_files:
202
204
  - spec/pragmatic_segmenter/languages/hindi_spec.rb
203
205
  - spec/pragmatic_segmenter/languages/italian_spec.rb
204
206
  - spec/pragmatic_segmenter/languages/japanese_spec.rb
207
+ - spec/pragmatic_segmenter/languages/kazakh_spec.rb
205
208
  - spec/pragmatic_segmenter/languages/persian_spec.rb
206
209
  - spec/pragmatic_segmenter/languages/polish_spec.rb
207
210
  - spec/pragmatic_segmenter/languages/russian_spec.rb