pragmatic_segmenter 0.3.18 → 0.3.23
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.gitignore +2 -1
- data/NEWS +24 -0
- data/README.md +31 -1
- data/lib/pragmatic_segmenter/abbreviation_replacer.rb +5 -4
- data/lib/pragmatic_segmenter/between_punctuation.rb +10 -0
- data/lib/pragmatic_segmenter/cleaner.rb +12 -12
- data/lib/pragmatic_segmenter/languages.rb +3 -1
- data/lib/pragmatic_segmenter/languages/chinese.rb +26 -0
- data/lib/pragmatic_segmenter/languages/common.rb +3 -1
- data/lib/pragmatic_segmenter/languages/common/numbers.rb +4 -2
- data/lib/pragmatic_segmenter/languages/deutsch.rb +4 -3
- data/lib/pragmatic_segmenter/languages/japanese.rb +1 -1
- data/lib/pragmatic_segmenter/languages/kazakh.rb +44 -0
- data/lib/pragmatic_segmenter/list.rb +5 -5
- data/lib/pragmatic_segmenter/processor.rb +18 -10
- data/lib/pragmatic_segmenter/punctuation_replacer.rb +3 -3
- data/lib/pragmatic_segmenter/types.rb +7 -7
- data/lib/pragmatic_segmenter/version.rb +1 -1
- data/pragmatic_segmenter.gemspec +2 -2
- data/spec/pragmatic_segmenter/languages/chinese_spec.rb +5 -0
- data/spec/pragmatic_segmenter/languages/danish_spec.rb +1 -1
- data/spec/pragmatic_segmenter/languages/deutsch_spec.rb +18 -0
- data/spec/pragmatic_segmenter/languages/english_spec.rb +57 -1
- data/spec/pragmatic_segmenter/languages/kazakh_spec.rb +73 -0
- metadata +16 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 2c66c757c1b4bd8d090e88d7db6c627720f58f6f26e6fab9916a20e8bc15471c
|
4
|
+
data.tar.gz: da3a9088f72c90ddde6f0deda67d3e3b4ea3bed317970416deef794e0f594d89
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 503c52965b2f98eebbc24e1215204c45307958a0279d56834e0c929d18625e81ac8c5c78779efb1a5946b5fdda5d8496b54a72b009ad6b2a597a70c4ba0fff66
|
7
|
+
data.tar.gz: f23773139a3a6d9f45cecaacabb363a7fb825a21eb76b40514abf4d0407191ed3b1afa887a5bc5328626abe2dbac5864895add62a1da036234036984d19a3454
|
data/.gitignore
CHANGED
data/NEWS
CHANGED
@@ -1,3 +1,27 @@
|
|
1
|
+
0.3.22 (2021-05-03):
|
2
|
+
|
3
|
+
* Improvement: Refactor for Ruby 3.0 compatibility
|
4
|
+
|
5
|
+
0.3.22 (2018-09-23):
|
6
|
+
|
7
|
+
* Improvement: Initial support for Kazakh
|
8
|
+
|
9
|
+
0.3.21 (2018-08-30):
|
10
|
+
|
11
|
+
* Improvement: Add support for file formats
|
12
|
+
* Improvement: Add support for numeric references at the end of a sentence (i.e. Wikipedia references)
|
13
|
+
|
14
|
+
0.3.20 (2018-08-28):
|
15
|
+
|
16
|
+
* Improvement: Handle slanted single quotation as a single quote
|
17
|
+
* Bug Fix: The text contains a single character abbreviation as part of a list
|
18
|
+
* Bug Fix: Chinese book quotes
|
19
|
+
* Improvement: Add viz as abbreviation
|
20
|
+
|
21
|
+
0.3.19 (2018-07-19):
|
22
|
+
|
23
|
+
* Bug Fix: A parenthetical following an abbreviation is now included as part of the same segment. Example: "The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)." is now treated as one segment.
|
24
|
+
|
1
25
|
0.3.18 (2018-03-27):
|
2
26
|
|
3
27
|
* Improvement: Performance optimizations
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Pragmatic Segmenter
|
2
2
|
|
3
|
-
[![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt)
|
4
4
|
|
5
5
|
Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
|
6
6
|
|
@@ -433,6 +433,12 @@ Was sind die Konsequenzen der Abstimmung vom 12. Juni?
|
|
433
433
|
=> ["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"]
|
434
434
|
```
|
435
435
|
|
436
|
+
4.) **Cardinal numbers at end of sentence** *Credit: Dr. Michael Ustaszewski*
|
437
|
+
```
|
438
|
+
Die Information steht auf Seite 12. Dort kannst du nachlesen.
|
439
|
+
=> ["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."]
|
440
|
+
```
|
441
|
+
|
436
442
|
#### Golden Rules (Japanese)
|
437
443
|
|
438
444
|
1.) **Simple period to end sentence**
|
@@ -865,6 +871,25 @@ To test the relative performance of different segmentation tools and libraries I
|
|
865
871
|
**Version 0.3.18**
|
866
872
|
* Performance optimizations
|
867
873
|
|
874
|
+
**Version 0.3.19**
|
875
|
+
* Treat a parenthetical following an abbreviation as part of the same segment
|
876
|
+
|
877
|
+
**Version 0.3.20**
|
878
|
+
* Handle slanted single quotation as a single quote
|
879
|
+
* Handle a single character abbreviation as part of a list
|
880
|
+
* Add support for Chinese caret brackets
|
881
|
+
* Add viz as abbreviation
|
882
|
+
|
883
|
+
**Version 0.3.21**
|
884
|
+
* Add support for file formats
|
885
|
+
* Add support for numeric references at the end of a sentence (i.e. Wikipedia references)
|
886
|
+
|
887
|
+
**Version 0.3.22**
|
888
|
+
* Add initial support and tests for Kazakh
|
889
|
+
|
890
|
+
**Version 0.3.23**
|
891
|
+
* Refactor for Ruby 3.0 compatibility
|
892
|
+
|
868
893
|
## Contributing
|
869
894
|
|
870
895
|
If you find a text that is incorrectly segmented using this gem, please submit an issue.
|
@@ -875,6 +900,11 @@ If you find a text that is incorrectly segmented using this gem, please submit a
|
|
875
900
|
4. Push to the branch (`git push origin my-new-feature`)
|
876
901
|
5. Create a new Pull Request
|
877
902
|
|
903
|
+
## Ports
|
904
|
+
|
905
|
+
* [C# - PragmaticSegmenterNet](https://github.com/UglyToad/PragmaticSegmenterNet)
|
906
|
+
* [Python - pySBD](https://github.com/nipunsadvilkar/pySBD)
|
907
|
+
|
878
908
|
## License
|
879
909
|
|
880
910
|
The MIT License (MIT)
|
@@ -10,18 +10,19 @@ module PragmaticSegmenter
|
|
10
10
|
|
11
11
|
attr_reader :text
|
12
12
|
def initialize(text:, language: )
|
13
|
-
@text =
|
13
|
+
@text = text.dup
|
14
14
|
@language = language
|
15
15
|
end
|
16
16
|
|
17
17
|
def replace
|
18
|
-
|
18
|
+
Rule.apply(@text,
|
19
|
+
@language::PossessiveAbbreviationRule,
|
19
20
|
@language::KommanditgesellschaftRule,
|
20
21
|
@language::SingleLetterAbbreviationRules::All)
|
21
22
|
|
22
23
|
@text = search_for_abbreviations_in_string(@text)
|
23
24
|
@text = replace_multi_period_abbreviations(@text)
|
24
|
-
|
25
|
+
Rule.apply(@text, @language::AmPmRules::All)
|
25
26
|
replace_abbreviation_as_sentence_boundary(@text)
|
26
27
|
end
|
27
28
|
|
@@ -108,7 +109,7 @@ module PragmaticSegmenter
|
|
108
109
|
end
|
109
110
|
|
110
111
|
def replace_period_of_abbr(txt, abbr)
|
111
|
-
txt.gsub!(/(?<=\s#{abbr.strip})\.(?=((\.|\:|-|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))|(?<=^#{abbr.strip})\.(?=((\.|\:|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))/, '∯')
|
112
|
+
txt.gsub!(/(?<=\s#{abbr.strip})\.(?=((\.|\:|-|\?)|(\s([a-z]|I\s|I'm|I'll|\d|\())))|(?<=^#{abbr.strip})\.(?=((\.|\:|\?)|(\s([a-z]|I\s|I'm|I'll|\d))))/, '∯')
|
112
113
|
txt.gsub!(/(?<=\s#{abbr.strip})\.(?=,)|(?<=^#{abbr.strip})\.(?=,)/, '∯')
|
113
114
|
txt
|
114
115
|
end
|
@@ -8,6 +8,8 @@ module PragmaticSegmenter
|
|
8
8
|
# Rubular: http://rubular.com/r/2YFrKWQUYi
|
9
9
|
BETWEEN_SINGLE_QUOTES_REGEX = /(?<=\s)'(?:[^']|'[a-zA-Z])*'/
|
10
10
|
|
11
|
+
BETWEEN_SINGLE_QUOTE_SLANTED_REGEX = /(?<=\s)‘(?:[^’]|’[a-zA-Z])*’/
|
12
|
+
|
11
13
|
# Rubular: http://rubular.com/r/3Pw1QlXOjd
|
12
14
|
BETWEEN_DOUBLE_QUOTES_REGEX = /"(?>[^"\\]+|\\{2}|\\.)*"/
|
13
15
|
|
@@ -42,6 +44,7 @@ module PragmaticSegmenter
|
|
42
44
|
|
43
45
|
def sub_punctuation_between_quotes_and_parens(txt)
|
44
46
|
sub_punctuation_between_single_quotes(txt)
|
47
|
+
sub_punctuation_between_single_quote_slanted(txt)
|
45
48
|
sub_punctuation_between_double_quotes(txt)
|
46
49
|
sub_punctuation_between_square_brackets(txt)
|
47
50
|
sub_punctuation_between_parens(txt)
|
@@ -74,6 +77,13 @@ module PragmaticSegmenter
|
|
74
77
|
end
|
75
78
|
end
|
76
79
|
|
80
|
+
def sub_punctuation_between_single_quote_slanted(txt)
|
81
|
+
PragmaticSegmenter::PunctuationReplacer.new(
|
82
|
+
matches_array: txt.scan(BETWEEN_SINGLE_QUOTE_SLANTED_REGEX),
|
83
|
+
text: txt
|
84
|
+
).replace
|
85
|
+
end
|
86
|
+
|
77
87
|
def sub_punctuation_between_double_quotes(txt)
|
78
88
|
PragmaticSegmenter::PunctuationReplacer.new(
|
79
89
|
matches_array: btwn_dbl_quote(txt),
|
@@ -11,7 +11,7 @@ module PragmaticSegmenter
|
|
11
11
|
|
12
12
|
attr_reader :text, :doc_type
|
13
13
|
def initialize(text:, doc_type: nil, language: Languages::Common)
|
14
|
-
@text =
|
14
|
+
@text = text.dup
|
15
15
|
@doc_type = doc_type
|
16
16
|
@language = language
|
17
17
|
end
|
@@ -37,10 +37,10 @@ module PragmaticSegmenter
|
|
37
37
|
replace_newlines
|
38
38
|
replace_escaped_newlines
|
39
39
|
|
40
|
-
|
40
|
+
Rule.apply(@text, HTML::All)
|
41
41
|
|
42
42
|
replace_punctuation_in_brackets
|
43
|
-
|
43
|
+
Rule.apply(@text, InlineFormattingRule)
|
44
44
|
clean_quotations
|
45
45
|
clean_table_of_contents
|
46
46
|
check_for_no_space_in_between_sentences
|
@@ -72,7 +72,7 @@ module PragmaticSegmenter
|
|
72
72
|
if word =~ regex
|
73
73
|
unless URL_EMAIL_KEYWORDS.any? { |web| word =~ /#{web}/ }
|
74
74
|
unless abbreviations.any? { |abbr| word =~ /#{abbr}/i }
|
75
|
-
new_word = word.dup
|
75
|
+
new_word = Rule.apply(word.dup, rule)
|
76
76
|
txt.gsub!(/#{Regexp.escape(word)}/, new_word)
|
77
77
|
end
|
78
78
|
end
|
@@ -92,45 +92,45 @@ module PragmaticSegmenter
|
|
92
92
|
end
|
93
93
|
|
94
94
|
def remove_newline_in_middle_of_word
|
95
|
-
|
95
|
+
Rule.apply @text, NewLineInMiddleOfWordRule
|
96
96
|
end
|
97
97
|
|
98
98
|
def replace_escaped_newlines
|
99
|
-
|
99
|
+
Rule.apply @text, EscapedNewLineRule, EscapedCarriageReturnRule,
|
100
100
|
TypoEscapedNewLineRule, TypoEscapedCarriageReturnRule
|
101
101
|
end
|
102
102
|
|
103
103
|
def replace_double_newlines
|
104
|
-
|
104
|
+
Rule.apply @text, DoubleNewLineWithSpaceRule, DoubleNewLineRule
|
105
105
|
end
|
106
106
|
|
107
107
|
def replace_newlines
|
108
108
|
if doc_type.eql?('pdf')
|
109
109
|
remove_pdf_line_breaks
|
110
110
|
else
|
111
|
-
|
111
|
+
Rule.apply @text, NewLineFollowedByPeriodRule,
|
112
112
|
ReplaceNewlineWithCarriageReturnRule
|
113
113
|
end
|
114
114
|
end
|
115
115
|
|
116
116
|
def remove_pdf_line_breaks
|
117
|
-
|
117
|
+
Rule.apply @text, NewLineFollowedByBulletRule,
|
118
118
|
|
119
119
|
PDF::NewLineInMiddleOfSentenceRule,
|
120
120
|
PDF::NewLineInMiddleOfSentenceNoSpacesRule
|
121
121
|
end
|
122
122
|
|
123
123
|
def clean_quotations
|
124
|
-
|
124
|
+
Rule.apply @text, QuotationsFirstRule, QuotationsSecondRule
|
125
125
|
end
|
126
126
|
|
127
127
|
def clean_table_of_contents
|
128
|
-
|
128
|
+
Rule.apply @text, TableOfContentsRule, ConsecutivePeriodsRule,
|
129
129
|
ConsecutiveForwardSlashRule
|
130
130
|
end
|
131
131
|
|
132
132
|
def clean_consecutive_characters
|
133
|
-
|
133
|
+
Rule.apply @text, ConsecutivePeriodsRule, ConsecutiveForwardSlashRule
|
134
134
|
end
|
135
135
|
end
|
136
136
|
end
|
@@ -26,6 +26,7 @@ require 'pragmatic_segmenter/languages/polish'
|
|
26
26
|
require 'pragmatic_segmenter/languages/chinese'
|
27
27
|
require 'pragmatic_segmenter/languages/bulgarian'
|
28
28
|
require 'pragmatic_segmenter/languages/danish'
|
29
|
+
require 'pragmatic_segmenter/languages/kazakh'
|
29
30
|
|
30
31
|
module PragmaticSegmenter
|
31
32
|
module Languages
|
@@ -49,7 +50,8 @@ module PragmaticSegmenter
|
|
49
50
|
'nl' => Dutch,
|
50
51
|
'pl' => Polish,
|
51
52
|
'zh' => Chinese,
|
52
|
-
'da' => Danish
|
53
|
+
'da' => Danish,
|
54
|
+
'kk' => Kazakh
|
53
55
|
}
|
54
56
|
|
55
57
|
def self.get_language_by_code(code)
|
@@ -8,6 +8,32 @@ module PragmaticSegmenter
|
|
8
8
|
class AbbreviationReplacer < AbbreviationReplacer
|
9
9
|
SENTENCE_STARTERS = [].freeze
|
10
10
|
end
|
11
|
+
|
12
|
+
class BetweenPunctuation < PragmaticSegmenter::BetweenPunctuation
|
13
|
+
BETWEEN_DOUBLE_ANGLE_QUOTATION_MARK_REGEX = /《(?>[^》\\]+|\\{2}|\\.)*》/
|
14
|
+
BETWEEN_L_BRACKET_REGEX = /「(?>[^」\\]+|\\{2}|\\.)*」/
|
15
|
+
private
|
16
|
+
|
17
|
+
def sub_punctuation_between_quotes_and_parens(txt)
|
18
|
+
super
|
19
|
+
sub_punctuation_between_double_angled_quotation_marks(txt)
|
20
|
+
sub_punctuation_between_l_bracket(txt)
|
21
|
+
end
|
22
|
+
|
23
|
+
def sub_punctuation_between_double_angled_quotation_marks(txt)
|
24
|
+
PunctuationReplacer.new(
|
25
|
+
matches_array: txt.scan(BETWEEN_DOUBLE_ANGLE_QUOTATION_MARK_REGEX),
|
26
|
+
text: txt
|
27
|
+
).replace
|
28
|
+
end
|
29
|
+
|
30
|
+
def sub_punctuation_between_l_bracket(txt)
|
31
|
+
PunctuationReplacer.new(
|
32
|
+
matches_array: txt.scan(BETWEEN_L_BRACKET_REGEX),
|
33
|
+
text: txt
|
34
|
+
).replace
|
35
|
+
end
|
36
|
+
end
|
11
37
|
end
|
12
38
|
end
|
13
39
|
end
|
@@ -11,7 +11,7 @@ module PragmaticSegmenter
|
|
11
11
|
|
12
12
|
# Defines the abbreviations for each language (if available)
|
13
13
|
module Abbreviation
|
14
|
-
ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
|
14
|
+
ABBREVIATIONS = Set.new(['adj', 'adm', 'adv', 'al', 'ala', 'alta', 'apr', 'arc', 'ariz', 'ark', 'art', 'assn', 'asst', 'attys', 'aug', 'ave', 'bart', 'bld', 'bldg', 'blvd', 'brig', 'bros', 'btw', 'cal', 'calif', 'capt', 'cl', 'cmdr', 'co', 'col', 'colo', 'comdr', 'con', 'conn', 'corp', 'cpl', 'cres', 'ct', 'd.phil', 'dak', 'dec', 'del', 'dept', 'det', 'dist', 'dr', 'dr.phil', 'dr.philos', 'drs', 'e.g', 'ens', 'esp', 'esq', 'etc', 'exp', 'expy', 'ext', 'feb', 'fed', 'fla', 'ft', 'fwy', 'fy', 'ga', 'gen', 'gov', 'hon', 'hosp', 'hr', 'hway', 'hwy', 'i.e', 'ia', 'id', 'ida', 'ill', 'inc', 'ind', 'ing', 'insp', 'is', 'jan', 'jr', 'jul', 'jun', 'kan', 'kans', 'ken', 'ky', 'la', 'lt', 'ltd', 'maj', 'man', 'mar', 'mass', 'may', 'md', 'me', 'med', 'messrs', 'mex', 'mfg', 'mich', 'min', 'minn', 'miss', 'mlle', 'mm', 'mme', 'mo', 'mont', 'mr', 'mrs', 'ms', 'msgr', 'mssrs', 'mt', 'mtn', 'neb', 'nebr', 'nev', 'no', 'nos', 'nov', 'nr', 'oct', 'ok', 'okla', 'ont', 'op', 'ord', 'ore', 'p', 'pa', 'pd', 'pde', 'penn', 'penna', 'pfc', 'ph', 'ph.d', 'pl', 'plz', 'pp', 'prof', 'pvt', 'que', 'rd', 'rs', 'ref', 'rep', 'reps', 'res', 'rev', 'rt', 'sask', 'sec', 'sen', 'sens', 'sep', 'sept', 'sfc', 'sgt', 'sr', 'st', 'supt', 'surg', 'tce', 'tenn', 'tex', 'univ', 'usafa', 'u.s', 'ut', 'va', 'v', 'ver', 'viz', 'vs', 'vt', 'wash', 'wis', 'wisc', 'wy', 'wyo', 'yuk']).freeze
|
15
15
|
PREPOSITIVE_ABBREVIATIONS = Set.new(['adm', 'attys', 'brig', 'capt', 'cmdr', 'col', 'cpl', 'det', 'dr', 'gen', 'gov', 'ing', 'lt', 'maj', 'mr', 'mrs', 'ms', 'mt', 'messrs', 'mssrs', 'prof', 'ph', 'rep', 'reps', 'rev', 'sen', 'sens', 'sgt', 'st', 'supt', 'v', 'vs']).freeze
|
16
16
|
NUMBER_ABBREVIATIONS = Set.new(['art', 'ext', 'no', 'nos', 'p', 'pp']).freeze
|
17
17
|
end
|
@@ -24,6 +24,8 @@ module PragmaticSegmenter
|
|
24
24
|
# Rubular: http://rubular.com/r/G2opjedIm9
|
25
25
|
GeoLocationRule = Rule.new(/(?<=[a-zA-z]°)\.(?=\s*\d+)/, '∯')
|
26
26
|
|
27
|
+
FileFormatRule = Rule.new(/(?<=\s)\.(?=(jpe?g|png|gif|tiff?|pdf|ps|docx?|xlsx?|svg|bmp|tga|exif|odt|html?|txt|rtf|bat|sxw|xml|zip|exe|msi|blend|wmv|mp[34]|pptx?|flac|rb|cpp|cs|js)\s)/, '∯')
|
28
|
+
|
27
29
|
SingleNewLineRule = Rule.new(/\n/, 'ȹ')
|
28
30
|
|
29
31
|
module DoublePunctuationRules
|
@@ -47,6 +47,8 @@ module PragmaticSegmenter
|
|
47
47
|
# Rubular: http://rubular.com/r/mQ8Es9bxtk
|
48
48
|
CONTINUOUS_PUNCTUATION_REGEX = /(?<=\S)(!|\?){3,}(?=(\s|\z|$))/
|
49
49
|
|
50
|
+
NUMBERED_REFERENCE_REGEX = /(?<=[^\d\s])(\.|∯)((\[(\d{1,3},?\s?-?\s?)*\b\d{1,3}\])+|((\d{1,3}\s?)*\d{1,3}))(\s)(?=[A-Z])/
|
51
|
+
|
50
52
|
# Rubular: http://rubular.com/r/yqa4Rit8EY
|
51
53
|
PossessiveAbbreviationRule = Rule.new(/\.(?='s\s)|\.(?='s$)|\.(?='s\z)/, '∯')
|
52
54
|
|
@@ -76,10 +78,10 @@ module PragmaticSegmenter
|
|
76
78
|
# replaces the periods.
|
77
79
|
module SingleLetterAbbreviationRules
|
78
80
|
# Rubular: http://rubular.com/r/e3H6kwnr6H
|
79
|
-
SingleUpperCaseLetterAtStartOfLineRule = Rule.new(/(?<=^[A-Z])\.(
|
81
|
+
SingleUpperCaseLetterAtStartOfLineRule = Rule.new(/(?<=^[A-Z])\.(?=,?\s)/, '∯')
|
80
82
|
|
81
83
|
# Rubular: http://rubular.com/r/gitvf0YWH4
|
82
|
-
SingleUpperCaseLetterRule = Rule.new(/(?<=\s[A-Z])\.(
|
84
|
+
SingleUpperCaseLetterRule = Rule.new(/(?<=\s[A-Z])\.(?=,?\s)/, '∯')
|
83
85
|
|
84
86
|
All = [
|
85
87
|
SingleUpperCaseLetterAtStartOfLineRule,
|
@@ -47,7 +47,7 @@ module PragmaticSegmenter
|
|
47
47
|
private
|
48
48
|
|
49
49
|
def replace_numbers
|
50
|
-
|
50
|
+
Rule.apply @text, Numbers::All
|
51
51
|
|
52
52
|
replace_period_in_deutsch_dates
|
53
53
|
end
|
@@ -68,7 +68,8 @@ module PragmaticSegmenter
|
|
68
68
|
).freeze
|
69
69
|
|
70
70
|
def replace
|
71
|
-
@text =
|
71
|
+
@text = Rule.apply(
|
72
|
+
text,
|
72
73
|
@language::PossessiveAbbreviationRule,
|
73
74
|
@language::SingleLetterAbbreviationRules::All,
|
74
75
|
SingleLowerCaseLetterRule,
|
@@ -76,7 +77,7 @@ module PragmaticSegmenter
|
|
76
77
|
|
77
78
|
@text = search_for_abbreviations_in_string(@text)
|
78
79
|
@text = replace_multi_period_abbreviations(@text)
|
79
|
-
|
80
|
+
Rule.apply(@text, Languages::Common::AmPmRules::All)
|
80
81
|
replace_abbreviation_as_sentence_boundary(@text)
|
81
82
|
end
|
82
83
|
|
@@ -0,0 +1,44 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module PragmaticSegmenter
|
4
|
+
module Languages
|
5
|
+
module Kazakh
|
6
|
+
include Languages::Common
|
7
|
+
|
8
|
+
MULTI_PERIOD_ABBREVIATION_REGEX = /\b\p{Cyrillic}(?:\.\s?\p{Cyrillic})+[.]|b[a-z](?:\.[a-z])+[.]/i
|
9
|
+
|
10
|
+
module Abbreviation
|
11
|
+
ABBREVIATIONS = Set.new(['afp', 'anp', 'atp', 'bae', 'bg', 'bp', 'cam', 'cctv', 'cd', 'cez', 'cgi', 'cnpc', 'farc', 'fbi', 'eiti', 'epo', 'er', 'gp', 'gps', 'has', 'hiv', 'hrh', 'http', 'icu', 'idf', 'imd', 'ime', 'icu', 'idf', 'ip', 'iso', 'kaz', 'kpo', 'kpa', 'kz', 'kz', 'mri', 'nasa', 'nba', 'nbc', 'nds', 'ohl', 'omlt', 'ppm', 'pda', 'pkk', 'psm', 'psp', 'raf', 'rss', 'rtl', 'sas', 'sme', 'sms', 'tnt', 'udf', 'uefa', 'usb', 'utc', 'x', 'zdf', 'әқбк', 'әқбк', 'аақ', 'авг.', 'aбб', 'аек', 'ак', 'ақ', 'акцион.', 'акср', 'ақш', 'англ', 'аөсшк', 'апр', 'м.', 'а.', 'р.', 'ғ.', 'апр.', 'аум.', 'ацат', 'әч', 'т. б.', 'б. з. б.', 'б. з. б.', 'б. з. д.', 'б. з. д.', 'биікт.', 'б. т.', 'биол.', 'биохим', 'бө', 'б. э. д.', 'бта', 'бұұ', 'вич', 'всоонл', 'геогр.', 'геол.', 'гленкор', 'гэс', 'қк', 'км', 'г', 'млн', 'млрд', 'т', 'ғ. с.', 'ғ.', 'қ.', 'ғ.', 'дек.', 'днқ', 'дсұ', 'еақк', 'еқыұ', 'ембімұнайгаз', 'ео', 'еуразэқ', 'еуроодақ', 'еұу', 'ж.', 'ж.', 'жж.', 'жоо', 'жіө', 'жсдп', 'жшс', 'іім', 'инта', 'исаф', 'камаз', 'кгб', 'кеу', 'кг', 'км²', 'км²', 'км³', 'км³', 'кимеп', 'кср', 'ксро', 'кокп', 'кхдр', 'қазатомпром', 'қазкср', 'қазұу', 'қазмұнайгаз', 'қазпошта', 'қазтаг', 'қазұу', 'қкп', 'қмдб', 'қр', 'қхр', 'лат.', 'м²', 'м²', 'м³', 'м³', 'магатэ', 'май.', 'максам', 'мб', 'мвт', 'мемл', 'м', 'мсоп', 'мтк', 'мыс.', 'наса', 'нато', 'нквд', 'нояб.', 'обл.', 'огпу', 'окт.', 'оңт.', 'опек', 'оеб', 'өзенмұнайгаз', 'өф', 'пәк', 'пед.', 'ркфср', 'рнқ', 'рсфср', 'рф', 'свс', 'сву', 'сду', 'сес', 'сент.', 'см', 'снпс', 'солт.', 'солт.', 'сооно', 'ссро', 'сср', 'ссср', 'ссс', 'сэс', 'дк', 'т. б.', 'т', 'тв', 'тереңд.', 'тех.', 'тжқ', 'тмд', 'төм.', 'трлн', 'тр', 'т.', 'и.', 'м.', 'с.', 'ш.', 'т.', 'т. с. с.', 'тэц', 'уаз', 'уефа', 'еқыұ', 'ұқк', 'ұқшұ', 'февр.', 'фққ', 'фсб', 'хим.', 'хқко', 'шұар', 'шыұ', 'экон.', 'экспо', 'цтп', 'цас', 'янв.', 'dvd', 'жкт', 'ққс', 'км', 'ацат', 'юнеско', 'ббс', 'mgm', 'жск', 'зоо', 'бсн', 'өұқ', 'оар', 'боак', 'эөкк', 'хтқо', 'әөк', 'жэк', 'хдо', 'спбму', 'аф', 'сбд', 'амт', 'гсдп', 'гсбп', 'эыдұ', 'нұсжп', 'шыұ', 'жтсх', 'хдп', 'эқк', 'фкққ', 'пиқ', 'өгк', 'мбф', 'маж', 'кота', 'тж', 'ук', 'обб', 'сбл', 'жхл', 'кмс', 'бмтрк', 'жққ', 'бхооо', 'мқо', 'ржмб', 'гулаг', 'жко', 'еэы', 'еаэы', 'кхдр', 'рфкп', 'рлдп', 'хвқ', 'мр', 'мт', 'кту', 'ртж', 'тим', 'мемдум', 'ксро', 'т.с.с', 'с.ш.', 'ш.б.', 'б.б.', 'руб', 'мин', 'акад.', 'ғ.', 'мм', 'мм.']).freeze
|
12
|
+
PREPOSITIVE_ABBREVIATIONS = [].freeze
|
13
|
+
NUMBER_ABBREVIATIONS = [].freeze
|
14
|
+
end
|
15
|
+
|
16
|
+
class Processor < PragmaticSegmenter::Processor
|
17
|
+
private
|
18
|
+
|
19
|
+
# Rubular: http://rubular.com/r/WRWy56Z5zp
|
20
|
+
QuestionMarkFollowedByDashLowercaseRule = Rule.new(/(?<=\p{Ll})\?(?=\s*[-—]\s*\p{Ll})/, '&ᓷ&')
|
21
|
+
# Rubular: http://rubular.com/r/lixxP7puSa
|
22
|
+
ExclamationMarkFollowedByDashLowercaseRule = Rule.new(/(?<=\p{Ll})!(?=\s*[-—]\s*\p{Ll})/, '&ᓴ&')
|
23
|
+
|
24
|
+
def between_punctuation(txt)
|
25
|
+
super(txt)
|
26
|
+
Rule.apply(txt, QuestionMarkFollowedByDashLowercaseRule, ExclamationMarkFollowedByDashLowercaseRule)
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
class AbbreviationReplacer < AbbreviationReplacer
|
31
|
+
SENTENCE_STARTERS = [].freeze
|
32
|
+
|
33
|
+
SingleUpperCaseCyrillicLetterAtStartOfLineRule = Rule.new(/(?<=^[А-ЯЁ])\.(?=\s)/, '∯')
|
34
|
+
SingleUpperCaseCyrillicLetterRule = Rule.new(/(?<=\s[А-ЯЁ])\.(?=\s)/, '∯')
|
35
|
+
|
36
|
+
def replace
|
37
|
+
super
|
38
|
+
Rule.apply(@text, SingleUpperCaseCyrillicLetterAtStartOfLineRule, SingleUpperCaseCyrillicLetterRule)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
@@ -48,7 +48,7 @@ module PragmaticSegmenter
|
|
48
48
|
|
49
49
|
attr_reader :text
|
50
50
|
def initialize(text:)
|
51
|
-
@text =
|
51
|
+
@text = text.dup
|
52
52
|
end
|
53
53
|
|
54
54
|
def add_line_break
|
@@ -68,13 +68,13 @@ module PragmaticSegmenter
|
|
68
68
|
def format_numbered_list_with_parens
|
69
69
|
replace_parens_in_numbered_list
|
70
70
|
add_line_breaks_for_numbered_list_with_parens
|
71
|
-
|
71
|
+
Rule.apply(@text, ListMarkerRule)
|
72
72
|
end
|
73
73
|
|
74
74
|
def format_numbered_list_with_periods
|
75
75
|
replace_periods_in_numbered_list
|
76
76
|
add_line_breaks_for_numbered_list_with_periods
|
77
|
-
|
77
|
+
Rule.apply(@text, SubstituteListPeriodRule)
|
78
78
|
end
|
79
79
|
|
80
80
|
def format_alphabetical_lists
|
@@ -93,7 +93,7 @@ module PragmaticSegmenter
|
|
93
93
|
|
94
94
|
def add_line_breaks_for_numbered_list_with_periods
|
95
95
|
if @text.include?('♨') && @text !~ /♨.+\n.+♨|♨.+\r.+♨/ && @text !~ /for\s\d{1,2}♨\s[a-z]/
|
96
|
-
|
96
|
+
Rule.apply(@text, SpaceBetweenListItemsFirstRule, SpaceBetweenListItemsSecondRule)
|
97
97
|
end
|
98
98
|
end
|
99
99
|
|
@@ -105,7 +105,7 @@ module PragmaticSegmenter
|
|
105
105
|
|
106
106
|
def add_line_breaks_for_numbered_list_with_parens
|
107
107
|
if @text.include?('☝') && @text !~ /☝.+\n.+☝|☝.+\r.+☝/
|
108
|
-
|
108
|
+
Rule.apply(@text, SpaceBetweenListItemsThirdRule)
|
109
109
|
end
|
110
110
|
end
|
111
111
|
|
@@ -23,8 +23,10 @@ module PragmaticSegmenter
|
|
23
23
|
replace_abbreviations
|
24
24
|
replace_numbers
|
25
25
|
replace_continuous_punctuation
|
26
|
-
|
27
|
-
|
26
|
+
replace_periods_before_numeric_references
|
27
|
+
Rule.apply(@text, @language::Abbreviations::WithMultiplePeriodsAndEmailRule)
|
28
|
+
Rule.apply(@text, @language::GeoLocationRule)
|
29
|
+
Rule.apply(@text, @language::FileFormatRule)
|
28
30
|
split_into_segments
|
29
31
|
end
|
30
32
|
|
@@ -32,18 +34,19 @@ module PragmaticSegmenter
|
|
32
34
|
|
33
35
|
def split_into_segments
|
34
36
|
check_for_parens_between_quotes(@text).split("\r")
|
35
|
-
.map! { |segment|
|
37
|
+
.map! { |segment| Rule.apply(segment, @language::SingleNewLineRule, @language::EllipsisRules::All) }
|
36
38
|
.map { |segment| check_for_punctuation(segment) }.flatten
|
37
|
-
.map! { |segment|
|
39
|
+
.map! { |segment| Rule.apply(segment, @language::SubSymbolsRules::All) }
|
38
40
|
.map { |segment| post_process_segments(segment) }
|
39
41
|
.flatten.compact.delete_if(&:empty?)
|
40
|
-
.map! { |segment|
|
42
|
+
.map! { |segment| Rule.apply(segment, @language::SubSingleQuoteRule) }
|
41
43
|
end
|
42
44
|
|
43
45
|
def post_process_segments(txt)
|
44
46
|
return txt if txt.length < 2 && txt =~ /\A[a-zA-Z]*\Z/
|
45
47
|
return if consecutive_underscore?(txt) || txt.length < 2
|
46
|
-
|
48
|
+
Rule.apply(
|
49
|
+
txt,
|
47
50
|
@language::ReinsertEllipsisRules::All,
|
48
51
|
@language::ExtraWhiteSpaceRule
|
49
52
|
)
|
@@ -68,6 +71,10 @@ module PragmaticSegmenter
|
|
68
71
|
end
|
69
72
|
end
|
70
73
|
|
74
|
+
def replace_periods_before_numeric_references
|
75
|
+
@text.gsub!(@language::NUMBERED_REFERENCE_REGEX, "∯\\2\r\\7")
|
76
|
+
end
|
77
|
+
|
71
78
|
def consecutive_underscore?(txt)
|
72
79
|
# Rubular: http://rubular.com/r/fTF2Ff3WBL
|
73
80
|
txt.gsub(/_{3,}/, '').length.eql?(0)
|
@@ -85,7 +92,8 @@ module PragmaticSegmenter
|
|
85
92
|
txt << 'ȸ' unless @language::Punctuations.any? { |p| txt[-1].include?(p) }
|
86
93
|
ExclamationWords.apply_rules(txt)
|
87
94
|
between_punctuation(txt)
|
88
|
-
txt =
|
95
|
+
txt = Rule.apply(
|
96
|
+
txt,
|
89
97
|
@language::DoublePunctuationRules::All,
|
90
98
|
@language::QuestionMarkInQuotationRule,
|
91
99
|
@language::ExclamationPointRules::All
|
@@ -95,7 +103,7 @@ module PragmaticSegmenter
|
|
95
103
|
end
|
96
104
|
|
97
105
|
def replace_numbers
|
98
|
-
|
106
|
+
Rule.apply @text, @language::Numbers::All
|
99
107
|
end
|
100
108
|
|
101
109
|
def abbreviations_replacer
|
@@ -123,8 +131,8 @@ module PragmaticSegmenter
|
|
123
131
|
end
|
124
132
|
|
125
133
|
def sentence_boundary_punctuation(txt)
|
126
|
-
txt =
|
127
|
-
txt =
|
134
|
+
txt = Rule.apply txt, @language::ReplaceColonBetweenNumbersRule if defined? @language::ReplaceColonBetweenNumbersRule
|
135
|
+
txt = Rule.apply txt, @language::ReplaceNonSentenceBoundaryCommaRule if defined? @language::ReplaceNonSentenceBoundaryCommaRule
|
128
136
|
|
129
137
|
txt.scan(@language::SENTENCE_BOUNDARY_REGEX)
|
130
138
|
end
|
@@ -45,9 +45,9 @@ module PragmaticSegmenter
|
|
45
45
|
|
46
46
|
def replace_punctuation(array)
|
47
47
|
return if !array || array.empty?
|
48
|
-
|
48
|
+
Rule.apply(@text, Rules::EscapeRegexReservedCharacters::All)
|
49
49
|
array.each do |a|
|
50
|
-
|
50
|
+
Rule.apply(a, Rules::EscapeRegexReservedCharacters::All)
|
51
51
|
sub = sub_characters(a, '.', '∯')
|
52
52
|
sub_1 = sub_characters(sub, '。', '&ᓰ&')
|
53
53
|
sub_2 = sub_characters(sub_1, '.', '&ᓱ&')
|
@@ -59,7 +59,7 @@ module PragmaticSegmenter
|
|
59
59
|
sub_7 = sub_characters(sub_6, "'", '&⎋&')
|
60
60
|
end
|
61
61
|
end
|
62
|
-
|
62
|
+
Rule.apply(@text, Rules::SubEscapedRegexReservedCharacters::All)
|
63
63
|
end
|
64
64
|
|
65
65
|
def sub_characters(string, char_a, char_b)
|
@@ -1,14 +1,14 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
module PragmaticSegmenter
|
4
|
-
Rule
|
5
|
-
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
4
|
+
class Rule < Struct.new(:pattern, :replacement)
|
5
|
+
class << self
|
6
|
+
def apply(str, *rules)
|
7
|
+
rules.flatten.each do |rule|
|
8
|
+
str.gsub!(rule.pattern, rule.replacement)
|
9
|
+
end
|
10
|
+
str
|
10
11
|
end
|
11
|
-
self
|
12
12
|
end
|
13
13
|
end
|
14
14
|
end
|
data/pragmatic_segmenter.gemspec
CHANGED
@@ -19,8 +19,8 @@ Gem::Specification.new do |spec|
|
|
19
19
|
spec.require_paths = ["lib"]
|
20
20
|
|
21
21
|
spec.add_runtime_dependency "unicode"
|
22
|
-
spec.add_development_dependency "bundler", "
|
23
|
-
spec.add_development_dependency "rake", "
|
22
|
+
spec.add_development_dependency "bundler", ">= 1.7"
|
23
|
+
spec.add_development_dependency "rake", ">= 12.3.3"
|
24
24
|
spec.add_development_dependency "rspec"
|
25
25
|
spec.add_development_dependency "stackprof"
|
26
26
|
end
|
@@ -7,5 +7,10 @@ RSpec.describe PragmaticSegmenter::Languages::Chinese, '(zh)' do
|
|
7
7
|
ps = PragmaticSegmenter::Segmenter.new(text: "安永已聯繫周怡安親屬,協助辦理簽證相關事宜,周怡安家屬1月1日晚間搭乘東方航空班機抵達上海,他們步入入境大廳時神情落寞、不發一語。周怡安來自台中,去年剛從元智大學畢業,同年9月加入安永。", language: 'zh')
|
8
8
|
expect(ps.segment).to eq(["安永已聯繫周怡安親屬,協助辦理簽證相關事宜,周怡安家屬1月1日晚間搭乘東方航空班機抵達上海,他們步入入境大廳時神情落寞、不發一語。", "周怡安來自台中,去年剛從元智大學畢業,同年9月加入安永。"])
|
9
9
|
end
|
10
|
+
|
11
|
+
it 'correctly segments text #002' do
|
12
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "我们明天一起去看《摔跤吧!爸爸》好吗?好!", language: 'zh')
|
13
|
+
expect(ps.segment).to eq(["我们明天一起去看《摔跤吧!爸爸》好吗?", "好!"])
|
14
|
+
end
|
10
15
|
end
|
11
16
|
end
|
@@ -507,7 +507,7 @@ RSpec.describe PragmaticSegmenter::Languages::Danish, "(da)" do
|
|
507
507
|
|
508
508
|
it 'correctly segments text #048' do
|
509
509
|
ps = PragmaticSegmenter::Segmenter.new(text: "CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)", language: 'en')
|
510
|
-
expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co.
|
510
|
+
expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)"])
|
511
511
|
end
|
512
512
|
|
513
513
|
it 'correctly segments text #049' do
|
@@ -17,6 +17,24 @@ RSpec.describe PragmaticSegmenter::Languages::Deutsch, "(de)" do
|
|
17
17
|
ps = PragmaticSegmenter::Segmenter.new(text: "Was sind die Konsequenzen der Abstimmung vom 12. Juni?", language: "de")
|
18
18
|
expect(ps.segment).to eq(["Was sind die Konsequenzen der Abstimmung vom 12. Juni?"])
|
19
19
|
end
|
20
|
+
|
21
|
+
it "Numbers #004" do
|
22
|
+
# Credit: Dr. Michael Ustaszewski
|
23
|
+
# A numeral followed by a dot within the sentence should not be treated as a sentence,
|
24
|
+
# because the meaning of numeral + dot is that of an ordinal number.
|
25
|
+
# However, if the numeral + dot is in the last position of the sentence, then it is not an ordinal,
|
26
|
+
# but a cardinal number and thus a sentence break should be made. See the following example:
|
27
|
+
|
28
|
+
# <INPUT>Die Information steht auf Seite 12. Dort kannst du nachlesen.</INPUT>
|
29
|
+
# <SHOULDBE>["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."]</SHOULDBE>
|
30
|
+
# The sentence translates as "The information can be found on page 12. You can read it there".
|
31
|
+
|
32
|
+
# That's a tricky one I guess, because the capitalisation of the word following the dot is not necessarily a clue,
|
33
|
+
# since German nouns are usually always capitalised.
|
34
|
+
skip "NOT IMPLEMENTED"
|
35
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Die Information steht auf Seite 12. Dort kannst du nachlesen.", language: "de")
|
36
|
+
expect(ps.segment).to eq(["Die Information steht auf Seite 12.", "Dort kannst du nachlesen."])
|
37
|
+
end
|
20
38
|
end
|
21
39
|
|
22
40
|
# Thanks to Silvia Busse for the German test examples.
|
@@ -512,7 +512,7 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
|
|
512
512
|
|
513
513
|
it 'correctly segments text #048' do
|
514
514
|
ps = PragmaticSegmenter::Segmenter.new(text: "CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)", language: 'en')
|
515
|
-
expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co.
|
515
|
+
expect(ps.segment).to eq(["CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. (cited from WSJ 05/29/1987)"])
|
516
516
|
end
|
517
517
|
|
518
518
|
it 'correctly segments text #049' do
|
@@ -1406,5 +1406,61 @@ RSpec.describe PragmaticSegmenter::Languages::English, "(en)" do
|
|
1406
1406
|
expect(ps.segment).to eq(["In placebo-controlled studies of all uses of Tracleer, marked decreases in hemoglobin (>15% decrease from baseline resulting in values <11 g/ dL) were observed in 6% of Tracleer-treated patients and 3% of placebo-treated patients.", "Bosentan is highly bound (>98%) to plasma proteins, mainly albumin."])
|
1407
1407
|
end
|
1408
1408
|
|
1409
|
+
it 'correctly segments text #118' do
|
1410
|
+
text = "The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)."
|
1411
|
+
ps = PragmaticSegmenter::Segmenter.new(text: text, clean: false)
|
1412
|
+
expect(ps.segment).to eq(["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”)."])
|
1413
|
+
end
|
1414
|
+
|
1415
|
+
it 'correctly segments text #119' do
|
1416
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Unlike the abbreviations i.e. and e.g., viz. is used to indicate a detailed description of something stated before.")
|
1417
|
+
expect(ps.segment).to eq(["Unlike the abbreviations i.e. and e.g., viz. is used to indicate a detailed description of something stated before."])
|
1418
|
+
end
|
1419
|
+
|
1420
|
+
it 'correctly segments text #120' do
|
1421
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "For example, ‘dragonswort… is said that it should be grown in dragon’s blood. It grows at the tops of mountains where there are groves of trees, chiefly in holy places and in the country that is called Apulia’ (translated by Anne Van Arsdall, in Medieval Herbal Remedies: The Old English Herbarium and Anglo-Saxon Medicine p. 154). The Herbal also includes lore about other plants, such as the mandrake.")
|
1422
|
+
expect(ps.segment).to eq(["For example, ‘dragonswort… is said that it should be grown in dragon’s blood. It grows at the tops of mountains where there are groves of trees, chiefly in holy places and in the country that is called Apulia’ (translated by Anne Van Arsdall, in Medieval Herbal Remedies: The Old English Herbarium and Anglo-Saxon Medicine p. 154).", "The Herbal also includes lore about other plants, such as the mandrake."])
|
1423
|
+
end
|
1424
|
+
|
1425
|
+
it 'correctly segments text #121' do
|
1426
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Here’s the - ahem - official citation: Baker, C., Anderson, Kenneth, Martin, James, & Palen, Leysia. Modeling Open Source Software Communities, ProQuest Dissertations and Theses.")
|
1427
|
+
expect(ps.segment).to eq(["Here’s the - ahem - official citation: Baker, C., Anderson, Kenneth, Martin, James, & Palen, Leysia.", "Modeling Open Source Software Communities, ProQuest Dissertations and Theses."])
|
1428
|
+
end
|
1429
|
+
|
1430
|
+
it 'correctly segments text #122' do
|
1431
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "These include images of various modes of transport and members of the team, all available in .jpeg format. Images can be downloaded from our website. We also offer archives as .zip files.")
|
1432
|
+
expect(ps.segment).to eq(["These include images of various modes of transport and members of the team, all available in .jpeg format.", "Images can be downloaded from our website.", "We also offer archives as .zip files."])
|
1433
|
+
end
|
1434
|
+
|
1435
|
+
it 'correctly segments text #123' do
|
1436
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Saint Maximus (died 250) is a Christian saint and martyr.[1] The emperor Decius published a decree ordering the veneration of busts of the deified emperors.")
|
1437
|
+
expect(ps.segment).to eq(["Saint Maximus (died 250) is a Christian saint and martyr.[1]", "The emperor Decius published a decree ordering the veneration of busts of the deified emperors."])
|
1438
|
+
end
|
1439
|
+
|
1440
|
+
it 'correctly segments text #124' do
|
1441
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Differing agendas can potentially create an understanding gap in a consultation.11 12 Take the example of one of the most common presentations in ill health: the common cold.")
|
1442
|
+
expect(ps.segment).to eq(["Differing agendas can potentially create an understanding gap in a consultation.11 12", "Take the example of one of the most common presentations in ill health: the common cold."])
|
1443
|
+
end
|
1444
|
+
|
1445
|
+
it 'correctly segments text #125' do
|
1446
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Daniel Kahneman popularised the concept of fast and slow thinking: the distinction between instinctive (type 1 thinking) and reflective, analytical cognition (type 2).10 This model relates to doctors achieving a balance between efficiency and effectiveness.")
|
1447
|
+
expect(ps.segment).to eq(["Daniel Kahneman popularised the concept of fast and slow thinking: the distinction between instinctive (type 1 thinking) and reflective, analytical cognition (type 2).10", "This model relates to doctors achieving a balance between efficiency and effectiveness."])
|
1448
|
+
end
|
1449
|
+
|
1450
|
+
it 'correctly segments text #126' do
|
1451
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Its traditional use[1] is well documented in the ethnobotanical literature [2–11]. Leaves, buds, tar and essential oils are used to treat a wide spectrum of diseases.")
|
1452
|
+
expect(ps.segment).to eq(["Its traditional use[1] is well documented in the ethnobotanical literature [2–11].", "Leaves, buds, tar and essential oils are used to treat a wide spectrum of diseases."])
|
1453
|
+
end
|
1454
|
+
|
1455
|
+
it 'correctly segments text #127' do
|
1456
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Thus increasing the desire for political reform both in Lancashire and in the country at large.[7][8] This was a serious misdemeanour,[16] encouraging them to declare the assembly illegal as soon as it was announced on 31 July.[17][18] The radicals sought a second opinion on the meeting's legality.")
|
1457
|
+
expect(ps.segment).to eq(["Thus increasing the desire for political reform both in Lancashire and in the country at large.[7][8]", "This was a serious misdemeanour,[16] encouraging them to declare the assembly illegal as soon as it was announced on 31 July.[17][18]", "The radicals sought a second opinion on the meeting's legality."])
|
1458
|
+
end
|
1459
|
+
|
1460
|
+
it 'correctly segments text #128' do
|
1461
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "The table in (4) is a sample from the Wall Street Journal (1987).1 According to the distribution all the pairs given in (4) count as candidates for abbreviations.")
|
1462
|
+
expect(ps.segment).to eq([ "The table in (4) is a sample from the Wall Street Journal (1987).1", "According to the distribution all the pairs given in (4) count as candidates for abbreviations."])
|
1463
|
+
|
1464
|
+
end
|
1409
1465
|
end
|
1410
1466
|
end
|
@@ -0,0 +1,73 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
|
3
|
+
RSpec.describe PragmaticSegmenter::Languages::Kazakh, "(kk)" do
|
4
|
+
|
5
|
+
context "Golden Rules" do
|
6
|
+
it "Simple period to end sentence #001" do
|
7
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Мұхитқа тікелей шыға алмайтын мемлекеттердің ішінде Қазақстан - ең үлкені.", language: "kk")
|
8
|
+
expect(ps.segment).to eq(["Мұхитқа тікелей шыға алмайтын мемлекеттердің ішінде Қазақстан - ең үлкені."])
|
9
|
+
end
|
10
|
+
|
11
|
+
it "Question mark to end sentence #002" do
|
12
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Оқушылар үйі, Достық даңғылы, Абай даналығы, ауыл шаруашылығы – кім? не?", language: "kk")
|
13
|
+
expect(ps.segment).to eq(["Оқушылар үйі, Достық даңғылы, Абай даналығы, ауыл шаруашылығы – кім?", "не?"])
|
14
|
+
end
|
15
|
+
|
16
|
+
it "Parenthetical inside sentence #003" do
|
17
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Әр түрлі өлшемнің атауы болып табылатын м (метр), см (сантиметр), кг (киллограмм), т (тонна), га (гектар), ц (центнер), т. б. (тағы басқа), тәрізді белгілер де қысқарған сөздер болып табылады.", language: "kk")
|
18
|
+
expect(ps.segment).to eq(["Әр түрлі өлшемнің атауы болып табылатын м (метр), см (сантиметр), кг (киллограмм), т (тонна), га (гектар), ц (центнер), т. б. (тағы басқа), тәрізді белгілер де қысқарған сөздер болып табылады."])
|
19
|
+
end
|
20
|
+
|
21
|
+
it "Two letter abbreviation to end sentence #004" do
|
22
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Мысалы: обкомға (облыстық комитетке) барды, ауаткомда (аудандық атқару комитетінде) болды, педучилищеге (педагогтік училищеге) түсті, медпункттің (медициналық пункттің) алдында т. б.", language: "kk")
|
23
|
+
expect(ps.segment).to eq(["Мысалы: обкомға (облыстық комитетке) барды, ауаткомда (аудандық атқару комитетінде) болды, педучилищеге (педагогтік училищеге) түсті, медпункттің (медициналық пункттің) алдында т. б."])
|
24
|
+
end
|
25
|
+
|
26
|
+
it "Number as non sentence boundary #005" do
|
27
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Елдің жалпы ішкі өнімі ЖІӨ (номинал) = $225.619 млрд (2014)", language: "kk")
|
28
|
+
expect(ps.segment).to eq(["Елдің жалпы ішкі өнімі ЖІӨ (номинал) = $225.619 млрд (2014)"])
|
29
|
+
end
|
30
|
+
|
31
|
+
it "No whitespace between sentence boundary #006" do
|
32
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Ресейдiң әлеуметтiк-экономикалық жағдайы.XVIII ғасырдың бiрiншi ширегiнде Ресейге тән нәрсе.", language: "kk")
|
33
|
+
expect(ps.segment).to eq(["Ресейдiң әлеуметтiк-экономикалық жағдайы.", "XVIII ғасырдың бiрiншi ширегiнде Ресейге тән нәрсе."])
|
34
|
+
end
|
35
|
+
|
36
|
+
it "Dates within sentence #007" do
|
37
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "(«Егемен Қазақстан», 7 қыркүйек 2012 жыл. №590-591); Бұл туралы кеше санпедқадағалау комитетінің облыыстық департаменті хабарлады. («Айқын», 23 сəуір 2010 жыл. № 70).", language: "kk")
|
38
|
+
expect(ps.segment).to eq(["(«Егемен Қазақстан», 7 қыркүйек 2012 жыл. №590-591); Бұл туралы кеше санпедқадағалау комитетінің облыыстық департаменті хабарлады.", "(«Айқын», 23 сəуір 2010 жыл. № 70)."])
|
39
|
+
end
|
40
|
+
|
41
|
+
it "Multi period abbreviation within sentence #008" do
|
42
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Иран революциясы (1905 — 11) және азаматтық қозғалыс (1918 — 21) кезінде А. Фарахани, М. Кермани, М. Т. Бехар, т.б. ақындар демократиялық идеяның жыршысы болды.", language: "kk")
|
43
|
+
expect(ps.segment).to eq(["Иран революциясы (1905 — 11) және азаматтық қозғалыс (1918 — 21) кезінде А. Фарахани, М. Кермани, М. Т. Бехар, т.б. ақындар демократиялық идеяның жыршысы болды."])
|
44
|
+
end
|
45
|
+
|
46
|
+
it "Web addresses #009" do
|
47
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Владимир Федосеев: Аттар магиясы енді жоқ http://www.vremya.ru/2003/179/10/80980.html", language: "kk")
|
48
|
+
expect(ps.segment).to eq(["Владимир Федосеев: Аттар магиясы енді жоқ http://www.vremya.ru/2003/179/10/80980.html"])
|
49
|
+
end
|
50
|
+
|
51
|
+
it "Question mark not at end of sentence #010" do
|
52
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Бірақ оның енді не керегі бар? — деді.", language: "kk")
|
53
|
+
expect(ps.segment).to eq(["Бірақ оның енді не керегі бар? — деді."])
|
54
|
+
end
|
55
|
+
|
56
|
+
it "Exclamation mark not at end of sentence #011" do
|
57
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Сондықтан шапаныма жегізіп отырғаным! - деп, жауап береді.", language: "kk")
|
58
|
+
expect(ps.segment).to eq(["Сондықтан шапаныма жегізіп отырғаным! - деп, жауап береді."])
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
describe '#segment' do
|
63
|
+
it 'correctly segments text #001' do
|
64
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "Б.з.б. 6 – 3 ғасырларда конфуцийшілдік, моизм, легизм мектептерінің қалыптасуы нәтижесінде Қытай философиясы пайда болды.", language: 'kk')
|
65
|
+
expect(ps.segment).to eq(["Б.з.б. 6 – 3 ғасырларда конфуцийшілдік, моизм, легизм мектептерінің қалыптасуы нәтижесінде Қытай философиясы пайда болды."])
|
66
|
+
end
|
67
|
+
|
68
|
+
it 'correctly segments text #002' do
|
69
|
+
ps = PragmaticSegmenter::Segmenter.new(text: "'Та марбута' тек сөз соңында екі түрде жазылады:", language: "kk")
|
70
|
+
expect(ps.segment).to eq(["'Та марбута' тек сөз соңында екі түрде жазылады:"])
|
71
|
+
end
|
72
|
+
end
|
73
|
+
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pragmatic_segmenter
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.3.
|
4
|
+
version: 0.3.23
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Kevin S. Dias
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2021-05-02 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: unicode
|
@@ -28,30 +28,30 @@ dependencies:
|
|
28
28
|
name: bundler
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
|
-
- - "
|
31
|
+
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
33
|
version: '1.7'
|
34
34
|
type: :development
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
38
|
-
- - "
|
38
|
+
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '1.7'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
42
|
name: rake
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
|
-
- - "
|
45
|
+
- - ">="
|
46
46
|
- !ruby/object:Gem::Version
|
47
|
-
version:
|
47
|
+
version: 12.3.3
|
48
48
|
type: :development
|
49
49
|
prerelease: false
|
50
50
|
version_requirements: !ruby/object:Gem::Requirement
|
51
51
|
requirements:
|
52
|
-
- - "
|
52
|
+
- - ">="
|
53
53
|
- !ruby/object:Gem::Version
|
54
|
-
version:
|
54
|
+
version: 12.3.3
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
56
|
name: rspec
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
@@ -124,6 +124,7 @@ files:
|
|
124
124
|
- lib/pragmatic_segmenter/languages/hindi.rb
|
125
125
|
- lib/pragmatic_segmenter/languages/italian.rb
|
126
126
|
- lib/pragmatic_segmenter/languages/japanese.rb
|
127
|
+
- lib/pragmatic_segmenter/languages/kazakh.rb
|
127
128
|
- lib/pragmatic_segmenter/languages/persian.rb
|
128
129
|
- lib/pragmatic_segmenter/languages/polish.rb
|
129
130
|
- lib/pragmatic_segmenter/languages/russian.rb
|
@@ -152,6 +153,7 @@ files:
|
|
152
153
|
- spec/pragmatic_segmenter/languages/hindi_spec.rb
|
153
154
|
- spec/pragmatic_segmenter/languages/italian_spec.rb
|
154
155
|
- spec/pragmatic_segmenter/languages/japanese_spec.rb
|
156
|
+
- spec/pragmatic_segmenter/languages/kazakh_spec.rb
|
155
157
|
- spec/pragmatic_segmenter/languages/persian_spec.rb
|
156
158
|
- spec/pragmatic_segmenter/languages/polish_spec.rb
|
157
159
|
- spec/pragmatic_segmenter/languages/russian_spec.rb
|
@@ -164,7 +166,7 @@ homepage: https://github.com/diasks2/pragmatic_segmenter
|
|
164
166
|
licenses:
|
165
167
|
- MIT
|
166
168
|
metadata: {}
|
167
|
-
post_install_message:
|
169
|
+
post_install_message:
|
168
170
|
rdoc_options: []
|
169
171
|
require_paths:
|
170
172
|
- lib
|
@@ -179,9 +181,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
179
181
|
- !ruby/object:Gem::Version
|
180
182
|
version: '0'
|
181
183
|
requirements: []
|
182
|
-
rubyforge_project:
|
183
|
-
rubygems_version: 2.
|
184
|
-
signing_key:
|
184
|
+
rubyforge_project:
|
185
|
+
rubygems_version: 2.7.6
|
186
|
+
signing_key:
|
185
187
|
specification_version: 4
|
186
188
|
summary: A rule-based sentence boundary detection gem that works out-of-the-box across
|
187
189
|
many languages
|
@@ -202,6 +204,7 @@ test_files:
|
|
202
204
|
- spec/pragmatic_segmenter/languages/hindi_spec.rb
|
203
205
|
- spec/pragmatic_segmenter/languages/italian_spec.rb
|
204
206
|
- spec/pragmatic_segmenter/languages/japanese_spec.rb
|
207
|
+
- spec/pragmatic_segmenter/languages/kazakh_spec.rb
|
205
208
|
- spec/pragmatic_segmenter/languages/persian_spec.rb
|
206
209
|
- spec/pragmatic_segmenter/languages/polish_spec.rb
|
207
210
|
- spec/pragmatic_segmenter/languages/russian_spec.rb
|