pragmatic_tokenizer 0.5.0 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (35) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +133 -151
  3. data/lib/pragmatic_tokenizer/ending_punctuation_separator.rb +31 -0
  4. data/lib/pragmatic_tokenizer/full_stop_separator.rb +38 -0
  5. data/lib/pragmatic_tokenizer/languages/arabic.rb +3 -3
  6. data/lib/pragmatic_tokenizer/languages/bulgarian.rb +3 -3
  7. data/lib/pragmatic_tokenizer/languages/catalan.rb +3 -3
  8. data/lib/pragmatic_tokenizer/languages/common.rb +14 -8
  9. data/lib/pragmatic_tokenizer/languages/czech.rb +3 -3
  10. data/lib/pragmatic_tokenizer/languages/danish.rb +3 -3
  11. data/lib/pragmatic_tokenizer/languages/deutsch.rb +2 -2
  12. data/lib/pragmatic_tokenizer/languages/dutch.rb +3 -3
  13. data/lib/pragmatic_tokenizer/languages/english.rb +2 -2
  14. data/lib/pragmatic_tokenizer/languages/finnish.rb +3 -3
  15. data/lib/pragmatic_tokenizer/languages/french.rb +3 -3
  16. data/lib/pragmatic_tokenizer/languages/greek.rb +3 -3
  17. data/lib/pragmatic_tokenizer/languages/indonesian.rb +3 -3
  18. data/lib/pragmatic_tokenizer/languages/italian.rb +3 -3
  19. data/lib/pragmatic_tokenizer/languages/latvian.rb +3 -3
  20. data/lib/pragmatic_tokenizer/languages/norwegian.rb +3 -3
  21. data/lib/pragmatic_tokenizer/languages/persian.rb +3 -3
  22. data/lib/pragmatic_tokenizer/languages/polish.rb +3 -3
  23. data/lib/pragmatic_tokenizer/languages/portuguese.rb +3 -3
  24. data/lib/pragmatic_tokenizer/languages/romanian.rb +3 -3
  25. data/lib/pragmatic_tokenizer/languages/russian.rb +3 -3
  26. data/lib/pragmatic_tokenizer/languages/slovak.rb +3 -3
  27. data/lib/pragmatic_tokenizer/languages/spanish.rb +3 -3
  28. data/lib/pragmatic_tokenizer/languages/swedish.rb +3 -3
  29. data/lib/pragmatic_tokenizer/languages/turkish.rb +3 -3
  30. data/lib/pragmatic_tokenizer/languages.rb +0 -2
  31. data/lib/pragmatic_tokenizer/post_processor.rb +49 -0
  32. data/lib/pragmatic_tokenizer/{processor.rb → pre_processor.rb} +35 -98
  33. data/lib/pragmatic_tokenizer/tokenizer.rb +186 -159
  34. data/lib/pragmatic_tokenizer/version.rb +1 -1
  35. metadata +6 -3
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: e86a121879d806b58f855e311c14be249ba6ce95
4
- data.tar.gz: 9be42b0a437ddaa0e03630d0fd6eee64f242bc9a
3
+ metadata.gz: c4834da7c6c1b1d6c614226840bb2fd5ef8b48b6
4
+ data.tar.gz: 395868d67e973b2a6e9e28b4b9883c95d1746fe6
5
5
  SHA512:
6
- metadata.gz: 9f8fbf0b2de1674c557144568dc771fadcb882b892318eae04c7aae3f1ec53743f29dd208c971c45707ded50d88304aaa824ca2f5364fc65b96bd0b72d93e0d6
7
- data.tar.gz: 39ee1f3e32cd243ef28c4f6b0823aa4cc523ca8a2adbb90cf6feca33fb966ccd452b70166f9ab7d2b64d859165b799c6ba3d45a410a2d5ff7116210170e77d02
6
+ metadata.gz: cc69a6f19545c9f5755df5c996e0625f0e65883fea81f01a877d10fce5f5b4eba8931529aecff9afb2ce56f8b993350d9bad15a94a5bb718db4eeafbbe611a29
7
+ data.tar.gz: f08442f148d59d98d3970e50ccc3bab2d59c1728fb06d9fefe4670ff5b4aca688168c81c30da3a49d1d800d8b398d53e76d9482c120450f2edc90b1b3c174617
data/README.md CHANGED
@@ -26,36 +26,70 @@ Or install it yourself as:
26
26
  * To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
27
27
  * Pragmatic Tokenizer will unescape any HTML entities.
28
28
 
29
+ **Example Usage**
30
+ ```ruby
31
+ text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""
32
+
33
+ PragmaticTokenizer::Tokenizer.new(text).tokenize
34
+ # => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]
35
+
36
+ # You can pass many different options:
37
+ options = {
38
+ language: :en, # the language of the string you are tokenizing
39
+ abbreviations: ['a.b', 'a'], # a user-supplied array of abbreviations (downcased with ending period removed)
40
+ stop_words: ['is', 'the'], # a user-supplied array of stop words (downcased)
41
+ remove_stop_words: true, # remove stop words
42
+ contractions: { "i'm" => "i am" }, # a user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
43
+ expand_contractions: true, # (i.e. ["isn't"] will change to two tokens ["is", "not"])
44
+ filter_languages: [:en, :de], # process abbreviations, contractions and stop words for this array of languages
45
+ punctuation: :none, # see below for more details
46
+ numbers: :none, # see below for more details
47
+ remove_emoji: :true, # remove any emoji tokens
48
+ remove_urls: :true, # remove any urls
49
+ remove_emails: :true, # remove any emails
50
+ remove_domains: :true, # remove any domains
51
+ hashtags: :keep_and_clean, # remove the hastag prefix
52
+ mentions: :keep_and_clean, # remove the @ prefix
53
+ clean: true, # remove some special characters
54
+ classic_filter: true, # removes dots from acronyms and 's from the end of tokens
55
+ downcase: false, # do not downcase tokens
56
+ minimum_length: 3, # remove any tokens less than 3 characters
57
+ long_word_split: 10 # split tokens longer than 10 characters at hypens or underscores
58
+ }
59
+ ```
60
+
29
61
  **Options**
30
62
 
31
- ##### `punctuation`
32
- **default** = `'all'`
33
- - `'all'`
34
- Does not remove any punctuation from the result.
35
- - `'semi'`
36
- Removes full stops (i.e. periods) ['。', '.', '.'].
37
- - `'none'`
38
- Removes all punctuation from the result.
39
- - `'only'`
40
- Removes everything except punctuation. The returned result is an array of only the punctuation.
63
+ ##### `language`
64
+ **default** = `'en'`
65
+ - To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages) as a symbol (i.e. `:en`) or string (i.e. `'en'`)
41
66
 
42
67
  <hr>
43
68
 
44
- ##### `remove_stop_words`
45
- **default** = `'false'`
46
- - `true`
47
- Removes all stop words.
48
- - `false`
49
- Does not remove stop words.
69
+ ##### `abbreviations`
70
+ **default** = `nil`
71
+ - You can pass an array of abbreviations to overide or compliment the abbreviations that come stored in this gem. Each element of the array should be a downcased String with the ending period removed.
72
+
73
+ <hr>
74
+
75
+ ##### `stop_words`
76
+ **default** = `nil`
77
+ - You can pass an array of stop words to overide or compliment the stop words that come stored in this gem. Each element of the array should be a downcased String.
78
+
79
+ <hr>
80
+
81
+ ##### `contractions`
82
+ **default** = `nil`
83
+ - You can pass a hash of contractions to overide or compliment the contractions that come stored in this gem. Each key is the contracted form downcased and each value is the expanded form downcased.
50
84
 
51
85
  <hr>
52
86
 
53
- ##### `remove_en_stop_words`
87
+ ##### `remove_stop_words`
54
88
  **default** = `'false'`
55
89
  - `true`
56
- Removes all English stop words (sometimes foreign language strings have English mixed in).
90
+ Removes all stop words.
57
91
  - `false`
58
- Does not remove English stop words.
92
+ Does not remove stop words.
59
93
 
60
94
  <hr>
61
95
 
@@ -68,180 +102,128 @@ Or install it yourself as:
68
102
 
69
103
  <hr>
70
104
 
71
- ##### `clean`
105
+ ##### `filter_languages`
106
+ **default** = `nil`
107
+ - You can pass an array of languages of which you would like to process abbreviations, stop words and contractions. This language can be indepedent of the language of the string you are tokenizing (for example your tex might be German but contain so English stop words that you want to remove). If you supply your own abbreviations, stop words or contractions they will be merged with the abbreviations, stop words and contractions of any languages you add in this option. You can pass an array of symbols or strings (i.e. `[:en, :de]` or `['en', 'de']`)
108
+
109
+ <hr>
110
+
111
+ ##### `punctuation`
112
+ **default** = `'all'`
113
+ - `:all`
114
+ Does not remove any punctuation from the result.
115
+ - `:semi`
116
+ Removes full stops (i.e. periods) ['。', '.', '.'].
117
+ - `:none`
118
+ Removes all punctuation from the result.
119
+ - `:only`
120
+ Removes everything except punctuation. The returned result is an array of only the punctuation.
121
+
122
+ <hr>
123
+
124
+ ##### `numbers`
125
+ **default** = `'all'`
126
+ - `:all`
127
+ Does not remove any numbers from the result
128
+ - `:semi`
129
+ Removes tokens that include only digits
130
+ - `:none`
131
+ Removes all tokens that include a number from the result (including Roman numerals)
132
+ - `:only`
133
+ Removes everything except tokens that include a number
134
+
135
+ <hr>
136
+
137
+ ##### `remove_emoji`
72
138
  **default** = `'false'`
73
139
  - `true`
74
- Removes tokens consisting of only hypens, underscores, or periods as well as some special characters (®, ©, ™). Also removes long tokens or tokens with a backslash.
140
+ Removes any token that contains an emoji.
75
141
  - `false`
76
142
  Leaves tokens as is.
77
143
 
78
144
  <hr>
79
145
 
80
- ##### `remove_numbers`
146
+ ##### `remove_urls`
81
147
  **default** = `'false'`
82
148
  - `true`
83
- Removes any token that contains a number.
149
+ Removes any token that contains a URL.
84
150
  - `false`
85
151
  Leaves tokens as is.
86
152
 
87
153
  <hr>
88
154
 
89
- ##### `remove_roman_numerals`
155
+ ##### `remove_domains`
90
156
  **default** = `'false'`
91
157
  - `true`
92
- Removes any token that contains a Roman numeral.
158
+ Removes any token that contains a domain.
93
159
  - `false`
94
160
  Leaves tokens as is.
95
161
 
96
162
  <hr>
97
163
 
98
- ##### `downcase`
99
- **default** = `'true'`
100
-
101
- <hr>
102
-
103
- ##### `minimum_length`
104
- **default** = `0`
105
- The minimum number of characters a token should be.
106
-
107
- **Methods**
108
-
109
- #### `#tokenize`
110
-
111
- **Example Usage**
112
- ```ruby
113
- text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""
114
-
115
- PragmaticTokenizer::Tokenizer.new(text).tokenize
116
- # => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]
117
-
118
- PragmaticTokenizer::Tokenizer.new(text, remove_stop_words: true).tokenize
119
- # => ["\"", ",", "'", "what're", "?", "crazy", "?", "'", "\"", "sandowsky", ".", "\"", "afford", ".", "\""]
120
-
121
- PragmaticTokenizer::Tokenizer.new(text, punctuation: 'none').tokenize
122
- # => ["i", "said", "what're", "you", "crazy", "said", "sandowsky", "i", "can't", "afford", "to", "do", "that"]
123
-
124
- PragmaticTokenizer::Tokenizer.new(text, punctuation: 'only').tokenize
125
- # => ["\"", ",", "'", "?", "?", "'", "\"", ".", "\"", ".", "\""]
126
-
127
- PragmaticTokenizer::Tokenizer.new(text, punctuation: 'semi').tokenize
128
- # => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", "\"", "i", "can't", "afford", "to", "do", "that", "\""]
129
-
130
- PragmaticTokenizer::Tokenizer.new(text, expand_contractions: true).tokenize
131
- # => ['"', 'i', 'said', ',', "'", 'what', 'are', 'you', '?', 'crazy', '?', "'", '"', 'said', 'sandowsky', '.', '"', 'i', 'cannot', 'afford', 'to', 'do', 'that', '.', '"']
132
-
133
- PragmaticTokenizer::Tokenizer.new(text,
134
- expand_contractions: true,
135
- remove_stop_words: true,
136
- punctuation: 'none'
137
- ).tokenize
138
- # => ["crazy", "sandowsky", "afford"]
139
-
140
- text = "The price is $5.50 and it works for 5 hours."
141
- PragmaticTokenizer::Tokenizer.new(text, remove_numbers: true).tokenize
142
- # => ["the", "price", "is", "and", "it", "works", "for", "hours", "."]
143
-
144
- text = "Hello ______ ."
145
- PragmaticTokenizer::Tokenizer.new(text, clean: true).tokenize
146
- # => ["hello", "."]
147
-
148
- text = "Let's test the minimum length."
149
- PragmaticTokenizer::Tokenizer.new(text, minimum_length: 6).tokenize
150
- # => ["minimum", "length"]
151
- ```
164
+ ##### `remove_domains`
165
+ **default** = `'false'`
166
+ - `true`
167
+ Removes any token that contains a domain.
168
+ - `false`
169
+ Leaves tokens as is.
152
170
 
153
171
  <hr>
154
172
 
155
- #### `#urls`
156
- Extract only valid URL tokens
157
-
158
- **Example Usage**
159
- ```ruby
160
- text = "Go to http://www.example.com"
161
-
162
- PragmaticTokenizer::Tokenizer.new(text).urls
163
- # => ["http://www.example.com"]
164
- ```
173
+ ##### `clean`
174
+ **default** = `'false'`
175
+ - `true`
176
+ Removes tokens consisting of only hypens, underscores, or periods as well as some special characters (®, ©, ™). Also removes long tokens or tokens with a backslash.
177
+ - `false`
178
+ Leaves tokens as is.
165
179
 
166
180
  <hr>
167
181
 
168
- #### `#domains`
169
- Extract only valid domain tokens
170
-
171
- **Example Usage**
172
- ```ruby
173
- text = "See the breaking news stories about X on cnn.com/europe and english.alarabiya.net, here’s a screenshot: https://t.co/s83k28f29d31s83"
174
-
175
- PragmaticTokenizer::Tokenizer.new(text).urls
176
- # => ["cnn.com/europe", "english.alarabiya.net"]
177
- ```
182
+ ##### `hashtags`
183
+ **default** = `'keep_original'`
184
+ - `:keep_original`
185
+ Does not alter the token at all.
186
+ - `:keep_and_clean`
187
+ Removes the hashtag (#) prefix from the token.
188
+ - `:remove`
189
+ Removes the token completely.
178
190
 
179
191
  <hr>
180
192
 
181
- #### `#emails`
182
- Extract only valid email tokens
183
-
184
- **Example Usage**
185
- ```ruby
186
- text = "Please email example@example.com for more info."
187
-
188
- PragmaticTokenizer::Tokenizer.new(text).emails
189
- # => ["example@example.com"]
190
- ```
193
+ ##### `mentions`
194
+ **default** = `'keep_original'`
195
+ - `:keep_original`
196
+ Does not alter the token at all.
197
+ - `:keep_and_clean`
198
+ Removes the mention (@) prefix from the token.
199
+ - `:remove`
200
+ Removes the token completely.
191
201
 
192
202
  <hr>
193
203
 
194
- #### `#hashtags`
195
- Extract only valid hashtag tokens
196
-
197
- **Example Usage**
198
- ```ruby
199
- text = "Find me all the #fun #hashtags and give me #backallofthem."
200
-
201
- PragmaticTokenizer::Tokenizer.new(text).hashtags
202
- # => ["#fun", "#hashtags", "#backallofthem"]
203
- ```
204
+ ##### `classic_filter`
205
+ **default** = `'false'`
206
+ - `true`
207
+ Removes dots from acronyms and 's from the end of tokens.
208
+ - `false`
209
+ Leaves tokens as is.
204
210
 
205
211
  <hr>
206
212
 
207
- #### `#mentions`
208
- Extract only valid @ mention tokens
209
-
210
- **Example Usage**
211
- ```ruby
212
- text = "Find me all the @awesome mentions."
213
-
214
- PragmaticTokenizer::Tokenizer.new(text).hashtags
215
- # => ["@awesome"]
216
- ```
213
+ ##### `downcase`
214
+ **default** = `'true'`
217
215
 
218
216
  <hr>
219
217
 
220
- #### `#emoticons`
221
- Extract only simple emoticon tokens
222
-
223
- **Example Usage**
224
- ```ruby
225
- text = "Hello ;-) :) 😄"
226
-
227
- PragmaticTokenizer::Tokenizer.new(text).emoticons
228
- # => [";-)", ":)""]
229
- ```
218
+ ##### `minimum_length`
219
+ **default** = `0`
220
+ The minimum number of characters a token should be.
230
221
 
231
222
  <hr>
232
223
 
233
- #### `#emoji`
234
- Extract only valid† emoji tokens
235
-
236
- *†matches all 1012 single-character Unicode Emoji (all except for two-character flags)*
237
-
238
- **Example Usage**
239
- ```ruby
240
- text = "Return the emoji 👿😍😱🐔🌚."
241
-
242
- PragmaticTokenizer::Tokenizer.new(text).emoticons
243
- # => ["👿", "😍", "😱", "🐔", "🌚"]
244
- ```
224
+ ##### `long_word_split`
225
+ **default** = `nil`
226
+ The number of characters after which a token should be split at hypens or underscores.
245
227
 
246
228
  ## Language Support
247
229
 
@@ -0,0 +1,31 @@
1
+ # -*- encoding : utf-8 -*-
2
+
3
+ module PragmaticTokenizer
4
+ # This class separates ending punctuation from a token
5
+ class EndingPunctuationSeparator
6
+ attr_reader :tokens
7
+ def initialize(tokens:)
8
+ @tokens = tokens
9
+ end
10
+
11
+ def separate
12
+ cleaned_tokens = []
13
+ tokens.each do |a|
14
+ split_punctuation = a.scan(/(?<=\S)[。.!!??]+$/)
15
+ if split_punctuation[0].nil?
16
+ cleaned_tokens << a
17
+ else
18
+ cleaned_tokens << a.tr(split_punctuation[0],'')
19
+ if split_punctuation[0].length.eql?(1)
20
+ cleaned_tokens << split_punctuation[0]
21
+ else
22
+ split_punctuation[0].split("").each do |s|
23
+ cleaned_tokens << s
24
+ end
25
+ end
26
+ end
27
+ end
28
+ cleaned_tokens
29
+ end
30
+ end
31
+ end
@@ -0,0 +1,38 @@
1
+ # -*- encoding : utf-8 -*-
2
+
3
+ module PragmaticTokenizer
4
+ # This class separates true full stops while ignoring
5
+ # periods that are part of an abbreviation
6
+ class FullStopSeparator
7
+ attr_reader :tokens, :abbreviations
8
+ def initialize(tokens:, abbreviations:)
9
+ @tokens = tokens
10
+ @abbreviations = abbreviations
11
+ end
12
+
13
+ def separate
14
+ abbr = {}
15
+ abbreviations.each do |i|
16
+ abbr[i] = true
17
+ end
18
+ cleaned_tokens = []
19
+ tokens.each_with_index do |_t, i|
20
+ if tokens[i + 1] && tokens[i] =~ /\A(.+)\.\z/
21
+ w = $1
22
+ unless abbr[Unicode::downcase(w)] || w =~ /\A[a-z]\z/i ||
23
+ w =~ /[a-z](?:\.[a-z])+\z/i
24
+ cleaned_tokens << w
25
+ cleaned_tokens << '.'
26
+ next
27
+ end
28
+ end
29
+ cleaned_tokens << tokens[i]
30
+ end
31
+ if cleaned_tokens[-1] && cleaned_tokens[-1] =~ /\A(.*\w)\.\z/
32
+ cleaned_tokens[-1] = $1
33
+ cleaned_tokens.push '.'
34
+ end
35
+ cleaned_tokens
36
+ end
37
+ end
38
+ end
@@ -2,9 +2,9 @@ module PragmaticTokenizer
2
2
  module Languages
3
3
  module Arabic
4
4
  include Languages::Common
5
- ABBREVIATIONS = ['ا', 'ا. د', 'ا.د', 'ا.ش.ا', 'ا.ش.ا', 'إلخ', 'ت.ب', 'ت.ب', 'ج.ب', 'جم', 'ج.ب', 'ج.م.ع', 'ج.م.ع', 'س.ت', 'س.ت', 'سم', 'ص.ب.', 'ص.ب', 'كج.', 'كلم.', 'م', 'م.ب', 'م.ب', 'ه', 'د‪']
6
- STOP_WORDS = ["فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "ب", "ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "فى"]
7
- CONTRACTIONS = {}
5
+ ABBREVIATIONS = ['ا', 'ا. د', 'ا.د', 'ا.ش.ا', 'ا.ش.ا', 'إلخ', 'ت.ب', 'ت.ب', 'ج.ب', 'جم', 'ج.ب', 'ج.م.ع', 'ج.م.ع', 'س.ت', 'س.ت', 'سم', 'ص.ب.', 'ص.ب', 'كج.', 'كلم.', 'م', 'م.ب', 'م.ب', 'ه', 'د‪'].freeze
6
+ STOP_WORDS = ["فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "ب", "ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "فى"].freeze
7
+ CONTRACTIONS = {}.freeze
8
8
  end
9
9
  end
10
10
  end
@@ -2,9 +2,9 @@ module PragmaticTokenizer
2
2
  module Languages
3
3
  module Bulgarian
4
4
  include Languages::Common
5
- ABBREVIATIONS = ["акад", "ал", "б.р", "б.ред", "бел.а", "бел.пр", "бр", "бул", "в", "вж", "вкл", "вм", "вр", "г", "ген", "гр", "дж", "дм", "доц", "др", "ем", "заб", "зам", "инж", "к.с", "кв", "кв.м", "кг", "км", "кор", "куб", "куб.м", "л", "лв", "м", "м.г", "мин", "млн", "млрд", "мм", "н.с", "напр", "пл", "полк", "проф", "р", "рис", "с", "св", "сек", "см", "сп", "срв", "ст", "стр", "т", "т.г", "т.е", "т.н", "т.нар", "табл", "тел", "у", "ул", "фиг", "ха", "хил", "ч", "чл", "щ.д"]
6
- STOP_WORDS = ["а", "автентичен", "аз", "ако", "ала", "бе", "без", "беше", "би", "бивш", "бивша", "бившо", "бил", "била", "били", "било", "благодаря", "близо", "бъдат", "бъде", "бяха", "в", "вас", "ваш", "ваша", "вероятно", "вече", "взема", "ви", "вие", "винаги", "внимава", "време", "все", "всеки", "всички", "всичко", "всяка", "във", "въпреки", "върху", "г", "г.", "ги", "главен", "главна", "главно", "глас", "го", "година", "години", "годишен", "д", "да", "дали", "два", "двама", "двамата", "две", "двете", "ден", "днес", "дни", "до", "добра", "добре", "добро", "добър", "докато", "докога", "дори", "досега", "доста", "друг", "друга", "други", "е", "евтин", "едва", "един", "една", "еднаква", "еднакви", "еднакъв", "едно", "екип", "ето", "живот", "за", "забавям", "зад", "заедно", "заради", "засега", "заспал", "затова", "защо", "защото", "и", "из", "или", "им", "има", "имат", "иска", "й", "каза", "как", "каква", "какво", "както", "какъв", "като", "кога", "когато", "което", "които", "кой", "който", "колко", "която", "къде", "където", "към", "лесен", "лесно", "ли", "лош", "м", "май", "малко", "ме", "между", "мек", "мен", "месец", "ми", "много", "мнозина", "мога", "могат", "може", "мокър", "моля", "момента", "му", "н", "на", "над", "назад", "най", "направи", "напред", "например", "нас", "не", "него", "нещо", "нея", "ни", "ние", "никой", "нито", "нищо", "но", "нов", "нова", "нови", "новина", "някои", "някой", "няколко", "няма", "обаче", "около", "освен", "особено", "от", "отгоре", "отново", "още", "пак", "по", "повече", "повечето", "под", "поне", "поради", "после", "почти", "прави", "пред", "преди", "през", "при", "пък", "първата", "първи", "първо", "пъти", "равен", "равна", "с", "са", "сам", "само", "се", "сега", "си", "син", "скоро", "след", "следващ", "сме", "смях", "според", "сред", "срещу", "сте", "съм", "със", "също", "т", "т.н.", "тази", "така", "такива", "такъв", "там", "твой", "те", "тези", "ти", "то", "това", "тогава", "този", "той", "толкова", "точно", "три", "трябва", "тук", "тъй", "тя", "тях", "у", "утре", "харесва", "хиляди", "ч", "часа", "че", "често", "чрез", "ще", "щом", "юмрук", "я", "як"]
7
- CONTRACTIONS = {}
5
+ ABBREVIATIONS = ["акад", "ал", "б.р", "б.ред", "бел.а", "бел.пр", "бр", "бул", "в", "вж", "вкл", "вм", "вр", "г", "ген", "гр", "дж", "дм", "доц", "др", "ем", "заб", "зам", "инж", "к.с", "кв", "кв.м", "кг", "км", "кор", "куб", "куб.м", "л", "лв", "м", "м.г", "мин", "млн", "млрд", "мм", "н.с", "напр", "пл", "полк", "проф", "р", "рис", "с", "св", "сек", "см", "сп", "срв", "ст", "стр", "т", "т.г", "т.е", "т.н", "т.нар", "табл", "тел", "у", "ул", "фиг", "ха", "хил", "ч", "чл", "щ.д"].freeze
6
+ STOP_WORDS = ["а", "автентичен", "аз", "ако", "ала", "бе", "без", "беше", "би", "бивш", "бивша", "бившо", "бил", "била", "били", "било", "благодаря", "близо", "бъдат", "бъде", "бяха", "в", "вас", "ваш", "ваша", "вероятно", "вече", "взема", "ви", "вие", "винаги", "внимава", "време", "все", "всеки", "всички", "всичко", "всяка", "във", "въпреки", "върху", "г", "г.", "ги", "главен", "главна", "главно", "глас", "го", "година", "години", "годишен", "д", "да", "дали", "два", "двама", "двамата", "две", "двете", "ден", "днес", "дни", "до", "добра", "добре", "добро", "добър", "докато", "докога", "дори", "досега", "доста", "друг", "друга", "други", "е", "евтин", "едва", "един", "една", "еднаква", "еднакви", "еднакъв", "едно", "екип", "ето", "живот", "за", "забавям", "зад", "заедно", "заради", "засега", "заспал", "затова", "защо", "защото", "и", "из", "или", "им", "има", "имат", "иска", "й", "каза", "как", "каква", "какво", "както", "какъв", "като", "кога", "когато", "което", "които", "кой", "който", "колко", "която", "къде", "където", "към", "лесен", "лесно", "ли", "лош", "м", "май", "малко", "ме", "между", "мек", "мен", "месец", "ми", "много", "мнозина", "мога", "могат", "може", "мокър", "моля", "момента", "му", "н", "на", "над", "назад", "най", "направи", "напред", "например", "нас", "не", "него", "нещо", "нея", "ни", "ние", "никой", "нито", "нищо", "но", "нов", "нова", "нови", "новина", "някои", "някой", "няколко", "няма", "обаче", "около", "освен", "особено", "от", "отгоре", "отново", "още", "пак", "по", "повече", "повечето", "под", "поне", "поради", "после", "почти", "прави", "пред", "преди", "през", "при", "пък", "първата", "първи", "първо", "пъти", "равен", "равна", "с", "са", "сам", "само", "се", "сега", "си", "син", "скоро", "след", "следващ", "сме", "смях", "според", "сред", "срещу", "сте", "съм", "със", "също", "т", "т.н.", "тази", "така", "такива", "такъв", "там", "твой", "те", "тези", "ти", "то", "това", "тогава", "този", "той", "толкова", "точно", "три", "трябва", "тук", "тъй", "тя", "тях", "у", "утре", "харесва", "хиляди", "ч", "часа", "че", "често", "чрез", "ще", "щом", "юмрук", "я", "як"].freeze
7
+ CONTRACTIONS = {}.freeze
8
8
  end
9
9
  end
10
10
  end
@@ -2,9 +2,9 @@ module PragmaticTokenizer
2
2
  module Languages
3
3
  module Catalan
4
4
  include Languages::Common
5
- ABBREVIATIONS = []
6
- STOP_WORDS = ["a", "abans", "algun", "alguna", "algunes", "alguns", "altre", "amb", "ambdós", "anar", "ans", "aquell", "aquelles", "aquells", "aquí", "bastant", "bé", "cada", "com", "consegueixo", "conseguim", "conseguir", "consigueix", "consigueixen", "consigueixes", "dalt", "de", "des de", "dins", "el", "elles", "ells", "els", "en", "ens", "entre", "era", "erem", "eren", "eres", "es", "és", "éssent", "està", "estan", "estat", "estava", "estem", "esteu", "estic", "ets", "fa", "faig", "fan", "fas", "fem", "fer", "feu", "fi", "haver", "i", "inclòs", "jo", "la", "les", "llarg", "llavors", "mentre", "meu", "mode", "molt", "molts", "nosaltres", "o", "on", "per", "per que", "però", "perquè", "podem", "poden", "poder", "podeu", "potser", "primer", "puc", "quan", "quant", "qui", "sabem", "saben", "saber", "sabeu", "sap", "saps", "sense", "ser", "seu", "seus", "si", "soc", "solament", "sols", "som", "sota", "també", "te", "tene", "tenim", "tenir", "teniu", "teu", "tinc", "tot", "últim", "un", "una", "unes", "uns", "ús", "va", "vaig", "van", "vosaltres"]
7
- CONTRACTIONS = {}
5
+ ABBREVIATIONS = [].freeze
6
+ STOP_WORDS = ["a", "abans", "algun", "alguna", "algunes", "alguns", "altre", "amb", "ambdós", "anar", "ans", "aquell", "aquelles", "aquells", "aquí", "bastant", "bé", "cada", "com", "consegueixo", "conseguim", "conseguir", "consigueix", "consigueixen", "consigueixes", "dalt", "de", "des de", "dins", "el", "elles", "ells", "els", "en", "ens", "entre", "era", "erem", "eren", "eres", "es", "és", "éssent", "està", "estan", "estat", "estava", "estem", "esteu", "estic", "ets", "fa", "faig", "fan", "fas", "fem", "fer", "feu", "fi", "haver", "i", "inclòs", "jo", "la", "les", "llarg", "llavors", "mentre", "meu", "mode", "molt", "molts", "nosaltres", "o", "on", "per", "per que", "però", "perquè", "podem", "poden", "poder", "podeu", "potser", "primer", "puc", "quan", "quant", "qui", "sabem", "saben", "saber", "sabeu", "sap", "saps", "sense", "ser", "seu", "seus", "si", "soc", "solament", "sols", "som", "sota", "també", "te", "tene", "tenim", "tenir", "teniu", "teu", "tinc", "tot", "últim", "un", "una", "unes", "uns", "ús", "va", "vaig", "van", "vosaltres"].freeze
7
+ CONTRACTIONS = {}.freeze
8
8
  end
9
9
  end
10
10
  end
@@ -1,17 +1,23 @@
1
1
  module PragmaticTokenizer
2
2
  module Languages
3
3
  module Common
4
- PUNCTUATION = ['。', '.', '.', '!', '!', '?', '?', '、', '¡', '¿', '„', '“', '[', ']', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', ':', ';', '<', '=', '>', '@', '^', '_', '`', "'", '{', '|', '}', '~', '-', '«', '»', '/', '›', '‹', '^', '”']
5
- PUNCTUATION_MAP = { "。" => "♳", "." => "♴", "." => "♵", "!" => "♶", "!" => "♷", "?" => "♸", "?" => "♹", "、" => "♺", "¡" => "⚀", "¿" => "⚁", "„" => "⚂", "“" => "⚃", "[" => "⚄", "]" => "⚅", "\"" => "☇", "#" => "☈", "$" => "☉", "%" => "☊", "&" => "☋", "(" => "☌", ")" => "☍", "*" => "☠", "+" => "☢", "," => "☣", ":" => "☤", ";" => "☥", "<" => "☦", "=" => "☧", ">" => "☀", "@" => "☁", "^" => "☂", "_" => "☃", "`" => "☄", "'" => "☮", "{" => "♔", "|" => "♕", "}" => "♖", "~" => "♗", "-" => "♘", "«" => "♙", "»" => "♚", "”" => "⚘" }
6
- SEMI_PUNCTUATION = ['。', '.', '.']
7
- ROMAN_NUMERALS = ['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx', 'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx', 'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl', 'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l', 'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx', 'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx', 'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx', 'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc', 'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix']
8
- SPECIAL_CHARACTERS = ['®', '©', '™']
9
- ABBREVIATIONS = []
10
- STOP_WORDS = []
11
- CONTRACTIONS = {}
4
+ PUNCTUATION = ['。', '.', '.', '!', '!', '?', '?', '、', '¡', '¿', '„', '“', '[', ']', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', ':', ';', '<', '=', '>', '@', '^', '_', '`', "'", '{', '|', '}', '~', '-', '«', '»', '/', '›', '‹', '^', '”'].freeze
5
+ PUNCTUATION_MAP = { "。" => "♳", "." => "♴", "." => "♵", "!" => "♶", "!" => "♷", "?" => "♸", "?" => "♹", "、" => "♺", "¡" => "⚀", "¿" => "⚁", "„" => "⚂", "“" => "⚃", "[" => "⚄", "]" => "⚅", "\"" => "☇", "#" => "☈", "$" => "☉", "%" => "☊", "&" => "☋", "(" => "☌", ")" => "☍", "*" => "☠", "+" => "☢", "," => "☣", ":" => "☤", ";" => "☥", "<" => "☦", "=" => "☧", ">" => "☀", "@" => "☁", "^" => "☂", "_" => "☃", "`" => "☄", "'" => "☮", "{" => "♔", "|" => "♕", "}" => "♖", "~" => "♗", "-" => "♘", "«" => "♙", "»" => "♚", "”" => "⚘" }.freeze
6
+ SEMI_PUNCTUATION = ['。', '.', '.'].freeze
7
+ ROMAN_NUMERALS = ['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx', 'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx', 'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl', 'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l', 'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx', 'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx', 'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx', 'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc', 'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix'].freeze
8
+ SPECIAL_CHARACTERS = ['®', '©', '™'].freeze
9
+ ABBREVIATIONS = [].freeze
10
+ STOP_WORDS = [].freeze
11
+ CONTRACTIONS = {}.freeze
12
+ EMOJI_REGEX = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
13
+ PREFIX_EMOJI_REGEX = /(?<=\S)(?=[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}])/
14
+ POSTFIX_EMOJI_REGEX = /(?<=[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}])(?=\S)/
15
+ EMOTICON_REGEX = /(?::|;|=)(?:-)?(?:\)|D|P)/
12
16
 
13
17
  class SingleQuotes
14
18
  def handle_single_quotes(text)
19
+ # Convert left quotes to special character except for 'Twas or 'twas
20
+ text.gsub!(/(\W|^)'(?=.*\w)(?!twas)(?!Twas)/o) { $1 ? $1 + ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' : ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' } || text
15
21
  text.gsub!(/(\W|^)'(?=.*\w)/o, ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"]) || text
16
22
  # Separate right single quotes
17
23
  text.gsub!(/(\w|\D)'(?!')(?=\W|$)/o) { $1 + ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' } || text
@@ -2,9 +2,9 @@ module PragmaticTokenizer
2
2
  module Languages
3
3
  module Czech
4
4
  include Languages::Common
5
- ABBREVIATIONS = []
6
- STOP_WORDS = ["ačkoli", "ahoj", "ale", "anebo", "ano", "asi", "aspoň", "během", "bez", "beze", "blízko", "bohužel", "brzo", "bude", "budeme", "budeš", "budete", "budou", "budu", "byl", "byla", "byli", "bylo", "byly", "bys", "čau", "chce", "chceme", "chceš", "chcete", "chci", "chtějí", "chtít", "chut'", "chuti", "co", "čtrnáct", "čtyři", "dál", "dále", "daleko", "děkovat", "děkujeme", "děkuji", "den", "deset", "devatenáct", "devět", "do", "dobrý", "docela", "dva", "dvacet", "dvanáct", "dvě", "hodně", "já", "jak", "jde", "je", "jeden", "jedenáct", "jedna", "jedno", "jednou", "jedou", "jeho", "její", "jejich", "jemu", "jen", "jenom", "ještě", "jestli", "jestliže", "jí", "jich", "jím", "jimi", "jinak", "jsem", "jsi", "jsme", "jsou", "jste", "kam", "kde", "kdo", "kdy", "když", "ke", "kolik", "kromě", "která", "které", "kteří", "který", "kvůli", "má", "mají", "málo", "mám", "máme", "máš", "máte", "mé", "mě", "mezi", "mí", "mít", "mně", "mnou", "moc", "mohl", "mohou", "moje", "moji", "možná", "můj", "musí", "může", "my", "na", "nad", "nade", "nám", "námi", "naproti", "nás", "náš", "naše", "naši", "ne", "ně", "nebo", "nebyl", "nebyla", "nebyli", "nebyly", "něco", "nedělá", "nedělají", "nedělám", "neděláme", "neděláš", "neděláte", "nějak", "nejsi", "někde", "někdo", "nemají", "nemáme", "nemáte", "neměl", "němu", "není", "nestačí", "nevadí", "než", "nic", "nich", "ním", "nimi", "nula", "od", "ode", "on", "ona", "oni", "ono", "ony", "osm", "osmnáct", "pak", "patnáct", "pět", "po", "pořád", "potom", "pozdě", "před", "přes", "přese", "pro", "proč", "prosím", "prostě", "proti", "protože", "rovně", "se", "sedm", "sedmnáct", "šest", "šestnáct", "skoro", "smějí", "smí", "snad", "spolu", "sta", "sté", "sto", "ta", "tady", "tak", "takhle", "taky", "tam", "tamhle", "tamhleto", "tamto", "tě", "tebe", "tebou", "ted'", "tedy", "ten", "ti", "tisíc", "tisíce", "to", "tobě", "tohle", "toto", "třeba", "tři", "třináct", "trošku", "tvá", "tvé", "tvoje", "tvůj", "ty", "určitě", "už", "vám", "vámi", "vás", "váš", "vaše", "vaši", "ve", "večer", "vedle", "vlastně", "všechno", "všichni", "vůbec", "vy", "vždy", "za", "zač", "zatímco", "ze", "že", "aby", "aj", "ani", "az", "budem", "budes", "by", "byt", "ci", "clanek", "clanku", "clanky", "coz", "cz", "dalsi", "design", "dnes", "email", "ho", "jako", "jej", "jeji", "jeste", "ji", "jine", "jiz", "jses", "kdyz", "ktera", "ktere", "kteri", "kterou", "ktery", "ma", "mate", "mi", "mit", "muj", "muze", "nam", "napiste", "nas", "nasi", "nejsou", "neni", "nez", "nove", "novy", "pod", "podle", "pokud", "pouze", "prave", "pred", "pres", "pri", "proc", "proto", "protoze", "prvni", "pta", "re", "si", "strana", "sve", "svych", "svym", "svymi", "take", "takze", "tato", "tema", "tento", "teto", "tim", "timto", "tipy", "toho", "tohoto", "tom", "tomto", "tomuto", "tu", "tuto", "tyto", "uz", "vam", "vas", "vase", "vice", "vsak", "zda", "zde", "zpet", "zpravy", "a", "aniž", "až", "být", "což", "či", "článek", "článku", "články", "další", "i", "jenž", "jiné", "již", "jseš", "jšte", "k", "každý", "kteři", "ku", "me", "ná", "napište", "nechť", "ní", "nové", "nový", "o", "práve", "první", "přede", "při", "s", "sice", "své", "svůj", "svých", "svým", "svými", "také", "takže", "te", "těma", "této", "tím", "tímto", "u", "v", "více", "však", "všechen", "z", "zpět", "zprávy"]
7
- CONTRACTIONS = {}
5
+ ABBREVIATIONS = [].freeze
6
+ STOP_WORDS = ["ačkoli", "ahoj", "ale", "anebo", "ano", "asi", "aspoň", "během", "bez", "beze", "blízko", "bohužel", "brzo", "bude", "budeme", "budeš", "budete", "budou", "budu", "byl", "byla", "byli", "bylo", "byly", "bys", "čau", "chce", "chceme", "chceš", "chcete", "chci", "chtějí", "chtít", "chut'", "chuti", "co", "čtrnáct", "čtyři", "dál", "dále", "daleko", "děkovat", "děkujeme", "děkuji", "den", "deset", "devatenáct", "devět", "do", "dobrý", "docela", "dva", "dvacet", "dvanáct", "dvě", "hodně", "já", "jak", "jde", "je", "jeden", "jedenáct", "jedna", "jedno", "jednou", "jedou", "jeho", "její", "jejich", "jemu", "jen", "jenom", "ještě", "jestli", "jestliže", "jí", "jich", "jím", "jimi", "jinak", "jsem", "jsi", "jsme", "jsou", "jste", "kam", "kde", "kdo", "kdy", "když", "ke", "kolik", "kromě", "která", "které", "kteří", "který", "kvůli", "má", "mají", "málo", "mám", "máme", "máš", "máte", "mé", "mě", "mezi", "mí", "mít", "mně", "mnou", "moc", "mohl", "mohou", "moje", "moji", "možná", "můj", "musí", "může", "my", "na", "nad", "nade", "nám", "námi", "naproti", "nás", "náš", "naše", "naši", "ne", "ně", "nebo", "nebyl", "nebyla", "nebyli", "nebyly", "něco", "nedělá", "nedělají", "nedělám", "neděláme", "neděláš", "neděláte", "nějak", "nejsi", "někde", "někdo", "nemají", "nemáme", "nemáte", "neměl", "němu", "není", "nestačí", "nevadí", "než", "nic", "nich", "ním", "nimi", "nula", "od", "ode", "on", "ona", "oni", "ono", "ony", "osm", "osmnáct", "pak", "patnáct", "pět", "po", "pořád", "potom", "pozdě", "před", "přes", "přese", "pro", "proč", "prosím", "prostě", "proti", "protože", "rovně", "se", "sedm", "sedmnáct", "šest", "šestnáct", "skoro", "smějí", "smí", "snad", "spolu", "sta", "sté", "sto", "ta", "tady", "tak", "takhle", "taky", "tam", "tamhle", "tamhleto", "tamto", "tě", "tebe", "tebou", "ted'", "tedy", "ten", "ti", "tisíc", "tisíce", "to", "tobě", "tohle", "toto", "třeba", "tři", "třináct", "trošku", "tvá", "tvé", "tvoje", "tvůj", "ty", "určitě", "už", "vám", "vámi", "vás", "váš", "vaše", "vaši", "ve", "večer", "vedle", "vlastně", "všechno", "všichni", "vůbec", "vy", "vždy", "za", "zač", "zatímco", "ze", "že", "aby", "aj", "ani", "az", "budem", "budes", "by", "byt", "ci", "clanek", "clanku", "clanky", "coz", "cz", "dalsi", "design", "dnes", "email", "ho", "jako", "jej", "jeji", "jeste", "ji", "jine", "jiz", "jses", "kdyz", "ktera", "ktere", "kteri", "kterou", "ktery", "ma", "mate", "mi", "mit", "muj", "muze", "nam", "napiste", "nas", "nasi", "nejsou", "neni", "nez", "nove", "novy", "pod", "podle", "pokud", "pouze", "prave", "pred", "pres", "pri", "proc", "proto", "protoze", "prvni", "pta", "re", "si", "strana", "sve", "svych", "svym", "svymi", "take", "takze", "tato", "tema", "tento", "teto", "tim", "timto", "tipy", "toho", "tohoto", "tom", "tomto", "tomuto", "tu", "tuto", "tyto", "uz", "vam", "vas", "vase", "vice", "vsak", "zda", "zde", "zpet", "zpravy", "a", "aniž", "až", "být", "což", "či", "článek", "článku", "články", "další", "i", "jenž", "jiné", "již", "jseš", "jšte", "k", "každý", "kteři", "ku", "me", "ná", "napište", "nechť", "ní", "nové", "nový", "o", "práve", "první", "přede", "při", "s", "sice", "své", "svůj", "svých", "svým", "svými", "také", "takže", "te", "těma", "této", "tím", "tímto", "u", "v", "více", "však", "všechen", "z", "zpět", "zprávy"].freeze
7
+ CONTRACTIONS = {}.freeze
8
8
  end
9
9
  end
10
10
  end