pragmatic_tokenizer 0.5.0 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +133 -151
- data/lib/pragmatic_tokenizer/ending_punctuation_separator.rb +31 -0
- data/lib/pragmatic_tokenizer/full_stop_separator.rb +38 -0
- data/lib/pragmatic_tokenizer/languages/arabic.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/bulgarian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/catalan.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/common.rb +14 -8
- data/lib/pragmatic_tokenizer/languages/czech.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/danish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/deutsch.rb +2 -2
- data/lib/pragmatic_tokenizer/languages/dutch.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/english.rb +2 -2
- data/lib/pragmatic_tokenizer/languages/finnish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/french.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/greek.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/indonesian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/italian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/latvian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/norwegian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/persian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/polish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/portuguese.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/romanian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/russian.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/slovak.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/spanish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/swedish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages/turkish.rb +3 -3
- data/lib/pragmatic_tokenizer/languages.rb +0 -2
- data/lib/pragmatic_tokenizer/post_processor.rb +49 -0
- data/lib/pragmatic_tokenizer/{processor.rb → pre_processor.rb} +35 -98
- data/lib/pragmatic_tokenizer/tokenizer.rb +186 -159
- data/lib/pragmatic_tokenizer/version.rb +1 -1
- metadata +6 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c4834da7c6c1b1d6c614226840bb2fd5ef8b48b6
|
4
|
+
data.tar.gz: 395868d67e973b2a6e9e28b4b9883c95d1746fe6
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: cc69a6f19545c9f5755df5c996e0625f0e65883fea81f01a877d10fce5f5b4eba8931529aecff9afb2ce56f8b993350d9bad15a94a5bb718db4eeafbbe611a29
|
7
|
+
data.tar.gz: f08442f148d59d98d3970e50ccc3bab2d59c1728fb06d9fefe4670ff5b4aca688168c81c30da3a49d1d800d8b398d53e76d9482c120450f2edc90b1b3c174617
|
data/README.md
CHANGED
@@ -26,36 +26,70 @@ Or install it yourself as:
|
|
26
26
|
* To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages).
|
27
27
|
* Pragmatic Tokenizer will unescape any HTML entities.
|
28
28
|
|
29
|
+
**Example Usage**
|
30
|
+
```ruby
|
31
|
+
text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""
|
32
|
+
|
33
|
+
PragmaticTokenizer::Tokenizer.new(text).tokenize
|
34
|
+
# => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]
|
35
|
+
|
36
|
+
# You can pass many different options:
|
37
|
+
options = {
|
38
|
+
language: :en, # the language of the string you are tokenizing
|
39
|
+
abbreviations: ['a.b', 'a'], # a user-supplied array of abbreviations (downcased with ending period removed)
|
40
|
+
stop_words: ['is', 'the'], # a user-supplied array of stop words (downcased)
|
41
|
+
remove_stop_words: true, # remove stop words
|
42
|
+
contractions: { "i'm" => "i am" }, # a user-supplied hash of contractions (key is the contracted form; value is the expanded form - both the key and value should be downcased)
|
43
|
+
expand_contractions: true, # (i.e. ["isn't"] will change to two tokens ["is", "not"])
|
44
|
+
filter_languages: [:en, :de], # process abbreviations, contractions and stop words for this array of languages
|
45
|
+
punctuation: :none, # see below for more details
|
46
|
+
numbers: :none, # see below for more details
|
47
|
+
remove_emoji: :true, # remove any emoji tokens
|
48
|
+
remove_urls: :true, # remove any urls
|
49
|
+
remove_emails: :true, # remove any emails
|
50
|
+
remove_domains: :true, # remove any domains
|
51
|
+
hashtags: :keep_and_clean, # remove the hastag prefix
|
52
|
+
mentions: :keep_and_clean, # remove the @ prefix
|
53
|
+
clean: true, # remove some special characters
|
54
|
+
classic_filter: true, # removes dots from acronyms and 's from the end of tokens
|
55
|
+
downcase: false, # do not downcase tokens
|
56
|
+
minimum_length: 3, # remove any tokens less than 3 characters
|
57
|
+
long_word_split: 10 # split tokens longer than 10 characters at hypens or underscores
|
58
|
+
}
|
59
|
+
```
|
60
|
+
|
29
61
|
**Options**
|
30
62
|
|
31
|
-
##### `
|
32
|
-
**default** = `'
|
33
|
-
- `'
|
34
|
-
Does not remove any punctuation from the result.
|
35
|
-
- `'semi'`
|
36
|
-
Removes full stops (i.e. periods) ['。', '.', '.'].
|
37
|
-
- `'none'`
|
38
|
-
Removes all punctuation from the result.
|
39
|
-
- `'only'`
|
40
|
-
Removes everything except punctuation. The returned result is an array of only the punctuation.
|
63
|
+
##### `language`
|
64
|
+
**default** = `'en'`
|
65
|
+
- To specify a language use its two character [ISO 639-1 code](https://www.tm-town.com/languages) as a symbol (i.e. `:en`) or string (i.e. `'en'`)
|
41
66
|
|
42
67
|
<hr>
|
43
68
|
|
44
|
-
##### `
|
45
|
-
**default** = `
|
46
|
-
-
|
47
|
-
|
48
|
-
|
49
|
-
|
69
|
+
##### `abbreviations`
|
70
|
+
**default** = `nil`
|
71
|
+
- You can pass an array of abbreviations to overide or compliment the abbreviations that come stored in this gem. Each element of the array should be a downcased String with the ending period removed.
|
72
|
+
|
73
|
+
<hr>
|
74
|
+
|
75
|
+
##### `stop_words`
|
76
|
+
**default** = `nil`
|
77
|
+
- You can pass an array of stop words to overide or compliment the stop words that come stored in this gem. Each element of the array should be a downcased String.
|
78
|
+
|
79
|
+
<hr>
|
80
|
+
|
81
|
+
##### `contractions`
|
82
|
+
**default** = `nil`
|
83
|
+
- You can pass a hash of contractions to overide or compliment the contractions that come stored in this gem. Each key is the contracted form downcased and each value is the expanded form downcased.
|
50
84
|
|
51
85
|
<hr>
|
52
86
|
|
53
|
-
##### `
|
87
|
+
##### `remove_stop_words`
|
54
88
|
**default** = `'false'`
|
55
89
|
- `true`
|
56
|
-
Removes all
|
90
|
+
Removes all stop words.
|
57
91
|
- `false`
|
58
|
-
Does not remove
|
92
|
+
Does not remove stop words.
|
59
93
|
|
60
94
|
<hr>
|
61
95
|
|
@@ -68,180 +102,128 @@ Or install it yourself as:
|
|
68
102
|
|
69
103
|
<hr>
|
70
104
|
|
71
|
-
##### `
|
105
|
+
##### `filter_languages`
|
106
|
+
**default** = `nil`
|
107
|
+
- You can pass an array of languages of which you would like to process abbreviations, stop words and contractions. This language can be indepedent of the language of the string you are tokenizing (for example your tex might be German but contain so English stop words that you want to remove). If you supply your own abbreviations, stop words or contractions they will be merged with the abbreviations, stop words and contractions of any languages you add in this option. You can pass an array of symbols or strings (i.e. `[:en, :de]` or `['en', 'de']`)
|
108
|
+
|
109
|
+
<hr>
|
110
|
+
|
111
|
+
##### `punctuation`
|
112
|
+
**default** = `'all'`
|
113
|
+
- `:all`
|
114
|
+
Does not remove any punctuation from the result.
|
115
|
+
- `:semi`
|
116
|
+
Removes full stops (i.e. periods) ['。', '.', '.'].
|
117
|
+
- `:none`
|
118
|
+
Removes all punctuation from the result.
|
119
|
+
- `:only`
|
120
|
+
Removes everything except punctuation. The returned result is an array of only the punctuation.
|
121
|
+
|
122
|
+
<hr>
|
123
|
+
|
124
|
+
##### `numbers`
|
125
|
+
**default** = `'all'`
|
126
|
+
- `:all`
|
127
|
+
Does not remove any numbers from the result
|
128
|
+
- `:semi`
|
129
|
+
Removes tokens that include only digits
|
130
|
+
- `:none`
|
131
|
+
Removes all tokens that include a number from the result (including Roman numerals)
|
132
|
+
- `:only`
|
133
|
+
Removes everything except tokens that include a number
|
134
|
+
|
135
|
+
<hr>
|
136
|
+
|
137
|
+
##### `remove_emoji`
|
72
138
|
**default** = `'false'`
|
73
139
|
- `true`
|
74
|
-
Removes
|
140
|
+
Removes any token that contains an emoji.
|
75
141
|
- `false`
|
76
142
|
Leaves tokens as is.
|
77
143
|
|
78
144
|
<hr>
|
79
145
|
|
80
|
-
##### `
|
146
|
+
##### `remove_urls`
|
81
147
|
**default** = `'false'`
|
82
148
|
- `true`
|
83
|
-
Removes any token that contains a
|
149
|
+
Removes any token that contains a URL.
|
84
150
|
- `false`
|
85
151
|
Leaves tokens as is.
|
86
152
|
|
87
153
|
<hr>
|
88
154
|
|
89
|
-
##### `
|
155
|
+
##### `remove_domains`
|
90
156
|
**default** = `'false'`
|
91
157
|
- `true`
|
92
|
-
Removes any token that contains a
|
158
|
+
Removes any token that contains a domain.
|
93
159
|
- `false`
|
94
160
|
Leaves tokens as is.
|
95
161
|
|
96
162
|
<hr>
|
97
163
|
|
98
|
-
##### `
|
99
|
-
**default** = `'
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
**default** = `0`
|
105
|
-
The minimum number of characters a token should be.
|
106
|
-
|
107
|
-
**Methods**
|
108
|
-
|
109
|
-
#### `#tokenize`
|
110
|
-
|
111
|
-
**Example Usage**
|
112
|
-
```ruby
|
113
|
-
text = "\"I said, 'what're you? Crazy?'\" said Sandowsky. \"I can't afford to do that.\""
|
114
|
-
|
115
|
-
PragmaticTokenizer::Tokenizer.new(text).tokenize
|
116
|
-
# => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", ".", "\"", "i", "can't", "afford", "to", "do", "that", ".", "\""]
|
117
|
-
|
118
|
-
PragmaticTokenizer::Tokenizer.new(text, remove_stop_words: true).tokenize
|
119
|
-
# => ["\"", ",", "'", "what're", "?", "crazy", "?", "'", "\"", "sandowsky", ".", "\"", "afford", ".", "\""]
|
120
|
-
|
121
|
-
PragmaticTokenizer::Tokenizer.new(text, punctuation: 'none').tokenize
|
122
|
-
# => ["i", "said", "what're", "you", "crazy", "said", "sandowsky", "i", "can't", "afford", "to", "do", "that"]
|
123
|
-
|
124
|
-
PragmaticTokenizer::Tokenizer.new(text, punctuation: 'only').tokenize
|
125
|
-
# => ["\"", ",", "'", "?", "?", "'", "\"", ".", "\"", ".", "\""]
|
126
|
-
|
127
|
-
PragmaticTokenizer::Tokenizer.new(text, punctuation: 'semi').tokenize
|
128
|
-
# => ["\"", "i", "said", ",", "'", "what're", "you", "?", "crazy", "?", "'", "\"", "said", "sandowsky", "\"", "i", "can't", "afford", "to", "do", "that", "\""]
|
129
|
-
|
130
|
-
PragmaticTokenizer::Tokenizer.new(text, expand_contractions: true).tokenize
|
131
|
-
# => ['"', 'i', 'said', ',', "'", 'what', 'are', 'you', '?', 'crazy', '?', "'", '"', 'said', 'sandowsky', '.', '"', 'i', 'cannot', 'afford', 'to', 'do', 'that', '.', '"']
|
132
|
-
|
133
|
-
PragmaticTokenizer::Tokenizer.new(text,
|
134
|
-
expand_contractions: true,
|
135
|
-
remove_stop_words: true,
|
136
|
-
punctuation: 'none'
|
137
|
-
).tokenize
|
138
|
-
# => ["crazy", "sandowsky", "afford"]
|
139
|
-
|
140
|
-
text = "The price is $5.50 and it works for 5 hours."
|
141
|
-
PragmaticTokenizer::Tokenizer.new(text, remove_numbers: true).tokenize
|
142
|
-
# => ["the", "price", "is", "and", "it", "works", "for", "hours", "."]
|
143
|
-
|
144
|
-
text = "Hello ______ ."
|
145
|
-
PragmaticTokenizer::Tokenizer.new(text, clean: true).tokenize
|
146
|
-
# => ["hello", "."]
|
147
|
-
|
148
|
-
text = "Let's test the minimum length."
|
149
|
-
PragmaticTokenizer::Tokenizer.new(text, minimum_length: 6).tokenize
|
150
|
-
# => ["minimum", "length"]
|
151
|
-
```
|
164
|
+
##### `remove_domains`
|
165
|
+
**default** = `'false'`
|
166
|
+
- `true`
|
167
|
+
Removes any token that contains a domain.
|
168
|
+
- `false`
|
169
|
+
Leaves tokens as is.
|
152
170
|
|
153
171
|
<hr>
|
154
172
|
|
155
|
-
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
PragmaticTokenizer::Tokenizer.new(text).urls
|
163
|
-
# => ["http://www.example.com"]
|
164
|
-
```
|
173
|
+
##### `clean`
|
174
|
+
**default** = `'false'`
|
175
|
+
- `true`
|
176
|
+
Removes tokens consisting of only hypens, underscores, or periods as well as some special characters (®, ©, ™). Also removes long tokens or tokens with a backslash.
|
177
|
+
- `false`
|
178
|
+
Leaves tokens as is.
|
165
179
|
|
166
180
|
<hr>
|
167
181
|
|
168
|
-
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
174
|
-
|
175
|
-
|
176
|
-
# => ["cnn.com/europe", "english.alarabiya.net"]
|
177
|
-
```
|
182
|
+
##### `hashtags`
|
183
|
+
**default** = `'keep_original'`
|
184
|
+
- `:keep_original`
|
185
|
+
Does not alter the token at all.
|
186
|
+
- `:keep_and_clean`
|
187
|
+
Removes the hashtag (#) prefix from the token.
|
188
|
+
- `:remove`
|
189
|
+
Removes the token completely.
|
178
190
|
|
179
191
|
<hr>
|
180
192
|
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
# => ["example@example.com"]
|
190
|
-
```
|
193
|
+
##### `mentions`
|
194
|
+
**default** = `'keep_original'`
|
195
|
+
- `:keep_original`
|
196
|
+
Does not alter the token at all.
|
197
|
+
- `:keep_and_clean`
|
198
|
+
Removes the mention (@) prefix from the token.
|
199
|
+
- `:remove`
|
200
|
+
Removes the token completely.
|
191
201
|
|
192
202
|
<hr>
|
193
203
|
|
194
|
-
|
195
|
-
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
|
200
|
-
|
201
|
-
PragmaticTokenizer::Tokenizer.new(text).hashtags
|
202
|
-
# => ["#fun", "#hashtags", "#backallofthem"]
|
203
|
-
```
|
204
|
+
##### `classic_filter`
|
205
|
+
**default** = `'false'`
|
206
|
+
- `true`
|
207
|
+
Removes dots from acronyms and 's from the end of tokens.
|
208
|
+
- `false`
|
209
|
+
Leaves tokens as is.
|
204
210
|
|
205
211
|
<hr>
|
206
212
|
|
207
|
-
|
208
|
-
|
209
|
-
|
210
|
-
**Example Usage**
|
211
|
-
```ruby
|
212
|
-
text = "Find me all the @awesome mentions."
|
213
|
-
|
214
|
-
PragmaticTokenizer::Tokenizer.new(text).hashtags
|
215
|
-
# => ["@awesome"]
|
216
|
-
```
|
213
|
+
##### `downcase`
|
214
|
+
**default** = `'true'`
|
217
215
|
|
218
216
|
<hr>
|
219
217
|
|
220
|
-
|
221
|
-
|
222
|
-
|
223
|
-
**Example Usage**
|
224
|
-
```ruby
|
225
|
-
text = "Hello ;-) :) 😄"
|
226
|
-
|
227
|
-
PragmaticTokenizer::Tokenizer.new(text).emoticons
|
228
|
-
# => [";-)", ":)""]
|
229
|
-
```
|
218
|
+
##### `minimum_length`
|
219
|
+
**default** = `0`
|
220
|
+
The minimum number of characters a token should be.
|
230
221
|
|
231
222
|
<hr>
|
232
223
|
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
*†matches all 1012 single-character Unicode Emoji (all except for two-character flags)*
|
237
|
-
|
238
|
-
**Example Usage**
|
239
|
-
```ruby
|
240
|
-
text = "Return the emoji 👿😍😱🐔🌚."
|
241
|
-
|
242
|
-
PragmaticTokenizer::Tokenizer.new(text).emoticons
|
243
|
-
# => ["👿", "😍", "😱", "🐔", "🌚"]
|
244
|
-
```
|
224
|
+
##### `long_word_split`
|
225
|
+
**default** = `nil`
|
226
|
+
The number of characters after which a token should be split at hypens or underscores.
|
245
227
|
|
246
228
|
## Language Support
|
247
229
|
|
@@ -0,0 +1,31 @@
|
|
1
|
+
# -*- encoding : utf-8 -*-
|
2
|
+
|
3
|
+
module PragmaticTokenizer
|
4
|
+
# This class separates ending punctuation from a token
|
5
|
+
class EndingPunctuationSeparator
|
6
|
+
attr_reader :tokens
|
7
|
+
def initialize(tokens:)
|
8
|
+
@tokens = tokens
|
9
|
+
end
|
10
|
+
|
11
|
+
def separate
|
12
|
+
cleaned_tokens = []
|
13
|
+
tokens.each do |a|
|
14
|
+
split_punctuation = a.scan(/(?<=\S)[。.!!??]+$/)
|
15
|
+
if split_punctuation[0].nil?
|
16
|
+
cleaned_tokens << a
|
17
|
+
else
|
18
|
+
cleaned_tokens << a.tr(split_punctuation[0],'')
|
19
|
+
if split_punctuation[0].length.eql?(1)
|
20
|
+
cleaned_tokens << split_punctuation[0]
|
21
|
+
else
|
22
|
+
split_punctuation[0].split("").each do |s|
|
23
|
+
cleaned_tokens << s
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
28
|
+
cleaned_tokens
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
@@ -0,0 +1,38 @@
|
|
1
|
+
# -*- encoding : utf-8 -*-
|
2
|
+
|
3
|
+
module PragmaticTokenizer
|
4
|
+
# This class separates true full stops while ignoring
|
5
|
+
# periods that are part of an abbreviation
|
6
|
+
class FullStopSeparator
|
7
|
+
attr_reader :tokens, :abbreviations
|
8
|
+
def initialize(tokens:, abbreviations:)
|
9
|
+
@tokens = tokens
|
10
|
+
@abbreviations = abbreviations
|
11
|
+
end
|
12
|
+
|
13
|
+
def separate
|
14
|
+
abbr = {}
|
15
|
+
abbreviations.each do |i|
|
16
|
+
abbr[i] = true
|
17
|
+
end
|
18
|
+
cleaned_tokens = []
|
19
|
+
tokens.each_with_index do |_t, i|
|
20
|
+
if tokens[i + 1] && tokens[i] =~ /\A(.+)\.\z/
|
21
|
+
w = $1
|
22
|
+
unless abbr[Unicode::downcase(w)] || w =~ /\A[a-z]\z/i ||
|
23
|
+
w =~ /[a-z](?:\.[a-z])+\z/i
|
24
|
+
cleaned_tokens << w
|
25
|
+
cleaned_tokens << '.'
|
26
|
+
next
|
27
|
+
end
|
28
|
+
end
|
29
|
+
cleaned_tokens << tokens[i]
|
30
|
+
end
|
31
|
+
if cleaned_tokens[-1] && cleaned_tokens[-1] =~ /\A(.*\w)\.\z/
|
32
|
+
cleaned_tokens[-1] = $1
|
33
|
+
cleaned_tokens.push '.'
|
34
|
+
end
|
35
|
+
cleaned_tokens
|
36
|
+
end
|
37
|
+
end
|
38
|
+
end
|
@@ -2,9 +2,9 @@ module PragmaticTokenizer
|
|
2
2
|
module Languages
|
3
3
|
module Arabic
|
4
4
|
include Languages::Common
|
5
|
-
ABBREVIATIONS = ['ا', 'ا. د', 'ا.د', 'ا.ش.ا', 'ا.ش.ا', 'إلخ', 'ت.ب', 'ت.ب', 'ج.ب', 'جم', 'ج.ب', 'ج.م.ع', 'ج.م.ع', 'س.ت', 'س.ت', 'سم', 'ص.ب.', 'ص.ب', 'كج.', 'كلم.', 'م', 'م.ب', 'م.ب', 'ه', 'د']
|
6
|
-
STOP_WORDS = ["فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "ب", "ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "فى"]
|
7
|
-
CONTRACTIONS = {}
|
5
|
+
ABBREVIATIONS = ['ا', 'ا. د', 'ا.د', 'ا.ش.ا', 'ا.ش.ا', 'إلخ', 'ت.ب', 'ت.ب', 'ج.ب', 'جم', 'ج.ب', 'ج.م.ع', 'ج.م.ع', 'س.ت', 'س.ت', 'سم', 'ص.ب.', 'ص.ب', 'كج.', 'كلم.', 'م', 'م.ب', 'م.ب', 'ه', 'د'].freeze
|
6
|
+
STOP_WORDS = ["فى", "في", "كل", "لم", "لن", "له", "من", "هو", "هي", "قوة", "كما", "لها", "منذ", "وقد", "ولا", "نفسه", "لقاء", "مقابل", "هناك", "وقال", "وكان", "نهاية", "وقالت", "وكانت", "للامم", "فيه", "كلم", "لكن", "وفي", "وقف", "ولم", "ومن", "وهو", "وهي", "يوم", "فيها", "منها", "مليار", "لوكالة", "يكون", "يمكن", "مليون", "حيث", "اكد", "الا", "اما", "امس", "السابق", "التى", "التي", "اكثر", "ايار", "ايضا", "ثلاثة", "الذاتي", "الاخيرة", "الثاني", "الثانية", "الذى", "الذي", "الان", "امام", "ايام", "خلال", "حوالى", "الذين", "الاول", "الاولى", "بين", "ذلك", "دون", "حول", "حين", "الف", "الى", "انه", "اول", "ضمن", "انها", "جميع", "الماضي", "الوقت", "المقبل", "اليوم", "ـ", "ف", "و", "و6", "قد", "لا", "ما", "مع", "مساء", "هذا", "واحد", "واضاف", "واضافت", "فان", "قبل", "قال", "كان", "لدى", "نحو", "هذه", "وان", "واكد", "كانت", "واوضح", "مايو", "ب", "ا", "أ", "،", "عشر", "عدد", "عدة", "عشرة", "عدم", "عام", "عاما", "عن", "عند", "عندما", "على", "عليه", "عليها", "زيارة", "سنة", "سنوات", "تم", "ضد", "بعد", "بعض", "اعادة", "اعلنت", "بسبب", "حتى", "اذا", "احد", "اثر", "برس", "باسم", "غدا", "شخصا", "صباح", "اطار", "اربعة", "اخرى", "بان", "اجل", "غير", "بشكل", "حاليا", "بن", "به", "ثم", "اف", "ان", "او", "اي", "بها", "صفر", "فى"].freeze
|
7
|
+
CONTRACTIONS = {}.freeze
|
8
8
|
end
|
9
9
|
end
|
10
10
|
end
|
@@ -2,9 +2,9 @@ module PragmaticTokenizer
|
|
2
2
|
module Languages
|
3
3
|
module Bulgarian
|
4
4
|
include Languages::Common
|
5
|
-
ABBREVIATIONS = ["акад", "ал", "б.р", "б.ред", "бел.а", "бел.пр", "бр", "бул", "в", "вж", "вкл", "вм", "вр", "г", "ген", "гр", "дж", "дм", "доц", "др", "ем", "заб", "зам", "инж", "к.с", "кв", "кв.м", "кг", "км", "кор", "куб", "куб.м", "л", "лв", "м", "м.г", "мин", "млн", "млрд", "мм", "н.с", "напр", "пл", "полк", "проф", "р", "рис", "с", "св", "сек", "см", "сп", "срв", "ст", "стр", "т", "т.г", "т.е", "т.н", "т.нар", "табл", "тел", "у", "ул", "фиг", "ха", "хил", "ч", "чл", "щ.д"]
|
6
|
-
STOP_WORDS = ["а", "автентичен", "аз", "ако", "ала", "бе", "без", "беше", "би", "бивш", "бивша", "бившо", "бил", "била", "били", "било", "благодаря", "близо", "бъдат", "бъде", "бяха", "в", "вас", "ваш", "ваша", "вероятно", "вече", "взема", "ви", "вие", "винаги", "внимава", "време", "все", "всеки", "всички", "всичко", "всяка", "във", "въпреки", "върху", "г", "г.", "ги", "главен", "главна", "главно", "глас", "го", "година", "години", "годишен", "д", "да", "дали", "два", "двама", "двамата", "две", "двете", "ден", "днес", "дни", "до", "добра", "добре", "добро", "добър", "докато", "докога", "дори", "досега", "доста", "друг", "друга", "други", "е", "евтин", "едва", "един", "една", "еднаква", "еднакви", "еднакъв", "едно", "екип", "ето", "живот", "за", "забавям", "зад", "заедно", "заради", "засега", "заспал", "затова", "защо", "защото", "и", "из", "или", "им", "има", "имат", "иска", "й", "каза", "как", "каква", "какво", "както", "какъв", "като", "кога", "когато", "което", "които", "кой", "който", "колко", "която", "къде", "където", "към", "лесен", "лесно", "ли", "лош", "м", "май", "малко", "ме", "между", "мек", "мен", "месец", "ми", "много", "мнозина", "мога", "могат", "може", "мокър", "моля", "момента", "му", "н", "на", "над", "назад", "най", "направи", "напред", "например", "нас", "не", "него", "нещо", "нея", "ни", "ние", "никой", "нито", "нищо", "но", "нов", "нова", "нови", "новина", "някои", "някой", "няколко", "няма", "обаче", "около", "освен", "особено", "от", "отгоре", "отново", "още", "пак", "по", "повече", "повечето", "под", "поне", "поради", "после", "почти", "прави", "пред", "преди", "през", "при", "пък", "първата", "първи", "първо", "пъти", "равен", "равна", "с", "са", "сам", "само", "се", "сега", "си", "син", "скоро", "след", "следващ", "сме", "смях", "според", "сред", "срещу", "сте", "съм", "със", "също", "т", "т.н.", "тази", "така", "такива", "такъв", "там", "твой", "те", "тези", "ти", "то", "това", "тогава", "този", "той", "толкова", "точно", "три", "трябва", "тук", "тъй", "тя", "тях", "у", "утре", "харесва", "хиляди", "ч", "часа", "че", "често", "чрез", "ще", "щом", "юмрук", "я", "як"]
|
7
|
-
CONTRACTIONS = {}
|
5
|
+
ABBREVIATIONS = ["акад", "ал", "б.р", "б.ред", "бел.а", "бел.пр", "бр", "бул", "в", "вж", "вкл", "вм", "вр", "г", "ген", "гр", "дж", "дм", "доц", "др", "ем", "заб", "зам", "инж", "к.с", "кв", "кв.м", "кг", "км", "кор", "куб", "куб.м", "л", "лв", "м", "м.г", "мин", "млн", "млрд", "мм", "н.с", "напр", "пл", "полк", "проф", "р", "рис", "с", "св", "сек", "см", "сп", "срв", "ст", "стр", "т", "т.г", "т.е", "т.н", "т.нар", "табл", "тел", "у", "ул", "фиг", "ха", "хил", "ч", "чл", "щ.д"].freeze
|
6
|
+
STOP_WORDS = ["а", "автентичен", "аз", "ако", "ала", "бе", "без", "беше", "би", "бивш", "бивша", "бившо", "бил", "била", "били", "било", "благодаря", "близо", "бъдат", "бъде", "бяха", "в", "вас", "ваш", "ваша", "вероятно", "вече", "взема", "ви", "вие", "винаги", "внимава", "време", "все", "всеки", "всички", "всичко", "всяка", "във", "въпреки", "върху", "г", "г.", "ги", "главен", "главна", "главно", "глас", "го", "година", "години", "годишен", "д", "да", "дали", "два", "двама", "двамата", "две", "двете", "ден", "днес", "дни", "до", "добра", "добре", "добро", "добър", "докато", "докога", "дори", "досега", "доста", "друг", "друга", "други", "е", "евтин", "едва", "един", "една", "еднаква", "еднакви", "еднакъв", "едно", "екип", "ето", "живот", "за", "забавям", "зад", "заедно", "заради", "засега", "заспал", "затова", "защо", "защото", "и", "из", "или", "им", "има", "имат", "иска", "й", "каза", "как", "каква", "какво", "както", "какъв", "като", "кога", "когато", "което", "които", "кой", "който", "колко", "която", "къде", "където", "към", "лесен", "лесно", "ли", "лош", "м", "май", "малко", "ме", "между", "мек", "мен", "месец", "ми", "много", "мнозина", "мога", "могат", "може", "мокър", "моля", "момента", "му", "н", "на", "над", "назад", "най", "направи", "напред", "например", "нас", "не", "него", "нещо", "нея", "ни", "ние", "никой", "нито", "нищо", "но", "нов", "нова", "нови", "новина", "някои", "някой", "няколко", "няма", "обаче", "около", "освен", "особено", "от", "отгоре", "отново", "още", "пак", "по", "повече", "повечето", "под", "поне", "поради", "после", "почти", "прави", "пред", "преди", "през", "при", "пък", "първата", "първи", "първо", "пъти", "равен", "равна", "с", "са", "сам", "само", "се", "сега", "си", "син", "скоро", "след", "следващ", "сме", "смях", "според", "сред", "срещу", "сте", "съм", "със", "също", "т", "т.н.", "тази", "така", "такива", "такъв", "там", "твой", "те", "тези", "ти", "то", "това", "тогава", "този", "той", "толкова", "точно", "три", "трябва", "тук", "тъй", "тя", "тях", "у", "утре", "харесва", "хиляди", "ч", "часа", "че", "често", "чрез", "ще", "щом", "юмрук", "я", "як"].freeze
|
7
|
+
CONTRACTIONS = {}.freeze
|
8
8
|
end
|
9
9
|
end
|
10
10
|
end
|
@@ -2,9 +2,9 @@ module PragmaticTokenizer
|
|
2
2
|
module Languages
|
3
3
|
module Catalan
|
4
4
|
include Languages::Common
|
5
|
-
ABBREVIATIONS = []
|
6
|
-
STOP_WORDS = ["a", "abans", "algun", "alguna", "algunes", "alguns", "altre", "amb", "ambdós", "anar", "ans", "aquell", "aquelles", "aquells", "aquí", "bastant", "bé", "cada", "com", "consegueixo", "conseguim", "conseguir", "consigueix", "consigueixen", "consigueixes", "dalt", "de", "des de", "dins", "el", "elles", "ells", "els", "en", "ens", "entre", "era", "erem", "eren", "eres", "es", "és", "éssent", "està", "estan", "estat", "estava", "estem", "esteu", "estic", "ets", "fa", "faig", "fan", "fas", "fem", "fer", "feu", "fi", "haver", "i", "inclòs", "jo", "la", "les", "llarg", "llavors", "mentre", "meu", "mode", "molt", "molts", "nosaltres", "o", "on", "per", "per que", "però", "perquè", "podem", "poden", "poder", "podeu", "potser", "primer", "puc", "quan", "quant", "qui", "sabem", "saben", "saber", "sabeu", "sap", "saps", "sense", "ser", "seu", "seus", "si", "soc", "solament", "sols", "som", "sota", "també", "te", "tene", "tenim", "tenir", "teniu", "teu", "tinc", "tot", "últim", "un", "una", "unes", "uns", "ús", "va", "vaig", "van", "vosaltres"]
|
7
|
-
CONTRACTIONS = {}
|
5
|
+
ABBREVIATIONS = [].freeze
|
6
|
+
STOP_WORDS = ["a", "abans", "algun", "alguna", "algunes", "alguns", "altre", "amb", "ambdós", "anar", "ans", "aquell", "aquelles", "aquells", "aquí", "bastant", "bé", "cada", "com", "consegueixo", "conseguim", "conseguir", "consigueix", "consigueixen", "consigueixes", "dalt", "de", "des de", "dins", "el", "elles", "ells", "els", "en", "ens", "entre", "era", "erem", "eren", "eres", "es", "és", "éssent", "està", "estan", "estat", "estava", "estem", "esteu", "estic", "ets", "fa", "faig", "fan", "fas", "fem", "fer", "feu", "fi", "haver", "i", "inclòs", "jo", "la", "les", "llarg", "llavors", "mentre", "meu", "mode", "molt", "molts", "nosaltres", "o", "on", "per", "per que", "però", "perquè", "podem", "poden", "poder", "podeu", "potser", "primer", "puc", "quan", "quant", "qui", "sabem", "saben", "saber", "sabeu", "sap", "saps", "sense", "ser", "seu", "seus", "si", "soc", "solament", "sols", "som", "sota", "també", "te", "tene", "tenim", "tenir", "teniu", "teu", "tinc", "tot", "últim", "un", "una", "unes", "uns", "ús", "va", "vaig", "van", "vosaltres"].freeze
|
7
|
+
CONTRACTIONS = {}.freeze
|
8
8
|
end
|
9
9
|
end
|
10
10
|
end
|
@@ -1,17 +1,23 @@
|
|
1
1
|
module PragmaticTokenizer
|
2
2
|
module Languages
|
3
3
|
module Common
|
4
|
-
PUNCTUATION = ['。', '.', '.', '!', '!', '?', '?', '、', '¡', '¿', '„', '“', '[', ']', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', ':', ';', '<', '=', '>', '@', '^', '_', '`', "'", '{', '|', '}', '~', '-', '«', '»', '/', '›', '‹', '^', '”']
|
5
|
-
PUNCTUATION_MAP = { "。" => "♳", "." => "♴", "." => "♵", "!" => "♶", "!" => "♷", "?" => "♸", "?" => "♹", "、" => "♺", "¡" => "⚀", "¿" => "⚁", "„" => "⚂", "“" => "⚃", "[" => "⚄", "]" => "⚅", "\"" => "☇", "#" => "☈", "$" => "☉", "%" => "☊", "&" => "☋", "(" => "☌", ")" => "☍", "*" => "☠", "+" => "☢", "," => "☣", ":" => "☤", ";" => "☥", "<" => "☦", "=" => "☧", ">" => "☀", "@" => "☁", "^" => "☂", "_" => "☃", "`" => "☄", "'" => "☮", "{" => "♔", "|" => "♕", "}" => "♖", "~" => "♗", "-" => "♘", "«" => "♙", "»" => "♚", "”" => "⚘" }
|
6
|
-
SEMI_PUNCTUATION = ['。', '.', '.']
|
7
|
-
ROMAN_NUMERALS = ['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx', 'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx', 'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl', 'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l', 'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx', 'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx', 'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx', 'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc', 'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix']
|
8
|
-
SPECIAL_CHARACTERS = ['®', '©', '™']
|
9
|
-
ABBREVIATIONS = []
|
10
|
-
STOP_WORDS = []
|
11
|
-
CONTRACTIONS = {}
|
4
|
+
PUNCTUATION = ['。', '.', '.', '!', '!', '?', '?', '、', '¡', '¿', '„', '“', '[', ']', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', ':', ';', '<', '=', '>', '@', '^', '_', '`', "'", '{', '|', '}', '~', '-', '«', '»', '/', '›', '‹', '^', '”'].freeze
|
5
|
+
PUNCTUATION_MAP = { "。" => "♳", "." => "♴", "." => "♵", "!" => "♶", "!" => "♷", "?" => "♸", "?" => "♹", "、" => "♺", "¡" => "⚀", "¿" => "⚁", "„" => "⚂", "“" => "⚃", "[" => "⚄", "]" => "⚅", "\"" => "☇", "#" => "☈", "$" => "☉", "%" => "☊", "&" => "☋", "(" => "☌", ")" => "☍", "*" => "☠", "+" => "☢", "," => "☣", ":" => "☤", ";" => "☥", "<" => "☦", "=" => "☧", ">" => "☀", "@" => "☁", "^" => "☂", "_" => "☃", "`" => "☄", "'" => "☮", "{" => "♔", "|" => "♕", "}" => "♖", "~" => "♗", "-" => "♘", "«" => "♙", "»" => "♚", "”" => "⚘" }.freeze
|
6
|
+
SEMI_PUNCTUATION = ['。', '.', '.'].freeze
|
7
|
+
ROMAN_NUMERALS = ['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx', 'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx', 'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl', 'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l', 'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx', 'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx', 'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx', 'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc', 'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix'].freeze
|
8
|
+
SPECIAL_CHARACTERS = ['®', '©', '™'].freeze
|
9
|
+
ABBREVIATIONS = [].freeze
|
10
|
+
STOP_WORDS = [].freeze
|
11
|
+
CONTRACTIONS = {}.freeze
|
12
|
+
EMOJI_REGEX = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
|
13
|
+
PREFIX_EMOJI_REGEX = /(?<=\S)(?=[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}])/
|
14
|
+
POSTFIX_EMOJI_REGEX = /(?<=[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}])(?=\S)/
|
15
|
+
EMOTICON_REGEX = /(?::|;|=)(?:-)?(?:\)|D|P)/
|
12
16
|
|
13
17
|
class SingleQuotes
|
14
18
|
def handle_single_quotes(text)
|
19
|
+
# Convert left quotes to special character except for 'Twas or 'twas
|
20
|
+
text.gsub!(/(\W|^)'(?=.*\w)(?!twas)(?!Twas)/o) { $1 ? $1 + ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' : ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' } || text
|
15
21
|
text.gsub!(/(\W|^)'(?=.*\w)/o, ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"]) || text
|
16
22
|
# Separate right single quotes
|
17
23
|
text.gsub!(/(\w|\D)'(?!')(?=\W|$)/o) { $1 + ' ' + PragmaticTokenizer::Languages::Common::PUNCTUATION_MAP["'"] + ' ' } || text
|
@@ -2,9 +2,9 @@ module PragmaticTokenizer
|
|
2
2
|
module Languages
|
3
3
|
module Czech
|
4
4
|
include Languages::Common
|
5
|
-
ABBREVIATIONS = []
|
6
|
-
STOP_WORDS = ["ačkoli", "ahoj", "ale", "anebo", "ano", "asi", "aspoň", "během", "bez", "beze", "blízko", "bohužel", "brzo", "bude", "budeme", "budeš", "budete", "budou", "budu", "byl", "byla", "byli", "bylo", "byly", "bys", "čau", "chce", "chceme", "chceš", "chcete", "chci", "chtějí", "chtít", "chut'", "chuti", "co", "čtrnáct", "čtyři", "dál", "dále", "daleko", "děkovat", "děkujeme", "děkuji", "den", "deset", "devatenáct", "devět", "do", "dobrý", "docela", "dva", "dvacet", "dvanáct", "dvě", "hodně", "já", "jak", "jde", "je", "jeden", "jedenáct", "jedna", "jedno", "jednou", "jedou", "jeho", "její", "jejich", "jemu", "jen", "jenom", "ještě", "jestli", "jestliže", "jí", "jich", "jím", "jimi", "jinak", "jsem", "jsi", "jsme", "jsou", "jste", "kam", "kde", "kdo", "kdy", "když", "ke", "kolik", "kromě", "která", "které", "kteří", "který", "kvůli", "má", "mají", "málo", "mám", "máme", "máš", "máte", "mé", "mě", "mezi", "mí", "mít", "mně", "mnou", "moc", "mohl", "mohou", "moje", "moji", "možná", "můj", "musí", "může", "my", "na", "nad", "nade", "nám", "námi", "naproti", "nás", "náš", "naše", "naši", "ne", "ně", "nebo", "nebyl", "nebyla", "nebyli", "nebyly", "něco", "nedělá", "nedělají", "nedělám", "neděláme", "neděláš", "neděláte", "nějak", "nejsi", "někde", "někdo", "nemají", "nemáme", "nemáte", "neměl", "němu", "není", "nestačí", "nevadí", "než", "nic", "nich", "ním", "nimi", "nula", "od", "ode", "on", "ona", "oni", "ono", "ony", "osm", "osmnáct", "pak", "patnáct", "pět", "po", "pořád", "potom", "pozdě", "před", "přes", "přese", "pro", "proč", "prosím", "prostě", "proti", "protože", "rovně", "se", "sedm", "sedmnáct", "šest", "šestnáct", "skoro", "smějí", "smí", "snad", "spolu", "sta", "sté", "sto", "ta", "tady", "tak", "takhle", "taky", "tam", "tamhle", "tamhleto", "tamto", "tě", "tebe", "tebou", "ted'", "tedy", "ten", "ti", "tisíc", "tisíce", "to", "tobě", "tohle", "toto", "třeba", "tři", "třináct", "trošku", "tvá", "tvé", "tvoje", "tvůj", "ty", "určitě", "už", "vám", "vámi", "vás", "váš", "vaše", "vaši", "ve", "večer", "vedle", "vlastně", "všechno", "všichni", "vůbec", "vy", "vždy", "za", "zač", "zatímco", "ze", "že", "aby", "aj", "ani", "az", "budem", "budes", "by", "byt", "ci", "clanek", "clanku", "clanky", "coz", "cz", "dalsi", "design", "dnes", "email", "ho", "jako", "jej", "jeji", "jeste", "ji", "jine", "jiz", "jses", "kdyz", "ktera", "ktere", "kteri", "kterou", "ktery", "ma", "mate", "mi", "mit", "muj", "muze", "nam", "napiste", "nas", "nasi", "nejsou", "neni", "nez", "nove", "novy", "pod", "podle", "pokud", "pouze", "prave", "pred", "pres", "pri", "proc", "proto", "protoze", "prvni", "pta", "re", "si", "strana", "sve", "svych", "svym", "svymi", "take", "takze", "tato", "tema", "tento", "teto", "tim", "timto", "tipy", "toho", "tohoto", "tom", "tomto", "tomuto", "tu", "tuto", "tyto", "uz", "vam", "vas", "vase", "vice", "vsak", "zda", "zde", "zpet", "zpravy", "a", "aniž", "až", "být", "což", "či", "článek", "článku", "články", "další", "i", "jenž", "jiné", "již", "jseš", "jšte", "k", "každý", "kteři", "ku", "me", "ná", "napište", "nechť", "ní", "nové", "nový", "o", "práve", "první", "přede", "při", "s", "sice", "své", "svůj", "svých", "svým", "svými", "také", "takže", "te", "těma", "této", "tím", "tímto", "u", "v", "více", "však", "všechen", "z", "zpět", "zprávy"]
|
7
|
-
CONTRACTIONS = {}
|
5
|
+
ABBREVIATIONS = [].freeze
|
6
|
+
STOP_WORDS = ["ačkoli", "ahoj", "ale", "anebo", "ano", "asi", "aspoň", "během", "bez", "beze", "blízko", "bohužel", "brzo", "bude", "budeme", "budeš", "budete", "budou", "budu", "byl", "byla", "byli", "bylo", "byly", "bys", "čau", "chce", "chceme", "chceš", "chcete", "chci", "chtějí", "chtít", "chut'", "chuti", "co", "čtrnáct", "čtyři", "dál", "dále", "daleko", "děkovat", "děkujeme", "děkuji", "den", "deset", "devatenáct", "devět", "do", "dobrý", "docela", "dva", "dvacet", "dvanáct", "dvě", "hodně", "já", "jak", "jde", "je", "jeden", "jedenáct", "jedna", "jedno", "jednou", "jedou", "jeho", "její", "jejich", "jemu", "jen", "jenom", "ještě", "jestli", "jestliže", "jí", "jich", "jím", "jimi", "jinak", "jsem", "jsi", "jsme", "jsou", "jste", "kam", "kde", "kdo", "kdy", "když", "ke", "kolik", "kromě", "která", "které", "kteří", "který", "kvůli", "má", "mají", "málo", "mám", "máme", "máš", "máte", "mé", "mě", "mezi", "mí", "mít", "mně", "mnou", "moc", "mohl", "mohou", "moje", "moji", "možná", "můj", "musí", "může", "my", "na", "nad", "nade", "nám", "námi", "naproti", "nás", "náš", "naše", "naši", "ne", "ně", "nebo", "nebyl", "nebyla", "nebyli", "nebyly", "něco", "nedělá", "nedělají", "nedělám", "neděláme", "neděláš", "neděláte", "nějak", "nejsi", "někde", "někdo", "nemají", "nemáme", "nemáte", "neměl", "němu", "není", "nestačí", "nevadí", "než", "nic", "nich", "ním", "nimi", "nula", "od", "ode", "on", "ona", "oni", "ono", "ony", "osm", "osmnáct", "pak", "patnáct", "pět", "po", "pořád", "potom", "pozdě", "před", "přes", "přese", "pro", "proč", "prosím", "prostě", "proti", "protože", "rovně", "se", "sedm", "sedmnáct", "šest", "šestnáct", "skoro", "smějí", "smí", "snad", "spolu", "sta", "sté", "sto", "ta", "tady", "tak", "takhle", "taky", "tam", "tamhle", "tamhleto", "tamto", "tě", "tebe", "tebou", "ted'", "tedy", "ten", "ti", "tisíc", "tisíce", "to", "tobě", "tohle", "toto", "třeba", "tři", "třináct", "trošku", "tvá", "tvé", "tvoje", "tvůj", "ty", "určitě", "už", "vám", "vámi", "vás", "váš", "vaše", "vaši", "ve", "večer", "vedle", "vlastně", "všechno", "všichni", "vůbec", "vy", "vždy", "za", "zač", "zatímco", "ze", "že", "aby", "aj", "ani", "az", "budem", "budes", "by", "byt", "ci", "clanek", "clanku", "clanky", "coz", "cz", "dalsi", "design", "dnes", "email", "ho", "jako", "jej", "jeji", "jeste", "ji", "jine", "jiz", "jses", "kdyz", "ktera", "ktere", "kteri", "kterou", "ktery", "ma", "mate", "mi", "mit", "muj", "muze", "nam", "napiste", "nas", "nasi", "nejsou", "neni", "nez", "nove", "novy", "pod", "podle", "pokud", "pouze", "prave", "pred", "pres", "pri", "proc", "proto", "protoze", "prvni", "pta", "re", "si", "strana", "sve", "svych", "svym", "svymi", "take", "takze", "tato", "tema", "tento", "teto", "tim", "timto", "tipy", "toho", "tohoto", "tom", "tomto", "tomuto", "tu", "tuto", "tyto", "uz", "vam", "vas", "vase", "vice", "vsak", "zda", "zde", "zpet", "zpravy", "a", "aniž", "až", "být", "což", "či", "článek", "článku", "články", "další", "i", "jenž", "jiné", "již", "jseš", "jšte", "k", "každý", "kteři", "ku", "me", "ná", "napište", "nechť", "ní", "nové", "nový", "o", "práve", "první", "přede", "při", "s", "sice", "své", "svůj", "svých", "svým", "svými", "také", "takže", "te", "těma", "této", "tím", "tímto", "u", "v", "více", "však", "všechen", "z", "zpět", "zprávy"].freeze
|
7
|
+
CONTRACTIONS = {}.freeze
|
8
8
|
end
|
9
9
|
end
|
10
10
|
end
|