nlp_arabic 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +9 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/README.md +67 -0
- data/Rakefile +10 -0
- data/bin/console +14 -0
- data/bin/setup +7 -0
- data/lib/nlp_arabic.rb +263 -0
- data/lib/nlp_arabic/characters.rb +102 -0
- data/lib/nlp_arabic/version.rb +3 -0
- data/nlp_arabic.gemspec +23 -0
- metadata +82 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 9dce240342285bde2206509493990d37a44143df
|
4
|
+
data.tar.gz: b55584ef3a20f4b60f1f0637108149922605f06d
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 9d92a384d51125411cca0a89479c7889830b92e4c86eaf29241bad88029bf1bc704a916151b36438737d7de4725eb1d39bd151d077685010419e4d8022824598
|
7
|
+
data.tar.gz: cd0855407749d5b51d22c43eef77027b53d8974d6ec2dcce27f7edee3577d122cd2c884fa26fbc8f04f09f7d0c0cc3f46d090c0dc504e8aba01d9606f5b7d914
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,67 @@
|
|
1
|
+
NlpArabic
|
2
|
+
=========
|
3
|
+
|
4
|
+
This gem is intended to contain tools for Arabic Natural Language Processing.
|
5
|
+
As of version 0.1, this toolkit gem allows you to:
|
6
|
+
|
7
|
+
1. Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
|
8
|
+
|
9
|
+
2. Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
|
10
|
+
|
11
|
+
[Arabic Stemming without a root dictionary](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1428453&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9755%2F30835%2F01428453.pdf%3Farnumber%3D1428453)
|
12
|
+
|
13
|
+
This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
|
14
|
+
|
15
|
+
|
16
|
+
Installation
|
17
|
+
============
|
18
|
+
|
19
|
+
Add this line to your application's Gemfile:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
gem 'nlp_arabic'
|
23
|
+
```
|
24
|
+
|
25
|
+
And then execute:
|
26
|
+
|
27
|
+
$ bundle
|
28
|
+
|
29
|
+
Or install it yourself as:
|
30
|
+
|
31
|
+
$ gem install nlp_arabic
|
32
|
+
|
33
|
+
## Usage
|
34
|
+
|
35
|
+
Once installed, you can use it like this:
|
36
|
+
|
37
|
+
NlpArabic.clean(text) will return the text without the stop words.
|
38
|
+
|
39
|
+
NlpArabic.stem(word) will return the word stemmed.
|
40
|
+
|
41
|
+
NlpArabic.stem_text(text) will stem an entire text.
|
42
|
+
|
43
|
+
NlpArabic.clean_and_stem(text) will do both.
|
44
|
+
|
45
|
+
NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
|
46
|
+
|
47
|
+
NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
|
48
|
+
|
49
|
+
Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
|
50
|
+
|
51
|
+
Development
|
52
|
+
===========
|
53
|
+
|
54
|
+
After checking out the repo, run `bin/console` for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run `bin/setup`.
|
55
|
+
|
56
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
57
|
+
|
58
|
+
Contributing
|
59
|
+
============
|
60
|
+
You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described [here](https://github.com/bbatsov/ruby-style-guide). The default encoding used is UTF-8.
|
61
|
+
|
62
|
+
1. Fork it ( https://github.com/othmanela/nlp_arabic/fork )
|
63
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
64
|
+
3. Write unit tests and make sure all of them (including the old ones) pass
|
65
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
66
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
67
|
+
5. Create a new Pull Request
|
data/Rakefile
ADDED
data/bin/console
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "bundler/setup"
|
4
|
+
require "nlp_arabic"
|
5
|
+
|
6
|
+
# You can add fixtures and/or initialization code here to make experimenting
|
7
|
+
# with your gem easier. You can also use a different console, if you like.
|
8
|
+
|
9
|
+
# (If you use this, don't forget to add pry to your Gemfile!)
|
10
|
+
# require "pry"
|
11
|
+
# Pry.start
|
12
|
+
|
13
|
+
require "irb"
|
14
|
+
IRB.start
|
data/bin/setup
ADDED
data/lib/nlp_arabic.rb
ADDED
@@ -0,0 +1,263 @@
|
|
1
|
+
require "nlp_arabic/version"
|
2
|
+
require "nlp_arabic/characters"
|
3
|
+
|
4
|
+
module NlpArabic
|
5
|
+
def self.stem(word)
|
6
|
+
# This function stems a word following the steps of ISRI stemmer
|
7
|
+
# Step 1: remove diacritics
|
8
|
+
word = remove_diacritics(word)
|
9
|
+
# Step 2: normalize hamza, ouaou and yeh to bare alef
|
10
|
+
word = normalize_hamzaas(word)
|
11
|
+
# Step 3: remove prefix of size 3 then 2
|
12
|
+
word = remove_prefix(word)
|
13
|
+
# Step 4: remove suffix of size 3 then 2
|
14
|
+
word = remove_suffix(word)
|
15
|
+
# Step 5: remove the connective waw
|
16
|
+
word = remove_waw(word)
|
17
|
+
# Step 6: convert inital alif (optional)
|
18
|
+
word = convert_initial_alef(word)
|
19
|
+
# Step 7: If the length of the word is higher than 3
|
20
|
+
if word.length == 4
|
21
|
+
word = word_4(word)
|
22
|
+
elsif word.length == 5
|
23
|
+
word = pattern_53(word)
|
24
|
+
word = word_5(word)
|
25
|
+
elsif word.length == 6
|
26
|
+
word = pattern_6(word)
|
27
|
+
word = word_6(word)
|
28
|
+
elsif word.length == 7
|
29
|
+
word = short_suffix(word)
|
30
|
+
word = short_prefix(word) if word.length == 7
|
31
|
+
if word.length == 6
|
32
|
+
word = pattern_6(word)
|
33
|
+
word = word_6(word)
|
34
|
+
end
|
35
|
+
end
|
36
|
+
return word
|
37
|
+
end
|
38
|
+
|
39
|
+
def self.clean_text(text)
|
40
|
+
# cleans the text using a stop list
|
41
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
42
|
+
clean_text = (tokenized_text - NlpArabic::STOP_LIST)
|
43
|
+
return clean_text.join(' ')
|
44
|
+
end
|
45
|
+
|
46
|
+
def self.stem_text(text)
|
47
|
+
# Only stems the text using the ISRI algorithm
|
48
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
49
|
+
for i in (0..(tokenized_text.length-1))
|
50
|
+
tokenized_text[i] = stem(tokenized_text[i]) if NlpArabic.is_alpha(tokenized_text[i])
|
51
|
+
end
|
52
|
+
return tokenized_text.join(' ')
|
53
|
+
end
|
54
|
+
|
55
|
+
def self.clean_and_stem(text)
|
56
|
+
# Cleans the text using the stop list than stems it
|
57
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
58
|
+
clean_text = (tokenized_text - NlpArabic::STOP_LIST)
|
59
|
+
for i in (0..(clean_text.length-1))
|
60
|
+
clean_text[i]= stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
|
61
|
+
end
|
62
|
+
return clean_text.join(' ')
|
63
|
+
end
|
64
|
+
|
65
|
+
def self.tokenize_text(text)
|
66
|
+
return text.split(/\s|(\?+)|(\.+)|(!+)|(\,+)|(\;+)|(\،+)|(\؟+)|(\:+)|(\(+)|(\)+)/).delete_if(&:empty?)
|
67
|
+
end
|
68
|
+
|
69
|
+
def self.wash_and_stem(text)
|
70
|
+
clean_text = text.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '').split - NlpArabic::STOP_LIST
|
71
|
+
new_text = []
|
72
|
+
for i in (0..(clean_text.length-1))
|
73
|
+
new_text << stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
|
74
|
+
end
|
75
|
+
new_text -= NlpArabic::STOP_LIST
|
76
|
+
return new_text.join(' ')
|
77
|
+
end
|
78
|
+
|
79
|
+
def self.is_alpha(word)
|
80
|
+
# checks if a word is alphanumeric
|
81
|
+
return !!word.match(/^[[:alpha:]]+$/)
|
82
|
+
end
|
83
|
+
|
84
|
+
def self.remove_na_characters(word)
|
85
|
+
# cleans the word from non alphanumeric characters
|
86
|
+
return word.strip.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '')
|
87
|
+
end
|
88
|
+
|
89
|
+
def self.remove_diacritics(word)
|
90
|
+
# removes arabic diacritics (fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun) and tateel
|
91
|
+
return word.gsub(/#{NlpArabic::DIACRITICS}/, '')
|
92
|
+
end
|
93
|
+
|
94
|
+
def self.convert_initial_alef(word)
|
95
|
+
# converts all the types of ALEF to a bare alef
|
96
|
+
return word.gsub(/#{NlpArabic::ALIFS}/, NlpArabic::ALEF)
|
97
|
+
end
|
98
|
+
|
99
|
+
def self.normalize_hamzaas(word)
|
100
|
+
# Normalize the hamzaas to an alef
|
101
|
+
return word.gsub(/#{NlpArabic::HAMZAAS}/, NlpArabic::ALEF)
|
102
|
+
end
|
103
|
+
|
104
|
+
def self.remove_prefix(word)
|
105
|
+
# Removes the prefixes of length three than the prefixes of length two
|
106
|
+
if word.length >= 6
|
107
|
+
return word[3..-1] if word.start_with?(*NlpArabic::P3)
|
108
|
+
end
|
109
|
+
if word.length >= 5
|
110
|
+
return word[2..-1] if word.start_with?(*NlpArabic::P2)
|
111
|
+
end
|
112
|
+
return word
|
113
|
+
end
|
114
|
+
|
115
|
+
def self.remove_suffix(word)
|
116
|
+
# Removes the suffixes of length three than the prefixes of length two
|
117
|
+
if word.length >= 6
|
118
|
+
return word[0..-4] if word.end_with?(*NlpArabic::S3)
|
119
|
+
end
|
120
|
+
if word.length >= 5
|
121
|
+
return word[0..-3] if word.end_with?(*NlpArabic::S2)
|
122
|
+
end
|
123
|
+
return word
|
124
|
+
end
|
125
|
+
|
126
|
+
def self.remove_waw(word)
|
127
|
+
# Remove the letter و if it is the initial letter
|
128
|
+
if word.length >= 4
|
129
|
+
return word[1..-1] if word.start_with?(*NlpArabic::DOUBLE_WAW)
|
130
|
+
end
|
131
|
+
return word
|
132
|
+
end
|
133
|
+
|
134
|
+
def self.word_4(word)
|
135
|
+
# Processes the words of length four
|
136
|
+
if NlpArabic::PR4[0].include? word[0]
|
137
|
+
return word[1..-1]
|
138
|
+
elsif NlpArabic::PR4[1].include? word[1]
|
139
|
+
word[1] = ''
|
140
|
+
elsif NlpArabic::PR4[2].include? word[2]
|
141
|
+
word[2] = ''
|
142
|
+
elsif NlpArabic::PR4[3].include? word[3]
|
143
|
+
word[3] = ''
|
144
|
+
else
|
145
|
+
word = short_suffix(word)
|
146
|
+
word = short_prefix(word) if word.length == 4
|
147
|
+
end
|
148
|
+
return word
|
149
|
+
end
|
150
|
+
|
151
|
+
def self.word_5(word)
|
152
|
+
# Processes the words of length four
|
153
|
+
if word.length == 4
|
154
|
+
word = word_4(word)
|
155
|
+
elsif word.length == 5
|
156
|
+
word = pattern_54(word)
|
157
|
+
end
|
158
|
+
return word
|
159
|
+
end
|
160
|
+
|
161
|
+
def self.pattern_53(word)
|
162
|
+
# Helper function that processes the length five patterns and extracts the length three roots
|
163
|
+
if NlpArabic::PR53[0].include? word[2] && word[0] == NlpArabic::ALEF
|
164
|
+
word = word[1] + word[3..-1]
|
165
|
+
elsif NlpArabic::PR53[1].include? word[3] && word[0] == NlpArabic::MEEM
|
166
|
+
word = word[1..2] + word[4]
|
167
|
+
elsif NlpArabic::PR53[2].include? word[0] && word[4] == NlpArabic::TEH_MARBUTA
|
168
|
+
word = word[1..3]
|
169
|
+
elsif NlpArabic::PR53[3].include? word[0] && word[2] == NlpArabic::TEH
|
170
|
+
word = word[1] + word[3..-1]
|
171
|
+
elsif NlpArabic::PR53[4].include? word[0] && word[2] == NlpArabic::ALEF
|
172
|
+
word = word[1] + word[3..-1]
|
173
|
+
elsif NlpArabic::PR53[5].include? word[2] && word[4] == NlpArabic::TEH_MARBUTA
|
174
|
+
word = word[0..1] + word[3]
|
175
|
+
elsif NlpArabic::PR53[6].include? word[0] && word[1] == NlpArabic::NOON
|
176
|
+
word = word[2..-1]
|
177
|
+
elsif word[3] == NlpArabic::ALEF && word[0] == NlpArabic::ALEF
|
178
|
+
word = word[1..2] + word[4]
|
179
|
+
elsif word[4] == NlpArabic::NOON && word[3] == NlpArabic::ALEF
|
180
|
+
word = word[0..2]
|
181
|
+
elsif word[3] == NlpArabic::YEH && word[0] == NlpArabic::TEH
|
182
|
+
word = word[1..3] + word[4]
|
183
|
+
elsif word[3] == NlpArabic::WAW && word[0] == NlpArabic::ALEF
|
184
|
+
word = word[0] + word[2] + word[4]
|
185
|
+
elsif word[2] == NlpArabic::ALEF && word[1] == NlpArabic::WAW
|
186
|
+
word = word[0] + word[3..-1]
|
187
|
+
elsif word[3] == NlpArabic::YEH_WITH_HAMZA_ABOVE && word[2] == NlpArabic::ALEF
|
188
|
+
word = word[0..1] + word[4]
|
189
|
+
elsif word[4] == NlpArabic::TEH_MARBUTA && word[1] == NlpArabic::ALEF
|
190
|
+
word = word[0] + word[2..3]
|
191
|
+
elsif word[4] == NlpArabic::YEH && word[2] == NlpArabic::ALEF
|
192
|
+
word = word[0..1] + word[3]
|
193
|
+
else
|
194
|
+
word = short_suffix(word)
|
195
|
+
word = short_prefix(word)if word.length == 5
|
196
|
+
end
|
197
|
+
return word
|
198
|
+
end
|
199
|
+
|
200
|
+
def self.pattern_54(word)
|
201
|
+
# Helper function that processes the length five patterns and extracts the length three roots
|
202
|
+
if NlpArabic::PR53[2].include? word[0]
|
203
|
+
word = word[1..-1]
|
204
|
+
elsif word[4] == NlpArabic::TEH_MARBUTA
|
205
|
+
word = word[0..3]
|
206
|
+
elsif word[2] == NlpArabic::ALEF
|
207
|
+
word = word[0..1] + word[3..-1]
|
208
|
+
end
|
209
|
+
return word
|
210
|
+
end
|
211
|
+
|
212
|
+
def self.word_6(word)
|
213
|
+
# Processes the words of length four
|
214
|
+
if word.length == 5
|
215
|
+
word = pattern_53(word)
|
216
|
+
word = word_5(word)
|
217
|
+
elsif word.length == 6
|
218
|
+
word = pattern_64(word)
|
219
|
+
end
|
220
|
+
return word
|
221
|
+
end
|
222
|
+
|
223
|
+
def self.pattern_6(word)
|
224
|
+
# Helper function that processes the length six patterns and extracts the length three roots
|
225
|
+
if word.start_with?(*NlpArabic::IST) || word.start_with?(*NlpArabic::MST)
|
226
|
+
word = word[3..-1]
|
227
|
+
elsif word[0] == NlpArabic::MEEM && word[3] == NlpArabic::ALEF && word[5] == NlpArabic::TEH_MARBUTA
|
228
|
+
word = word[1..2] + word[4]
|
229
|
+
elsif word[0] == NlpArabic::ALEF && word[2] == NlpArabic::TEH && word[4] == NlpArabic::ALEF
|
230
|
+
word = word[1] + word[3] + word[5]
|
231
|
+
elsif word[0] == NlpArabic::ALEF && word[3] == NlpArabic::WAW && word[2] == word[4]
|
232
|
+
word = word[1] + word[4..-1]
|
233
|
+
elsif word[0] == NlpArabic::TEH && word[2] == NlpArabic::ALEF && word[4] == NlpArabic::YEH
|
234
|
+
word = word[1] + word[3] + word[5]
|
235
|
+
else
|
236
|
+
word = short_suffix(word)
|
237
|
+
word = short_prefix(word) if word.length == 6
|
238
|
+
end
|
239
|
+
return word
|
240
|
+
end
|
241
|
+
|
242
|
+
def self.pattern_64(word)
|
243
|
+
# Helper function that processes the length six patterns and extracts the length four roots
|
244
|
+
if word[0] == NlpArabic::ALEF && word[4] == NlpArabic::ALEF
|
245
|
+
word = word[1..3] + word[5]
|
246
|
+
elsif
|
247
|
+
word = word[2..-1]
|
248
|
+
end
|
249
|
+
return word
|
250
|
+
end
|
251
|
+
|
252
|
+
def self.short_prefix(word)
|
253
|
+
# Removes the short prefixes
|
254
|
+
word[1..-1] if word.start_with?(*NlpArabic::P1)
|
255
|
+
return word
|
256
|
+
end
|
257
|
+
|
258
|
+
def self.short_suffix(word)
|
259
|
+
# Removes the short suffixes
|
260
|
+
word[0..-2] if word.end_with?(*NlpArabic::S1)
|
261
|
+
return word
|
262
|
+
end
|
263
|
+
end
|
@@ -0,0 +1,102 @@
|
|
1
|
+
module NlpArabic
|
2
|
+
|
3
|
+
# Stop List
|
4
|
+
STOP_LIST = ["\u0648","\u064a\u0643\u0648\u0646","\u0644\u064A\u0633","\u0648\u0644\u064a\u0633","\u0648\u0643\u0627\u0646","\u0643\u0630\u0644\u0643","\u0627\u0644\u062a\u064a","\u0648\u0628\u064a\u0646",
|
5
|
+
"\u0639\u0644\u064a\u0647\u0627","\u0639\u0644\u064A","\u0645\u0633\u0627\u0621","\u0627\u0644\u0630\u064a","\u0648\u0643\u0627\u0646\u062a","\u0644\u0643\u0646","\u0648\u0644\u0643\u0646","\u0648\u0627\u0644\u062a\u064a",
|
6
|
+
"\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0648\u0645","\u0627\u0644\u0644\u0630\u064a\u0646","\u0639\u0644\u064a\u0647","\u0643\u0627\u0646\u062a",
|
7
|
+
"\u0644\u0630\u0644\u0643","\u0623\u0645\u0627\u0645","\u0647\u0646\u0627","\u0647\u0646\u0627\u0643","\u0645\u0646\u0647\u0627","\u0645\u0627\u0632\u0627\u0644","\u0644\u0627\u0632\u0627\u0644",
|
8
|
+
"\u0644\u0627\u064a\u0632\u0627\u0644","\u0645\u0627\u064a\u0632\u0627\u0644","\u0627\u0635\u0628\u062d","\u0623\u0635\u0628\u062d","\u0623\u0645\u0633\u0649",
|
9
|
+
"\u0627\u0645\u0633\u0649","\u0623\u0636\u062d\u0649","\u0627\u0636\u062d\u0649","\u0645\u0627\u0628\u0631\u062d","\u0645\u0627\u0641\u062a\u0626","\u0645\u0627\u0627\u0646\u0641\u0643",
|
10
|
+
"\u0644\u0627\u0633\u064a\u0645\u0627","\u0648\u0644\u0627\u064a\u0632\u0627\u0644","\u0627\u0644\u062d\u0627\u0644\u064a","\u0627\u0644\u064a\u0647\u0627","\u0627\u0644\u0630\u064a\u0646","\u0641\u0627\u0646\u0647",
|
11
|
+
"\u0648\u0627\u0644\u0630\u064a","\u0648\u0647\u0630\u0627","\u0644\u0647\u0630\u0627","\u0641\u0643\u0627\u0646","\u0633\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0647",
|
12
|
+
"\u064a\u0645\u0643\u0646","\u0628\u0647\u0630\u0627","\u0627\u0644\u0630\u0649","\u0641\u0649","\u0641\u064a","\u0643\u0644","\u0644\u0645","\u0644\u0646","\u0644\u0647","\u0645\u0646","\u0647\u0648",
|
13
|
+
"\u0643\u0645\u0627","\u0644\u0647\u0627","\u0645\u0646\u0630","\u0642\u062F","\u0648\u0642\u062F","\u0648\u0644\u0627","\u0648\u0642\u0627\u0644","\u0648\u0642\u0627\u0644\u062A",
|
14
|
+
"\u0644\u0644\u0627\u0645\u0645","\u0641\u064A\u0647","\u0643\u0644\u0645","\u0648\u0641\u064A","\u0648\u0642\u0641","\u0648\u0644\u0645","\u0648\u0645\u0646","\u0648\u0647\u0648","\u0648\u0647\u064A",
|
15
|
+
"\u062D\u064A\u062B","\u0627\u0643\u062F","\u0627\u0644\u0627","\u0627\u0645\u0627","\u0627\u0645\u0633","\u0627\u0644\u0633\u0627\u0628\u0642","\u0627\u0644\u062A\u0649","\u0627\u0643\u062B\u0631",
|
16
|
+
"\u0627\u064A\u0627\u0631","\u0627\u064A\u0636\u0627","\u0627\u0644\u0630\u0627\u062A\u064A","\u0627\u0644\u0627\u062E\u064A\u0631\u0629","\u0627\u0644\u0627\u0646","\u0627\u0645\u0627\u0645","\u0627\u064A\u0627\u0645",
|
17
|
+
"\u062E\u0644\u0627\u0644","\u062D\u0648\u0627\u0644\u0649","\u0630\u0644\u0643","\u062F\u0648\u0646","\u062D\u0648\u0644","\u062D\u064A\u0646","\u0627\u0644\u0641","\u0627\u0644\u0649","\u0648\u062A\u0645",
|
18
|
+
"\u0627\u0646\u0647","\u0627\u0648\u0644","\u0636\u0645\u0646","\u0627\u0646\u0647\u0627","\u062C\u0645\u064A\u0639","\u0627\u0644\u0645\u0627\u0636\u064A","\u0627\u0644\u0648\u0642\u062A",
|
19
|
+
"\u0627\u0644\u0645\u0642\u0628\u0644","\u0644\u0627","\u0645\u0627","\u0645\u0639","\u0647\u0630\u0627","\u0648\u0627\u062D\u062F","\u0641\u0627\u0646","\u0642\u0627\u0644","\u0643\u0627\u0646",
|
20
|
+
"\u0644\u062F\u0649","\u0646\u062D\u0648","\u0647\u0630\u0647","\u0648\u0627\u0646","\u0648\u0627\u0643\u062F","\u0639\u0634\u0631","\u0639\u062F\u062F","\u0639\u062F\u0629","\u0639\u0634\u0631\u0629","\u0639\u062F\u0645",
|
21
|
+
"\u0639\u0627\u0645","\u0639\u0627\u0645\u0627","\u0639\u0646","\u0639\u0646\u062F","\u0639\u0646\u062F\u0645\u0627","\u0639\u0644\u0649","\u0633\u0646\u0629","\u0633\u0646\u0648\u0627\u062A","\u062A\u0645","\u0636\u062F",
|
22
|
+
"\u0628\u0639\u062F","\u0628\u0639\u0636","\u0627\u0639\u0627\u062F\u0629","\u0627\u0639\u0644\u0646\u062A","\u0628\u0633\u0628\u0628","\u062D\u062A\u0649","\u0627\u0630\u0627","\u0627\u062D\u062F","\u0645\u0645\u0646",
|
23
|
+
"\u0627\u062B\u0631","\u063A\u062F\u0627","\u0634\u062E\u0635\u0627","\u0635\u0628\u0627\u062D","\u0627\u0637\u0627\u0631","\u0627\u0631\u0628\u0639\u0629","\u0627\u062E\u0631\u0649","\u0628\u0627\u0646",
|
24
|
+
"\u0627\u062C\u0644","\u063A\u064A\u0631","\u0628\u0634\u0643\u0644","\u062D\u0627\u0644\u064A\u0627","\u0628\u0646","\u0628\u0647","\u062B\u0645","\u0627\u0641","\u0627\u0646","\u0627\u0648","\u0627\u064A",
|
25
|
+
"\u0628\u0647\u0627","\u0635\u0641\u0631","\u0627\u0644\u062B\u0627\u0646\u064A","\u0627\u0644\u062B\u0627\u0646\u064A\u0629","\u0627\u062F\u0627","\u0627\u0648\u0644\u0627","\u0648\u0644\u0643\u0646\u0647",
|
26
|
+
"\u0627\u0644\u0627\u0648\u0644","\u0627\u0644\u0627\u0648\u0644\u0649","\u0628\u064A\u0646","\u0630\u0644\u0643","\u0645\u0645\u0627","\u0631\u063A\u0645","\u0628\u064A","\u0644\u0627\u0646","\u0647\u0644","\u0644\u0648",
|
27
|
+
"\u0628\u0645\u0627","\u0627\u0646\u0627","\u062A\u064A","\u0628\u0644\u0627","\u0642\u0628\u0644","\u0627\u0644\u0646","\u064A\u0627\u0647","\u0644\u062F\u064A","\u0628\u0644","\u0644\u0646\u0627","\u0627\u0645",
|
28
|
+
"\u0627\u0646\u0646\u0627","\u0644\u0642\u062F","\u062D\u064A\u062A","\u0627\u0630\u0646","\u0627\u0644\u064A","\u0628\u0630\u0644\u0643","\u062E\u0644\u0644","\u062D\u0648\u0644","\u0644\u0643","\u062A\u0645\u0627",
|
29
|
+
"\u0644\u0645\u0646","\u0644\u0646\u0647","\u0627\u0644\u0627","\u0627\u064A\u0646","\u0639\u0645\u0627","\u0628\u0643\u0644","\u0648\u0647\u0646\u0627\u0643","\u0646\u0647\u0627",
|
30
|
+
"\u0648\u0647\u0630\u0647","\u0648\u0645\u0627","\u0647\u0645\u0627","\u0648\u0647\u0645","\u0644\u0647\u0630\u0647","\u0639\u0646\u0647","\u0645\u062A\u0646","\u0644\u0645\u0627","\u0643\u0645","\u0645\u062A\u0649",
|
31
|
+
"\u0647\u0643\u0630\u0627","\u0627\u064A\u0647","\u0644\u0643\u0646\u0647","\u062A\u0645","\u0644\u064A\u0643","\u0648\u0644\u0643","\u0644\u0645\u0630\u0627","\u062C\u062F","\u0641\u0641\u064A","\u062F\u064A","\u0625\u064A",
|
32
|
+
"\u0635\u0641\u0631","\u0648\u0627\u062D\u062F","\u0627\u062B\u0646\u0627\u0646","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629","\u0633\u0628\u0639\u0629",
|
33
|
+
"\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0629","\u0639\u0634\u0631","\u0623\u062D\u062F",
|
34
|
+
"\u0627\u062B\u0646\u0627","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629",
|
35
|
+
"\u0633\u0628\u0639\u0629","\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0648\u0646","\u062B\u0644\u0627\u062B\u0648\u0646",
|
36
|
+
"\u0623\u0631\u0628\u0639\u0648\u0646","\u062E\u0645\u0633\u0648\u0646","\u0633\u062A\u0648\u0646","\u0633\u0628\u0639\u0648\u0646","\u062B\u0645\u0627\u0646\u0648\u0646","\u062A\u0633\u0639\u0648\u0646","\u0645\u0626\u0629",
|
37
|
+
"\u0645\u0627\u0626\u0629","\u0623\u0646\u0627","\u0627\u0646\u062A","\u0627\u0646\u062A\u064E","\u0627\u0646\u062A\u0649","\u0627\u0646\u062A\u0650","\u0647\u0648","\u0647\u064A","\u0646\u062D\u0646","\u0623\u0646\u062A\u0645\u0627",
|
38
|
+
"\u0647\u0645\u0627","\u0623\u0646\u062A\u0645","\u0623\u0646\u062A\u0646","\u0647\u0645","\u0647\u0646"].freeze
|
39
|
+
|
40
|
+
# Diacritics
|
41
|
+
DIACRITICS = "[\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0640]"
|
42
|
+
|
43
|
+
# Alifs
|
44
|
+
# Initial Alifs
|
45
|
+
ALIFS = "[\u0622\u0623\u0625\u0671]"
|
46
|
+
|
47
|
+
# Hamzaas
|
48
|
+
HAMZAAS = "[\u0621\u0624\u0626]"
|
49
|
+
|
50
|
+
# Affix sets
|
51
|
+
# Prefixes of length three
|
52
|
+
P3 = ["\u0643\u0627\u0644", "\u0628\u0627\u0644", "\u0648\u0644\u0644", "\u0648\u0627\u0644"]
|
53
|
+
|
54
|
+
# Prefixes of length two
|
55
|
+
P2 = ["\u0627\u0644", "\u0644\u0644"].freeze
|
56
|
+
|
57
|
+
# Prefixes of length one
|
58
|
+
P1 = ["\u0644", "\u0628", "\u0641", "\u0633", "\u0648","\u064a", "\u062a", "\u0646", "\u0627"].freeze
|
59
|
+
|
60
|
+
# Suffixes of length three
|
61
|
+
S3 = ["\u062a\u0645\u0644", "\u0647\u0645\u0644","\u062a\u0627\u0646", "\u062a\u064a\u0646","\u0643\u0645\u0644"].freeze
|
62
|
+
|
63
|
+
# Suffixes of length two
|
64
|
+
S2 = ["\u0648\u0646", "\u0627\u062a", "\u0627\u0646","\u064a\u0646", "\u062a\u0646", "\u0643\u0645","\u0647\u0646", "\u0646\u0627", "\u064a\u0627",
|
65
|
+
"\u0647\u0627", "\u062a\u0645", "\u0643\u0646","\u0646\u064a", "\u0648\u0627", "\u0645\u0627","\u0647\u0645"].freeze
|
66
|
+
|
67
|
+
# Suffixes of length one
|
68
|
+
S1 = ["\u0629", "\u0647", "\u064a", "\u0643", "\u062a","\u0627", "\u0646"].freeze
|
69
|
+
|
70
|
+
# Patterns and roots
|
71
|
+
# Pattern of length four
|
72
|
+
PR4 = { 0 => ["\u0645"],
|
73
|
+
1 => ["\u0627"],
|
74
|
+
2 => ["\u0627", "\u0648", "\u064A"],
|
75
|
+
3 => ["\u0629"]}.freeze
|
76
|
+
|
77
|
+
# Pattern of length five and length three roots
|
78
|
+
PR53 = {0 => ["\u0627", "\u062a"],
|
79
|
+
1 => ["\u0627", "\u064a", "\u0648"],
|
80
|
+
2 => ["\u0627", "\u062a", "\u0645"],
|
81
|
+
3 => ["\u0645", "\u064a", "\u062a"],
|
82
|
+
4 => ["\u0645", "\u062a"],
|
83
|
+
5 => ["\u0627", "\u0648"],
|
84
|
+
6 => ["\u0627", "\u0645"]}.freeze
|
85
|
+
|
86
|
+
# Letters
|
87
|
+
DOUBLE_WAW = "\u0648\u0648"
|
88
|
+
ALEF = "\u0627"
|
89
|
+
MEEM = "\u0645"
|
90
|
+
TEH_MARBUTA = "\u0629"
|
91
|
+
TEH = "\u062a"
|
92
|
+
NOON = "\u0646"
|
93
|
+
YEH = "\u064a"
|
94
|
+
WAW = "\u0648"
|
95
|
+
YEH_WITH_HAMZA_ABOVE = "\u0626"
|
96
|
+
|
97
|
+
#STEMS
|
98
|
+
IST = "\u0627\u0633\u062a"
|
99
|
+
MST = "\u0645\u0633\u062a"
|
100
|
+
MT = "\u0645\u062a"
|
101
|
+
|
102
|
+
end
|
data/nlp_arabic.gemspec
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'nlp_arabic/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = "nlp_arabic"
|
8
|
+
spec.version = NlpArabic::VERSION
|
9
|
+
spec.authors = ["Othmane Laousy"]
|
10
|
+
spec.email = ["othmane.laousy@gmail.com"]
|
11
|
+
|
12
|
+
spec.summary = %q{Natural Language Processing Tools for Arabic}
|
13
|
+
spec.description = %q{This gem is intended to contain tools for Arabic Natural Language Processing.}
|
14
|
+
spec.homepage = "https://github.com/othmanela/nlp_arabic"
|
15
|
+
|
16
|
+
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test)/}) }
|
17
|
+
spec.bindir = "exe"
|
18
|
+
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
19
|
+
spec.require_paths = ["lib"]
|
20
|
+
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.9"
|
22
|
+
spec.add_development_dependency "rake", "~> 10.0"
|
23
|
+
end
|
metadata
ADDED
@@ -0,0 +1,82 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: nlp_arabic
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Othmane Laousy
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2015-05-11 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.9'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.9'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '10.0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '10.0'
|
41
|
+
description: This gem is intended to contain tools for Arabic Natural Language Processing.
|
42
|
+
email:
|
43
|
+
- othmane.laousy@gmail.com
|
44
|
+
executables: []
|
45
|
+
extensions: []
|
46
|
+
extra_rdoc_files: []
|
47
|
+
files:
|
48
|
+
- ".gitignore"
|
49
|
+
- ".travis.yml"
|
50
|
+
- Gemfile
|
51
|
+
- README.md
|
52
|
+
- Rakefile
|
53
|
+
- bin/console
|
54
|
+
- bin/setup
|
55
|
+
- lib/nlp_arabic.rb
|
56
|
+
- lib/nlp_arabic/characters.rb
|
57
|
+
- lib/nlp_arabic/version.rb
|
58
|
+
- nlp_arabic.gemspec
|
59
|
+
homepage: https://github.com/othmanela/nlp_arabic
|
60
|
+
licenses: []
|
61
|
+
metadata: {}
|
62
|
+
post_install_message:
|
63
|
+
rdoc_options: []
|
64
|
+
require_paths:
|
65
|
+
- lib
|
66
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
67
|
+
requirements:
|
68
|
+
- - ">="
|
69
|
+
- !ruby/object:Gem::Version
|
70
|
+
version: '0'
|
71
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - ">="
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
requirements: []
|
77
|
+
rubyforge_project:
|
78
|
+
rubygems_version: 2.4.6
|
79
|
+
signing_key:
|
80
|
+
specification_version: 4
|
81
|
+
summary: Natural Language Processing Tools for Arabic
|
82
|
+
test_files: []
|