nlp_arabic 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +9 -0
- data/.travis.yml +3 -0
- data/Gemfile +4 -0
- data/README.md +67 -0
- data/Rakefile +10 -0
- data/bin/console +14 -0
- data/bin/setup +7 -0
- data/lib/nlp_arabic.rb +263 -0
- data/lib/nlp_arabic/characters.rb +102 -0
- data/lib/nlp_arabic/version.rb +3 -0
- data/nlp_arabic.gemspec +23 -0
- metadata +82 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA1:
|
|
3
|
+
metadata.gz: 9dce240342285bde2206509493990d37a44143df
|
|
4
|
+
data.tar.gz: b55584ef3a20f4b60f1f0637108149922605f06d
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 9d92a384d51125411cca0a89479c7889830b92e4c86eaf29241bad88029bf1bc704a916151b36438737d7de4725eb1d39bd151d077685010419e4d8022824598
|
|
7
|
+
data.tar.gz: cd0855407749d5b51d22c43eef77027b53d8974d6ec2dcce27f7edee3577d122cd2c884fa26fbc8f04f09f7d0c0cc3f46d090c0dc504e8aba01d9606f5b7d914
|
data/.gitignore
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
NlpArabic
|
|
2
|
+
=========
|
|
3
|
+
|
|
4
|
+
This gem is intended to contain tools for Arabic Natural Language Processing.
|
|
5
|
+
As of version 0.1, this toolkit gem allows you to:
|
|
6
|
+
|
|
7
|
+
1. Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
|
|
8
|
+
|
|
9
|
+
2. Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
|
|
10
|
+
|
|
11
|
+
[Arabic Stemming without a root dictionary](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1428453&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9755%2F30835%2F01428453.pdf%3Farnumber%3D1428453)
|
|
12
|
+
|
|
13
|
+
This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
Installation
|
|
17
|
+
============
|
|
18
|
+
|
|
19
|
+
Add this line to your application's Gemfile:
|
|
20
|
+
|
|
21
|
+
```ruby
|
|
22
|
+
gem 'nlp_arabic'
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
And then execute:
|
|
26
|
+
|
|
27
|
+
$ bundle
|
|
28
|
+
|
|
29
|
+
Or install it yourself as:
|
|
30
|
+
|
|
31
|
+
$ gem install nlp_arabic
|
|
32
|
+
|
|
33
|
+
## Usage
|
|
34
|
+
|
|
35
|
+
Once installed, you can use it like this:
|
|
36
|
+
|
|
37
|
+
NlpArabic.clean(text) will return the text without the stop words.
|
|
38
|
+
|
|
39
|
+
NlpArabic.stem(word) will return the word stemmed.
|
|
40
|
+
|
|
41
|
+
NlpArabic.stem_text(text) will stem an entire text.
|
|
42
|
+
|
|
43
|
+
NlpArabic.clean_and_stem(text) will do both.
|
|
44
|
+
|
|
45
|
+
NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
|
|
46
|
+
|
|
47
|
+
NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
|
|
48
|
+
|
|
49
|
+
Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
|
|
50
|
+
|
|
51
|
+
Development
|
|
52
|
+
===========
|
|
53
|
+
|
|
54
|
+
After checking out the repo, run `bin/console` for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run `bin/setup`.
|
|
55
|
+
|
|
56
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
|
57
|
+
|
|
58
|
+
Contributing
|
|
59
|
+
============
|
|
60
|
+
You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described [here](https://github.com/bbatsov/ruby-style-guide). The default encoding used is UTF-8.
|
|
61
|
+
|
|
62
|
+
1. Fork it ( https://github.com/othmanela/nlp_arabic/fork )
|
|
63
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
|
64
|
+
3. Write unit tests and make sure all of them (including the old ones) pass
|
|
65
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
|
66
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
|
67
|
+
5. Create a new Pull Request
|
data/Rakefile
ADDED
data/bin/console
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
|
|
3
|
+
require "bundler/setup"
|
|
4
|
+
require "nlp_arabic"
|
|
5
|
+
|
|
6
|
+
# You can add fixtures and/or initialization code here to make experimenting
|
|
7
|
+
# with your gem easier. You can also use a different console, if you like.
|
|
8
|
+
|
|
9
|
+
# (If you use this, don't forget to add pry to your Gemfile!)
|
|
10
|
+
# require "pry"
|
|
11
|
+
# Pry.start
|
|
12
|
+
|
|
13
|
+
require "irb"
|
|
14
|
+
IRB.start
|
data/bin/setup
ADDED
data/lib/nlp_arabic.rb
ADDED
|
@@ -0,0 +1,263 @@
|
|
|
1
|
+
require "nlp_arabic/version"
|
|
2
|
+
require "nlp_arabic/characters"
|
|
3
|
+
|
|
4
|
+
module NlpArabic
|
|
5
|
+
def self.stem(word)
|
|
6
|
+
# This function stems a word following the steps of ISRI stemmer
|
|
7
|
+
# Step 1: remove diacritics
|
|
8
|
+
word = remove_diacritics(word)
|
|
9
|
+
# Step 2: normalize hamza, ouaou and yeh to bare alef
|
|
10
|
+
word = normalize_hamzaas(word)
|
|
11
|
+
# Step 3: remove prefix of size 3 then 2
|
|
12
|
+
word = remove_prefix(word)
|
|
13
|
+
# Step 4: remove suffix of size 3 then 2
|
|
14
|
+
word = remove_suffix(word)
|
|
15
|
+
# Step 5: remove the connective waw
|
|
16
|
+
word = remove_waw(word)
|
|
17
|
+
# Step 6: convert inital alif (optional)
|
|
18
|
+
word = convert_initial_alef(word)
|
|
19
|
+
# Step 7: If the length of the word is higher than 3
|
|
20
|
+
if word.length == 4
|
|
21
|
+
word = word_4(word)
|
|
22
|
+
elsif word.length == 5
|
|
23
|
+
word = pattern_53(word)
|
|
24
|
+
word = word_5(word)
|
|
25
|
+
elsif word.length == 6
|
|
26
|
+
word = pattern_6(word)
|
|
27
|
+
word = word_6(word)
|
|
28
|
+
elsif word.length == 7
|
|
29
|
+
word = short_suffix(word)
|
|
30
|
+
word = short_prefix(word) if word.length == 7
|
|
31
|
+
if word.length == 6
|
|
32
|
+
word = pattern_6(word)
|
|
33
|
+
word = word_6(word)
|
|
34
|
+
end
|
|
35
|
+
end
|
|
36
|
+
return word
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
def self.clean_text(text)
|
|
40
|
+
# cleans the text using a stop list
|
|
41
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
|
42
|
+
clean_text = (tokenized_text - NlpArabic::STOP_LIST)
|
|
43
|
+
return clean_text.join(' ')
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
def self.stem_text(text)
|
|
47
|
+
# Only stems the text using the ISRI algorithm
|
|
48
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
|
49
|
+
for i in (0..(tokenized_text.length-1))
|
|
50
|
+
tokenized_text[i] = stem(tokenized_text[i]) if NlpArabic.is_alpha(tokenized_text[i])
|
|
51
|
+
end
|
|
52
|
+
return tokenized_text.join(' ')
|
|
53
|
+
end
|
|
54
|
+
|
|
55
|
+
def self.clean_and_stem(text)
|
|
56
|
+
# Cleans the text using the stop list than stems it
|
|
57
|
+
tokenized_text = NlpArabic.tokenize_text(text)
|
|
58
|
+
clean_text = (tokenized_text - NlpArabic::STOP_LIST)
|
|
59
|
+
for i in (0..(clean_text.length-1))
|
|
60
|
+
clean_text[i]= stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
|
|
61
|
+
end
|
|
62
|
+
return clean_text.join(' ')
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
def self.tokenize_text(text)
|
|
66
|
+
return text.split(/\s|(\?+)|(\.+)|(!+)|(\,+)|(\;+)|(\،+)|(\؟+)|(\:+)|(\(+)|(\)+)/).delete_if(&:empty?)
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
def self.wash_and_stem(text)
|
|
70
|
+
clean_text = text.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '').split - NlpArabic::STOP_LIST
|
|
71
|
+
new_text = []
|
|
72
|
+
for i in (0..(clean_text.length-1))
|
|
73
|
+
new_text << stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
|
|
74
|
+
end
|
|
75
|
+
new_text -= NlpArabic::STOP_LIST
|
|
76
|
+
return new_text.join(' ')
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
def self.is_alpha(word)
|
|
80
|
+
# checks if a word is alphanumeric
|
|
81
|
+
return !!word.match(/^[[:alpha:]]+$/)
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
def self.remove_na_characters(word)
|
|
85
|
+
# cleans the word from non alphanumeric characters
|
|
86
|
+
return word.strip.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '')
|
|
87
|
+
end
|
|
88
|
+
|
|
89
|
+
def self.remove_diacritics(word)
|
|
90
|
+
# removes arabic diacritics (fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun) and tateel
|
|
91
|
+
return word.gsub(/#{NlpArabic::DIACRITICS}/, '')
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
def self.convert_initial_alef(word)
|
|
95
|
+
# converts all the types of ALEF to a bare alef
|
|
96
|
+
return word.gsub(/#{NlpArabic::ALIFS}/, NlpArabic::ALEF)
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
def self.normalize_hamzaas(word)
|
|
100
|
+
# Normalize the hamzaas to an alef
|
|
101
|
+
return word.gsub(/#{NlpArabic::HAMZAAS}/, NlpArabic::ALEF)
|
|
102
|
+
end
|
|
103
|
+
|
|
104
|
+
def self.remove_prefix(word)
|
|
105
|
+
# Removes the prefixes of length three than the prefixes of length two
|
|
106
|
+
if word.length >= 6
|
|
107
|
+
return word[3..-1] if word.start_with?(*NlpArabic::P3)
|
|
108
|
+
end
|
|
109
|
+
if word.length >= 5
|
|
110
|
+
return word[2..-1] if word.start_with?(*NlpArabic::P2)
|
|
111
|
+
end
|
|
112
|
+
return word
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
def self.remove_suffix(word)
|
|
116
|
+
# Removes the suffixes of length three than the prefixes of length two
|
|
117
|
+
if word.length >= 6
|
|
118
|
+
return word[0..-4] if word.end_with?(*NlpArabic::S3)
|
|
119
|
+
end
|
|
120
|
+
if word.length >= 5
|
|
121
|
+
return word[0..-3] if word.end_with?(*NlpArabic::S2)
|
|
122
|
+
end
|
|
123
|
+
return word
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
def self.remove_waw(word)
|
|
127
|
+
# Remove the letter و if it is the initial letter
|
|
128
|
+
if word.length >= 4
|
|
129
|
+
return word[1..-1] if word.start_with?(*NlpArabic::DOUBLE_WAW)
|
|
130
|
+
end
|
|
131
|
+
return word
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
def self.word_4(word)
|
|
135
|
+
# Processes the words of length four
|
|
136
|
+
if NlpArabic::PR4[0].include? word[0]
|
|
137
|
+
return word[1..-1]
|
|
138
|
+
elsif NlpArabic::PR4[1].include? word[1]
|
|
139
|
+
word[1] = ''
|
|
140
|
+
elsif NlpArabic::PR4[2].include? word[2]
|
|
141
|
+
word[2] = ''
|
|
142
|
+
elsif NlpArabic::PR4[3].include? word[3]
|
|
143
|
+
word[3] = ''
|
|
144
|
+
else
|
|
145
|
+
word = short_suffix(word)
|
|
146
|
+
word = short_prefix(word) if word.length == 4
|
|
147
|
+
end
|
|
148
|
+
return word
|
|
149
|
+
end
|
|
150
|
+
|
|
151
|
+
def self.word_5(word)
|
|
152
|
+
# Processes the words of length four
|
|
153
|
+
if word.length == 4
|
|
154
|
+
word = word_4(word)
|
|
155
|
+
elsif word.length == 5
|
|
156
|
+
word = pattern_54(word)
|
|
157
|
+
end
|
|
158
|
+
return word
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
def self.pattern_53(word)
|
|
162
|
+
# Helper function that processes the length five patterns and extracts the length three roots
|
|
163
|
+
if NlpArabic::PR53[0].include? word[2] && word[0] == NlpArabic::ALEF
|
|
164
|
+
word = word[1] + word[3..-1]
|
|
165
|
+
elsif NlpArabic::PR53[1].include? word[3] && word[0] == NlpArabic::MEEM
|
|
166
|
+
word = word[1..2] + word[4]
|
|
167
|
+
elsif NlpArabic::PR53[2].include? word[0] && word[4] == NlpArabic::TEH_MARBUTA
|
|
168
|
+
word = word[1..3]
|
|
169
|
+
elsif NlpArabic::PR53[3].include? word[0] && word[2] == NlpArabic::TEH
|
|
170
|
+
word = word[1] + word[3..-1]
|
|
171
|
+
elsif NlpArabic::PR53[4].include? word[0] && word[2] == NlpArabic::ALEF
|
|
172
|
+
word = word[1] + word[3..-1]
|
|
173
|
+
elsif NlpArabic::PR53[5].include? word[2] && word[4] == NlpArabic::TEH_MARBUTA
|
|
174
|
+
word = word[0..1] + word[3]
|
|
175
|
+
elsif NlpArabic::PR53[6].include? word[0] && word[1] == NlpArabic::NOON
|
|
176
|
+
word = word[2..-1]
|
|
177
|
+
elsif word[3] == NlpArabic::ALEF && word[0] == NlpArabic::ALEF
|
|
178
|
+
word = word[1..2] + word[4]
|
|
179
|
+
elsif word[4] == NlpArabic::NOON && word[3] == NlpArabic::ALEF
|
|
180
|
+
word = word[0..2]
|
|
181
|
+
elsif word[3] == NlpArabic::YEH && word[0] == NlpArabic::TEH
|
|
182
|
+
word = word[1..3] + word[4]
|
|
183
|
+
elsif word[3] == NlpArabic::WAW && word[0] == NlpArabic::ALEF
|
|
184
|
+
word = word[0] + word[2] + word[4]
|
|
185
|
+
elsif word[2] == NlpArabic::ALEF && word[1] == NlpArabic::WAW
|
|
186
|
+
word = word[0] + word[3..-1]
|
|
187
|
+
elsif word[3] == NlpArabic::YEH_WITH_HAMZA_ABOVE && word[2] == NlpArabic::ALEF
|
|
188
|
+
word = word[0..1] + word[4]
|
|
189
|
+
elsif word[4] == NlpArabic::TEH_MARBUTA && word[1] == NlpArabic::ALEF
|
|
190
|
+
word = word[0] + word[2..3]
|
|
191
|
+
elsif word[4] == NlpArabic::YEH && word[2] == NlpArabic::ALEF
|
|
192
|
+
word = word[0..1] + word[3]
|
|
193
|
+
else
|
|
194
|
+
word = short_suffix(word)
|
|
195
|
+
word = short_prefix(word)if word.length == 5
|
|
196
|
+
end
|
|
197
|
+
return word
|
|
198
|
+
end
|
|
199
|
+
|
|
200
|
+
def self.pattern_54(word)
|
|
201
|
+
# Helper function that processes the length five patterns and extracts the length three roots
|
|
202
|
+
if NlpArabic::PR53[2].include? word[0]
|
|
203
|
+
word = word[1..-1]
|
|
204
|
+
elsif word[4] == NlpArabic::TEH_MARBUTA
|
|
205
|
+
word = word[0..3]
|
|
206
|
+
elsif word[2] == NlpArabic::ALEF
|
|
207
|
+
word = word[0..1] + word[3..-1]
|
|
208
|
+
end
|
|
209
|
+
return word
|
|
210
|
+
end
|
|
211
|
+
|
|
212
|
+
def self.word_6(word)
|
|
213
|
+
# Processes the words of length four
|
|
214
|
+
if word.length == 5
|
|
215
|
+
word = pattern_53(word)
|
|
216
|
+
word = word_5(word)
|
|
217
|
+
elsif word.length == 6
|
|
218
|
+
word = pattern_64(word)
|
|
219
|
+
end
|
|
220
|
+
return word
|
|
221
|
+
end
|
|
222
|
+
|
|
223
|
+
def self.pattern_6(word)
|
|
224
|
+
# Helper function that processes the length six patterns and extracts the length three roots
|
|
225
|
+
if word.start_with?(*NlpArabic::IST) || word.start_with?(*NlpArabic::MST)
|
|
226
|
+
word = word[3..-1]
|
|
227
|
+
elsif word[0] == NlpArabic::MEEM && word[3] == NlpArabic::ALEF && word[5] == NlpArabic::TEH_MARBUTA
|
|
228
|
+
word = word[1..2] + word[4]
|
|
229
|
+
elsif word[0] == NlpArabic::ALEF && word[2] == NlpArabic::TEH && word[4] == NlpArabic::ALEF
|
|
230
|
+
word = word[1] + word[3] + word[5]
|
|
231
|
+
elsif word[0] == NlpArabic::ALEF && word[3] == NlpArabic::WAW && word[2] == word[4]
|
|
232
|
+
word = word[1] + word[4..-1]
|
|
233
|
+
elsif word[0] == NlpArabic::TEH && word[2] == NlpArabic::ALEF && word[4] == NlpArabic::YEH
|
|
234
|
+
word = word[1] + word[3] + word[5]
|
|
235
|
+
else
|
|
236
|
+
word = short_suffix(word)
|
|
237
|
+
word = short_prefix(word) if word.length == 6
|
|
238
|
+
end
|
|
239
|
+
return word
|
|
240
|
+
end
|
|
241
|
+
|
|
242
|
+
def self.pattern_64(word)
|
|
243
|
+
# Helper function that processes the length six patterns and extracts the length four roots
|
|
244
|
+
if word[0] == NlpArabic::ALEF && word[4] == NlpArabic::ALEF
|
|
245
|
+
word = word[1..3] + word[5]
|
|
246
|
+
elsif
|
|
247
|
+
word = word[2..-1]
|
|
248
|
+
end
|
|
249
|
+
return word
|
|
250
|
+
end
|
|
251
|
+
|
|
252
|
+
def self.short_prefix(word)
|
|
253
|
+
# Removes the short prefixes
|
|
254
|
+
word[1..-1] if word.start_with?(*NlpArabic::P1)
|
|
255
|
+
return word
|
|
256
|
+
end
|
|
257
|
+
|
|
258
|
+
def self.short_suffix(word)
|
|
259
|
+
# Removes the short suffixes
|
|
260
|
+
word[0..-2] if word.end_with?(*NlpArabic::S1)
|
|
261
|
+
return word
|
|
262
|
+
end
|
|
263
|
+
end
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
module NlpArabic
|
|
2
|
+
|
|
3
|
+
# Stop List
|
|
4
|
+
STOP_LIST = ["\u0648","\u064a\u0643\u0648\u0646","\u0644\u064A\u0633","\u0648\u0644\u064a\u0633","\u0648\u0643\u0627\u0646","\u0643\u0630\u0644\u0643","\u0627\u0644\u062a\u064a","\u0648\u0628\u064a\u0646",
|
|
5
|
+
"\u0639\u0644\u064a\u0647\u0627","\u0639\u0644\u064A","\u0645\u0633\u0627\u0621","\u0627\u0644\u0630\u064a","\u0648\u0643\u0627\u0646\u062a","\u0644\u0643\u0646","\u0648\u0644\u0643\u0646","\u0648\u0627\u0644\u062a\u064a",
|
|
6
|
+
"\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0648\u0645","\u0627\u0644\u0644\u0630\u064a\u0646","\u0639\u0644\u064a\u0647","\u0643\u0627\u0646\u062a",
|
|
7
|
+
"\u0644\u0630\u0644\u0643","\u0623\u0645\u0627\u0645","\u0647\u0646\u0627","\u0647\u0646\u0627\u0643","\u0645\u0646\u0647\u0627","\u0645\u0627\u0632\u0627\u0644","\u0644\u0627\u0632\u0627\u0644",
|
|
8
|
+
"\u0644\u0627\u064a\u0632\u0627\u0644","\u0645\u0627\u064a\u0632\u0627\u0644","\u0627\u0635\u0628\u062d","\u0623\u0635\u0628\u062d","\u0623\u0645\u0633\u0649",
|
|
9
|
+
"\u0627\u0645\u0633\u0649","\u0623\u0636\u062d\u0649","\u0627\u0636\u062d\u0649","\u0645\u0627\u0628\u0631\u062d","\u0645\u0627\u0641\u062a\u0626","\u0645\u0627\u0627\u0646\u0641\u0643",
|
|
10
|
+
"\u0644\u0627\u0633\u064a\u0645\u0627","\u0648\u0644\u0627\u064a\u0632\u0627\u0644","\u0627\u0644\u062d\u0627\u0644\u064a","\u0627\u0644\u064a\u0647\u0627","\u0627\u0644\u0630\u064a\u0646","\u0641\u0627\u0646\u0647",
|
|
11
|
+
"\u0648\u0627\u0644\u0630\u064a","\u0648\u0647\u0630\u0627","\u0644\u0647\u0630\u0627","\u0641\u0643\u0627\u0646","\u0633\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0647",
|
|
12
|
+
"\u064a\u0645\u0643\u0646","\u0628\u0647\u0630\u0627","\u0627\u0644\u0630\u0649","\u0641\u0649","\u0641\u064a","\u0643\u0644","\u0644\u0645","\u0644\u0646","\u0644\u0647","\u0645\u0646","\u0647\u0648",
|
|
13
|
+
"\u0643\u0645\u0627","\u0644\u0647\u0627","\u0645\u0646\u0630","\u0642\u062F","\u0648\u0642\u062F","\u0648\u0644\u0627","\u0648\u0642\u0627\u0644","\u0648\u0642\u0627\u0644\u062A",
|
|
14
|
+
"\u0644\u0644\u0627\u0645\u0645","\u0641\u064A\u0647","\u0643\u0644\u0645","\u0648\u0641\u064A","\u0648\u0642\u0641","\u0648\u0644\u0645","\u0648\u0645\u0646","\u0648\u0647\u0648","\u0648\u0647\u064A",
|
|
15
|
+
"\u062D\u064A\u062B","\u0627\u0643\u062F","\u0627\u0644\u0627","\u0627\u0645\u0627","\u0627\u0645\u0633","\u0627\u0644\u0633\u0627\u0628\u0642","\u0627\u0644\u062A\u0649","\u0627\u0643\u062B\u0631",
|
|
16
|
+
"\u0627\u064A\u0627\u0631","\u0627\u064A\u0636\u0627","\u0627\u0644\u0630\u0627\u062A\u064A","\u0627\u0644\u0627\u062E\u064A\u0631\u0629","\u0627\u0644\u0627\u0646","\u0627\u0645\u0627\u0645","\u0627\u064A\u0627\u0645",
|
|
17
|
+
"\u062E\u0644\u0627\u0644","\u062D\u0648\u0627\u0644\u0649","\u0630\u0644\u0643","\u062F\u0648\u0646","\u062D\u0648\u0644","\u062D\u064A\u0646","\u0627\u0644\u0641","\u0627\u0644\u0649","\u0648\u062A\u0645",
|
|
18
|
+
"\u0627\u0646\u0647","\u0627\u0648\u0644","\u0636\u0645\u0646","\u0627\u0646\u0647\u0627","\u062C\u0645\u064A\u0639","\u0627\u0644\u0645\u0627\u0636\u064A","\u0627\u0644\u0648\u0642\u062A",
|
|
19
|
+
"\u0627\u0644\u0645\u0642\u0628\u0644","\u0644\u0627","\u0645\u0627","\u0645\u0639","\u0647\u0630\u0627","\u0648\u0627\u062D\u062F","\u0641\u0627\u0646","\u0642\u0627\u0644","\u0643\u0627\u0646",
|
|
20
|
+
"\u0644\u062F\u0649","\u0646\u062D\u0648","\u0647\u0630\u0647","\u0648\u0627\u0646","\u0648\u0627\u0643\u062F","\u0639\u0634\u0631","\u0639\u062F\u062F","\u0639\u062F\u0629","\u0639\u0634\u0631\u0629","\u0639\u062F\u0645",
|
|
21
|
+
"\u0639\u0627\u0645","\u0639\u0627\u0645\u0627","\u0639\u0646","\u0639\u0646\u062F","\u0639\u0646\u062F\u0645\u0627","\u0639\u0644\u0649","\u0633\u0646\u0629","\u0633\u0646\u0648\u0627\u062A","\u062A\u0645","\u0636\u062F",
|
|
22
|
+
"\u0628\u0639\u062F","\u0628\u0639\u0636","\u0627\u0639\u0627\u062F\u0629","\u0627\u0639\u0644\u0646\u062A","\u0628\u0633\u0628\u0628","\u062D\u062A\u0649","\u0627\u0630\u0627","\u0627\u062D\u062F","\u0645\u0645\u0646",
|
|
23
|
+
"\u0627\u062B\u0631","\u063A\u062F\u0627","\u0634\u062E\u0635\u0627","\u0635\u0628\u0627\u062D","\u0627\u0637\u0627\u0631","\u0627\u0631\u0628\u0639\u0629","\u0627\u062E\u0631\u0649","\u0628\u0627\u0646",
|
|
24
|
+
"\u0627\u062C\u0644","\u063A\u064A\u0631","\u0628\u0634\u0643\u0644","\u062D\u0627\u0644\u064A\u0627","\u0628\u0646","\u0628\u0647","\u062B\u0645","\u0627\u0641","\u0627\u0646","\u0627\u0648","\u0627\u064A",
|
|
25
|
+
"\u0628\u0647\u0627","\u0635\u0641\u0631","\u0627\u0644\u062B\u0627\u0646\u064A","\u0627\u0644\u062B\u0627\u0646\u064A\u0629","\u0627\u062F\u0627","\u0627\u0648\u0644\u0627","\u0648\u0644\u0643\u0646\u0647",
|
|
26
|
+
"\u0627\u0644\u0627\u0648\u0644","\u0627\u0644\u0627\u0648\u0644\u0649","\u0628\u064A\u0646","\u0630\u0644\u0643","\u0645\u0645\u0627","\u0631\u063A\u0645","\u0628\u064A","\u0644\u0627\u0646","\u0647\u0644","\u0644\u0648",
|
|
27
|
+
"\u0628\u0645\u0627","\u0627\u0646\u0627","\u062A\u064A","\u0628\u0644\u0627","\u0642\u0628\u0644","\u0627\u0644\u0646","\u064A\u0627\u0647","\u0644\u062F\u064A","\u0628\u0644","\u0644\u0646\u0627","\u0627\u0645",
|
|
28
|
+
"\u0627\u0646\u0646\u0627","\u0644\u0642\u062F","\u062D\u064A\u062A","\u0627\u0630\u0646","\u0627\u0644\u064A","\u0628\u0630\u0644\u0643","\u062E\u0644\u0644","\u062D\u0648\u0644","\u0644\u0643","\u062A\u0645\u0627",
|
|
29
|
+
"\u0644\u0645\u0646","\u0644\u0646\u0647","\u0627\u0644\u0627","\u0627\u064A\u0646","\u0639\u0645\u0627","\u0628\u0643\u0644","\u0648\u0647\u0646\u0627\u0643","\u0646\u0647\u0627",
|
|
30
|
+
"\u0648\u0647\u0630\u0647","\u0648\u0645\u0627","\u0647\u0645\u0627","\u0648\u0647\u0645","\u0644\u0647\u0630\u0647","\u0639\u0646\u0647","\u0645\u062A\u0646","\u0644\u0645\u0627","\u0643\u0645","\u0645\u062A\u0649",
|
|
31
|
+
"\u0647\u0643\u0630\u0627","\u0627\u064A\u0647","\u0644\u0643\u0646\u0647","\u062A\u0645","\u0644\u064A\u0643","\u0648\u0644\u0643","\u0644\u0645\u0630\u0627","\u062C\u062F","\u0641\u0641\u064A","\u062F\u064A","\u0625\u064A",
|
|
32
|
+
"\u0635\u0641\u0631","\u0648\u0627\u062D\u062F","\u0627\u062B\u0646\u0627\u0646","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629","\u0633\u0628\u0639\u0629",
|
|
33
|
+
"\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0629","\u0639\u0634\u0631","\u0623\u062D\u062F",
|
|
34
|
+
"\u0627\u062B\u0646\u0627","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629",
|
|
35
|
+
"\u0633\u0628\u0639\u0629","\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0648\u0646","\u062B\u0644\u0627\u062B\u0648\u0646",
|
|
36
|
+
"\u0623\u0631\u0628\u0639\u0648\u0646","\u062E\u0645\u0633\u0648\u0646","\u0633\u062A\u0648\u0646","\u0633\u0628\u0639\u0648\u0646","\u062B\u0645\u0627\u0646\u0648\u0646","\u062A\u0633\u0639\u0648\u0646","\u0645\u0626\u0629",
|
|
37
|
+
"\u0645\u0627\u0626\u0629","\u0623\u0646\u0627","\u0627\u0646\u062A","\u0627\u0646\u062A\u064E","\u0627\u0646\u062A\u0649","\u0627\u0646\u062A\u0650","\u0647\u0648","\u0647\u064A","\u0646\u062D\u0646","\u0623\u0646\u062A\u0645\u0627",
|
|
38
|
+
"\u0647\u0645\u0627","\u0623\u0646\u062A\u0645","\u0623\u0646\u062A\u0646","\u0647\u0645","\u0647\u0646"].freeze
|
|
39
|
+
|
|
40
|
+
# Diacritics
|
|
41
|
+
DIACRITICS = "[\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0640]"
|
|
42
|
+
|
|
43
|
+
# Alifs
|
|
44
|
+
# Initial Alifs
|
|
45
|
+
ALIFS = "[\u0622\u0623\u0625\u0671]"
|
|
46
|
+
|
|
47
|
+
# Hamzaas
|
|
48
|
+
HAMZAAS = "[\u0621\u0624\u0626]"
|
|
49
|
+
|
|
50
|
+
# Affix sets
|
|
51
|
+
# Prefixes of length three
|
|
52
|
+
P3 = ["\u0643\u0627\u0644", "\u0628\u0627\u0644", "\u0648\u0644\u0644", "\u0648\u0627\u0644"]
|
|
53
|
+
|
|
54
|
+
# Prefixes of length two
|
|
55
|
+
P2 = ["\u0627\u0644", "\u0644\u0644"].freeze
|
|
56
|
+
|
|
57
|
+
# Prefixes of length one
|
|
58
|
+
P1 = ["\u0644", "\u0628", "\u0641", "\u0633", "\u0648","\u064a", "\u062a", "\u0646", "\u0627"].freeze
|
|
59
|
+
|
|
60
|
+
# Suffixes of length three
|
|
61
|
+
S3 = ["\u062a\u0645\u0644", "\u0647\u0645\u0644","\u062a\u0627\u0646", "\u062a\u064a\u0646","\u0643\u0645\u0644"].freeze
|
|
62
|
+
|
|
63
|
+
# Suffixes of length two
|
|
64
|
+
S2 = ["\u0648\u0646", "\u0627\u062a", "\u0627\u0646","\u064a\u0646", "\u062a\u0646", "\u0643\u0645","\u0647\u0646", "\u0646\u0627", "\u064a\u0627",
|
|
65
|
+
"\u0647\u0627", "\u062a\u0645", "\u0643\u0646","\u0646\u064a", "\u0648\u0627", "\u0645\u0627","\u0647\u0645"].freeze
|
|
66
|
+
|
|
67
|
+
# Suffixes of length one
|
|
68
|
+
S1 = ["\u0629", "\u0647", "\u064a", "\u0643", "\u062a","\u0627", "\u0646"].freeze
|
|
69
|
+
|
|
70
|
+
# Patterns and roots
|
|
71
|
+
# Pattern of length four
|
|
72
|
+
PR4 = { 0 => ["\u0645"],
|
|
73
|
+
1 => ["\u0627"],
|
|
74
|
+
2 => ["\u0627", "\u0648", "\u064A"],
|
|
75
|
+
3 => ["\u0629"]}.freeze
|
|
76
|
+
|
|
77
|
+
# Pattern of length five and length three roots
|
|
78
|
+
PR53 = {0 => ["\u0627", "\u062a"],
|
|
79
|
+
1 => ["\u0627", "\u064a", "\u0648"],
|
|
80
|
+
2 => ["\u0627", "\u062a", "\u0645"],
|
|
81
|
+
3 => ["\u0645", "\u064a", "\u062a"],
|
|
82
|
+
4 => ["\u0645", "\u062a"],
|
|
83
|
+
5 => ["\u0627", "\u0648"],
|
|
84
|
+
6 => ["\u0627", "\u0645"]}.freeze
|
|
85
|
+
|
|
86
|
+
# Letters
|
|
87
|
+
DOUBLE_WAW = "\u0648\u0648"
|
|
88
|
+
ALEF = "\u0627"
|
|
89
|
+
MEEM = "\u0645"
|
|
90
|
+
TEH_MARBUTA = "\u0629"
|
|
91
|
+
TEH = "\u062a"
|
|
92
|
+
NOON = "\u0646"
|
|
93
|
+
YEH = "\u064a"
|
|
94
|
+
WAW = "\u0648"
|
|
95
|
+
YEH_WITH_HAMZA_ABOVE = "\u0626"
|
|
96
|
+
|
|
97
|
+
#STEMS
|
|
98
|
+
IST = "\u0627\u0633\u062a"
|
|
99
|
+
MST = "\u0645\u0633\u062a"
|
|
100
|
+
MT = "\u0645\u062a"
|
|
101
|
+
|
|
102
|
+
end
|
data/nlp_arabic.gemspec
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# coding: utf-8
|
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
|
4
|
+
require 'nlp_arabic/version'
|
|
5
|
+
|
|
6
|
+
Gem::Specification.new do |spec|
|
|
7
|
+
spec.name = "nlp_arabic"
|
|
8
|
+
spec.version = NlpArabic::VERSION
|
|
9
|
+
spec.authors = ["Othmane Laousy"]
|
|
10
|
+
spec.email = ["othmane.laousy@gmail.com"]
|
|
11
|
+
|
|
12
|
+
spec.summary = %q{Natural Language Processing Tools for Arabic}
|
|
13
|
+
spec.description = %q{This gem is intended to contain tools for Arabic Natural Language Processing.}
|
|
14
|
+
spec.homepage = "https://github.com/othmanela/nlp_arabic"
|
|
15
|
+
|
|
16
|
+
spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test)/}) }
|
|
17
|
+
spec.bindir = "exe"
|
|
18
|
+
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
|
19
|
+
spec.require_paths = ["lib"]
|
|
20
|
+
|
|
21
|
+
spec.add_development_dependency "bundler", "~> 1.9"
|
|
22
|
+
spec.add_development_dependency "rake", "~> 10.0"
|
|
23
|
+
end
|
metadata
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: nlp_arabic
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- Othmane Laousy
|
|
8
|
+
autorequire:
|
|
9
|
+
bindir: exe
|
|
10
|
+
cert_chain: []
|
|
11
|
+
date: 2015-05-11 00:00:00.000000000 Z
|
|
12
|
+
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: bundler
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - "~>"
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: '1.9'
|
|
20
|
+
type: :development
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - "~>"
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: '1.9'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: rake
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '10.0'
|
|
34
|
+
type: :development
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '10.0'
|
|
41
|
+
description: This gem is intended to contain tools for Arabic Natural Language Processing.
|
|
42
|
+
email:
|
|
43
|
+
- othmane.laousy@gmail.com
|
|
44
|
+
executables: []
|
|
45
|
+
extensions: []
|
|
46
|
+
extra_rdoc_files: []
|
|
47
|
+
files:
|
|
48
|
+
- ".gitignore"
|
|
49
|
+
- ".travis.yml"
|
|
50
|
+
- Gemfile
|
|
51
|
+
- README.md
|
|
52
|
+
- Rakefile
|
|
53
|
+
- bin/console
|
|
54
|
+
- bin/setup
|
|
55
|
+
- lib/nlp_arabic.rb
|
|
56
|
+
- lib/nlp_arabic/characters.rb
|
|
57
|
+
- lib/nlp_arabic/version.rb
|
|
58
|
+
- nlp_arabic.gemspec
|
|
59
|
+
homepage: https://github.com/othmanela/nlp_arabic
|
|
60
|
+
licenses: []
|
|
61
|
+
metadata: {}
|
|
62
|
+
post_install_message:
|
|
63
|
+
rdoc_options: []
|
|
64
|
+
require_paths:
|
|
65
|
+
- lib
|
|
66
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
67
|
+
requirements:
|
|
68
|
+
- - ">="
|
|
69
|
+
- !ruby/object:Gem::Version
|
|
70
|
+
version: '0'
|
|
71
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
72
|
+
requirements:
|
|
73
|
+
- - ">="
|
|
74
|
+
- !ruby/object:Gem::Version
|
|
75
|
+
version: '0'
|
|
76
|
+
requirements: []
|
|
77
|
+
rubyforge_project:
|
|
78
|
+
rubygems_version: 2.4.6
|
|
79
|
+
signing_key:
|
|
80
|
+
specification_version: 4
|
|
81
|
+
summary: Natural Language Processing Tools for Arabic
|
|
82
|
+
test_files: []
|