nlp_arabic 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9dce240342285bde2206509493990d37a44143df
4
+ data.tar.gz: b55584ef3a20f4b60f1f0637108149922605f06d
5
+ SHA512:
6
+ metadata.gz: 9d92a384d51125411cca0a89479c7889830b92e4c86eaf29241bad88029bf1bc704a916151b36438737d7de4725eb1d39bd151d077685010419e4d8022824598
7
+ data.tar.gz: cd0855407749d5b51d22c43eef77027b53d8974d6ec2dcce27f7edee3577d122cd2c884fa26fbc8f04f09f7d0c0cc3f46d090c0dc504e8aba01d9606f5b7d914
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.2.0
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in nlp_arabic.gemspec
4
+ gemspec
@@ -0,0 +1,67 @@
1
+ NlpArabic
2
+ =========
3
+
4
+ This gem is intended to contain tools for Arabic Natural Language Processing.
5
+ As of version 0.1, this toolkit gem allows you to:
6
+
7
+ 1. Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
8
+
9
+ 2. Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
10
+
11
+ [Arabic Stemming without a root dictionary](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1428453&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9755%2F30835%2F01428453.pdf%3Farnumber%3D1428453)
12
+
13
+ This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
14
+
15
+
16
+ Installation
17
+ ============
18
+
19
+ Add this line to your application's Gemfile:
20
+
21
+ ```ruby
22
+ gem 'nlp_arabic'
23
+ ```
24
+
25
+ And then execute:
26
+
27
+ $ bundle
28
+
29
+ Or install it yourself as:
30
+
31
+ $ gem install nlp_arabic
32
+
33
+ ## Usage
34
+
35
+ Once installed, you can use it like this:
36
+
37
+ NlpArabic.clean(text) will return the text without the stop words.
38
+
39
+ NlpArabic.stem(word) will return the word stemmed.
40
+
41
+ NlpArabic.stem_text(text) will stem an entire text.
42
+
43
+ NlpArabic.clean_and_stem(text) will do both.
44
+
45
+ NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
46
+
47
+ NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
48
+
49
+ Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
50
+
51
+ Development
52
+ ===========
53
+
54
+ After checking out the repo, run `bin/console` for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run `bin/setup`.
55
+
56
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
57
+
58
+ Contributing
59
+ ============
60
+ You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described [here](https://github.com/bbatsov/ruby-style-guide). The default encoding used is UTF-8.
61
+
62
+ 1. Fork it ( https://github.com/othmanela/nlp_arabic/fork )
63
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
64
+ 3. Write unit tests and make sure all of them (including the old ones) pass
65
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
66
+ 4. Push to the branch (`git push origin my-new-feature`)
67
+ 5. Create a new Pull Request
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new do |t|
5
+ t.libs << "test"
6
+ t.test_files = FileList["test/**/*_test.rb"]
7
+ t.verbose = true
8
+ end
9
+
10
+ task default: :test
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "nlp_arabic"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,263 @@
1
+ require "nlp_arabic/version"
2
+ require "nlp_arabic/characters"
3
+
4
+ module NlpArabic
5
+ def self.stem(word)
6
+ # This function stems a word following the steps of ISRI stemmer
7
+ # Step 1: remove diacritics
8
+ word = remove_diacritics(word)
9
+ # Step 2: normalize hamza, ouaou and yeh to bare alef
10
+ word = normalize_hamzaas(word)
11
+ # Step 3: remove prefix of size 3 then 2
12
+ word = remove_prefix(word)
13
+ # Step 4: remove suffix of size 3 then 2
14
+ word = remove_suffix(word)
15
+ # Step 5: remove the connective waw
16
+ word = remove_waw(word)
17
+ # Step 6: convert inital alif (optional)
18
+ word = convert_initial_alef(word)
19
+ # Step 7: If the length of the word is higher than 3
20
+ if word.length == 4
21
+ word = word_4(word)
22
+ elsif word.length == 5
23
+ word = pattern_53(word)
24
+ word = word_5(word)
25
+ elsif word.length == 6
26
+ word = pattern_6(word)
27
+ word = word_6(word)
28
+ elsif word.length == 7
29
+ word = short_suffix(word)
30
+ word = short_prefix(word) if word.length == 7
31
+ if word.length == 6
32
+ word = pattern_6(word)
33
+ word = word_6(word)
34
+ end
35
+ end
36
+ return word
37
+ end
38
+
39
+ def self.clean_text(text)
40
+ # cleans the text using a stop list
41
+ tokenized_text = NlpArabic.tokenize_text(text)
42
+ clean_text = (tokenized_text - NlpArabic::STOP_LIST)
43
+ return clean_text.join(' ')
44
+ end
45
+
46
+ def self.stem_text(text)
47
+ # Only stems the text using the ISRI algorithm
48
+ tokenized_text = NlpArabic.tokenize_text(text)
49
+ for i in (0..(tokenized_text.length-1))
50
+ tokenized_text[i] = stem(tokenized_text[i]) if NlpArabic.is_alpha(tokenized_text[i])
51
+ end
52
+ return tokenized_text.join(' ')
53
+ end
54
+
55
+ def self.clean_and_stem(text)
56
+ # Cleans the text using the stop list than stems it
57
+ tokenized_text = NlpArabic.tokenize_text(text)
58
+ clean_text = (tokenized_text - NlpArabic::STOP_LIST)
59
+ for i in (0..(clean_text.length-1))
60
+ clean_text[i]= stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
61
+ end
62
+ return clean_text.join(' ')
63
+ end
64
+
65
+ def self.tokenize_text(text)
66
+ return text.split(/\s|(\?+)|(\.+)|(!+)|(\,+)|(\;+)|(\،+)|(\؟+)|(\:+)|(\(+)|(\)+)/).delete_if(&:empty?)
67
+ end
68
+
69
+ def self.wash_and_stem(text)
70
+ clean_text = text.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '').split - NlpArabic::STOP_LIST
71
+ new_text = []
72
+ for i in (0..(clean_text.length-1))
73
+ new_text << stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
74
+ end
75
+ new_text -= NlpArabic::STOP_LIST
76
+ return new_text.join(' ')
77
+ end
78
+
79
+ def self.is_alpha(word)
80
+ # checks if a word is alphanumeric
81
+ return !!word.match(/^[[:alpha:]]+$/)
82
+ end
83
+
84
+ def self.remove_na_characters(word)
85
+ # cleans the word from non alphanumeric characters
86
+ return word.strip.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '')
87
+ end
88
+
89
+ def self.remove_diacritics(word)
90
+ # removes arabic diacritics (fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun) and tateel
91
+ return word.gsub(/#{NlpArabic::DIACRITICS}/, '')
92
+ end
93
+
94
+ def self.convert_initial_alef(word)
95
+ # converts all the types of ALEF to a bare alef
96
+ return word.gsub(/#{NlpArabic::ALIFS}/, NlpArabic::ALEF)
97
+ end
98
+
99
+ def self.normalize_hamzaas(word)
100
+ # Normalize the hamzaas to an alef
101
+ return word.gsub(/#{NlpArabic::HAMZAAS}/, NlpArabic::ALEF)
102
+ end
103
+
104
+ def self.remove_prefix(word)
105
+ # Removes the prefixes of length three than the prefixes of length two
106
+ if word.length >= 6
107
+ return word[3..-1] if word.start_with?(*NlpArabic::P3)
108
+ end
109
+ if word.length >= 5
110
+ return word[2..-1] if word.start_with?(*NlpArabic::P2)
111
+ end
112
+ return word
113
+ end
114
+
115
+ def self.remove_suffix(word)
116
+ # Removes the suffixes of length three than the prefixes of length two
117
+ if word.length >= 6
118
+ return word[0..-4] if word.end_with?(*NlpArabic::S3)
119
+ end
120
+ if word.length >= 5
121
+ return word[0..-3] if word.end_with?(*NlpArabic::S2)
122
+ end
123
+ return word
124
+ end
125
+
126
+ def self.remove_waw(word)
127
+ # Remove the letter و if it is the initial letter
128
+ if word.length >= 4
129
+ return word[1..-1] if word.start_with?(*NlpArabic::DOUBLE_WAW)
130
+ end
131
+ return word
132
+ end
133
+
134
+ def self.word_4(word)
135
+ # Processes the words of length four
136
+ if NlpArabic::PR4[0].include? word[0]
137
+ return word[1..-1]
138
+ elsif NlpArabic::PR4[1].include? word[1]
139
+ word[1] = ''
140
+ elsif NlpArabic::PR4[2].include? word[2]
141
+ word[2] = ''
142
+ elsif NlpArabic::PR4[3].include? word[3]
143
+ word[3] = ''
144
+ else
145
+ word = short_suffix(word)
146
+ word = short_prefix(word) if word.length == 4
147
+ end
148
+ return word
149
+ end
150
+
151
+ def self.word_5(word)
152
+ # Processes the words of length four
153
+ if word.length == 4
154
+ word = word_4(word)
155
+ elsif word.length == 5
156
+ word = pattern_54(word)
157
+ end
158
+ return word
159
+ end
160
+
161
+ def self.pattern_53(word)
162
+ # Helper function that processes the length five patterns and extracts the length three roots
163
+ if NlpArabic::PR53[0].include? word[2] && word[0] == NlpArabic::ALEF
164
+ word = word[1] + word[3..-1]
165
+ elsif NlpArabic::PR53[1].include? word[3] && word[0] == NlpArabic::MEEM
166
+ word = word[1..2] + word[4]
167
+ elsif NlpArabic::PR53[2].include? word[0] && word[4] == NlpArabic::TEH_MARBUTA
168
+ word = word[1..3]
169
+ elsif NlpArabic::PR53[3].include? word[0] && word[2] == NlpArabic::TEH
170
+ word = word[1] + word[3..-1]
171
+ elsif NlpArabic::PR53[4].include? word[0] && word[2] == NlpArabic::ALEF
172
+ word = word[1] + word[3..-1]
173
+ elsif NlpArabic::PR53[5].include? word[2] && word[4] == NlpArabic::TEH_MARBUTA
174
+ word = word[0..1] + word[3]
175
+ elsif NlpArabic::PR53[6].include? word[0] && word[1] == NlpArabic::NOON
176
+ word = word[2..-1]
177
+ elsif word[3] == NlpArabic::ALEF && word[0] == NlpArabic::ALEF
178
+ word = word[1..2] + word[4]
179
+ elsif word[4] == NlpArabic::NOON && word[3] == NlpArabic::ALEF
180
+ word = word[0..2]
181
+ elsif word[3] == NlpArabic::YEH && word[0] == NlpArabic::TEH
182
+ word = word[1..3] + word[4]
183
+ elsif word[3] == NlpArabic::WAW && word[0] == NlpArabic::ALEF
184
+ word = word[0] + word[2] + word[4]
185
+ elsif word[2] == NlpArabic::ALEF && word[1] == NlpArabic::WAW
186
+ word = word[0] + word[3..-1]
187
+ elsif word[3] == NlpArabic::YEH_WITH_HAMZA_ABOVE && word[2] == NlpArabic::ALEF
188
+ word = word[0..1] + word[4]
189
+ elsif word[4] == NlpArabic::TEH_MARBUTA && word[1] == NlpArabic::ALEF
190
+ word = word[0] + word[2..3]
191
+ elsif word[4] == NlpArabic::YEH && word[2] == NlpArabic::ALEF
192
+ word = word[0..1] + word[3]
193
+ else
194
+ word = short_suffix(word)
195
+ word = short_prefix(word)if word.length == 5
196
+ end
197
+ return word
198
+ end
199
+
200
+ def self.pattern_54(word)
201
+ # Helper function that processes the length five patterns and extracts the length three roots
202
+ if NlpArabic::PR53[2].include? word[0]
203
+ word = word[1..-1]
204
+ elsif word[4] == NlpArabic::TEH_MARBUTA
205
+ word = word[0..3]
206
+ elsif word[2] == NlpArabic::ALEF
207
+ word = word[0..1] + word[3..-1]
208
+ end
209
+ return word
210
+ end
211
+
212
+ def self.word_6(word)
213
+ # Processes the words of length four
214
+ if word.length == 5
215
+ word = pattern_53(word)
216
+ word = word_5(word)
217
+ elsif word.length == 6
218
+ word = pattern_64(word)
219
+ end
220
+ return word
221
+ end
222
+
223
+ def self.pattern_6(word)
224
+ # Helper function that processes the length six patterns and extracts the length three roots
225
+ if word.start_with?(*NlpArabic::IST) || word.start_with?(*NlpArabic::MST)
226
+ word = word[3..-1]
227
+ elsif word[0] == NlpArabic::MEEM && word[3] == NlpArabic::ALEF && word[5] == NlpArabic::TEH_MARBUTA
228
+ word = word[1..2] + word[4]
229
+ elsif word[0] == NlpArabic::ALEF && word[2] == NlpArabic::TEH && word[4] == NlpArabic::ALEF
230
+ word = word[1] + word[3] + word[5]
231
+ elsif word[0] == NlpArabic::ALEF && word[3] == NlpArabic::WAW && word[2] == word[4]
232
+ word = word[1] + word[4..-1]
233
+ elsif word[0] == NlpArabic::TEH && word[2] == NlpArabic::ALEF && word[4] == NlpArabic::YEH
234
+ word = word[1] + word[3] + word[5]
235
+ else
236
+ word = short_suffix(word)
237
+ word = short_prefix(word) if word.length == 6
238
+ end
239
+ return word
240
+ end
241
+
242
+ def self.pattern_64(word)
243
+ # Helper function that processes the length six patterns and extracts the length four roots
244
+ if word[0] == NlpArabic::ALEF && word[4] == NlpArabic::ALEF
245
+ word = word[1..3] + word[5]
246
+ elsif
247
+ word = word[2..-1]
248
+ end
249
+ return word
250
+ end
251
+
252
+ def self.short_prefix(word)
253
+ # Removes the short prefixes
254
+ word[1..-1] if word.start_with?(*NlpArabic::P1)
255
+ return word
256
+ end
257
+
258
+ def self.short_suffix(word)
259
+ # Removes the short suffixes
260
+ word[0..-2] if word.end_with?(*NlpArabic::S1)
261
+ return word
262
+ end
263
+ end
@@ -0,0 +1,102 @@
1
+ module NlpArabic
2
+
3
+ # Stop List
4
+ STOP_LIST = ["\u0648","\u064a\u0643\u0648\u0646","\u0644\u064A\u0633","\u0648\u0644\u064a\u0633","\u0648\u0643\u0627\u0646","\u0643\u0630\u0644\u0643","\u0627\u0644\u062a\u064a","\u0648\u0628\u064a\u0646",
5
+ "\u0639\u0644\u064a\u0647\u0627","\u0639\u0644\u064A","\u0645\u0633\u0627\u0621","\u0627\u0644\u0630\u064a","\u0648\u0643\u0627\u0646\u062a","\u0644\u0643\u0646","\u0648\u0644\u0643\u0646","\u0648\u0627\u0644\u062a\u064a",
6
+ "\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0648\u0645","\u0627\u0644\u0644\u0630\u064a\u0646","\u0639\u0644\u064a\u0647","\u0643\u0627\u0646\u062a",
7
+ "\u0644\u0630\u0644\u0643","\u0623\u0645\u0627\u0645","\u0647\u0646\u0627","\u0647\u0646\u0627\u0643","\u0645\u0646\u0647\u0627","\u0645\u0627\u0632\u0627\u0644","\u0644\u0627\u0632\u0627\u0644",
8
+ "\u0644\u0627\u064a\u0632\u0627\u0644","\u0645\u0627\u064a\u0632\u0627\u0644","\u0627\u0635\u0628\u062d","\u0623\u0635\u0628\u062d","\u0623\u0645\u0633\u0649",
9
+ "\u0627\u0645\u0633\u0649","\u0623\u0636\u062d\u0649","\u0627\u0636\u062d\u0649","\u0645\u0627\u0628\u0631\u062d","\u0645\u0627\u0641\u062a\u0626","\u0645\u0627\u0627\u0646\u0641\u0643",
10
+ "\u0644\u0627\u0633\u064a\u0645\u0627","\u0648\u0644\u0627\u064a\u0632\u0627\u0644","\u0627\u0644\u062d\u0627\u0644\u064a","\u0627\u0644\u064a\u0647\u0627","\u0627\u0644\u0630\u064a\u0646","\u0641\u0627\u0646\u0647",
11
+ "\u0648\u0627\u0644\u0630\u064a","\u0648\u0647\u0630\u0627","\u0644\u0647\u0630\u0627","\u0641\u0643\u0627\u0646","\u0633\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0647",
12
+ "\u064a\u0645\u0643\u0646","\u0628\u0647\u0630\u0627","\u0627\u0644\u0630\u0649","\u0641\u0649","\u0641\u064a","\u0643\u0644","\u0644\u0645","\u0644\u0646","\u0644\u0647","\u0645\u0646","\u0647\u0648",
13
+ "\u0643\u0645\u0627","\u0644\u0647\u0627","\u0645\u0646\u0630","\u0642\u062F","\u0648\u0642\u062F","\u0648\u0644\u0627","\u0648\u0642\u0627\u0644","\u0648\u0642\u0627\u0644\u062A",
14
+ "\u0644\u0644\u0627\u0645\u0645","\u0641\u064A\u0647","\u0643\u0644\u0645","\u0648\u0641\u064A","\u0648\u0642\u0641","\u0648\u0644\u0645","\u0648\u0645\u0646","\u0648\u0647\u0648","\u0648\u0647\u064A",
15
+ "\u062D\u064A\u062B","\u0627\u0643\u062F","\u0627\u0644\u0627","\u0627\u0645\u0627","\u0627\u0645\u0633","\u0627\u0644\u0633\u0627\u0628\u0642","\u0627\u0644\u062A\u0649","\u0627\u0643\u062B\u0631",
16
+ "\u0627\u064A\u0627\u0631","\u0627\u064A\u0636\u0627","\u0627\u0644\u0630\u0627\u062A\u064A","\u0627\u0644\u0627\u062E\u064A\u0631\u0629","\u0627\u0644\u0627\u0646","\u0627\u0645\u0627\u0645","\u0627\u064A\u0627\u0645",
17
+ "\u062E\u0644\u0627\u0644","\u062D\u0648\u0627\u0644\u0649","\u0630\u0644\u0643","\u062F\u0648\u0646","\u062D\u0648\u0644","\u062D\u064A\u0646","\u0627\u0644\u0641","\u0627\u0644\u0649","\u0648\u062A\u0645",
18
+ "\u0627\u0646\u0647","\u0627\u0648\u0644","\u0636\u0645\u0646","\u0627\u0646\u0647\u0627","\u062C\u0645\u064A\u0639","\u0627\u0644\u0645\u0627\u0636\u064A","\u0627\u0644\u0648\u0642\u062A",
19
+ "\u0627\u0644\u0645\u0642\u0628\u0644","\u0644\u0627","\u0645\u0627","\u0645\u0639","\u0647\u0630\u0627","\u0648\u0627\u062D\u062F","\u0641\u0627\u0646","\u0642\u0627\u0644","\u0643\u0627\u0646",
20
+ "\u0644\u062F\u0649","\u0646\u062D\u0648","\u0647\u0630\u0647","\u0648\u0627\u0646","\u0648\u0627\u0643\u062F","\u0639\u0634\u0631","\u0639\u062F\u062F","\u0639\u062F\u0629","\u0639\u0634\u0631\u0629","\u0639\u062F\u0645",
21
+ "\u0639\u0627\u0645","\u0639\u0627\u0645\u0627","\u0639\u0646","\u0639\u0646\u062F","\u0639\u0646\u062F\u0645\u0627","\u0639\u0644\u0649","\u0633\u0646\u0629","\u0633\u0646\u0648\u0627\u062A","\u062A\u0645","\u0636\u062F",
22
+ "\u0628\u0639\u062F","\u0628\u0639\u0636","\u0627\u0639\u0627\u062F\u0629","\u0627\u0639\u0644\u0646\u062A","\u0628\u0633\u0628\u0628","\u062D\u062A\u0649","\u0627\u0630\u0627","\u0627\u062D\u062F","\u0645\u0645\u0646",
23
+ "\u0627\u062B\u0631","\u063A\u062F\u0627","\u0634\u062E\u0635\u0627","\u0635\u0628\u0627\u062D","\u0627\u0637\u0627\u0631","\u0627\u0631\u0628\u0639\u0629","\u0627\u062E\u0631\u0649","\u0628\u0627\u0646",
24
+ "\u0627\u062C\u0644","\u063A\u064A\u0631","\u0628\u0634\u0643\u0644","\u062D\u0627\u0644\u064A\u0627","\u0628\u0646","\u0628\u0647","\u062B\u0645","\u0627\u0641","\u0627\u0646","\u0627\u0648","\u0627\u064A",
25
+ "\u0628\u0647\u0627","\u0635\u0641\u0631","\u0627\u0644\u062B\u0627\u0646\u064A","\u0627\u0644\u062B\u0627\u0646\u064A\u0629","\u0627\u062F\u0627","\u0627\u0648\u0644\u0627","\u0648\u0644\u0643\u0646\u0647",
26
+ "\u0627\u0644\u0627\u0648\u0644","\u0627\u0644\u0627\u0648\u0644\u0649","\u0628\u064A\u0646","\u0630\u0644\u0643","\u0645\u0645\u0627","\u0631\u063A\u0645","\u0628\u064A","\u0644\u0627\u0646","\u0647\u0644","\u0644\u0648",
27
+ "\u0628\u0645\u0627","\u0627\u0646\u0627","\u062A\u064A","\u0628\u0644\u0627","\u0642\u0628\u0644","\u0627\u0644\u0646","\u064A\u0627\u0647","\u0644\u062F\u064A","\u0628\u0644","\u0644\u0646\u0627","\u0627\u0645",
28
+ "\u0627\u0646\u0646\u0627","\u0644\u0642\u062F","\u062D\u064A\u062A","\u0627\u0630\u0646","\u0627\u0644\u064A","\u0628\u0630\u0644\u0643","\u062E\u0644\u0644","\u062D\u0648\u0644","\u0644\u0643","\u062A\u0645\u0627",
29
+ "\u0644\u0645\u0646","\u0644\u0646\u0647","\u0627\u0644\u0627","\u0627\u064A\u0646","\u0639\u0645\u0627","\u0628\u0643\u0644","\u0648\u0647\u0646\u0627\u0643","\u0646\u0647\u0627",
30
+ "\u0648\u0647\u0630\u0647","\u0648\u0645\u0627","\u0647\u0645\u0627","\u0648\u0647\u0645","\u0644\u0647\u0630\u0647","\u0639\u0646\u0647","\u0645\u062A\u0646","\u0644\u0645\u0627","\u0643\u0645","\u0645\u062A\u0649",
31
+ "\u0647\u0643\u0630\u0627","\u0627\u064A\u0647","\u0644\u0643\u0646\u0647","\u062A\u0645","\u0644\u064A\u0643","\u0648\u0644\u0643","\u0644\u0645\u0630\u0627","\u062C\u062F","\u0641\u0641\u064A","\u062F\u064A","\u0625\u064A",
32
+ "\u0635\u0641\u0631","\u0648\u0627\u062D\u062F","\u0627\u062B\u0646\u0627\u0646","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629","\u0633\u0628\u0639\u0629",
33
+ "\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0629","\u0639\u0634\u0631","\u0623\u062D\u062F",
34
+ "\u0627\u062B\u0646\u0627","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629",
35
+ "\u0633\u0628\u0639\u0629","\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0648\u0646","\u062B\u0644\u0627\u062B\u0648\u0646",
36
+ "\u0623\u0631\u0628\u0639\u0648\u0646","\u062E\u0645\u0633\u0648\u0646","\u0633\u062A\u0648\u0646","\u0633\u0628\u0639\u0648\u0646","\u062B\u0645\u0627\u0646\u0648\u0646","\u062A\u0633\u0639\u0648\u0646","\u0645\u0626\u0629",
37
+ "\u0645\u0627\u0626\u0629","\u0623\u0646\u0627","\u0627\u0646\u062A","\u0627\u0646\u062A\u064E","\u0627\u0646\u062A\u0649","\u0627\u0646\u062A\u0650","\u0647\u0648","\u0647\u064A","\u0646\u062D\u0646","\u0623\u0646\u062A\u0645\u0627",
38
+ "\u0647\u0645\u0627","\u0623\u0646\u062A\u0645","\u0623\u0646\u062A\u0646","\u0647\u0645","\u0647\u0646"].freeze
39
+
40
+ # Diacritics
41
+ DIACRITICS = "[\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0640]"
42
+
43
+ # Alifs
44
+ # Initial Alifs
45
+ ALIFS = "[\u0622\u0623\u0625\u0671]"
46
+
47
+ # Hamzaas
48
+ HAMZAAS = "[\u0621\u0624\u0626]"
49
+
50
+ # Affix sets
51
+ # Prefixes of length three
52
+ P3 = ["\u0643\u0627\u0644", "\u0628\u0627\u0644", "\u0648\u0644\u0644", "\u0648\u0627\u0644"]
53
+
54
+ # Prefixes of length two
55
+ P2 = ["\u0627\u0644", "\u0644\u0644"].freeze
56
+
57
+ # Prefixes of length one
58
+ P1 = ["\u0644", "\u0628", "\u0641", "\u0633", "\u0648","\u064a", "\u062a", "\u0646", "\u0627"].freeze
59
+
60
+ # Suffixes of length three
61
+ S3 = ["\u062a\u0645\u0644", "\u0647\u0645\u0644","\u062a\u0627\u0646", "\u062a\u064a\u0646","\u0643\u0645\u0644"].freeze
62
+
63
+ # Suffixes of length two
64
+ S2 = ["\u0648\u0646", "\u0627\u062a", "\u0627\u0646","\u064a\u0646", "\u062a\u0646", "\u0643\u0645","\u0647\u0646", "\u0646\u0627", "\u064a\u0627",
65
+ "\u0647\u0627", "\u062a\u0645", "\u0643\u0646","\u0646\u064a", "\u0648\u0627", "\u0645\u0627","\u0647\u0645"].freeze
66
+
67
+ # Suffixes of length one
68
+ S1 = ["\u0629", "\u0647", "\u064a", "\u0643", "\u062a","\u0627", "\u0646"].freeze
69
+
70
+ # Patterns and roots
71
+ # Pattern of length four
72
+ PR4 = { 0 => ["\u0645"],
73
+ 1 => ["\u0627"],
74
+ 2 => ["\u0627", "\u0648", "\u064A"],
75
+ 3 => ["\u0629"]}.freeze
76
+
77
+ # Pattern of length five and length three roots
78
+ PR53 = {0 => ["\u0627", "\u062a"],
79
+ 1 => ["\u0627", "\u064a", "\u0648"],
80
+ 2 => ["\u0627", "\u062a", "\u0645"],
81
+ 3 => ["\u0645", "\u064a", "\u062a"],
82
+ 4 => ["\u0645", "\u062a"],
83
+ 5 => ["\u0627", "\u0648"],
84
+ 6 => ["\u0627", "\u0645"]}.freeze
85
+
86
+ # Letters
87
+ DOUBLE_WAW = "\u0648\u0648"
88
+ ALEF = "\u0627"
89
+ MEEM = "\u0645"
90
+ TEH_MARBUTA = "\u0629"
91
+ TEH = "\u062a"
92
+ NOON = "\u0646"
93
+ YEH = "\u064a"
94
+ WAW = "\u0648"
95
+ YEH_WITH_HAMZA_ABOVE = "\u0626"
96
+
97
+ #STEMS
98
+ IST = "\u0627\u0633\u062a"
99
+ MST = "\u0645\u0633\u062a"
100
+ MT = "\u0645\u062a"
101
+
102
+ end
@@ -0,0 +1,3 @@
1
+ module NlpArabic
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'nlp_arabic/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "nlp_arabic"
8
+ spec.version = NlpArabic::VERSION
9
+ spec.authors = ["Othmane Laousy"]
10
+ spec.email = ["othmane.laousy@gmail.com"]
11
+
12
+ spec.summary = %q{Natural Language Processing Tools for Arabic}
13
+ spec.description = %q{This gem is intended to contain tools for Arabic Natural Language Processing.}
14
+ spec.homepage = "https://github.com/othmanela/nlp_arabic"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test)/}) }
17
+ spec.bindir = "exe"
18
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.9"
22
+ spec.add_development_dependency "rake", "~> 10.0"
23
+ end
metadata ADDED
@@ -0,0 +1,82 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: nlp_arabic
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Othmane Laousy
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-05-11 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.9'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.9'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ description: This gem is intended to contain tools for Arabic Natural Language Processing.
42
+ email:
43
+ - othmane.laousy@gmail.com
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - ".gitignore"
49
+ - ".travis.yml"
50
+ - Gemfile
51
+ - README.md
52
+ - Rakefile
53
+ - bin/console
54
+ - bin/setup
55
+ - lib/nlp_arabic.rb
56
+ - lib/nlp_arabic/characters.rb
57
+ - lib/nlp_arabic/version.rb
58
+ - nlp_arabic.gemspec
59
+ homepage: https://github.com/othmanela/nlp_arabic
60
+ licenses: []
61
+ metadata: {}
62
+ post_install_message:
63
+ rdoc_options: []
64
+ require_paths:
65
+ - lib
66
+ required_ruby_version: !ruby/object:Gem::Requirement
67
+ requirements:
68
+ - - ">="
69
+ - !ruby/object:Gem::Version
70
+ version: '0'
71
+ required_rubygems_version: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ requirements: []
77
+ rubyforge_project:
78
+ rubygems_version: 2.4.6
79
+ signing_key:
80
+ specification_version: 4
81
+ summary: Natural Language Processing Tools for Arabic
82
+ test_files: []