nlp_arabic 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 9dce240342285bde2206509493990d37a44143df
4
+ data.tar.gz: b55584ef3a20f4b60f1f0637108149922605f06d
5
+ SHA512:
6
+ metadata.gz: 9d92a384d51125411cca0a89479c7889830b92e4c86eaf29241bad88029bf1bc704a916151b36438737d7de4725eb1d39bd151d077685010419e4d8022824598
7
+ data.tar.gz: cd0855407749d5b51d22c43eef77027b53d8974d6ec2dcce27f7edee3577d122cd2c884fa26fbc8f04f09f7d0c0cc3f46d090c0dc504e8aba01d9606f5b7d914
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.2.0
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in nlp_arabic.gemspec
4
+ gemspec
@@ -0,0 +1,67 @@
1
+ NlpArabic
2
+ =========
3
+
4
+ This gem is intended to contain tools for Arabic Natural Language Processing.
5
+ As of version 0.1, this toolkit gem allows you to:
6
+
7
+ 1. Clean a text using a stop list. This stop list was generated using the tf-idf score calculated on words from over 900 articles. The words selected have also been checked and validated by hand which resulted in a stop list of over 270 words.
8
+
9
+ 2. Stem a word or a text. The stemming algorithm used is the ISRI Arabic stemmer. It is described in the following research paper:
10
+
11
+ [Arabic Stemming without a root dictionary](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1428453&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9755%2F30835%2F01428453.pdf%3Farnumber%3D1428453)
12
+
13
+ This root-extraction stemmer is similar to the Khoja stemmer but does not use a root-dictionnary which can be laborious to maintain. Also, when the root can not be found, the ISRI stemmer would return a normalized form and not the orginial unmodified form. Overall, the ISRI has been proved to perform equivalently if not better than the Khoja.
14
+
15
+
16
+ Installation
17
+ ============
18
+
19
+ Add this line to your application's Gemfile:
20
+
21
+ ```ruby
22
+ gem 'nlp_arabic'
23
+ ```
24
+
25
+ And then execute:
26
+
27
+ $ bundle
28
+
29
+ Or install it yourself as:
30
+
31
+ $ gem install nlp_arabic
32
+
33
+ ## Usage
34
+
35
+ Once installed, you can use it like this:
36
+
37
+ NlpArabic.clean(text) will return the text without the stop words.
38
+
39
+ NlpArabic.stem(word) will return the word stemmed.
40
+
41
+ NlpArabic.stem_text(text) will stem an entire text.
42
+
43
+ NlpArabic.clean_and_stem(text) will do both.
44
+
45
+ NlpArabic.wash_and_stem(text) will stem the text removing stop words and delimiters from it.
46
+
47
+ NlpArabic.tokenize_text(text) will break the text into an array of words and delimiters.
48
+
49
+ Each step of the ISRI algorithm is coded in a separate function so you should be able to find the helper function you may be looking for just by browsing the code.
50
+
51
+ Development
52
+ ===========
53
+
54
+ After checking out the repo, run `bin/console` for an interactive prompt that will allow you to experiment. For now the gem doesn't use any dependencies so you don't need to run `bin/setup`.
55
+
56
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release` to create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
57
+
58
+ Contributing
59
+ ============
60
+ You are more than welcome to contribute to this project :) Please try to respect the ruby style guidelines described [here](https://github.com/bbatsov/ruby-style-guide). The default encoding used is UTF-8.
61
+
62
+ 1. Fork it ( https://github.com/othmanela/nlp_arabic/fork )
63
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
64
+ 3. Write unit tests and make sure all of them (including the old ones) pass
65
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
66
+ 4. Push to the branch (`git push origin my-new-feature`)
67
+ 5. Create a new Pull Request
@@ -0,0 +1,10 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new do |t|
5
+ t.libs << "test"
6
+ t.test_files = FileList["test/**/*_test.rb"]
7
+ t.verbose = true
8
+ end
9
+
10
+ task default: :test
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "nlp_arabic"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,263 @@
1
+ require "nlp_arabic/version"
2
+ require "nlp_arabic/characters"
3
+
4
+ module NlpArabic
5
+ def self.stem(word)
6
+ # This function stems a word following the steps of ISRI stemmer
7
+ # Step 1: remove diacritics
8
+ word = remove_diacritics(word)
9
+ # Step 2: normalize hamza, ouaou and yeh to bare alef
10
+ word = normalize_hamzaas(word)
11
+ # Step 3: remove prefix of size 3 then 2
12
+ word = remove_prefix(word)
13
+ # Step 4: remove suffix of size 3 then 2
14
+ word = remove_suffix(word)
15
+ # Step 5: remove the connective waw
16
+ word = remove_waw(word)
17
+ # Step 6: convert inital alif (optional)
18
+ word = convert_initial_alef(word)
19
+ # Step 7: If the length of the word is higher than 3
20
+ if word.length == 4
21
+ word = word_4(word)
22
+ elsif word.length == 5
23
+ word = pattern_53(word)
24
+ word = word_5(word)
25
+ elsif word.length == 6
26
+ word = pattern_6(word)
27
+ word = word_6(word)
28
+ elsif word.length == 7
29
+ word = short_suffix(word)
30
+ word = short_prefix(word) if word.length == 7
31
+ if word.length == 6
32
+ word = pattern_6(word)
33
+ word = word_6(word)
34
+ end
35
+ end
36
+ return word
37
+ end
38
+
39
+ def self.clean_text(text)
40
+ # cleans the text using a stop list
41
+ tokenized_text = NlpArabic.tokenize_text(text)
42
+ clean_text = (tokenized_text - NlpArabic::STOP_LIST)
43
+ return clean_text.join(' ')
44
+ end
45
+
46
+ def self.stem_text(text)
47
+ # Only stems the text using the ISRI algorithm
48
+ tokenized_text = NlpArabic.tokenize_text(text)
49
+ for i in (0..(tokenized_text.length-1))
50
+ tokenized_text[i] = stem(tokenized_text[i]) if NlpArabic.is_alpha(tokenized_text[i])
51
+ end
52
+ return tokenized_text.join(' ')
53
+ end
54
+
55
+ def self.clean_and_stem(text)
56
+ # Cleans the text using the stop list than stems it
57
+ tokenized_text = NlpArabic.tokenize_text(text)
58
+ clean_text = (tokenized_text - NlpArabic::STOP_LIST)
59
+ for i in (0..(clean_text.length-1))
60
+ clean_text[i]= stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
61
+ end
62
+ return clean_text.join(' ')
63
+ end
64
+
65
+ def self.tokenize_text(text)
66
+ return text.split(/\s|(\?+)|(\.+)|(!+)|(\,+)|(\;+)|(\،+)|(\؟+)|(\:+)|(\(+)|(\)+)/).delete_if(&:empty?)
67
+ end
68
+
69
+ def self.wash_and_stem(text)
70
+ clean_text = text.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '').split - NlpArabic::STOP_LIST
71
+ new_text = []
72
+ for i in (0..(clean_text.length-1))
73
+ new_text << stem(clean_text[i]) if NlpArabic.is_alpha(clean_text[i])
74
+ end
75
+ new_text -= NlpArabic::STOP_LIST
76
+ return new_text.join(' ')
77
+ end
78
+
79
+ def self.is_alpha(word)
80
+ # checks if a word is alphanumeric
81
+ return !!word.match(/^[[:alpha:]]+$/)
82
+ end
83
+
84
+ def self.remove_na_characters(word)
85
+ # cleans the word from non alphanumeric characters
86
+ return word.strip.gsub(/[._,،\"\':–%\/;·&?؟()\”\“]/, '')
87
+ end
88
+
89
+ def self.remove_diacritics(word)
90
+ # removes arabic diacritics (fathatan, dammatan, kasratan, fatha, damma, kasra, shadda, sukun) and tateel
91
+ return word.gsub(/#{NlpArabic::DIACRITICS}/, '')
92
+ end
93
+
94
+ def self.convert_initial_alef(word)
95
+ # converts all the types of ALEF to a bare alef
96
+ return word.gsub(/#{NlpArabic::ALIFS}/, NlpArabic::ALEF)
97
+ end
98
+
99
+ def self.normalize_hamzaas(word)
100
+ # Normalize the hamzaas to an alef
101
+ return word.gsub(/#{NlpArabic::HAMZAAS}/, NlpArabic::ALEF)
102
+ end
103
+
104
+ def self.remove_prefix(word)
105
+ # Removes the prefixes of length three than the prefixes of length two
106
+ if word.length >= 6
107
+ return word[3..-1] if word.start_with?(*NlpArabic::P3)
108
+ end
109
+ if word.length >= 5
110
+ return word[2..-1] if word.start_with?(*NlpArabic::P2)
111
+ end
112
+ return word
113
+ end
114
+
115
+ def self.remove_suffix(word)
116
+ # Removes the suffixes of length three than the prefixes of length two
117
+ if word.length >= 6
118
+ return word[0..-4] if word.end_with?(*NlpArabic::S3)
119
+ end
120
+ if word.length >= 5
121
+ return word[0..-3] if word.end_with?(*NlpArabic::S2)
122
+ end
123
+ return word
124
+ end
125
+
126
+ def self.remove_waw(word)
127
+ # Remove the letter و if it is the initial letter
128
+ if word.length >= 4
129
+ return word[1..-1] if word.start_with?(*NlpArabic::DOUBLE_WAW)
130
+ end
131
+ return word
132
+ end
133
+
134
+ def self.word_4(word)
135
+ # Processes the words of length four
136
+ if NlpArabic::PR4[0].include? word[0]
137
+ return word[1..-1]
138
+ elsif NlpArabic::PR4[1].include? word[1]
139
+ word[1] = ''
140
+ elsif NlpArabic::PR4[2].include? word[2]
141
+ word[2] = ''
142
+ elsif NlpArabic::PR4[3].include? word[3]
143
+ word[3] = ''
144
+ else
145
+ word = short_suffix(word)
146
+ word = short_prefix(word) if word.length == 4
147
+ end
148
+ return word
149
+ end
150
+
151
+ def self.word_5(word)
152
+ # Processes the words of length four
153
+ if word.length == 4
154
+ word = word_4(word)
155
+ elsif word.length == 5
156
+ word = pattern_54(word)
157
+ end
158
+ return word
159
+ end
160
+
161
+ def self.pattern_53(word)
162
+ # Helper function that processes the length five patterns and extracts the length three roots
163
+ if NlpArabic::PR53[0].include? word[2] && word[0] == NlpArabic::ALEF
164
+ word = word[1] + word[3..-1]
165
+ elsif NlpArabic::PR53[1].include? word[3] && word[0] == NlpArabic::MEEM
166
+ word = word[1..2] + word[4]
167
+ elsif NlpArabic::PR53[2].include? word[0] && word[4] == NlpArabic::TEH_MARBUTA
168
+ word = word[1..3]
169
+ elsif NlpArabic::PR53[3].include? word[0] && word[2] == NlpArabic::TEH
170
+ word = word[1] + word[3..-1]
171
+ elsif NlpArabic::PR53[4].include? word[0] && word[2] == NlpArabic::ALEF
172
+ word = word[1] + word[3..-1]
173
+ elsif NlpArabic::PR53[5].include? word[2] && word[4] == NlpArabic::TEH_MARBUTA
174
+ word = word[0..1] + word[3]
175
+ elsif NlpArabic::PR53[6].include? word[0] && word[1] == NlpArabic::NOON
176
+ word = word[2..-1]
177
+ elsif word[3] == NlpArabic::ALEF && word[0] == NlpArabic::ALEF
178
+ word = word[1..2] + word[4]
179
+ elsif word[4] == NlpArabic::NOON && word[3] == NlpArabic::ALEF
180
+ word = word[0..2]
181
+ elsif word[3] == NlpArabic::YEH && word[0] == NlpArabic::TEH
182
+ word = word[1..3] + word[4]
183
+ elsif word[3] == NlpArabic::WAW && word[0] == NlpArabic::ALEF
184
+ word = word[0] + word[2] + word[4]
185
+ elsif word[2] == NlpArabic::ALEF && word[1] == NlpArabic::WAW
186
+ word = word[0] + word[3..-1]
187
+ elsif word[3] == NlpArabic::YEH_WITH_HAMZA_ABOVE && word[2] == NlpArabic::ALEF
188
+ word = word[0..1] + word[4]
189
+ elsif word[4] == NlpArabic::TEH_MARBUTA && word[1] == NlpArabic::ALEF
190
+ word = word[0] + word[2..3]
191
+ elsif word[4] == NlpArabic::YEH && word[2] == NlpArabic::ALEF
192
+ word = word[0..1] + word[3]
193
+ else
194
+ word = short_suffix(word)
195
+ word = short_prefix(word)if word.length == 5
196
+ end
197
+ return word
198
+ end
199
+
200
+ def self.pattern_54(word)
201
+ # Helper function that processes the length five patterns and extracts the length three roots
202
+ if NlpArabic::PR53[2].include? word[0]
203
+ word = word[1..-1]
204
+ elsif word[4] == NlpArabic::TEH_MARBUTA
205
+ word = word[0..3]
206
+ elsif word[2] == NlpArabic::ALEF
207
+ word = word[0..1] + word[3..-1]
208
+ end
209
+ return word
210
+ end
211
+
212
+ def self.word_6(word)
213
+ # Processes the words of length four
214
+ if word.length == 5
215
+ word = pattern_53(word)
216
+ word = word_5(word)
217
+ elsif word.length == 6
218
+ word = pattern_64(word)
219
+ end
220
+ return word
221
+ end
222
+
223
+ def self.pattern_6(word)
224
+ # Helper function that processes the length six patterns and extracts the length three roots
225
+ if word.start_with?(*NlpArabic::IST) || word.start_with?(*NlpArabic::MST)
226
+ word = word[3..-1]
227
+ elsif word[0] == NlpArabic::MEEM && word[3] == NlpArabic::ALEF && word[5] == NlpArabic::TEH_MARBUTA
228
+ word = word[1..2] + word[4]
229
+ elsif word[0] == NlpArabic::ALEF && word[2] == NlpArabic::TEH && word[4] == NlpArabic::ALEF
230
+ word = word[1] + word[3] + word[5]
231
+ elsif word[0] == NlpArabic::ALEF && word[3] == NlpArabic::WAW && word[2] == word[4]
232
+ word = word[1] + word[4..-1]
233
+ elsif word[0] == NlpArabic::TEH && word[2] == NlpArabic::ALEF && word[4] == NlpArabic::YEH
234
+ word = word[1] + word[3] + word[5]
235
+ else
236
+ word = short_suffix(word)
237
+ word = short_prefix(word) if word.length == 6
238
+ end
239
+ return word
240
+ end
241
+
242
+ def self.pattern_64(word)
243
+ # Helper function that processes the length six patterns and extracts the length four roots
244
+ if word[0] == NlpArabic::ALEF && word[4] == NlpArabic::ALEF
245
+ word = word[1..3] + word[5]
246
+ elsif
247
+ word = word[2..-1]
248
+ end
249
+ return word
250
+ end
251
+
252
+ def self.short_prefix(word)
253
+ # Removes the short prefixes
254
+ word[1..-1] if word.start_with?(*NlpArabic::P1)
255
+ return word
256
+ end
257
+
258
+ def self.short_suffix(word)
259
+ # Removes the short suffixes
260
+ word[0..-2] if word.end_with?(*NlpArabic::S1)
261
+ return word
262
+ end
263
+ end
@@ -0,0 +1,102 @@
1
+ module NlpArabic
2
+
3
+ # Stop List
4
+ STOP_LIST = ["\u0648","\u064a\u0643\u0648\u0646","\u0644\u064A\u0633","\u0648\u0644\u064a\u0633","\u0648\u0643\u0627\u0646","\u0643\u0630\u0644\u0643","\u0627\u0644\u062a\u064a","\u0648\u0628\u064a\u0646",
5
+ "\u0639\u0644\u064a\u0647\u0627","\u0639\u0644\u064A","\u0645\u0633\u0627\u0621","\u0627\u0644\u0630\u064a","\u0648\u0643\u0627\u0646\u062a","\u0644\u0643\u0646","\u0648\u0644\u0643\u0646","\u0648\u0627\u0644\u062a\u064a",
6
+ "\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0648\u0645","\u0627\u0644\u0644\u0630\u064a\u0646","\u0639\u0644\u064a\u0647","\u0643\u0627\u0646\u062a",
7
+ "\u0644\u0630\u0644\u0643","\u0623\u0645\u0627\u0645","\u0647\u0646\u0627","\u0647\u0646\u0627\u0643","\u0645\u0646\u0647\u0627","\u0645\u0627\u0632\u0627\u0644","\u0644\u0627\u0632\u0627\u0644",
8
+ "\u0644\u0627\u064a\u0632\u0627\u0644","\u0645\u0627\u064a\u0632\u0627\u0644","\u0627\u0635\u0628\u062d","\u0623\u0635\u0628\u062d","\u0623\u0645\u0633\u0649",
9
+ "\u0627\u0645\u0633\u0649","\u0623\u0636\u062d\u0649","\u0627\u0636\u062d\u0649","\u0645\u0627\u0628\u0631\u062d","\u0645\u0627\u0641\u062a\u0626","\u0645\u0627\u0627\u0646\u0641\u0643",
10
+ "\u0644\u0627\u0633\u064a\u0645\u0627","\u0648\u0644\u0627\u064a\u0632\u0627\u0644","\u0627\u0644\u062d\u0627\u0644\u064a","\u0627\u0644\u064a\u0647\u0627","\u0627\u0644\u0630\u064a\u0646","\u0641\u0627\u0646\u0647",
11
+ "\u0648\u0627\u0644\u0630\u064a","\u0648\u0647\u0630\u0627","\u0644\u0647\u0630\u0627","\u0641\u0643\u0627\u0646","\u0633\u062a\u0643\u0648\u0646","\u0627\u0644\u064a\u0647",
12
+ "\u064a\u0645\u0643\u0646","\u0628\u0647\u0630\u0627","\u0627\u0644\u0630\u0649","\u0641\u0649","\u0641\u064a","\u0643\u0644","\u0644\u0645","\u0644\u0646","\u0644\u0647","\u0645\u0646","\u0647\u0648",
13
+ "\u0643\u0645\u0627","\u0644\u0647\u0627","\u0645\u0646\u0630","\u0642\u062F","\u0648\u0642\u062F","\u0648\u0644\u0627","\u0648\u0642\u0627\u0644","\u0648\u0642\u0627\u0644\u062A",
14
+ "\u0644\u0644\u0627\u0645\u0645","\u0641\u064A\u0647","\u0643\u0644\u0645","\u0648\u0641\u064A","\u0648\u0642\u0641","\u0648\u0644\u0645","\u0648\u0645\u0646","\u0648\u0647\u0648","\u0648\u0647\u064A",
15
+ "\u062D\u064A\u062B","\u0627\u0643\u062F","\u0627\u0644\u0627","\u0627\u0645\u0627","\u0627\u0645\u0633","\u0627\u0644\u0633\u0627\u0628\u0642","\u0627\u0644\u062A\u0649","\u0627\u0643\u062B\u0631",
16
+ "\u0627\u064A\u0627\u0631","\u0627\u064A\u0636\u0627","\u0627\u0644\u0630\u0627\u062A\u064A","\u0627\u0644\u0627\u062E\u064A\u0631\u0629","\u0627\u0644\u0627\u0646","\u0627\u0645\u0627\u0645","\u0627\u064A\u0627\u0645",
17
+ "\u062E\u0644\u0627\u0644","\u062D\u0648\u0627\u0644\u0649","\u0630\u0644\u0643","\u062F\u0648\u0646","\u062D\u0648\u0644","\u062D\u064A\u0646","\u0627\u0644\u0641","\u0627\u0644\u0649","\u0648\u062A\u0645",
18
+ "\u0627\u0646\u0647","\u0627\u0648\u0644","\u0636\u0645\u0646","\u0627\u0646\u0647\u0627","\u062C\u0645\u064A\u0639","\u0627\u0644\u0645\u0627\u0636\u064A","\u0627\u0644\u0648\u0642\u062A",
19
+ "\u0627\u0644\u0645\u0642\u0628\u0644","\u0644\u0627","\u0645\u0627","\u0645\u0639","\u0647\u0630\u0627","\u0648\u0627\u062D\u062F","\u0641\u0627\u0646","\u0642\u0627\u0644","\u0643\u0627\u0646",
20
+ "\u0644\u062F\u0649","\u0646\u062D\u0648","\u0647\u0630\u0647","\u0648\u0627\u0646","\u0648\u0627\u0643\u062F","\u0639\u0634\u0631","\u0639\u062F\u062F","\u0639\u062F\u0629","\u0639\u0634\u0631\u0629","\u0639\u062F\u0645",
21
+ "\u0639\u0627\u0645","\u0639\u0627\u0645\u0627","\u0639\u0646","\u0639\u0646\u062F","\u0639\u0646\u062F\u0645\u0627","\u0639\u0644\u0649","\u0633\u0646\u0629","\u0633\u0646\u0648\u0627\u062A","\u062A\u0645","\u0636\u062F",
22
+ "\u0628\u0639\u062F","\u0628\u0639\u0636","\u0627\u0639\u0627\u062F\u0629","\u0627\u0639\u0644\u0646\u062A","\u0628\u0633\u0628\u0628","\u062D\u062A\u0649","\u0627\u0630\u0627","\u0627\u062D\u062F","\u0645\u0645\u0646",
23
+ "\u0627\u062B\u0631","\u063A\u062F\u0627","\u0634\u062E\u0635\u0627","\u0635\u0628\u0627\u062D","\u0627\u0637\u0627\u0631","\u0627\u0631\u0628\u0639\u0629","\u0627\u062E\u0631\u0649","\u0628\u0627\u0646",
24
+ "\u0627\u062C\u0644","\u063A\u064A\u0631","\u0628\u0634\u0643\u0644","\u062D\u0627\u0644\u064A\u0627","\u0628\u0646","\u0628\u0647","\u062B\u0645","\u0627\u0641","\u0627\u0646","\u0627\u0648","\u0627\u064A",
25
+ "\u0628\u0647\u0627","\u0635\u0641\u0631","\u0627\u0644\u062B\u0627\u0646\u064A","\u0627\u0644\u062B\u0627\u0646\u064A\u0629","\u0627\u062F\u0627","\u0627\u0648\u0644\u0627","\u0648\u0644\u0643\u0646\u0647",
26
+ "\u0627\u0644\u0627\u0648\u0644","\u0627\u0644\u0627\u0648\u0644\u0649","\u0628\u064A\u0646","\u0630\u0644\u0643","\u0645\u0645\u0627","\u0631\u063A\u0645","\u0628\u064A","\u0644\u0627\u0646","\u0647\u0644","\u0644\u0648",
27
+ "\u0628\u0645\u0627","\u0627\u0646\u0627","\u062A\u064A","\u0628\u0644\u0627","\u0642\u0628\u0644","\u0627\u0644\u0646","\u064A\u0627\u0647","\u0644\u062F\u064A","\u0628\u0644","\u0644\u0646\u0627","\u0627\u0645",
28
+ "\u0627\u0646\u0646\u0627","\u0644\u0642\u062F","\u062D\u064A\u062A","\u0627\u0630\u0646","\u0627\u0644\u064A","\u0628\u0630\u0644\u0643","\u062E\u0644\u0644","\u062D\u0648\u0644","\u0644\u0643","\u062A\u0645\u0627",
29
+ "\u0644\u0645\u0646","\u0644\u0646\u0647","\u0627\u0644\u0627","\u0627\u064A\u0646","\u0639\u0645\u0627","\u0628\u0643\u0644","\u0648\u0647\u0646\u0627\u0643","\u0646\u0647\u0627",
30
+ "\u0648\u0647\u0630\u0647","\u0648\u0645\u0627","\u0647\u0645\u0627","\u0648\u0647\u0645","\u0644\u0647\u0630\u0647","\u0639\u0646\u0647","\u0645\u062A\u0646","\u0644\u0645\u0627","\u0643\u0645","\u0645\u062A\u0649",
31
+ "\u0647\u0643\u0630\u0627","\u0627\u064A\u0647","\u0644\u0643\u0646\u0647","\u062A\u0645","\u0644\u064A\u0643","\u0648\u0644\u0643","\u0644\u0645\u0630\u0627","\u062C\u062F","\u0641\u0641\u064A","\u062F\u064A","\u0625\u064A",
32
+ "\u0635\u0641\u0631","\u0648\u0627\u062D\u062F","\u0627\u062B\u0646\u0627\u0646","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629","\u0633\u0628\u0639\u0629",
33
+ "\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0629","\u0639\u0634\u0631","\u0623\u062D\u062F",
34
+ "\u0627\u062B\u0646\u0627","\u062B\u0644\u0627\u062B\u0629","\u0623\u0631\u0628\u0639\u0629","\u062E\u0645\u0633\u0629","\u0633\u062A\u0629",
35
+ "\u0633\u0628\u0639\u0629","\u062B\u0645\u0627\u0646\u064A\u0629","\u062A\u0633\u0639\u0629","\u0639\u0634\u0631\u0648\u0646","\u062B\u0644\u0627\u062B\u0648\u0646",
36
+ "\u0623\u0631\u0628\u0639\u0648\u0646","\u062E\u0645\u0633\u0648\u0646","\u0633\u062A\u0648\u0646","\u0633\u0628\u0639\u0648\u0646","\u062B\u0645\u0627\u0646\u0648\u0646","\u062A\u0633\u0639\u0648\u0646","\u0645\u0626\u0629",
37
+ "\u0645\u0627\u0626\u0629","\u0623\u0646\u0627","\u0627\u0646\u062A","\u0627\u0646\u062A\u064E","\u0627\u0646\u062A\u0649","\u0627\u0646\u062A\u0650","\u0647\u0648","\u0647\u064A","\u0646\u062D\u0646","\u0623\u0646\u062A\u0645\u0627",
38
+ "\u0647\u0645\u0627","\u0623\u0646\u062A\u0645","\u0623\u0646\u062A\u0646","\u0647\u0645","\u0647\u0646"].freeze
39
+
40
+ # Diacritics
41
+ DIACRITICS = "[\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0640]"
42
+
43
+ # Alifs
44
+ # Initial Alifs
45
+ ALIFS = "[\u0622\u0623\u0625\u0671]"
46
+
47
+ # Hamzaas
48
+ HAMZAAS = "[\u0621\u0624\u0626]"
49
+
50
+ # Affix sets
51
+ # Prefixes of length three
52
+ P3 = ["\u0643\u0627\u0644", "\u0628\u0627\u0644", "\u0648\u0644\u0644", "\u0648\u0627\u0644"]
53
+
54
+ # Prefixes of length two
55
+ P2 = ["\u0627\u0644", "\u0644\u0644"].freeze
56
+
57
+ # Prefixes of length one
58
+ P1 = ["\u0644", "\u0628", "\u0641", "\u0633", "\u0648","\u064a", "\u062a", "\u0646", "\u0627"].freeze
59
+
60
+ # Suffixes of length three
61
+ S3 = ["\u062a\u0645\u0644", "\u0647\u0645\u0644","\u062a\u0627\u0646", "\u062a\u064a\u0646","\u0643\u0645\u0644"].freeze
62
+
63
+ # Suffixes of length two
64
+ S2 = ["\u0648\u0646", "\u0627\u062a", "\u0627\u0646","\u064a\u0646", "\u062a\u0646", "\u0643\u0645","\u0647\u0646", "\u0646\u0627", "\u064a\u0627",
65
+ "\u0647\u0627", "\u062a\u0645", "\u0643\u0646","\u0646\u064a", "\u0648\u0627", "\u0645\u0627","\u0647\u0645"].freeze
66
+
67
+ # Suffixes of length one
68
+ S1 = ["\u0629", "\u0647", "\u064a", "\u0643", "\u062a","\u0627", "\u0646"].freeze
69
+
70
+ # Patterns and roots
71
+ # Pattern of length four
72
+ PR4 = { 0 => ["\u0645"],
73
+ 1 => ["\u0627"],
74
+ 2 => ["\u0627", "\u0648", "\u064A"],
75
+ 3 => ["\u0629"]}.freeze
76
+
77
+ # Pattern of length five and length three roots
78
+ PR53 = {0 => ["\u0627", "\u062a"],
79
+ 1 => ["\u0627", "\u064a", "\u0648"],
80
+ 2 => ["\u0627", "\u062a", "\u0645"],
81
+ 3 => ["\u0645", "\u064a", "\u062a"],
82
+ 4 => ["\u0645", "\u062a"],
83
+ 5 => ["\u0627", "\u0648"],
84
+ 6 => ["\u0627", "\u0645"]}.freeze
85
+
86
+ # Letters
87
+ DOUBLE_WAW = "\u0648\u0648"
88
+ ALEF = "\u0627"
89
+ MEEM = "\u0645"
90
+ TEH_MARBUTA = "\u0629"
91
+ TEH = "\u062a"
92
+ NOON = "\u0646"
93
+ YEH = "\u064a"
94
+ WAW = "\u0648"
95
+ YEH_WITH_HAMZA_ABOVE = "\u0626"
96
+
97
+ #STEMS
98
+ IST = "\u0627\u0633\u062a"
99
+ MST = "\u0645\u0633\u062a"
100
+ MT = "\u0645\u062a"
101
+
102
+ end
@@ -0,0 +1,3 @@
1
+ module NlpArabic
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,23 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'nlp_arabic/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "nlp_arabic"
8
+ spec.version = NlpArabic::VERSION
9
+ spec.authors = ["Othmane Laousy"]
10
+ spec.email = ["othmane.laousy@gmail.com"]
11
+
12
+ spec.summary = %q{Natural Language Processing Tools for Arabic}
13
+ spec.description = %q{This gem is intended to contain tools for Arabic Natural Language Processing.}
14
+ spec.homepage = "https://github.com/othmanela/nlp_arabic"
15
+
16
+ spec.files = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test)/}) }
17
+ spec.bindir = "exe"
18
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.9"
22
+ spec.add_development_dependency "rake", "~> 10.0"
23
+ end
metadata ADDED
@@ -0,0 +1,82 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: nlp_arabic
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Othmane Laousy
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2015-05-11 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.9'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.9'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ description: This gem is intended to contain tools for Arabic Natural Language Processing.
42
+ email:
43
+ - othmane.laousy@gmail.com
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - ".gitignore"
49
+ - ".travis.yml"
50
+ - Gemfile
51
+ - README.md
52
+ - Rakefile
53
+ - bin/console
54
+ - bin/setup
55
+ - lib/nlp_arabic.rb
56
+ - lib/nlp_arabic/characters.rb
57
+ - lib/nlp_arabic/version.rb
58
+ - nlp_arabic.gemspec
59
+ homepage: https://github.com/othmanela/nlp_arabic
60
+ licenses: []
61
+ metadata: {}
62
+ post_install_message:
63
+ rdoc_options: []
64
+ require_paths:
65
+ - lib
66
+ required_ruby_version: !ruby/object:Gem::Requirement
67
+ requirements:
68
+ - - ">="
69
+ - !ruby/object:Gem::Version
70
+ version: '0'
71
+ required_rubygems_version: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ requirements: []
77
+ rubyforge_project:
78
+ rubygems_version: 2.4.6
79
+ signing_key:
80
+ specification_version: 4
81
+ summary: Natural Language Processing Tools for Arabic
82
+ test_files: []