rababa 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/python.yml +81 -0
  3. data/.github/workflows/release.yml +36 -0
  4. data/.github/workflows/ruby.yml +27 -0
  5. data/.gitignore +3 -0
  6. data/.rubocop.yml +1 -1
  7. data/CODE_OF_CONDUCT.md +13 -13
  8. data/README.adoc +80 -0
  9. data/Rakefile +1 -1
  10. data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
  11. data/exe/rababa +1 -1
  12. data/lib/README.adoc +95 -0
  13. data/lib/rababa/diacritizer.rb +16 -8
  14. data/lib/rababa/encoders.rb +2 -2
  15. data/lib/rababa/harakats.rb +1 -1
  16. data/lib/rababa/reconcile.rb +1 -33
  17. data/lib/rababa/version.rb +1 -1
  18. data/models-data/README.adoc +6 -0
  19. data/python/README.adoc +211 -0
  20. data/python/config/cbhg.yml +1 -1
  21. data/python/config/test_cbhg.yml +51 -0
  22. data/python/dataset.py +23 -31
  23. data/python/diacritization_model_to_onnx.py +216 -15
  24. data/python/diacritizer.py +35 -31
  25. data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
  26. data/python/log_dir/README.adoc +1 -0
  27. data/python/{requirement.txt → requirements.txt} +1 -1
  28. data/python/setup.py +32 -0
  29. data/python/trainer.py +10 -4
  30. data/python/util/reconcile_original_plus_diacritized.py +2 -0
  31. data/python/util/text_cleaners.py +59 -4
  32. data/rababa.gemspec +1 -1
  33. data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
  34. metadata +22 -18
  35. data/.github/workflows/main.yml +0 -18
  36. data/README.md +0 -73
  37. data/lib/README.md +0 -82
  38. data/models-data/README.md +0 -6
  39. data/python/README.md +0 -163
  40. data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
  41. data/python/log_dir/README.md +0 -1
data/exe/rababa CHANGED
@@ -10,7 +10,7 @@ parser = Rababa.parser
10
10
 
11
11
  config_path = parser.has_key?(:config) ? parser[:config] : "config/model.yml"
12
12
 
13
- diacritizer = Rababa::Diacritizer.new(parser[:model_path], config_path)
13
+ diacritizer = Rababa::Diacritizer.new(parser[:model_path], YAML.load_file(config_path))
14
14
 
15
15
  if parser.has_key?(:text)
16
16
  # run diacritization text if has :text
data/lib/README.adoc ADDED
@@ -0,0 +1,95 @@
1
+ = Arabic Diacritization in Ruby with Rababa
2
+
3
+ == Try out Rababa
4
+
5
+ * Install the Gems listed below
6
+ * Download a Ruby model on https://github.com/secryst/rababa-models[releases]
7
+
8
+ == Usage
9
+
10
+ === Install
11
+
12
+ [source,sh]
13
+ ----
14
+ gem install rababa
15
+ ----
16
+
17
+ === Download the ONNX model
18
+
19
+ Please download the `diacritization_model_max_len_200.onnx` model file
20
+ from https://github.com/secryst/rababa-models/releases/tag/0.1.
21
+
22
+
23
+ === Running examples
24
+
25
+ One can diacritize either single strings:
26
+
27
+ [source,sh]
28
+ ----
29
+ rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
30
+ # or when inside the gem directory during development
31
+ bundle exec exe/rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
32
+ ----
33
+
34
+ Or files as `data/examples.txt` or your own Arabic file (the max string length
35
+ is specified in the model and has to match the `max_len` parameter in
36
+ `config/models.yaml`):
37
+
38
+ [source,sh]
39
+ ----
40
+ rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
41
+ # or when inside the gem directory during development
42
+ bundle exec exe/rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
43
+ ----
44
+
45
+ One would have to preprocess generic arabic texts for running Rababa in general.
46
+ This can be done on sentences beginnings running for instance
47
+ https://github.com/Hamza5/Pipeline-diacritizer[Hamza5]:
48
+
49
+ ----
50
+ python __main__.py preprocess source destination
51
+ ----
52
+
53
+ == Training
54
+
55
+ === ONNX Models
56
+
57
+ They can either be built in the `/python` repository or downloaded from the
58
+ https://github.com/secryst/rababa-models[releases].
59
+
60
+ Or ONNX model can be generated running the Python
61
+ https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
62
+ in this library.
63
+
64
+ It requires to go through some of the steps described in the link above.
65
+
66
+ === Parameters
67
+
68
+ * text to diacritize: "**-t**TEXT", "--text=TEXT",
69
+ * path to file to diacritize: "**-f**FILE", "--text_filename=FILE",
70
+ * path to ONNX model **Mandatory**: "-mMODEL", "--model_file=MODEL",
71
+ * path to config file **Default:config/model.yml**: "-cCONFIG", "--config=CONFIG"
72
+
73
+ === Config
74
+
75
+ ==== Players
76
+
77
+ * `max_len`: `200` -- `600`
78
+
79
+ ** Parameter that has to match the ONNX model built using the
80
+ https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
81
+ and following the python/README.adoc.
82
+
83
+ ** Longer sentences will need to be preprocessed, which can be done for
84
+ instance using https://github.com/Hamza5[Hamza5]
85
+ https://github.com/Hamza5/Pipeline-diacritizer/blob/master/pipeline_diacritizer/pipeline_diacritizer.py[code].
86
+
87
+ ** the smaller the faster the NNets code.
88
+
89
+ * text_encoder corresponding to the https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py[rules]:
90
+ ** `BasicArabicEncoder`
91
+ ** `ArabicEncoderWithStartSymbol`
92
+
93
+ * text_cleaner corresponding to https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py[logics]:
94
+ ** `basic_cleaners`: remove redundancy in whitespaces and strip string
95
+ ** `valid_arabic_cleaners`: basic+filter of only Arabic words
@@ -1,6 +1,6 @@
1
1
 
2
2
  # this refers to:
3
- # https://github.com/interscript/rababa/blob/master/python/diacritizer.py
3
+ # https://github.com/interscript/rababa/blob/main/python/diacritizer.py
4
4
  # as well a drastic simplification of
5
5
  # https://github.com/almodhfer/Arabic_Diacritization/blob/master/config_manager.py
6
6
 
@@ -16,27 +16,33 @@ module Rababa
16
16
  include Rababa::Harakats
17
17
  include Rababa::Reconcile
18
18
 
19
- def initialize(onnx_model_path, config_path)
19
+ def initialize(onnx_model_path, config)
20
20
 
21
21
  # load inference model from model_path
22
22
  @onnx_session = OnnxRuntime::InferenceSession.new(onnx_model_path)
23
23
 
24
24
  # load config
25
- @config = YAML.load_file(config_path)
25
+ @config = config
26
26
  @max_length = @config['max_len']
27
27
  @batch_size = @config['batch_size']
28
28
 
29
29
  # instantiate encoder's class
30
- @encoder = get_text_encoder()
30
+ @encoder = get_text_encoder
31
31
  @start_symbol_id = @encoder.start_symbol_id
32
32
 
33
33
  end
34
34
 
35
35
  # preprocess text into indices
36
36
  def preprocess_text(text)
37
- #if (text.length > @max_length)
38
- # raise ValueError.new('text length larger than max_length')
39
- #end
37
+ # if (text.length > @max_length)
38
+ # raise ValueError.new('text length larger than max_length')
39
+ # end
40
+ # hack in absence of preprocessing!
41
+ if text.length > @max_length
42
+ text = text[0..@max_length]
43
+ warn('WARNING:: string cut length > #{@max_length},\n')
44
+ warn('text:: '+text)
45
+ end
40
46
 
41
47
  text = @encoder.clean(text)
42
48
  text = remove_diacritics(text)
@@ -47,6 +53,8 @@ module Rababa
47
53
 
48
54
  # Diacritize single arabic strings
49
55
  def diacritize_text(text)
56
+ """Diacritize single arabic strings"""
57
+ text = text.strip()
50
58
  seq = preprocess_text(text)
51
59
 
52
60
  # initialize onnx computation
@@ -66,7 +74,7 @@ module Rababa
66
74
  def diacritize_file(path)
67
75
  texts = []
68
76
  File.open(path).each do |line|
69
- texts.push(line.chomp)
77
+ texts.push(line.chomp.strip())
70
78
  end
71
79
 
72
80
  # process batches
@@ -1,8 +1,8 @@
1
1
  """
2
2
  corresponds to:
3
- https://github.com/interscript/rababa/blob/master/python/util/text_encoders.py
3
+ https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py
4
4
  and
5
- https://github.com/interscript/rababa/blob/master/python/util/text_cleaners.py
5
+ https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py
6
6
  """
7
7
 
8
8
  require_relative "arabic_constants"
@@ -14,7 +14,7 @@ module Rababa::Harakats
14
14
  char_haraqat = []
15
15
 
16
16
  while stack.length != 0
17
- char_haraqat.append(stack.pop)
17
+ char_haraqat << stack.pop
18
18
  end
19
19
 
20
20
  full_haraqah = char_haraqat.join("")
@@ -31,7 +31,7 @@ module Rababa::Reconcile
31
31
  (idx_ori..d_original.length).each {|i|
32
32
  if (c_dia == d_original[i])
33
33
  idx_ori = i
34
- l_map.append([idx_dia, idx_ori])
34
+ l_map << [idx_dia, idx_ori]
35
35
  break
36
36
  end
37
37
  }
@@ -99,35 +99,3 @@ module Rababa::Reconcile
99
99
  end
100
100
 
101
101
  end
102
-
103
-
104
- """TESTS
105
- TODO: MOVE TO RSPEC
106
- d_tests = [{'original' => '# گيله پسمير الجديد 34',
107
- 'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
108
- 'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },
109
-
110
- {'original' => 'abc',
111
- 'diacritized' => '',
112
- 'reconciled' => 'abc'},
113
-
114
- {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
115
- 'diacritized' => '',
116
- 'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},
117
-
118
- {'original' => '26 سبتمبر العقبة',
119
- 'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
120
- 'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]
121
-
122
- d_tests.each {|d| \
123
- if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
124
- raise Exception.new('reconcile string not matched')
125
- end
126
- }
127
-
128
- or:
129
- for s in '# گيله پسمير الجديد 34' 'abc' '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
130
- do;
131
- ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
132
- done
133
- """
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Rababa
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
5
5
  end
@@ -0,0 +1,6 @@
1
+ === Model data dir
2
+
3
+ Contains:
4
+
5
+ * ONNX data
6
+ * Pickle sample data
@@ -0,0 +1,211 @@
1
+ = Rababa Python for diacritization
2
+
3
+ == Purpose
4
+
5
+ Rababa Python is used for both:
6
+
7
+ * Training of the Rababa diacriticization models
8
+ * Conversion of non-diacriticized Arabic into diacriticized Arabic
9
+ (i.e. running of the Rababa diacriticization models)
10
+
11
+ == Introduction
12
+
13
+ Rababa uses deep Learning models for recovering Arabic language diacritics.
14
+
15
+ Rababa implements the models described in the paper
16
+ https://ieeexplore.ieee.org/document/9274427[Effective Deep Learning Models for Automatic Diacritization of Arabic Text] and refers to the implementation models from
17
+ https://github.com/almodhfer/Arabic_Diacritization[almodhfer],
18
+ which we have selected for this project from a list of alternatives listed in
19
+ the README.
20
+
21
+ Out of the four models that https://github.com/almodhfer[almodhfer] has
22
+ implemented, we selected the simplest and most performant ones:
23
+
24
+ * The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
25
+ optional batch norm layers.
26
+
27
+ * The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model
28
+ with optional post LSTM, and batch norm layers.
29
+
30
+
31
+ == Usage
32
+
33
+ === Prerequisites
34
+
35
+ Python version: 3.6+.
36
+
37
+ Setup dependencies with:
38
+
39
+ [source,bash]
40
+ ----
41
+ pip install -r requirement.txt
42
+ ----
43
+
44
+
45
+ === Quickstart
46
+
47
+ . Setup prerequisites
48
+
49
+ . Download the released model
50
+ https://github.com/secryst/rababa-models/releases/download/0.1/2000000-snapshot.pt[here]
51
+ and place under `python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt`
52
+
53
+ . Single sentences and text can now be diacritized as below:
54
+
55
+ [source,bash]
56
+ ----
57
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
58
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
59
+ ----
60
+
61
+ The maximal string length is set in configs at `600`.
62
+
63
+ Longer lines need to be broken down, for instance using the library
64
+ introduced in the link:../lib/README.adoc[Ruby quickstart] section.
65
+
66
+
67
+ == Training
68
+
69
+ === Datasets
70
+
71
+ * We have chosen the "Tashkeela processed" corpus ~2,800,000 sentences:
72
+ ** https://github.com/interscript/rababa-tashkeela
73
+
74
+ Other datasets are discussed in the reviewed literature or in the article
75
+ referenced above.
76
+
77
+ For training, data needs to be stored in the `data/CA_MSA` directory in such a
78
+ format:
79
+
80
+ [source,bash]
81
+ ----
82
+ > ls data/CA_MSA/*
83
+ --> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
84
+ ----
85
+
86
+ For instance:
87
+
88
+ [source,bash]
89
+ ----
90
+ mkdir -p data/CA_MSA
91
+ cd data
92
+ curl -sSL https://github.com/interscript/rababa-tashkeela/archive/refs/tags/v1.0.zip -o tashkeela.zip
93
+ unzip tashkeela.zip
94
+ for d in `ls rababa-tashkeela-1.0/tashkeela_val/*`; do cat $d >> CA_MSA/eval.csv; done
95
+ for d in `ls rababa-tashkeela-1.0/tashkeela_train/*`; do cat $d >> CA_MSA/train.csv; done
96
+ for d in `ls rababa-tashkeela-1.0/tashkeela_test/*`; do cat $d >> CA_MSA/test.csv; done
97
+ ----
98
+
99
+ Alternatively, the dataset can be downloaded at
100
+ [rababa-tashkeela](https://github.com/interscript/rababa-tashkeela).
101
+
102
+ === Load Model
103
+
104
+ Alternatively, trained CBHG models are available under
105
+ https://github.com/secryst/rababa-models[releases].
106
+
107
+ Models are to be copied as specified in the link just above under:
108
+
109
+ [source,bash]
110
+ ----
111
+ > log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
112
+ ----
113
+
114
+
115
+ === Config Files
116
+
117
+ One can adjust the model configurations in the `/config` repository.
118
+
119
+ The model configurations are about the layers but also the dataset to be used
120
+ and various other options.
121
+
122
+ The configuration files are called explicitly in the below applications.
123
+
124
+ === Data Preprocessing
125
+
126
+ The original work cited above allow for both raw and preprocessed.
127
+
128
+ We go for the simplest raw version here:
129
+ - As mentioned above, corpus must have `test.csv`,
130
+ `train.csv`, and `valid.csv`.
131
+
132
+ - Specify that the data is not preprocessed in the config.
133
+ In that case, each batch will be processed and the text and diacritics
134
+ will be extracted from the original text.
135
+
136
+ - You also have to specify the text encoder and the cleaner functions.
137
+ Two text encoders were included: `BasicArabicEncoder`,
138
+ `ArabicEncoderWithStartSymbol`.
139
+
140
+ Moreover, we have one cleaning function: `valid_arabic_cleaners`, which clean
141
+ all characters except valid Arabic characters, which include Arabic letters,
142
+ punctuations, and diacritics.
143
+
144
+ === Training
145
+
146
+ All models config are placed in the config directory.
147
+
148
+ [source,bash]
149
+ ----
150
+ python train.py --model "cbhg" --config config/cbhg.yml
151
+ ----
152
+
153
+ The model will report the WER and DER while training using the
154
+ `diacritization_evaluation` package. The frequency of calculating WER and
155
+ DER can be specified in the config file.
156
+
157
+ === Testing
158
+
159
+ The testing is done in the same way as the training,
160
+ For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
161
+
162
+ [source,bash]
163
+ ----
164
+ python test.py --model 'cbhg' --config config/cbhg.yml
165
+ ----
166
+
167
+ The model will load the last saved model unless you specified it in the config:
168
+ `test_data_path`. The test file is expected to have the correct diacritization!
169
+
170
+ If the test file name is different than `test.csv`, you
171
+ can add it to the `config: test_file_name`.
172
+
173
+ === Diacritize text or files
174
+
175
+ Single sentences or files can be processed. The code outputs is the diacritized
176
+ text or lines.
177
+
178
+ [source,bash]
179
+ ----
180
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
181
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
182
+ ----
183
+
184
+ === Convert CBHG, Python model to ONNX
185
+
186
+ The last model stored during training is automatically chosen and the ONNX model
187
+ is saved into a hardcoded location:
188
+
189
+ * `../models-data/diacritization_model.onnx`
190
+
191
+ ==== Run
192
+
193
+ [source,bash]
194
+ ----
195
+ python diacritization_model_to_onnx.py
196
+ ----
197
+
198
+ ==== Important parameters
199
+
200
+ They are hardcoded in the beginning of the script:
201
+
202
+ * `max_len`:
203
+ ** match string length, initial model value is given in config.
204
+ ** this param allows tuning the model speed and size!
205
+ ** the Ruby ../lib/README.md points to resources for preprocessing
206
+
207
+ * batch_size:
208
+ ** the value is given by the original model and its training.
209
+ ** this constrain how the ONNX model can be put in production:
210
+ ... if > 1, single lines involve redundant computations
211
+ ... if > 1, files are processed in batches.