rababa 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/python.yml +81 -0
  3. data/.github/workflows/release.yml +36 -0
  4. data/.github/workflows/ruby.yml +27 -0
  5. data/.gitignore +3 -0
  6. data/.rubocop.yml +1 -1
  7. data/CODE_OF_CONDUCT.md +13 -13
  8. data/README.adoc +80 -0
  9. data/Rakefile +1 -1
  10. data/docs/{research-arabic-diacritization-06-2021.md → research-arabic-diacritization-06-2021.adoc} +52 -37
  11. data/exe/rababa +1 -1
  12. data/lib/README.adoc +95 -0
  13. data/lib/rababa/diacritizer.rb +16 -8
  14. data/lib/rababa/encoders.rb +2 -2
  15. data/lib/rababa/harakats.rb +1 -1
  16. data/lib/rababa/reconcile.rb +1 -33
  17. data/lib/rababa/version.rb +1 -1
  18. data/models-data/README.adoc +6 -0
  19. data/python/README.adoc +211 -0
  20. data/python/config/cbhg.yml +1 -1
  21. data/python/config/test_cbhg.yml +51 -0
  22. data/python/dataset.py +23 -31
  23. data/python/diacritization_model_to_onnx.py +216 -15
  24. data/python/diacritizer.py +35 -31
  25. data/python/log_dir/CA_MSA.base.cbhg/models/README.adoc +2 -0
  26. data/python/log_dir/README.adoc +1 -0
  27. data/python/{requirement.txt → requirements.txt} +1 -1
  28. data/python/setup.py +32 -0
  29. data/python/trainer.py +10 -4
  30. data/python/util/reconcile_original_plus_diacritized.py +2 -0
  31. data/python/util/text_cleaners.py +59 -4
  32. data/rababa.gemspec +1 -1
  33. data/test-datasets/data-arabic-pointing/{Readme.md → README.adoc} +2 -1
  34. metadata +22 -18
  35. data/.github/workflows/main.yml +0 -18
  36. data/README.md +0 -73
  37. data/lib/README.md +0 -82
  38. data/models-data/README.md +0 -6
  39. data/python/README.md +0 -163
  40. data/python/log_dir/CA_MSA.base.cbhg/models/Readme.md +0 -2
  41. data/python/log_dir/README.md +0 -1
data/exe/rababa CHANGED
@@ -10,7 +10,7 @@ parser = Rababa.parser
10
10
 
11
11
  config_path = parser.has_key?(:config) ? parser[:config] : "config/model.yml"
12
12
 
13
- diacritizer = Rababa::Diacritizer.new(parser[:model_path], config_path)
13
+ diacritizer = Rababa::Diacritizer.new(parser[:model_path], YAML.load_file(config_path))
14
14
 
15
15
  if parser.has_key?(:text)
16
16
  # run diacritization text if has :text
data/lib/README.adoc ADDED
@@ -0,0 +1,95 @@
1
+ = Arabic Diacritization in Ruby with Rababa
2
+
3
+ == Try out Rababa
4
+
5
+ * Install the Gems listed below
6
+ * Download a Ruby model on https://github.com/secryst/rababa-models[releases]
7
+
8
+ == Usage
9
+
10
+ === Install
11
+
12
+ [source,sh]
13
+ ----
14
+ gem install rababa
15
+ ----
16
+
17
+ === Download the ONNX model
18
+
19
+ Please download the `diacritization_model_max_len_200.onnx` model file
20
+ from https://github.com/secryst/rababa-models/releases/tag/0.1.
21
+
22
+
23
+ === Running examples
24
+
25
+ One can diacritize either single strings:
26
+
27
+ [source,sh]
28
+ ----
29
+ rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
30
+ # or when inside the gem directory during development
31
+ bundle exec exe/rababa -t 'قطر' -m diacritization_model_max_len_200.onnx
32
+ ----
33
+
34
+ Or files as `data/examples.txt` or your own Arabic file (the max string length
35
+ is specified in the model and has to match the `max_len` parameter in
36
+ `config/models.yaml`):
37
+
38
+ [source,sh]
39
+ ----
40
+ rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
41
+ # or when inside the gem directory during development
42
+ bundle exec exe/rababa -f data/example.txt -m diacritization_model_max_len_200.onnx
43
+ ----
44
+
45
+ One would have to preprocess generic arabic texts for running Rababa in general.
46
+ This can be done on sentences beginnings running for instance
47
+ https://github.com/Hamza5/Pipeline-diacritizer[Hamza5]:
48
+
49
+ ----
50
+ python __main__.py preprocess source destination
51
+ ----
52
+
53
+ == Training
54
+
55
+ === ONNX Models
56
+
57
+ They can either be built in the `/python` repository or downloaded from the
58
+ https://github.com/secryst/rababa-models[releases].
59
+
60
+ Or ONNX model can be generated running the Python
61
+ https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
62
+ in this library.
63
+
64
+ It requires to go through some of the steps described in the link above.
65
+
66
+ === Parameters
67
+
68
+ * text to diacritize: "**-t**TEXT", "--text=TEXT",
69
+ * path to file to diacritize: "**-f**FILE", "--text_filename=FILE",
70
+ * path to ONNX model **Mandatory**: "-mMODEL", "--model_file=MODEL",
71
+ * path to config file **Default:config/model.yml**: "-cCONFIG", "--config=CONFIG"
72
+
73
+ === Config
74
+
75
+ ==== Players
76
+
77
+ * `max_len`: `200` -- `600`
78
+
79
+ ** Parameter that has to match the ONNX model built using the
80
+ https://github.com/interscript/rababa/blob/main/python/diacritization_model_to_onnx.py[code]
81
+ and following the python/README.adoc.
82
+
83
+ ** Longer sentences will need to be preprocessed, which can be done for
84
+ instance using https://github.com/Hamza5[Hamza5]
85
+ https://github.com/Hamza5/Pipeline-diacritizer/blob/master/pipeline_diacritizer/pipeline_diacritizer.py[code].
86
+
87
+ ** the smaller the faster the NNets code.
88
+
89
+ * text_encoder corresponding to the https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py[rules]:
90
+ ** `BasicArabicEncoder`
91
+ ** `ArabicEncoderWithStartSymbol`
92
+
93
+ * text_cleaner corresponding to https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py[logics]:
94
+ ** `basic_cleaners`: remove redundancy in whitespaces and strip string
95
+ ** `valid_arabic_cleaners`: basic+filter of only Arabic words
@@ -1,6 +1,6 @@
1
1
 
2
2
  # this refers to:
3
- # https://github.com/interscript/rababa/blob/master/python/diacritizer.py
3
+ # https://github.com/interscript/rababa/blob/main/python/diacritizer.py
4
4
  # as well a drastic simplification of
5
5
  # https://github.com/almodhfer/Arabic_Diacritization/blob/master/config_manager.py
6
6
 
@@ -16,27 +16,33 @@ module Rababa
16
16
  include Rababa::Harakats
17
17
  include Rababa::Reconcile
18
18
 
19
- def initialize(onnx_model_path, config_path)
19
+ def initialize(onnx_model_path, config)
20
20
 
21
21
  # load inference model from model_path
22
22
  @onnx_session = OnnxRuntime::InferenceSession.new(onnx_model_path)
23
23
 
24
24
  # load config
25
- @config = YAML.load_file(config_path)
25
+ @config = config
26
26
  @max_length = @config['max_len']
27
27
  @batch_size = @config['batch_size']
28
28
 
29
29
  # instantiate encoder's class
30
- @encoder = get_text_encoder()
30
+ @encoder = get_text_encoder
31
31
  @start_symbol_id = @encoder.start_symbol_id
32
32
 
33
33
  end
34
34
 
35
35
  # preprocess text into indices
36
36
  def preprocess_text(text)
37
- #if (text.length > @max_length)
38
- # raise ValueError.new('text length larger than max_length')
39
- #end
37
+ # if (text.length > @max_length)
38
+ # raise ValueError.new('text length larger than max_length')
39
+ # end
40
+ # hack in absence of preprocessing!
41
+ if text.length > @max_length
42
+ text = text[0..@max_length]
43
+ warn('WARNING:: string cut length > #{@max_length},\n')
44
+ warn('text:: '+text)
45
+ end
40
46
 
41
47
  text = @encoder.clean(text)
42
48
  text = remove_diacritics(text)
@@ -47,6 +53,8 @@ module Rababa
47
53
 
48
54
  # Diacritize single arabic strings
49
55
  def diacritize_text(text)
56
+ """Diacritize single arabic strings"""
57
+ text = text.strip()
50
58
  seq = preprocess_text(text)
51
59
 
52
60
  # initialize onnx computation
@@ -66,7 +74,7 @@ module Rababa
66
74
  def diacritize_file(path)
67
75
  texts = []
68
76
  File.open(path).each do |line|
69
- texts.push(line.chomp)
77
+ texts.push(line.chomp.strip())
70
78
  end
71
79
 
72
80
  # process batches
@@ -1,8 +1,8 @@
1
1
  """
2
2
  corresponds to:
3
- https://github.com/interscript/rababa/blob/master/python/util/text_encoders.py
3
+ https://github.com/interscript/rababa/blob/main/python/util/text_encoders.py
4
4
  and
5
- https://github.com/interscript/rababa/blob/master/python/util/text_cleaners.py
5
+ https://github.com/interscript/rababa/blob/main/python/util/text_cleaners.py
6
6
  """
7
7
 
8
8
  require_relative "arabic_constants"
@@ -14,7 +14,7 @@ module Rababa::Harakats
14
14
  char_haraqat = []
15
15
 
16
16
  while stack.length != 0
17
- char_haraqat.append(stack.pop)
17
+ char_haraqat << stack.pop
18
18
  end
19
19
 
20
20
  full_haraqah = char_haraqat.join("")
@@ -31,7 +31,7 @@ module Rababa::Reconcile
31
31
  (idx_ori..d_original.length).each {|i|
32
32
  if (c_dia == d_original[i])
33
33
  idx_ori = i
34
- l_map.append([idx_dia, idx_ori])
34
+ l_map << [idx_dia, idx_ori]
35
35
  break
36
36
  end
37
37
  }
@@ -99,35 +99,3 @@ module Rababa::Reconcile
99
99
  end
100
100
 
101
101
  end
102
-
103
-
104
- """TESTS
105
- TODO: MOVE TO RSPEC
106
- d_tests = [{'original' => '# گيله پسمير الجديد 34',
107
- 'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
108
- 'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },
109
-
110
- {'original' => 'abc',
111
- 'diacritized' => '',
112
- 'reconciled' => 'abc'},
113
-
114
- {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
115
- 'diacritized' => '',
116
- 'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},
117
-
118
- {'original' => '26 سبتمبر العقبة',
119
- 'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
120
- 'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]
121
-
122
- d_tests.each {|d| \
123
- if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
124
- raise Exception.new('reconcile string not matched')
125
- end
126
- }
127
-
128
- or:
129
- for s in '# گيله پسمير الجديد 34' 'abc' '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
130
- do;
131
- ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
132
- done
133
- """
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Rababa
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
5
5
  end
@@ -0,0 +1,6 @@
1
+ === Model data dir
2
+
3
+ Contains:
4
+
5
+ * ONNX data
6
+ * Pickle sample data
@@ -0,0 +1,211 @@
1
+ = Rababa Python for diacritization
2
+
3
+ == Purpose
4
+
5
+ Rababa Python is used for both:
6
+
7
+ * Training of the Rababa diacriticization models
8
+ * Conversion of non-diacriticized Arabic into diacriticized Arabic
9
+ (i.e. running of the Rababa diacriticization models)
10
+
11
+ == Introduction
12
+
13
+ Rababa uses deep Learning models for recovering Arabic language diacritics.
14
+
15
+ Rababa implements the models described in the paper
16
+ https://ieeexplore.ieee.org/document/9274427[Effective Deep Learning Models for Automatic Diacritization of Arabic Text] and refers to the implementation models from
17
+ https://github.com/almodhfer/Arabic_Diacritization[almodhfer],
18
+ which we have selected for this project from a list of alternatives listed in
19
+ the README.
20
+
21
+ Out of the four models that https://github.com/almodhfer[almodhfer] has
22
+ implemented, we selected the simplest and most performant ones:
23
+
24
+ * The baseline model (`baseline`): consists of 3 bidirectional LSTM layers with
25
+ optional batch norm layers.
26
+
27
+ * The CBHG model (`cbhg`): uses only the encoder of the Tacotron-based model
28
+ with optional post LSTM, and batch norm layers.
29
+
30
+
31
+ == Usage
32
+
33
+ === Prerequisites
34
+
35
+ Python version: 3.6+.
36
+
37
+ Setup dependencies with:
38
+
39
+ [source,bash]
40
+ ----
41
+ pip install -r requirement.txt
42
+ ----
43
+
44
+
45
+ === Quickstart
46
+
47
+ . Setup prerequisites
48
+
49
+ . Download the released model
50
+ https://github.com/secryst/rababa-models/releases/download/0.1/2000000-snapshot.pt[here]
51
+ and place under `python/log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt`
52
+
53
+ . Single sentences and text can now be diacritized as below:
54
+
55
+ [source,bash]
56
+ ----
57
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
58
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
59
+ ----
60
+
61
+ The maximal string length is set in configs at `600`.
62
+
63
+ Longer lines need to be broken down, for instance using the library
64
+ introduced in the link:../lib/README.adoc[Ruby quickstart] section.
65
+
66
+
67
+ == Training
68
+
69
+ === Datasets
70
+
71
+ * We have chosen the "Tashkeela processed" corpus ~2,800,000 sentences:
72
+ ** https://github.com/interscript/rababa-tashkeela
73
+
74
+ Other datasets are discussed in the reviewed literature or in the article
75
+ referenced above.
76
+
77
+ For training, data needs to be stored in the `data/CA_MSA` directory in such a
78
+ format:
79
+
80
+ [source,bash]
81
+ ----
82
+ > ls data/CA_MSA/*
83
+ --> data/CA_MSA/eval.csv data/CA_MSA/train.csv data/CA_MSA/test.csv
84
+ ----
85
+
86
+ For instance:
87
+
88
+ [source,bash]
89
+ ----
90
+ mkdir -p data/CA_MSA
91
+ cd data
92
+ curl -sSL https://github.com/interscript/rababa-tashkeela/archive/refs/tags/v1.0.zip -o tashkeela.zip
93
+ unzip tashkeela.zip
94
+ for d in `ls rababa-tashkeela-1.0/tashkeela_val/*`; do cat $d >> CA_MSA/eval.csv; done
95
+ for d in `ls rababa-tashkeela-1.0/tashkeela_train/*`; do cat $d >> CA_MSA/train.csv; done
96
+ for d in `ls rababa-tashkeela-1.0/tashkeela_test/*`; do cat $d >> CA_MSA/test.csv; done
97
+ ----
98
+
99
+ Alternatively, the dataset can be downloaded at
100
+ [rababa-tashkeela](https://github.com/interscript/rababa-tashkeela).
101
+
102
+ === Load Model
103
+
104
+ Alternatively, trained CBHG models are available under
105
+ https://github.com/secryst/rababa-models[releases].
106
+
107
+ Models are to be copied as specified in the link just above under:
108
+
109
+ [source,bash]
110
+ ----
111
+ > log_dir/CA_MSA.base.cbhg/models/2000000-snapshot.pt
112
+ ----
113
+
114
+
115
+ === Config Files
116
+
117
+ One can adjust the model configurations in the `/config` repository.
118
+
119
+ The model configurations are about the layers but also the dataset to be used
120
+ and various other options.
121
+
122
+ The configuration files are called explicitly in the below applications.
123
+
124
+ === Data Preprocessing
125
+
126
+ The original work cited above allow for both raw and preprocessed.
127
+
128
+ We go for the simplest raw version here:
129
+ - As mentioned above, corpus must have `test.csv`,
130
+ `train.csv`, and `valid.csv`.
131
+
132
+ - Specify that the data is not preprocessed in the config.
133
+ In that case, each batch will be processed and the text and diacritics
134
+ will be extracted from the original text.
135
+
136
+ - You also have to specify the text encoder and the cleaner functions.
137
+ Two text encoders were included: `BasicArabicEncoder`,
138
+ `ArabicEncoderWithStartSymbol`.
139
+
140
+ Moreover, we have one cleaning function: `valid_arabic_cleaners`, which clean
141
+ all characters except valid Arabic characters, which include Arabic letters,
142
+ punctuations, and diacritics.
143
+
144
+ === Training
145
+
146
+ All models config are placed in the config directory.
147
+
148
+ [source,bash]
149
+ ----
150
+ python train.py --model "cbhg" --config config/cbhg.yml
151
+ ----
152
+
153
+ The model will report the WER and DER while training using the
154
+ `diacritization_evaluation` package. The frequency of calculating WER and
155
+ DER can be specified in the config file.
156
+
157
+ === Testing
158
+
159
+ The testing is done in the same way as the training,
160
+ For instance, with the CBHG model on the data in `/data/CA_MSA/test.csv`:
161
+
162
+ [source,bash]
163
+ ----
164
+ python test.py --model 'cbhg' --config config/cbhg.yml
165
+ ----
166
+
167
+ The model will load the last saved model unless you specified it in the config:
168
+ `test_data_path`. The test file is expected to have the correct diacritization!
169
+
170
+ If the test file name is different than `test.csv`, you
171
+ can add it to the `config: test_file_name`.
172
+
173
+ === Diacritize text or files
174
+
175
+ Single sentences or files can be processed. The code outputs is the diacritized
176
+ text or lines.
177
+
178
+ [source,bash]
179
+ ----
180
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text 'قطر'
181
+ python diacritize.py --model_kind "cbhg" --config config/cbhg.yml --text_file relative_path_to_text_file
182
+ ----
183
+
184
+ === Convert CBHG, Python model to ONNX
185
+
186
+ The last model stored during training is automatically chosen and the ONNX model
187
+ is saved into a hardcoded location:
188
+
189
+ * `../models-data/diacritization_model.onnx`
190
+
191
+ ==== Run
192
+
193
+ [source,bash]
194
+ ----
195
+ python diacritization_model_to_onnx.py
196
+ ----
197
+
198
+ ==== Important parameters
199
+
200
+ They are hardcoded in the beginning of the script:
201
+
202
+ * `max_len`:
203
+ ** match string length, initial model value is given in config.
204
+ ** this param allows tuning the model speed and size!
205
+ ** the Ruby ../lib/README.md points to resources for preprocessing
206
+
207
+ * batch_size:
208
+ ** the value is given by the original model and its training.
209
+ ** this constrain how the ONNX model can be put in production:
210
+ ... if > 1, single lines involve redundant computations
211
+ ... if > 1, files are processed in batches.